Backport the performance improvement from llama.cpp

It would be very cool if the performance improvements from https://github.com/ggerganov/llama.cpp/pull/613 could be backported to this repo.

I couldn't find an issue for this, if there is one, I'm happy to close this.