Skip to content
forked from gaxler/llama2.rs

Inference Llama 2 in one file of pure F#

License

Notifications You must be signed in to change notification settings

micsh/llama2.fs

 
 

Repository files navigation

llama2.fs 🦀

Have you ever wanted to inference a baby Llama 2 model in pure C? No? Well, now you can!

Full Llama2 Support 🚀🚀

We can now run the full llama2-7B!! No memory mapping for now, so all the weights must fit in memory (~26Gb). On my codespaces VM with 16 cores and 64Gb memory, the inference runs at 1.4 tokens per second.

Performance

It's fast: ~1 token per sec on 7B chat model, on a 6 core machine.

Keeping up with the original

I'm pretty sure that llama2.c is going to move fast and get lots of contributions.

So any contribution is welcome here!

Contribution Ideas

  • 8bit vectorization.

License

MIT

About

Inference Llama 2 in one file of pure F#

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 57.3%
  • F# 20.6%
  • C 20.3%
  • Makefile 1.8%