-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v0.6.0 significantly reduced performance #225
Comments
That is a huge slowdown! Can you profile the application at all to see where the time is going (e.g. using the visual studio diagnostic tools)? The vast majority of the time should be spent inside of llama.cpp (it'll show up as If there is a slowdown inside llama.cpp then the only reason I can think of is that we're currently using a bit of an obsolete inference system. Not long ago |
One way to check might be to pull this PR and to try out the |
@martindevans it is a little over of my area expertise to do the profiling, however I can add that I can reproduce the results on two vastly different setups. |
If you can't do the profiling would you mind trying the batched decoding example? If that's the same speed in llama.cpp and LLamaSharp I think it will confirm what the problem is. |
I just tried rewriting the Stateless executor to use llama_decode (it was very rough, so I'm not committing this work yet). There was absolutely no speed difference, so it's probably not due to |
@martindevans I have checked out https:/martindevans/LLamaSharp/tree/batch_decoding to see if there are any performance differences in your branch in non-batched chat version - it is still on par with v0.6.0. I will try to find out if I can manage to profile both versions. |
@martindevans my attempt in profiling and comparing two versions: The only difference I can discern that 0.6.0 has lengthy |
Unfortunately that's a false lead, |
I had a guess that maybe we were hitting some pathological edge case in async evaluation. Unfortunately that doesn't seem to be true. Just tried out 4 tests with the stateless executor (each time running the
It's interesting the #3 is a little bit faster, and we might want to look into switching in the future, but it doesn't explain the massive performance regression. |
Yeah that was what made me wonder about the async system in my previous comment. Completely removing async from the stateless executor produced no real change though, so I'm not sure how else to investigate that. |
Actually I guess I should check, @lexxsoft do you see the same regression in the stateless executor? I've been using that for testing assuming you do, but all my tests might be useless if you don't! |
@martindevans, that's basically the same result: stateless execution (with InferAsync) times: 14.869 vs 1:18.076 |
I've had a guess, maybe the required memory increased in 0.6.0. Therefore the inference will be greatly slow down when there're memory page check-in/out. @lexxsoft Could you please have a look of the memory usages when running the model inference? |
If my guess is wrong unfortunately, I think we should profile the single |
@AsakusaRinne I'm afraid that's a bit over my head to delve deeper in the profiling. I wonder is this me alone with this problem or is anyone getting this too? |
@lexxsoft You're the first one to meet the performance problem. However before we find the reason behind it, it may be a common problem, which I think is of high priority. In |
Ok I've created an extremely simply test case we can use to narrow this down some more, here it is. It may require some small adjustments for older versions of LLamaSharp. Here's my results so far: #223Batch (100): 34490ms master (321d0b5)Batch (100): 32817ms Tag 0.5.1Batch (100): 3006ms Definitely reproducing the results! |
(Starting a git bisect now) |
Ok I think I've identified the commit, but I don't think it's good news.
Note that if you want to test this for yourself you might need to grab your own libllama.dll from the appropriate release on the main repo. The ones I use during development are usually AVX512. |
Anyone else have any ideas to debug this further? I spent a while double checking the changes in that commit (making sure it's not doing something crazy like setting batchsize or nthreads to one) but I can't spot anything. I'm at a bit of a dead end right now. Everything points to the |
@martindevans Thanks for you effort to narrow the problem to one commit. I've reviewed that PR and corresponding code of llama.cpp. Though I'm not sure what caused the problem unfortunately, I'd like to share some of my ideas. At first we can always trust the implementation of llama.cpp. Based on this assumption, what I first come up with is the wrong parameters passed to native APIs, which also disturbed me in the early of this project. That is when you see Here're some code which I might suspect on my side.
Here's a method to check this kind of problem, which is from my previous debug experience:
I know that the debug work is time-cost so if you meet obstacles please tell me what I can do for you to help. If you're too busy these days I'd like to do the check above in this weekend (on weekdays I'm a little busy). :) |
I just tried the LLama.Web Project (v060) with llama-2-7b-chat.Q4_0.gguf and LLamaSharp.Backend.Cpu and noticed that the text stream takes about 3-5 seconds for each word that is displayed. I did not change anything in the default configuration. Is this the current problem from the thread creator, or is my processor just too slow? I am using an I7 2.20GHz 4 core with 8 logical processors. It amazes me that every single word loaded so slowly :) |
Try comparing the same project in 0.5.1, that'll tell you if it's just a hardware limitation or this issue. |
@AsakusaRinne Thanks for all the suggestions. I've worked my way through those double checking everything in docs/testing:
If the property has an invisible backing field (i.e.
You guessed right, it's equivalent. The
It shouldn't metter in that specific case since
I'm fairly sure it won't cause any issues, making it a
As far as I'm aware it's just a compiler thing and doesn't actually have any runtime effect (e.g. reflection can modify readonly fields). I can't actually find a definitive answer on the docs though. Just to be sure I went through and removed all |
While testing the above I rewrote the test script to use as little of LLamaSharp as possible. Here it is. it directly calls into llama.cpp through the native API with no other C# code in the way. Still slow! |
Well... good news I guess. Given the above test I couldn't see how it could be anything but a problem in the binary. I just grabbed a new binary from llama.cpp (bin-win-avx2 from here) and it's fast. I'm going to start a new build action, update all of the binaries from scratch and hope that fixes the issue. Edit: Did this, it's slow again! There's something wrong with our build process. |
Ok, finally this should be resolved. Phew. The problem with the build process was that by default I've fixed that, rebuilt all the binaries and added them to #223. If you can confirm that fixes the issue for you please leave a comment on that PR. I'll merge it (which should immediately auto-release) as soon as some people have confirmed. |
That's a good catch! I'll add a benchmark test to the ci this weekend to reduce the risk of happening such performance problems. |
Yes, it seems that fixed the problem, it is now as performant as it was before. |
0.7.0 still seems slower... |
Could you please provide the performance comparison between 0.5.1 and 0.7.0? |
Not to side-rail, but there has been another driver-dependent performance issue that has been affecting applications that use lots of VRAM. If anyone is running Windows, they should check if they have the latest drivers, and also switch off the System Memory Fallback option for their program, because it causes a CPU bottleneck: |
@hswlab could you open a separate issue with details on your setup so we can look into that? The |
I'll close this one now since 0.7.0 has resolved the performance issue. |
The only thing I'm changing is upgrading to 0.7.0. Using the Cuda12 backend on a 4090. GpuLayerCount is 30, UseMemoryLock is true, MaxTokens is 128. Seed: 42 Seed: 42 |
@AlShadi How large is your model? Would you like to do the same test with a small model which can be fully offloaded to your GPU? I guess the bottleneck is completely from CPU in 0.7.0. |
There shouldn't be any slowdown on the CPU side from 0.7.0 now that's it's properly using AVX2 :/ |
@AsakusaRinne airoboros-l2-70b-3.1.2.Q5_K_M.gguf. CPU is an AMD Ryzen 9 5900X, System.Runtime.Intrinsics.X86.Avx2.IsSupported returns true. Trying with llama-2-7b-chat.f16.gguf with 35 of 35 layers. Another observation is that the 7b model had the same output with the same seed between versions. The 70b model had different output with the same seed between 0.5.1 and 0.7.0. |
@martindevans It seems that problem still exists with CPU since gpu inference cost is the same. However I have no idea now. :( @AlShadi Could you please running 7B model with 10 layers offloaded to GPU to see if there's a performance gap between 0.5.1 and 0.7.0? Besides, there's another experiment that will help to see if this is a llama.cpp issue or a problem of LLamaSharp implementation: compiling the corresponding llama.cpp version of v0.5.1 and v0.7.0 with same settings, then running 70B model and printing the time cost. I know this may be time-cost so that I won't require it, but I'll appreciate it if you would like to help with that. |
@AsakusaRinne Sure. I'll have time this weekend to work on it. |
@martindevans, unfortunately this only partially fixes performance problem: |
I've just opened up a PR that should fix this: #245 It creates a new variable which contains all the compiler defines common to all platforms and then uses it everywhere. This makes it less likely that one particular platforms will be missed with changes in the future. This new variables includes |
@lexxsoft there's a test build linked in that PR, when it's finished running would you mind downloading and testing the binaries from there to confirm this fixes the issue for you? |
@martindevans, no, I don't mind at all, though I'll wait once test are not failing |
@AsakusaRinne CPU Backend - llama-2-70b-chat.Q5_K_M.gguf Cuda12 Backend - 10 layers - llama-2-70b-chat.Q5_K_M.gguf Cuda12 Backend - 35 layers - llama-2-70b-chat.Q5_K_M.gguf |
The new binaries ended up being in #249 instead, that's been merged into master now. |
Solved by the latest release v0.8.0. Please feel free to reopen it if there's any other problem. |
After upgrading to v0.6.0 I have noticed a greatly increased processing time.
I have compared current and previous versions, and also llama.cpp binaries itself (around the time LLamaSharp has been released):
The degraded speed affects both
CPU
andCuda12
backends.All tests were performed with https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF model, specifically
Q5_K_M
version.For benchmark reproduction, use
a) llama.cpp:
b) LLamaSharp v0.5.1 with LINQPad
c) LLamaSharp v0.6.0 with LINQPad
The text was updated successfully, but these errors were encountered: