Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.6.0 significantly reduced performance #225

Closed
lexxsoft opened this issue Oct 28, 2023 · 51 comments · Fixed by #227
Closed

v0.6.0 significantly reduced performance #225

lexxsoft opened this issue Oct 28, 2023 · 51 comments · Fixed by #227
Labels
bug Something isn't working

Comments

@lexxsoft
Copy link

After upgrading to v0.6.0 I have noticed a greatly increased processing time.

I have compared current and previous versions, and also llama.cpp binaries itself (around the time LLamaSharp has been released):

version time comment
llama-b1184-bin-win-avx2-x64 15.451 released on the same date as LLamaSharp v0.5.1
llama-b1426-bin-win-avx2-x64 15.655 released on the same date as LLamaSharp v0.6.0
llama-b1440-bin-win-avx2-x64 15.744 latest as of raising this ticket
LLamaSharp v0.5.1 14.561 ok
LLamaSharp v0.6.0 1:07.869 extremely slow

The degraded speed affects both CPU and Cuda12 backends.

All tests were performed with https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF model, specifically Q5_K_M version.

For benchmark reproduction, use

a) llama.cpp:

time ./main --color --n_predict 100 --ctx_size 4096 --seed "0" --model ../../../models/Llama-2-7b-Chat-GGUF/llama-2-7b-chat.Q5_K_M.gguf -p "### Instruction:\nWhat is a llama?\n### Response:\n"

b) LLamaSharp v0.5.1 with LINQPad

<Query Kind="Program">
  <NuGetReference Version="0.5.1">LLamaSharp</NuGetReference>
  <NuGetReference Version="0.5.1">LLamaSharp.Backend.Cpu</NuGetReference>
  <Namespace>LLama</Namespace>
  <Namespace>LLama.Common</Namespace>
  <Namespace>System.Drawing</Namespace>
  <Namespace>System.Threading.Tasks</Namespace>
</Query>

async Task Main()
{
	var path = @"C:\LLM\models\Llama-2-7b-Chat-GGUF\llama-2-7b-chat.Q5_K_M.gguf";
	
	var parameters = new ModelParams(modelPath: path)
	{
		ContextSize = 1024 * 4,
		Seed = 0,
		GpuLayerCount = 0,
	};
	
	using var model = LLamaWeights.LoadFromFile(parameters);
	using var context = model.CreateContext(parameters);
	var executor = new InteractiveExecutor(context);
	var session = new ChatSession(executor);

	var prompt = """
### Instruction:
What is a llama?
### Response:
""";

	var watch = Stopwatch.StartNew();
	var outputs = session.ChatAsync(prompt, new InferenceParams()
	{
		AntiPrompts = new List<string>() {
				"### Instruction:",
				"### Human:",
				"### User:",
				"<|im_start|>user",
			},
		MaxTokens = 100,
	});

	await foreach (var output in outputs)
	{
		Console.Write(output);
	}

	watch.Elapsed.Dump("duration");
}

c) LLamaSharp v0.6.0 with LINQPad

<Query Kind="Program">
  <NuGetReference Version="0.6.0">LLamaSharp</NuGetReference>
  <NuGetReference Version="0.6.0">LLamaSharp.Backend.Cpu</NuGetReference>
  <Namespace>LLama</Namespace>
  <Namespace>LLama.Common</Namespace>
  <Namespace>System.Drawing</Namespace>
  <Namespace>System.Threading.Tasks</Namespace>
</Query>

async Task Main()
{
	var path = @"C:\LLM\models\Llama-2-7b-Chat-GGUF\llama-2-7b-chat.Q5_K_M.gguf";
	
	var parameters = new ModelParams(modelPath: path)
	{
		ContextSize = 1024 * 4,
		Seed = 0,
		GpuLayerCount = 0,
	};
	
	using var model = LLamaWeights.LoadFromFile(parameters);
	using var context = model.CreateContext(parameters);
	var executor = new InteractiveExecutor(context);
	var session = new ChatSession(executor);

	var prompt = """
### Instruction:
What is a llama?
### Response:
""";

	var watch = Stopwatch.StartNew();
	var outputs = session.ChatAsync(prompt, new InferenceParams()
	{
		AntiPrompts = new List<string>() {
				"### Instruction:",
				"### Human:",
				"### User:",
				"<|im_start|>user",
			},
		MaxTokens = 100,
	});

	await foreach (var output in outputs)
	{
		Console.Write(output);
	}

	watch.Elapsed.Dump("duration");
}
@martindevans
Copy link
Member

That is a huge slowdown! Can you profile the application at all to see where the time is going (e.g. using the visual studio diagnostic tools)? The vast majority of the time should be spent inside of llama.cpp (it'll show up as SafeLLamaContextHandle.Eval in a trace).

If there is a slowdown inside llama.cpp then the only reason I can think of is that we're currently using a bit of an obsolete inference system. Not long ago llama_eval was replaced with llama_decode, but we haven't switched over the internals of LLamaSharp yet (I'm still experimenting with it). All that said: I wouldn't actually expect llama_eval to be that much slower, so I may be completely wrong!

@martindevans
Copy link
Member

I wouldn't actually expect llama_eval to be that much slower, so I may be completely wrong!

One way to check might be to pull this PR and to try out the Batch Decoding example. That uses the new mechanism, you can compare it with this example in llama.cpp to compare speed.

@lexxsoft
Copy link
Author

@martindevans it is a little over of my area expertise to do the profiling, however I can add that I can reproduce the results on two vastly different setups.

@martindevans
Copy link
Member

If you can't do the profiling would you mind trying the batched decoding example? If that's the same speed in llama.cpp and LLamaSharp I think it will confirm what the problem is.

@martindevans martindevans mentioned this issue Oct 28, 2023
@martindevans
Copy link
Member

I just tried rewriting the Stateless executor to use llama_decode (it was very rough, so I'm not committing this work yet). There was absolutely no speed difference, so it's probably not due to llama_eval/llama_decode.

@lexxsoft
Copy link
Author

@martindevans I have checked out https:/martindevans/LLamaSharp/tree/batch_decoding to see if there are any performance differences in your branch in non-batched chat version - it is still on par with v0.6.0.

I will try to find out if I can manage to profile both versions.

@lexxsoft
Copy link
Author

@martindevans my attempt in profiling and comparing two versions:

v0.5.1:
image

v0.6.0:
image

The only difference I can discern that 0.6.0 has lengthy LLama.LLamaContext.Eval, but that's not where most of the time is spent - I am unable to drill-down inside "Stack traces without user methods".

@lexxsoft
Copy link
Author

Update. Another profiling with extended tracing:

v0.5.1:
image

v0.6.0:
image

So it seems that the main difference is between llama_eval_with_pointer versus llama_eval.

@martindevans
Copy link
Member

Unfortunately that's a false lead, llama_eval and llama_eval_with_pointer are just different names for the same thing!

@martindevans
Copy link
Member

I had a guess that maybe we were hitting some pathological edge case in async evaluation. Unfortunately that doesn't seem to be true. Just tried out 4 tests with the stateless executor (each time running the Stateless test and timing it):

  1. 17543ms (Eval inside Task.Run)
  2. 17245ms (Eval not in Task.Run)
  3. 17065ms (prompt Eval in Task.StartNew+LongRunning)
  4. 17014ms (removed async entirely)

It's interesting the #3 is a little bit faster, and we might want to look into switching in the future, but it doesn't explain the massive performance regression.

@AsakusaRinne
Copy link
Collaborator

AsakusaRinne commented Oct 29, 2023

image

In 0.6.0 there's a huge gap between MoveNext and MoveNextAsync, while in 0.5.1 this difference is near zero. I haven't come up with an idea about what actually happened but maybe it provides a thread to find the reason.

@martindevans
Copy link
Member

Yeah that was what made me wonder about the async system in my previous comment. Completely removing async from the stateless executor produced no real change though, so I'm not sure how else to investigate that.

@martindevans
Copy link
Member

Actually I guess I should check, @lexxsoft do you see the same regression in the stateless executor? I've been using that for testing assuming you do, but all my tests might be useless if you don't!

@lexxsoft
Copy link
Author

Actually I guess I should check, @lexxsoft do you see the same regression in the stateless executor? I've been using that for testing assuming you do, but all my tests might be useless if you don't!

@martindevans, that's basically the same result: stateless execution (with InferAsync) times: 14.869 vs 1:18.076

@AsakusaRinne
Copy link
Collaborator

I've had a guess, maybe the required memory increased in 0.6.0. Therefore the inference will be greatly slow down when there're memory page check-in/out. @lexxsoft Could you please have a look of the memory usages when running the model inference?

@AsakusaRinne
Copy link
Collaborator

If my guess is wrong unfortunately, I think we should profile the single eval function rather than including wrappers such as executor and session in the test, to see if the problem is about native call or wrappers.

@lexxsoft
Copy link
Author

@AsakusaRinne I'm afraid that's a bit over my head to delve deeper in the profiling.
However both rigs have tested are packing generous amount of RAM (64GB and 100GB).

I wonder is this me alone with this problem or is anyone getting this too?

@AsakusaRinne
Copy link
Collaborator

@lexxsoft You're the first one to meet the performance problem. However before we find the reason behind it, it may be a common problem, which I think is of high priority. In LLamaSharp the native apis are public and the only problem is writing the code to test. Since it's late night here, I'll write such a code tomorrow or maybe @martindevans could help now. :)

@martindevans
Copy link
Member

Ok I've created an extremely simply test case we can use to narrow this down some more, here it is. It may require some small adjustments for older versions of LLamaSharp.

Here's my results so far:

#223

Batch (100): 34490ms
Single Tokens (x5): 1869ms

master (321d0b5)

Batch (100): 32817ms
Single Tokens (x5): 1876ms

Tag 0.5.1

Batch (100): 3006ms
Single Tokens (x5): 433ms

Definitely reproducing the results!

@martindevans
Copy link
Member

(Starting a git bisect now)

@martindevans
Copy link
Member

martindevans commented Oct 29, 2023

Ok I think I've identified the commit, but I don't think it's good news.

  • 54b3eb0 is fast.
  • The very first commit in that PR (bca55ea) is slow.
    • Probably the largest change in this commit is a new DLL. Since it was the start of development I would have just grabbed that DLL directly from the llama.cpp release, rather than using our own Github action build process.

Note that if you want to test this for yourself you might need to grab your own libllama.dll from the appropriate release on the main repo. The ones I use during development are usually AVX512.

@martindevans
Copy link
Member

Anyone else have any ideas to debug this further? I spent a while double checking the changes in that commit (making sure it's not doing something crazy like setting batchsize or nthreads to one) but I can't spot anything. I'm at a bit of a dead end right now.

Everything points to the llama_eval call simply being 4x slower, but that doesn't match up with the report in the original comment which checked 3 versions of llama.cpp for speed.

@AsakusaRinne
Copy link
Collaborator

AsakusaRinne commented Oct 30, 2023

@martindevans Thanks for you effort to narrow the problem to one commit. I've reviewed that PR and corresponding code of llama.cpp. Though I'm not sure what caused the problem unfortunately, I'd like to share some of my ideas.


At first we can always trust the implementation of llama.cpp. Based on this assumption, what I first come up with is the wrong parameters passed to native APIs, which also disturbed me in the early of this project. That is when you see a = -1 on the side of C#, the value passed to native library might be a completely different value.

Here're some code which I might suspect on my side.

  1. The properties used in structs for native apis, such as LLamaContextParams. To be honest I don't know much about the internal process of P/Invoke so I never add properties to these structs. If you're sure it won't introduce problems, please just pass this point.
  2. bool members of native structs are not decorated by [MarshalAs(UnmanagedType.I1)]. It's a bit ashamed that I don't know the behaviour behind it, instead only some experiences told me to do so. (noticed that you convert the members to sbyte, not sure if it's equivalent thing)
  3. Using record struct without [StructLayout(LayoutKind.Sequential)], such as LLamaPos. Still, as my personal preference, I'm a bit too careful about native structs so I prefer to never use record struct. Please take it easy if you're sure all things go well with it.
  4. Using readonly in native structs. Theoretically, I prefer to believe this keyword only works in C# and will not prevent changing the values in cpp. If nothing is found after all other checks, then I'd like to have a try to remove this keyword.

Here's a method to check this kind of problem, which is from my previous debug experience:

  1. Narrow the range of code as much as possible, using the least native apis.
  2. Compile the dll with some logs added to it, printing the values inside the c api implementations in the file llama.cpp. Besides, print the structs passed to native apis on C# side before the call. Then compare the value outputs to check if there is difference. When checking the native apis with return value, such as llama_model_default_params, the struct should be printed on C# side after the call.

I know that the debug work is time-cost so if you meet obstacles please tell me what I can do for you to help. If you're too busy these days I'd like to do the check above in this weekend (on weekdays I'm a little busy). :)

@hswlab
Copy link
Contributor

hswlab commented Oct 30, 2023

I just tried the LLama.Web Project (v060) with llama-2-7b-chat.Q4_0.gguf and LLamaSharp.Backend.Cpu and noticed that the text stream takes about 3-5 seconds for each word that is displayed. I did not change anything in the default configuration. Is this the current problem from the thread creator, or is my processor just too slow? I am using an I7 2.20GHz 4 core with 8 logical processors. It amazes me that every single word loaded so slowly :)

@martindevans
Copy link
Member

Try comparing the same project in 0.5.1, that'll tell you if it's just a hardware limitation or this issue.

@martindevans
Copy link
Member

@AsakusaRinne Thanks for all the suggestions. I've worked my way through those double checking everything in docs/testing:

The properties used in structs for native apis, such as LLamaContextParams. To be honest I don't know much about the internal process of P/Invoke so I never add properties to these structs.

If the property has an invisible backing field (i.e. Foo { get; set; }) then I think that'll get handled like any other field. Not 100% sure and I'd avoid doing this on a p-invoked struct. The properties in LLamaContextParams are intentionally set up to have an explicit backing field, so there's nothing special about them.

bool members of native structs are not decorated by [MarshalAs(UnmanagedType.I1)] ... (noticed that you convert the members to sbyte, not sure if it's equivalent thing)

You guessed right, it's equivalent. The MarshalAs attribute basically just tells the marshaller to convert bools into 1 byte integers (i.e. bytes). I can't remember the exact reason, but I think they're done with properties and explicit conversion like this is because the attribute didn't work properly on .Net Framework 4.5.

Using record struct without [StructLayout(LayoutKind.Sequential)], such as LLamaPos

It shouldn't metter in that specific case since LLamaPos is just a wrapper around a single integer, but I think it's a bad habit not to put it on everything being passed into native code so I've added it on. 👍

I'm a bit too careful about native structs so I prefer to never use record struct.

I'm fairly sure it won't cause any issues, making it a record struct instead of a struct just adds some convenience things like proper equality etc. Just to double check I wrote out all of the record structs in the project and it made no difference to the speed test.

Using readonly in native structs.

As far as I'm aware it's just a compiler thing and doesn't actually have any runtime effect (e.g. reflection can modify readonly fields). I can't actually find a definitive answer on the docs though.

Just to be sure I went through and removed all readonly in the native structs. No difference in speed 😢

@martindevans
Copy link
Member

While testing the above I rewrote the test script to use as little of LLamaSharp as possible. Here it is. it directly calls into llama.cpp through the native API with no other C# code in the way. Still slow!

@martindevans
Copy link
Member

martindevans commented Oct 30, 2023

Well... good news I guess.

Given the above test I couldn't see how it could be anything but a problem in the binary. I just grabbed a new binary from llama.cpp (bin-win-avx2 from here) and it's fast.

I'm going to start a new build action, update all of the binaries from scratch and hope that fixes the issue.

Edit: Did this, it's slow again! There's something wrong with our build process.

@martindevans martindevans reopened this Oct 30, 2023
@martindevans
Copy link
Member

Ok, finally this should be resolved. Phew.

The problem with the build process was that by default march=native is on! So despite the explicit flags saying that AVX2 should be turned on it was compiled without it (presumably the github runners don't support AVX2, which is surprising).

I've fixed that, rebuilt all the binaries and added them to #223. If you can confirm that fixes the issue for you please leave a comment on that PR. I'll merge it (which should immediately auto-release) as soon as some people have confirmed.

@AsakusaRinne
Copy link
Collaborator

That's a good catch! I'll add a benchmark test to the ci this weekend to reduce the risk of happening such performance problems.

@lexxsoft
Copy link
Author

Yes, it seems that fixed the problem, it is now as performant as it was before.

@AlShadi
Copy link

AlShadi commented Oct 31, 2023

0.7.0 still seems slower...

@AsakusaRinne
Copy link
Collaborator

0.7.0 still seems slower...

Could you please provide the performance comparison between 0.5.1 and 0.7.0?

@atonalfreerider
Copy link

Not to side-rail, but there has been another driver-dependent performance issue that has been affecting applications that use lots of VRAM. If anyone is running Windows, they should check if they have the latest drivers, and also switch off the System Memory Fallback option for their program, because it causes a CPU bottleneck:

ggerganov/llama.cpp#3806 (comment)

@hswlab
Copy link
Contributor

hswlab commented Oct 31, 2023

Not sure if I'm doing something wrong, but with v070 I'm getting this error in LLama.Web.Common.ModelOptions when I'm starting the debuger.
image

I'm testing LLama.Web from the latest master branch. with LLamaSharp 0.7.0 and LLamaSharp.Backend.Cpu 0.7.0 on Windows 10. I disconnected the LLamaSharp project from the Web Project and replaced it with the libraries from the nuget package manager. It seems to be a setting for GPUs, that causes the Exception, but I'm trying to use the CPU.

@martindevans
Copy link
Member

@hswlab could you open a separate issue with details on your setup so we can look into that? The TensorSplitsCollection is a change I introduced (in the main project) but I'm not very familiar with the Web stuff so it's possible I did something wrong there.

@martindevans
Copy link
Member

martindevans commented Nov 1, 2023

I'll close this one now since 0.7.0 has resolved the performance issue.

@AlShadi
Copy link

AlShadi commented Nov 3, 2023

The only thing I'm changing is upgrading to 0.7.0. Using the Cuda12 backend on a 4090. GpuLayerCount is 30, UseMemoryLock is true, MaxTokens is 128.

Seed: 42
Write a short 100 word story about fish.
In the deep, dark abyss of the ocean, lived a unique and mysterious creature - the Fishy Fable. It had scales that shimmered with every color imaginable underwater sunlight. Its tail was long and elegant, swaying gracefully as it glided through currents. But what made this fish truly extraordinary were its eyes; they held secrets from centuries past, stories untold by any other marine creature. Every time someone gazed into those profound orbs, they felt an inexplicable connection to the vast history of the sea. The Fishy Fable was more
Duration: 00:02:30.4995150

Seed: 42
Write a short 100 word story about fish.
In the deep, dark abyss of the ocean, where sunlight could not penetrate and silence was absolute, lived a creature that defied all logic and reason. It was a fish unlike any other - its scales shimmered with an ethereal glow, casting beams of light in every direction. This luminescent being illuminated the depths with its radiance, creating a world within the darkness for those creatures who dared to venture into this uncharted territory. Its existence was shrouded in mystery and intrigue; it had no name but simply referred to as "The
Duration: 00:07:20.0257408

@AsakusaRinne
Copy link
Collaborator

@AlShadi How large is your model? Would you like to do the same test with a small model which can be fully offloaded to your GPU? I guess the bottleneck is completely from CPU in 0.7.0.

@martindevans
Copy link
Member

There shouldn't be any slowdown on the CPU side from 0.7.0 now that's it's properly using AVX2 :/

@AlShadi
Copy link

AlShadi commented Nov 3, 2023

@AsakusaRinne airoboros-l2-70b-3.1.2.Q5_K_M.gguf. CPU is an AMD Ryzen 9 5900X, System.Runtime.Intrinsics.X86.Avx2.IsSupported returns true.

Trying with llama-2-7b-chat.f16.gguf with 35 of 35 layers.
0.5.1 - Duration: 00:00:02.6431234
0.7.0 - Duration: 00:00:02.4879262

Another observation is that the 7b model had the same output with the same seed between versions. The 70b model had different output with the same seed between 0.5.1 and 0.7.0.

@AsakusaRinne
Copy link
Collaborator

Trying with llama-2-7b-chat.f16.gguf with 35 of 35 layers.
0.5.1 - Duration: 00:00:02.6431234
0.7.0 - Duration: 00:00:02.4879262

@martindevans It seems that problem still exists with CPU since gpu inference cost is the same. However I have no idea now. :(

@AlShadi Could you please running 7B model with 10 layers offloaded to GPU to see if there's a performance gap between 0.5.1 and 0.7.0?

Besides, there's another experiment that will help to see if this is a llama.cpp issue or a problem of LLamaSharp implementation: compiling the corresponding llama.cpp version of v0.5.1 and v0.7.0 with same settings, then running 70B model and printing the time cost. I know this may be time-cost so that I won't require it, but I'll appreciate it if you would like to help with that.

@AlShadi
Copy link

AlShadi commented Nov 3, 2023

@AsakusaRinne
0.5.1 - Duration: 00:00:39.6480427
0.7.0 - Duration: 00:00:53.3219226

Sure. I'll have time this weekend to work on it.

@lexxsoft
Copy link
Author

lexxsoft commented Nov 4, 2023

@martindevans, unfortunately this only partially fixes performance problem:
Yes, this is now fine with LLamaSharp + LLamaSharp.Backend.Cpu, however LLamaSharp + LLamaSharp.Backend.Cuda12 when using GpuLayerCount = 0 has the same problem: 14.992 (v0.5.1) vs 1:08.703 (v0.7.0), so the fix to AVX2 seems only impact only CPU variant, and others still need some configuration?

@martindevans
Copy link
Member

I've just opened up a PR that should fix this: #245

It creates a new variable which contains all the compiler defines common to all platforms and then uses it everywhere. This makes it less likely that one particular platforms will be missed with changes in the future.

This new variables includes -DLLAMA_NATIVE=OFF which is the actual fix for this issue.

@martindevans martindevans reopened this Nov 4, 2023
@martindevans
Copy link
Member

@lexxsoft there's a test build linked in that PR, when it's finished running would you mind downloading and testing the binaries from there to confirm this fixes the issue for you?

@lexxsoft
Copy link
Author

lexxsoft commented Nov 4, 2023

@martindevans, no, I don't mind at all, though I'll wait once test are not failing

@AlShadi
Copy link

AlShadi commented Nov 4, 2023

@AsakusaRinne
CPU Backend - llama-2-7b-chat.f16.gguf
0.5.1 - Duration: 00:00:51.3044133
0.7.0 - Duration: 00:00:49.1195524

CPU Backend - llama-2-70b-chat.Q5_K_M.gguf
0.5.1 - Duration: 00:03:33.5422383
0.7.0 - Duration: 00:03:35.3132351

Cuda12 Backend - 10 layers - llama-2-70b-chat.Q5_K_M.gguf
0.5.1 - Duration: 00:03:18.6230830
0.7.0 - Duration: 00:10:15.7618287

Cuda12 Backend - 35 layers - llama-2-70b-chat.Q5_K_M.gguf
0.5.1 - Duration: 00:02:13.3029884
0.7.0 - Duration: 00:06:00.6600464

@AsakusaRinne
Copy link
Collaborator

@AlShadi Thank you so much for providing this test. It further indicated the problem is from compile flags of cuda12 binary package. It will be solved in #245

@martindevans martindevans added the bug Something isn't working label Nov 6, 2023
@martindevans
Copy link
Member

The new binaries ended up being in #249 instead, that's been merged into master now.

@AsakusaRinne
Copy link
Collaborator

Solved by the latest release v0.8.0. Please feel free to reopen it if there's any other problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants