Instructions for running it with local models is lacking. #943

iswarpatel123 · 2023-12-31T03:22:44Z

Policy and info

Maintainers will close issues that have been stale for 14 days if they contain relevant answers.
Adding the label "sweep" will automatically turn the issue into a coded pull request. Works best for mechanical tasks. More info/syntax at: https://docs.sweep.dev/

Description

Instructions:

Running the Example
Once the API is set up, you can find the host and the exposed TCP port by checking your Runpod dashboard.

Then, you can use the port and host to run the following example using WizardCoder-Python-34B hosted on Runpod:

OPENAI_API_BASE=http://:/v1 python -m gpt_engineer.cli.main benchmark/pomodoro_timer --steps benchmark TheBloke_WizardCoder-Python-34B-V1.0-GPTQ

What is this example? What does it do? Whats gpt_engineer.cli.main?

How do i run the main command "gpte projects/my-new-project" after i have a local llm runing on localhost:8000?

Suggestion

Please provide more step by step instructions.

viborc · 2024-01-31T14:07:42Z

As a quick update for the community, we are actively working on this issue and experimenting with using several local models to see how well they can work with gpt-engineer. After that, based on our experiments, we will update the documentation with relevant info.

definitiontv · 2024-01-31T17:34:30Z

I just got docker container working transparently using a dummy cloudflare hosted external address. On the server a combination of

ollama serve
litellm

simulates openai

and in the .env file add OPENAI_API_BASE=https://ai.mydomain.com

I experimentally pointed ollama at mistral and code llama - memory seems to produce code but no files are written so far.

Cloudflare tunnel is only because my implementation of docker does not recognize host.docker.internal and I didn't want to add a network to the docker compose. But you could modify the docker compose to point to the hosts localhost:8000 if your setup supports that.

zigabrencic · 2024-02-02T20:11:18Z

Hey.

As discussed a proposal the local LLM support. Please provide feedback before I dive in.

ollama and the rest can be added later following the same approach.

Requirements:

If possible don't add extra dependencies to gpte.
Use langchain package for interacting with the LLM's.
Adding minimal amount of code to gpte for the local LLM support.

Support for `llama.cpp` compatible models

Custom LLM would be supplied to the gpte via --open-llm flag:

gpte --open-llm "path/to/my_llm_langchain_interface.py"

With my_llm_langchain_interface.py along those lines as per Langchain docs:

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain_community.llms import LlamaCpp

chat_model: BaseChatModel = LlamaCpp(
    model_path="path/to/model/model.bin",
    n_gpu_layers=1,
    n_batch=512,
    n_ctx=2048,
    f16_kv=True,
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
    verbose=True,
)

Above requires from the user to install also python package llama-cpp-python.

TODO's

Develop under feature/open-llm-docs.
Add to class AI.__init__ the field self.open_llm = None.
The CustomChatModel object should be then called if supplied here. So:

if self.open_llm is not None:
     # todo: check that self.open_llm is a valid file path
     
     # Using importlib get the proper LLM interface 
     from importlib import import_module
     return import_module(self.open_llm).chat_model

Add to the gpt-cli entry point the --open-llm flag at.
Add to gpt-engineer/docs/intro/open_llms.md file and specify the installation instructions. Add links to existing documentation from langchain and llama-cpp-python.

viborc · 2024-02-02T20:20:34Z

@ATheorell @captivus I'm fine with @zigabrencic's proposal. Are you guys okay with it, too, or you'd like to change something?

ATheorell · 2024-02-03T18:01:20Z

I know too little about this to have a strong opinion. What does @AntonOsika say? I want to add that this is a priority issue to me and it is clear that we need to have at least one example of setting this up in our own docs that we maintain, so that we can refer users to a text we know is true whenever this question comes up, which happens frequently.

captivus · 2024-02-03T22:31:49Z

This looks good to me. Thanks for picking this up @zigabrencic!

definitiontv · 2024-02-11T12:10:02Z

@ATheorell @captivus I'm fine with @zigabrencic's proposal. Are you guys okay with it, too, or you'd like to change something?

I am rather late to notice progress here but I have a few caveats to add which may or may not be pertinent. GPTE is working brilliantly for me so far - just diving into improving the workflow and hope to send some suggestions upstream . It was annoying for me that gpte didn't work "out of the box" but the solution is out there already.

1:/ I personally think the langchain code base is buggy and it keeps shifting, I wouldn't rely on it personally - I had many nightmares with certain "impossible" situations due to introoduced breaking changes.
2:/ There are so many (and developeing) ways to do local inference efficiently, most of the "older" methods assume you have a GPU and you end up with all sorts of CUDA incompatibilities and nightmares, I would strongly suggest that this project does not attempt to navigate these at all to be honest.
3:/ Projects like ollama and litellm have to handle all of these uses cases as their core capability. I would strongly suggest that the gpte project keeps a very high level focus instead of trying to replicate their functionality. I have dropped many projects because they had buggy locall llm support.
4:/ I now have gpte working perfectly just using its standard open ai compatible calls pointing to a litellm proxy which I prefer to point to my local ollama instances. Note that this setup lets me use openai, azure, llama-cpp-python bindings, whatever....with a very simple change to the single config file of litellm or an addition of a model to ollama. I can even map incompatible parameters or use custom prompts t o change models on the fly. Or do cost management/dynamic model selections.
5:/ It woudl be a very simple matter to add to the docker compose a clean version of ollama/litellm - I would really look at that config before making any internal changes whotosever.

See more here https://docs.litellm.ai/docs/

brgds

James

zigabrencic · 2024-02-11T14:39:37Z

Hey @definitiontv

Thanks for all the inputs.

1.) I experienced sth similar when trying to add open LLM's using langchain so far and am worried on the same front. Since they(langchain) try to be everything lib.

2.) Do you mean by "older" methods here PyTorch, TensorFlow? Or also tools like llama.cpp?

3.) Good point.

4.) & 5.) This sounds like what we need.

Could you provide us maybe with your working version code/setup?

So we don't re-invent the wheel? If you have a working docker-compose please share it so we can build on top of your solution.

Extra from me:

6.) How do you find the inference speed with the stack of ollama and lite-llm you described? Fast/slow?

Cheers

Ziga

definitiontv · 2024-02-12T16:16:18Z

Hey @definitiontv

Thanks for all the inputs.

1.) I experienced sth similar when trying to add open LLM's using langchain so far and am worried on the same front. Since they(langchain) try to be everything lib.

2.) Do you mean by "older" methods here PyTorch, TensorFlow? Or also tools like llama.cpp?

Yes, but to be honest I am finding that many(most?) LLM open source implementations are annoyingly opinionated about their assumptions on user setup ..it's new.... consensus hasn't settled in.

3.) Good point.

4.) & 5.) This sounds like what we need.

Could you provide us maybe with your working version code/setup?

I was just going to quickly copy my setup - then realised multiple gotchas that I fixed local to me (happy to send to you direct (how???) but don't wanna steer people wrong on the thread)
I decided easier to reference each project docker implementations and then found multiple errors in their implementations, too tedious to go through here.
So, instead I will create a new clean working Dockerfile/Docker compose setup, test on a clean server and paste here for review. watch this space...

So we don't re-invent the wheel? If you have a working docker-compose please share it so we can build on top of your solution.

Extra from me:

6.) How do you find the inference speed with the stack of ollama and lite-llm you described? Fast/slow?

I run on a free tier ARM CPU remote processor -using only models needing 5,6GB (mostly mistral 7b) I get nice fast inference
Streaming return at 60 to 100 wpm lets a new small project be created in just a few (5) minutes . If i don't get an infinite loop by mistake !!!

Cheers

Ziga

captivus · 2024-02-12T16:21:40Z

The question raised by @zigabrencic as #6 is a particularly important one. We will want to compare inference performance of proposed solutions prior to implementing, given issues observed in other testing Ziga and I have been working on.

definitiontv · 2024-02-13T19:29:01Z

Sorry - I have been rushing - Bit more complex than i thought so I had to wrap it in a pull request in order to share.
I am not sure the automatic ollama model download is working so you may have to exec into the container and run /start.sh
or
ollama pull 'model'
ollama serve

AHAHH, as I type this I see my error!!! commands are the wrong way round

see #1015 some notes added to the bottom of the exising docker/README

I will expand /notes etc once a few options are pinned down

zigabrencic · 2024-02-14T18:13:42Z

Hey.

Thanks for submitting this. I checked the PR. Have only one question: How's the performance and underlying hardware access in docker?

For further chat's I propose that you reach out on http://discord.com/users/820749115197227138 so we can speed this up a little ;)

definitiontv · 2024-02-15T22:36:14Z

Hey.

Thanks for submitting this. I checked the PR. Have only one question: How's the performance and underlying hardware access in docker?

For further chat's I propose that you reach out on http://discord.com/users/820749115197227138 so we can speed this up a little ;)

Well if you are relying on GPU then you just need to make sure your docker instance has access which it shoudl unless you are running a very weird setup. CPU and memory should of course be near native - that's the point od docker.
I actually store my models on extremely slow external disks since they need to get loaded into memory anyway. Ollama will keep last model used in memory so after the first overhead of loading the model should make no difference Docker or not docker. But the compose file should give a one line test of speed/hardware copatibility etc etc. NOt sure if i documented..just set model to use in .env and exec into gpt containter to run gpte. BTW That discord link didn't seem to do anything.

zigabrencic · 2024-02-16T11:18:36Z

Hey

1.) Thanks for the GPU/CPU point. Must admit I'm not that familiar with docker internals.

2.) Discord. Strange. Works for me and others. How about this one: https://discord.com/channels/1119885301872070706/1120698764445880350 that's the link to the community. And find me there under ziga.ai

iswarpatel123 added documentation Improvements or additions to documentation triage Interesting but stale issue. Will be close if inactive for 3 more days after label added. labels Dec 31, 2023

ATheorell removed the triage Interesting but stale issue. Will be close if inactive for 3 more days after label added. label Jan 2, 2024

viborc self-assigned this Jan 31, 2024

viborc assigned zigabrencic and unassigned viborc Feb 2, 2024

zigabrencic mentioned this issue Feb 16, 2024

[Bug]: LLM provider NOT provided when using GGUF file at custom path BerriAI/litellm#2001

Closed

zigabrencic mentioned this issue Mar 21, 2024

Support for Open LLMs #1082

Merged

viborc closed this as completed Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instructions for running it with local models is lacking. #943

Instructions for running it with local models is lacking. #943

iswarpatel123 commented Dec 31, 2023

viborc commented Jan 31, 2024

definitiontv commented Jan 31, 2024

zigabrencic commented Feb 2, 2024

viborc commented Feb 2, 2024

ATheorell commented Feb 3, 2024

captivus commented Feb 3, 2024

definitiontv commented Feb 11, 2024

zigabrencic commented Feb 11, 2024

definitiontv commented Feb 12, 2024

captivus commented Feb 12, 2024

definitiontv commented Feb 13, 2024

zigabrencic commented Feb 14, 2024

definitiontv commented Feb 15, 2024

zigabrencic commented Feb 16, 2024 •

edited

Loading

Instructions for running it with local models is lacking. #943

Instructions for running it with local models is lacking. #943

Comments

iswarpatel123 commented Dec 31, 2023

Policy and info

Description

Suggestion

viborc commented Jan 31, 2024

definitiontv commented Jan 31, 2024

zigabrencic commented Feb 2, 2024

Support for llama.cpp compatible models

TODO's

viborc commented Feb 2, 2024

ATheorell commented Feb 3, 2024

captivus commented Feb 3, 2024

definitiontv commented Feb 11, 2024

zigabrencic commented Feb 11, 2024

definitiontv commented Feb 12, 2024

captivus commented Feb 12, 2024

definitiontv commented Feb 13, 2024

zigabrencic commented Feb 14, 2024

definitiontv commented Feb 15, 2024

zigabrencic commented Feb 16, 2024 • edited Loading

Support for `llama.cpp` compatible models

zigabrencic commented Feb 16, 2024 •

edited

Loading