Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instructions for running it with local models is lacking. #943

Closed
iswarpatel123 opened this issue Dec 31, 2023 · 14 comments
Closed

Instructions for running it with local models is lacking. #943

iswarpatel123 opened this issue Dec 31, 2023 · 14 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@iswarpatel123
Copy link

Policy and info

  • Maintainers will close issues that have been stale for 14 days if they contain relevant answers.
  • Adding the label "sweep" will automatically turn the issue into a coded pull request. Works best for mechanical tasks. More info/syntax at: https://docs.sweep.dev/

Description

Instructions:

Running the Example
Once the API is set up, you can find the host and the exposed TCP port by checking your Runpod dashboard.

Then, you can use the port and host to run the following example using WizardCoder-Python-34B hosted on Runpod:

OPENAI_API_BASE=http://:/v1 python -m gpt_engineer.cli.main benchmark/pomodoro_timer --steps benchmark TheBloke_WizardCoder-Python-34B-V1.0-GPTQ

What is this example? What does it do? Whats gpt_engineer.cli.main?

How do i run the main command "gpte projects/my-new-project" after i have a local llm runing on localhost:8000?

Suggestion

Please provide more step by step instructions.

@iswarpatel123 iswarpatel123 added documentation Improvements or additions to documentation triage Interesting but stale issue. Will be close if inactive for 3 more days after label added. labels Dec 31, 2023
@ATheorell ATheorell removed the triage Interesting but stale issue. Will be close if inactive for 3 more days after label added. label Jan 2, 2024
@viborc viborc self-assigned this Jan 31, 2024
@viborc
Copy link
Collaborator

viborc commented Jan 31, 2024

As a quick update for the community, we are actively working on this issue and experimenting with using several local models to see how well they can work with gpt-engineer. After that, based on our experiments, we will update the documentation with relevant info.

@definitiontv
Copy link

I just got docker container working transparently using a dummy cloudflare hosted external address. On the server a combination of

ollama serve
litellm

simulates openai

and in the .env file add OPENAI_API_BASE=https://ai.mydomain.com

I experimentally pointed ollama at mistral and code llama - memory seems to produce code but no files are written so far.

Cloudflare tunnel is only because my implementation of docker does not recognize host.docker.internal and I didn't want to add a network to the docker compose. But you could modify the docker compose to point to the hosts localhost:8000 if your setup supports that.

@zigabrencic
Copy link
Collaborator

Hey.

As discussed a proposal the local LLM support. Please provide feedback before I dive in.

ollama and the rest can be added later following the same approach.

Requirements:

  • If possible don't add extra dependencies to gpte.
  • Use langchain package for interacting with the LLM's.
  • Adding minimal amount of code to gpte for the local LLM support.

Support for llama.cpp compatible models

Custom LLM would be supplied to the gpte via --open-llm flag:

gpte --open-llm "path/to/my_llm_langchain_interface.py"

With my_llm_langchain_interface.py along those lines as per Langchain docs:

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain_community.llms import LlamaCpp

chat_model: BaseChatModel = LlamaCpp(
    model_path="path/to/model/model.bin",
    n_gpu_layers=1,
    n_batch=512,
    n_ctx=2048,
    f16_kv=True,
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
    verbose=True,
)

Above requires from the user to install also python package llama-cpp-python.

TODO's

  • Develop under feature/open-llm-docs.
  • Add to class AI.__init__ the field self.open_llm = None.
  • The CustomChatModel object should be then called if supplied here. So:
if self.open_llm is not None:
     # todo: check that self.open_llm is a valid file path
     
     # Using importlib get the proper LLM interface 
     from importlib import import_module
     return import_module(self.open_llm).chat_model
  • Add to the gpt-cli entry point the --open-llm flag at.
  • Add to gpt-engineer/docs/intro/open_llms.md file and specify the installation instructions. Add links to existing documentation from langchain and llama-cpp-python.

@viborc viborc assigned zigabrencic and unassigned viborc Feb 2, 2024
@viborc
Copy link
Collaborator

viborc commented Feb 2, 2024

@ATheorell @captivus I'm fine with @zigabrencic's proposal. Are you guys okay with it, too, or you'd like to change something?

@ATheorell
Copy link
Collaborator

I know too little about this to have a strong opinion. What does @AntonOsika say? I want to add that this is a priority issue to me and it is clear that we need to have at least one example of setting this up in our own docs that we maintain, so that we can refer users to a text we know is true whenever this question comes up, which happens frequently.

@captivus
Copy link
Collaborator

captivus commented Feb 3, 2024

This looks good to me. Thanks for picking this up @zigabrencic!

@definitiontv
Copy link

@ATheorell @captivus I'm fine with @zigabrencic's proposal. Are you guys okay with it, too, or you'd like to change something?

I am rather late to notice progress here but I have a few caveats to add which may or may not be pertinent. GPTE is working brilliantly for me so far - just diving into improving the workflow and hope to send some suggestions upstream . It was annoying for me that gpte didn't work "out of the box" but the solution is out there already.

1:/ I personally think the langchain code base is buggy and it keeps shifting, I wouldn't rely on it personally - I had many nightmares with certain "impossible" situations due to introoduced breaking changes.
2:/ There are so many (and developeing) ways to do local inference efficiently, most of the "older" methods assume you have a GPU and you end up with all sorts of CUDA incompatibilities and nightmares, I would strongly suggest that this project does not attempt to navigate these at all to be honest.
3:/ Projects like ollama and litellm have to handle all of these uses cases as their core capability. I would strongly suggest that the gpte project keeps a very high level focus instead of trying to replicate their functionality. I have dropped many projects because they had buggy locall llm support.
4:/ I now have gpte working perfectly just using its standard open ai compatible calls pointing to a litellm proxy which I prefer to point to my local ollama instances. Note that this setup lets me use openai, azure, llama-cpp-python bindings, whatever....with a very simple change to the single config file of litellm or an addition of a model to ollama. I can even map incompatible parameters or use custom prompts t o change models on the fly. Or do cost management/dynamic model selections.
5:/ It woudl be a very simple matter to add to the docker compose a clean version of ollama/litellm - I would really look at that config before making any internal changes whotosever.

See more here https://docs.litellm.ai/docs/

brgds

James

@zigabrencic
Copy link
Collaborator

Hey @definitiontv

Thanks for all the inputs.

1.) I experienced sth similar when trying to add open LLM's using langchain so far and am worried on the same front. Since they(langchain) try to be everything lib.

2.) Do you mean by "older" methods here PyTorch, TensorFlow? Or also tools like llama.cpp?

3.) Good point.

4.) & 5.) This sounds like what we need.

Could you provide us maybe with your working version code/setup?

So we don't re-invent the wheel? If you have a working docker-compose please share it so we can build on top of your solution.

Extra from me:

6.) How do you find the inference speed with the stack of ollama and lite-llm you described? Fast/slow?

Cheers

Ziga

@definitiontv
Copy link

Hey @definitiontv

Thanks for all the inputs.

1.) I experienced sth similar when trying to add open LLM's using langchain so far and am worried on the same front. Since they(langchain) try to be everything lib.

2.) Do you mean by "older" methods here PyTorch, TensorFlow? Or also tools like llama.cpp?

Yes, but to be honest I am finding that many(most?) LLM open source implementations are annoyingly opinionated about their assumptions on user setup ..it's new.... consensus hasn't settled in.

3.) Good point.

4.) & 5.) This sounds like what we need.

Could you provide us maybe with your working version code/setup?

I was just going to quickly copy my setup - then realised multiple gotchas that I fixed local to me (happy to send to you direct (how???) but don't wanna steer people wrong on the thread)
I decided easier to reference each project docker implementations and then found multiple errors in their implementations, too tedious to go through here.
So, instead I will create a new clean working Dockerfile/Docker compose setup, test on a clean server and paste here for review. watch this space...

So we don't re-invent the wheel? If you have a working docker-compose please share it so we can build on top of your solution.

Extra from me:

6.) How do you find the inference speed with the stack of ollama and lite-llm you described? Fast/slow?

I run on a free tier ARM CPU remote processor -using only models needing 5,6GB (mostly mistral 7b) I get nice fast inference
Streaming return at 60 to 100 wpm lets a new small project be created in just a few (5) minutes . If i don't get an infinite loop by mistake !!!

Cheers

Ziga

@captivus
Copy link
Collaborator

The question raised by @zigabrencic as #6 is a particularly important one. We will want to compare inference performance of proposed solutions prior to implementing, given issues observed in other testing Ziga and I have been working on.

@definitiontv
Copy link

Sorry - I have been rushing - Bit more complex than i thought so I had to wrap it in a pull request in order to share.
I am not sure the automatic ollama model download is working so you may have to exec into the container and run /start.sh
or
ollama pull 'model'
ollama serve

AHAHH, as I type this I see my error!!! commands are the wrong way round

see #1015 some notes added to the bottom of the exising docker/README

I will expand /notes etc once a few options are pinned down

@zigabrencic
Copy link
Collaborator

Hey.

Thanks for submitting this. I checked the PR. Have only one question: How's the performance and underlying hardware access in docker?

For further chat's I propose that you reach out on http://discord.com/users/820749115197227138 so we can speed this up a little ;)

@definitiontv
Copy link

Hey.

Thanks for submitting this. I checked the PR. Have only one question: How's the performance and underlying hardware access in docker?

For further chat's I propose that you reach out on http://discord.com/users/820749115197227138 so we can speed this up a little ;)

Well if you are relying on GPU then you just need to make sure your docker instance has access which it shoudl unless you are running a very weird setup. CPU and memory should of course be near native - that's the point od docker.
I actually store my models on extremely slow external disks since they need to get loaded into memory anyway. Ollama will keep last model used in memory so after the first overhead of loading the model should make no difference Docker or not docker. But the compose file should give a one line test of speed/hardware copatibility etc etc. NOt sure if i documented..just set model to use in .env and exec into gpt containter to run gpte. BTW That discord link didn't seem to do anything.

@zigabrencic
Copy link
Collaborator

zigabrencic commented Feb 16, 2024

Hey

1.) Thanks for the GPU/CPU point. Must admit I'm not that familiar with docker internals.

2.) Discord. Strange. Works for me and others. How about this one: https://discord.com/channels/1119885301872070706/1120698764445880350 that's the link to the community. And find me there under ziga.ai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: Done
Development

No branches or pull requests

6 participants