Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate APPS benchmarking #1051

Merged
merged 24 commits into from
Mar 23, 2024
Merged

Conversation

azrv
Copy link
Contributor

@azrv azrv commented Mar 7, 2024

Integrate APPS dataset for benchmarking.

Changes:

  1. I faced a few git diff related errors while parsing LLM's response which I wrapped into DiffError. I ignore such error and continue running. I guess that's a temporary change until we polish git diff feature
  2. Also there was regex dependency added in order to timeout endless regex matching inside parse_diffs function. (done in Add timeout while searching for git diffs in LLMs response #1067)
  3. run.py had to be adjusted to work with multiple commands/inputs as APPS might provide multiple inputs and outputs per problem

How to run?

python3 gpt_engineer/benchmark gpt_engineer/core/default/simple_agent.py apps

I tried to run it on first 50 problems, here is the output (total time is not right, because generated code is cached from previous runs)

--- Results ---
Total time: 0.38s
Completely correct tasks: 12/50
Total correct assertions: 229/500
Average success rate: 45.79999999999998% on 50 tasks
--- Results ---

Benchmark against three simple problems
Iterate over predefined set of problems
TODOs left
- try using command line arguments
- integrate self-healing
Add parse_diffs timeout
Temporary except diffs related issues (Maybe some `try` statements are not necessary at this point)
Handle problems with starter_code
Improve results
@AntonOsika
Copy link
Collaborator

Super excited about this, wanted to use it today

I'll add some comments to suggest simplifications.

Apart from that, are we ready to merge or anything you want to fix first? (apart from just resolving pyproject.toml and running poetry lock

@ATheorell
Copy link
Collaborator

I think @azrv wanted to do some cleanup.

Currently, this PR comes with 2 new agent implementations, the benchmark_agent and the self_healing_agent. For the "base" PR of APPS, probably neither of these should be included; it should be sufficient to run the default_agent, even though performance may be deplorable. It is then the fun of coming PRs to add smart self-healing etc which improves performance.

@AntonOsika
Copy link
Collaborator

Thanks for PR @azrv

Some good things here!

I'll be direct here though: we need higher standards to merge any PR, specifically:

  1. There is no explanation of why to do things in a new way (unless obvious)
  2. The PR adds both a new agent to evaluate and a new benchmark. They should be separate PRs.
  3. The PR completely changes the benchmark code, without explanations of why (and it seemingly adds "agent logic" in the PR?)

Referring to this in 3:
image

@AntonOsika
Copy link
Collaborator

Finally, I don't think need to add inputs:

You can just set the command = None, and then create a dict comprehension of assertions that each loop over all the inputs and calls the command. Makes sense?

@ATheorell
Copy link
Collaborator

ATheorell commented Mar 9, 2024

Finally, I don't think need to add inputs:

You can just set the command = None, and then create a dict comprehension of assertions that each loop over all the inputs and calls the command. Makes sense?

Ah, the turning inputs and assertions into a list was my invention, but yes, I believe the design where the inputs are stored in the assertions is nicer. I would vouch for that.

Drop self_healing_agent and benchmark_agent
Store commands within assertions
Store commands within assertions
@azrv
Copy link
Contributor Author

azrv commented Mar 10, 2024

Thanks for PR @azrv

Some good things here!

I'll be direct here though: we need higher standards to merge any PR, specifically:

  1. There is no explanation of why to do things in a new way (unless obvious)
  2. The PR adds both a new agent to evaluate and a new benchmark. They should be separate PRs.
  3. The PR completely changes the benchmark code, without explanations of why (and it seemingly adds "agent logic" in the PR?)

Referring to this in 3: image

Hey!

  1. Sure, I updated the PR description
  2. Both agents were dropped
  3. I cleaned things up and removed 'agent based' logic, that was added solely for self healing in sake of better performance

Here are few things I found could be handy for gpt-engineer benchmarks. Please share your thoughts whether this is smth we would consider doing later:

  1. retries AKA self-healing if most of the assertions didn't succeed
  2. Running benchmark in multiple threads

@azrv
Copy link
Contributor Author

azrv commented Mar 10, 2024

Finally, I don't think need to add inputs:

You can just set the command = None, and then create a dict comprehension of assertions that each loop over all the inputs and calls the command. Makes sense?

Not sure this is achievable with simple dictionaries, because you need to store command somewhere.
For now I simply extended Assertion class, let me know if it clicks with you or please elaborate on dict way

Yes, other benchmarks are still broken. Will fix after we agree on single solution :)

@azrv azrv requested a review from AntonOsika March 10, 2024 21:12
@azrv azrv marked this pull request as ready for review March 11, 2024 18:21
@azrv azrv changed the title WIP Integrate APPS benchmarking Integrate APPS benchmarking Mar 11, 2024
@ATheorell
Copy link
Collaborator

Finally, I don't think need to add inputs:
You can just set the command = None, and then create a dict comprehension of assertions that each loop over all the inputs and calls the command. Makes sense?

Not sure this is achievable with simple dictionaries, because you need to store command somewhere. For now I simply extended Assertion class, let me know if it clicks with you or please elaborate on dict way

Yes, other benchmarks are still broken. Will fix after we agree on single solution :)

The code is getting more general which is great! I think we can still scale back a lot on the changes to the general run.py file by implementing the system that @AntonOsika suggested. It is not necessary to extend the assertion class. One way (which I think is quite nice), for doing this, is to use the AppsAssertion class that I implemented. The idea goes:

For each input output/pair, create an AppsAssertion object. Pass code, (timeout), command + input, expected output and an execution environment to the constructor of the AppsAssertion object. In the evaluate method, run the code through the execution command and check whether the assertion holds. Timeout and runtime failures will make the assertion fail in the same way as a wrong output (but we could log it differently in the future).

A drawback with this method may be that it complicates parallelization, which could, however, still be added somehow from run.py. Another idea would be to parallelize on problem level. Input on this @AntonOsika ?

Hope this helps

Copy link

codecov bot commented Mar 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.29%. Comparing base (164730a) to head (ce35057).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1051   +/-   ##
=======================================
  Coverage   84.29%   84.29%           
=======================================
  Files          27       27           
  Lines        1566     1566           
=======================================
  Hits         1320     1320           
  Misses        246      246           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@ATheorell ATheorell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work and this is more or less working. Just requesting a bit of housekeeping now really.

[result for result in self.assertion_results if list(result.values())[0]]
)

return succeeded / len(self.assertion_results)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get: "ZeroDivisionError: division by zero" here when I run the first 5 problems.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been addressed? I think it happened because I ran few problems, only the first 5...
Or might it have been due to the catch clause that the list was empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't manage to reproduce it, but I added a check in the beginning of the method:

if not self.assertion_results:
            return 0.0

Can you still catch it?

files_dict = agent.improve(task.initial_code, task.prompt, task.command)
try:
files_dict = agent.improve(task.initial_code, task.prompt)
except DiffError: # Temporary catch errors related to git diffs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still a problem? I see that this was necessary during development, but I think we should aim to fix this on the diff side, rather than having a specific catch in the benchmarks.

+ "\nThe program, including its inputs, should be run from the command "
"line like 'python main \"input1 input2 etc \"', with all inputs inside "
"the quotation marks. The program should not read inputs from stdin.",
assertions=[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the list wrapping again and just use a dictionary comprehension like in the original design

assertion_name: assertion(exec_result)
for assertion_name, assertion in task.assertions.items()
},
assertion_results=[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we should be able to revert, now that we are using the AppsAssertion design. Each apps assertion will be an entry in the dictionary.

@@ -57,7 +57,7 @@ class Task:
initial_code: Optional[FilesDict]
command: Optional[str]
prompt: str
assertions: Optional[Dict[str, Assertion]]
assertions: Optional[List[OrderedDict[str, Assertion]]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, revert this

@@ -72,5 +72,14 @@ class Benchmark:
@dataclass
class TaskResult:
task_name: str
assertion_results: dict[str, bool]
assertion_results: List[dict[str, bool]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, revert

@@ -134,21 +140,25 @@ def parse_diffs(diff_string: str) -> dict:
- dict: A dictionary of Diff objects keyed by filename.
"""
# Regex to match individual diff blocks
diff_block_pattern = re.compile(
diff_block_pattern = regex.compile(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is great that you are improving the diff-pipeline! Formally, this should be a separate PR. Would you be able to make a separate PR with the changes to chat_to_files?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can merge that quickly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyproject.toml Outdated
@@ -32,6 +32,8 @@ langchain_openai = "*"
toml = ">=0.10.2"
pyperclip = "^1.8.2"
langchain-anthropic = "^0.1.1"
datasets = "^2.17.1"
regex = "^2023.12.25"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice if you can comment what functionality makes the new dependencies necessary

self.command = command

def evaluate(self, assertable: Assertable) -> bool:
pro = assertable.env.popen(self.command)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may sound a little counterintuitive to the architecture, but I think it may be preferable to create a fresh DiskExecutionEnvironment, upload the code and run from there, instead of using the global one. The reason is that, executing the code may have side-effects like creating and removing files and folders. If this is the case, multiple executions of the same program with different inputs will not be independent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, will do

files_dict = agent.improve(task.initial_code, task.prompt, task.command)
try:
files_dict = agent.improve(task.initial_code, task.prompt)
except RecursionError: # Temporary catch errors related to git diffs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ATheorell either diff errors were indeed fixed in latest master or I added them prematurely at the first place, but now the only thing left is validate_and_correct function getting into recursion.

I suggest I temporary catch it here and file an issue for it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually suggest you remove this catch. We merge without it and let it fail. Then we can use it to repair what ever is broken in the diffs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@azrv azrv requested a review from ATheorell March 21, 2024 19:44
@azrv
Copy link
Contributor Author

azrv commented Mar 21, 2024

@ATheorell @AntonOsika All fixed
The only thing left is RecursionError explained above

@ATheorell ATheorell merged commit 75b5438 into gpt-engineer-org:main Mar 23, 2024
6 checks passed
@azrv
Copy link
Contributor Author

azrv commented Mar 23, 2024

Solves #819

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants