Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate APPS benchmarking #1051

Merged
merged 24 commits into from
Mar 23, 2024
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
29f4680
Initial APPS benchmark integration
azrv Feb 19, 2024
4ddfb5e
Download and cache dataset
azrv Feb 21, 2024
cf3db0e
WIP APPS benchmarking
azrv Feb 28, 2024
ccb00ee
Merge remote-tracking branch 'origin/main' into apps_self_healing
azrv Mar 5, 2024
3c6c4aa
Multiple inputs, results
azrv Mar 6, 2024
833e3c9
Add subprocess execution timeout
azrv Mar 7, 2024
5e54e5b
Remove unused code
azrv Mar 10, 2024
e804c0b
Drop retry logic and outer loop in run.py
azrv Mar 10, 2024
95c88f3
Get rid of inputs
azrv Mar 10, 2024
5f6f7d6
Get rid of inputs
azrv Mar 10, 2024
c25c1d1
Merge branch 'apps2' into apps_self_healing
azrv Mar 10, 2024
b6ecace
Remove `AppsAssertion` class
azrv Mar 11, 2024
a526a91
Merge remote-tracking branch 'origin/main' into apps_self_healing
azrv Mar 11, 2024
37df506
Merge branch 'main' into apps_self_healing
azrv Mar 16, 2024
cb0d9ca
Add `AppsAssertion` class, simplify run.py
azrv Mar 16, 2024
50cd771
Return run.py to its initial state
azrv Mar 16, 2024
20890d3
Run pre-commit; Update lock file
azrv Mar 16, 2024
befa702
Run pre-commit on all files
azrv Mar 16, 2024
adb60a2
Merge remote-tracking branch 'origin/main' into apps_self_healing
azrv Mar 21, 2024
450c7e2
Revert assertions back to using dicts
azrv Mar 21, 2024
b175938
Create new execution environment for every run to avoid side effects
azrv Mar 21, 2024
911e092
Clean up
azrv Mar 21, 2024
d380f05
Revert DiffError; Temporary except `RecursionError` in benchmarks
azrv Mar 21, 2024
ce35057
Remove RecursionError try-catch
azrv Mar 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -87,3 +87,6 @@ webapp/.next/

#ignore tox files
.tox

# locally saved datasets
gpt_engineer/benchmark/benchmarks/apps/dataset
114 changes: 114 additions & 0 deletions gpt_engineer/benchmark/benchmarks/apps/load.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
"""
Module for loading APPS evaluation tasks.

This module provides functionality to load tasks for evaluating GPT-based models
on smaller, more focused tasks. It defines a set of tasks with predefined prompts
and assertions to benchmark the performance of AI models.

Functions
---------
load_apps : function
Loads the APPS benchmark, which consists of a series coding problems.
"""
from collections import OrderedDict
from pathlib import Path
from subprocess import TimeoutExpired
from typing import Union

from datasets import Dataset, DatasetDict, load_dataset, load_from_disk

from gpt_engineer.benchmark.benchmarks.apps.problem import Problem
from gpt_engineer.benchmark.benchmarks.apps.problems import PROBLEM_IDS
from gpt_engineer.benchmark.types import Assertable, Benchmark, Task
from gpt_engineer.core.files_dict import FilesDict

DATASET_PATH = Path("gpt_engineer/benchmark/benchmarks/apps/dataset")
MAX_N_TEST_EXAMPLES = 10


class AppsAssertion:
azrv marked this conversation as resolved.
Show resolved Hide resolved
def __init__(self, expected: str, command: str):
self.expected_output = self._format(expected)
self.command = command

def evaluate(self, assertable: Assertable) -> bool:
pro = assertable.env.popen(self.command)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may sound a little counterintuitive to the architecture, but I think it may be preferable to create a fresh DiskExecutionEnvironment, upload the code and run from there, instead of using the global one. The reason is that, executing the code may have side-effects like creating and removing files and folders. If this is the case, multiple executions of the same program with different inputs will not be independent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, will do

try:
stdout, stderr = pro.communicate(timeout=2)
stdout, stderr = stdout.decode("utf-8"), stderr.decode("utf-8")
except TimeoutExpired:
print("Execution Timeout")
return False

return self.expected_output in self._format(stdout)

def _format(self, string: str) -> str:
return string.replace(" ", "").replace("\n", "")


def _get_dataset() -> Union[Dataset, DatasetDict]:
try:
return load_from_disk(str(DATASET_PATH))
except FileNotFoundError:
print("Dataset not found locally, downloading...")

dataset = load_dataset("codeparrot/apps")
dataset.save_to_disk(DATASET_PATH)

return dataset


def load_apps():
"""
Loads the APPS benchmark, which consists of a series coding problems.

Returns
-------
Benchmark
A Benchmark object containing a list of Task objects for the APPS evaluation.
"""
dataset = _get_dataset()
tasks = []

problems = [
Problem(
id=problem["problem_id"],
question=problem["question"],
input_output=problem["input_output"],
starter_code=problem["starter_code"],
)
for problem in dataset["test"]
if problem["problem_id"] in PROBLEM_IDS
]

for problem in problems:
tasks.append(
Task(
name=str(problem.id),
initial_code=FilesDict({"main.py": problem.starter_code}),
command=None, # Explicitly setting `None` because each assertion specifies its command
prompt=problem.question
+ "\nThe program, including its inputs, should be run from the command "
"line like 'python main \"input1 input2 etc \"', with all inputs inside "
"the quotation marks. The program should not read inputs from stdin.",
assertions=[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the list wrapping again and just use a dictionary comprehension like in the original design

OrderedDict(
{
"correct output": AppsAssertion(
expected=problem.outputs[i],
command="python main.py"
+ ' "'
+ problem.inputs[i]
+ '"',
).evaluate
}
)
for i in range(min(len(problem.outputs), MAX_N_TEST_EXAMPLES))
],
)
)

return Benchmark(
name="APPS",
tasks=tasks,
)
25 changes: 25 additions & 0 deletions gpt_engineer/benchmark/benchmarks/apps/problem.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import json

from dataclasses import dataclass
from functools import cached_property
from typing import List


@dataclass(frozen=True)
class Problem:
id: int
question: str
input_output: str
starter_code: str

@property
def inputs(self) -> List[str]:
return self._parsed_inputs_outputs["inputs"]

@property
def outputs(self) -> List[str]:
return self._parsed_inputs_outputs["outputs"]

@cached_property
def _parsed_inputs_outputs(self):
return json.loads(self.input_output.replace("\n", ""))
9 changes: 9 additions & 0 deletions gpt_engineer/benchmark/benchmarks/apps/problems.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
INTRODUCTORY_PROBLEM_IDS = list(range(2500, 2600))

INTERVIEW_PROBLEM_IDS = list(range(0, 100))

COMPETITION_PROBLEM_IDS = list()

# TODO: Pick problems
# Temporary testing against these problems
PROBLEM_IDS = list(range(0, 50))
3 changes: 2 additions & 1 deletion gpt_engineer/benchmark/benchmarks/load.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,15 @@
get_benchmark : function
Retrieves a Benchmark object by name. Raises ValueError if the benchmark is unknown.
"""

from gpt_engineer.benchmark.benchmarks.apps.load import load_apps
from gpt_engineer.benchmark.benchmarks.gpteng.load import load_gpteng
from gpt_engineer.benchmark.benchmarks.gptme.load import load_gptme
from gpt_engineer.benchmark.types import Benchmark

BENCHMARKS = {
"gptme": load_gptme,
"gpteng": load_gpteng,
"apps": load_apps,
}


Expand Down
65 changes: 45 additions & 20 deletions gpt_engineer/benchmark/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

from gpt_engineer.benchmark.types import Assertable, Benchmark, TaskResult
from gpt_engineer.core.base_agent import BaseAgent
from gpt_engineer.core.chat_to_files import DiffError
from gpt_engineer.core.default.disk_execution_env import DiskExecutionEnv


Expand Down Expand Up @@ -48,8 +49,20 @@ def run(
"""
task_results = []
for task in benchmark.tasks:
print(f"--> Running task: {task.name}\n")

t0 = time.time()
files_dict = agent.improve(task.initial_code, task.prompt, task.command)
try:
files_dict = agent.improve(task.initial_code, task.prompt)
except DiffError: # Temporary catch errors related to git diffs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still a problem? I see that this was necessary during development, but I think we should aim to fix this on the diff side, rather than having a specific catch in the benchmarks.

task_results.append(
TaskResult(
task_name=task.name,
duration=time.time() - t0,
assertion_results=[],
)
)
continue
t1 = time.time()

env = DiskExecutionEnv()
Expand All @@ -61,6 +74,7 @@ def run(
stdout, stderr = stdout.decode("utf-8"), stderr.decode("utf-8")
else:
p, stdout, stderr = None, None, None

exec_result = Assertable(
files=files_dict,
env=env,
Expand All @@ -72,13 +86,17 @@ def run(
task_results.append(
TaskResult(
task_name=task.name,
assertion_results={
assertion_name: assertion(exec_result)
for assertion_name, assertion in task.assertions.items()
},
assertion_results=[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we should be able to revert, now that we are using the AppsAssertion design. Each apps assertion will be an entry in the dictionary.

{
key: assertion(exec_result)
for key, assertion in task.assertions[i].items()
}
for i in range(len(task.assertions))
],
duration=t1 - t0,
)
)

if verbose:
print_results(task_results)
return task_results
Expand All @@ -100,32 +118,39 @@ def print_results(results: list[TaskResult]):
for task_result in results:
print(f"\n--- Results for {task_result.task_name} ---")
print(f"{task_result.task_name} ({task_result.duration:.2f}s)")
for assertion_name, assertion_result in task_result.assertion_results.items():
checkmark = "✅" if assertion_result else "❌"
print(f" {checkmark} {assertion_name}")
for assertion_results_dict in task_result.assertion_results:
for assertion_name, assertion_result in assertion_results_dict.items():
checkmark = "✅" if assertion_result else "❌"
print(f" {checkmark} {assertion_name}")
print()
print()

success_rates = [task_result.success_rate for task_result in results]
avg_success_rate = sum(success_rates) / len(results)

total_time = sum(task_result.duration for task_result in results)
print(f"Total time: {total_time:.2f}s")

correct_assertions = sum(
sum(
assertion_result
for assertion_result in task_result.assertion_results.values()
for assertion_results_dict in task_result.assertion_results
for assertion_result in assertion_results_dict.values()
)
for task_result in results
)
total_assertions = sum(
len(task_result.assertion_results) for task_result in results
)
print(f"Total correct assertions: {correct_assertions}/{total_assertions}")

correct_tasks = sum(
all(
assertion_result
for assertion_result in task_result.assertion_results.values()
)
len(assertion_results_dict)
for task_result in results
for assertion_results_dict in task_result.assertion_results
)
print(f"Correct tasks: {correct_tasks}/{len(results)}")
correct_tasks = [
task_result for task_result in results if task_result.success_rate == 1
]

print("--- Results ---")
print(f"Total time: {total_time:.2f}s")
print(f"Completely correct tasks: {len(correct_tasks)}/{len(results)}")
print(f"Total correct assertions: {correct_assertions}/{total_assertions}")
print(f"Average success rate: {avg_success_rate * 100}% on {len(results)} tasks")
print("--- Results ---")
print()
15 changes: 12 additions & 3 deletions gpt_engineer/benchmark/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
"""
from dataclasses import dataclass
from subprocess import Popen
from typing import Callable, Dict, Optional
from typing import Callable, List, Optional, OrderedDict

from gpt_engineer.core.base_execution_env import BaseExecutionEnv
from gpt_engineer.core.files_dict import FilesDict
Expand Down Expand Up @@ -57,7 +57,7 @@ class Task:
initial_code: Optional[FilesDict]
command: Optional[str]
prompt: str
assertions: Optional[Dict[str, Assertion]]
assertions: Optional[List[OrderedDict[str, Assertion]]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, revert this



@dataclass
Expand All @@ -72,5 +72,14 @@ class Benchmark:
@dataclass
class TaskResult:
task_name: str
assertion_results: dict[str, bool]
assertion_results: List[dict[str, bool]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, revert

duration: float

# Returns success rate from 0.00 up to 1.00
@property
def success_rate(self) -> float:
succeeded = len(
[result for result in self.assertion_results if list(result.values())[0]]
)

return succeeded / len(self.assertion_results)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get: "ZeroDivisionError: division by zero" here when I run the first 5 problems.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been addressed? I think it happened because I ran few problems, only the first 5...
Or might it have been due to the catch clause that the list was empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't manage to reproduce it, but I added a check in the beginning of the method:

if not self.assertion_results:
            return 0.0

Can you still catch it?

24 changes: 17 additions & 7 deletions gpt_engineer/core/chat_to_files.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,19 @@

from typing import Dict, Tuple

import regex

from gpt_engineer.core.diff import ADD, REMOVE, RETAIN, Diff, Hunk
from gpt_engineer.core.files_dict import FilesDict, file_to_lines_dict

# Initialize a logger for this module
logger = logging.getLogger(__name__)


class DiffError(ValueError):
pass


def chat_to_files_dict(chat: str) -> FilesDict:
"""
Converts a chat string containing file paths and code blocks into a FilesDict object.
Expand Down Expand Up @@ -134,21 +140,25 @@
- dict: A dictionary of Diff objects keyed by filename.
"""
# Regex to match individual diff blocks
diff_block_pattern = re.compile(
diff_block_pattern = regex.compile(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is great that you are improving the diff-pipeline! Formally, this should be a separate PR. Would you be able to make a separate PR with the changes to chat_to_files?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can merge that quickly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r"```.*?\n\s*?--- .*?\n\s*?\+\+\+ .*?\n(?:@@ .*? @@\n(?:[-+ ].*?\n)*?)*?```",
re.DOTALL,
)

diffs = {}
for block in diff_block_pattern.finditer(diff_string):
diff_block = block.group()
try:
for block in diff_block_pattern.finditer(diff_string, timeout=1):
diff_block = block.group()

# Parse individual diff blocks and update the diffs dictionary
diffs.update(parse_diff_block(diff_block))
# Parse individual diff blocks and update the diffs dictionary
diffs.update(parse_diff_block(diff_block))
except TimeoutError:
raise DiffError("`diff_block_pattern.finditer` has timed out")

Check warning on line 156 in gpt_engineer/core/chat_to_files.py

View check run for this annotation

Codecov / codecov/patch

gpt_engineer/core/chat_to_files.py#L155-L156

Added lines #L155 - L156 were not covered by tests

if not diffs:
print(
"GPT did not provide any proposed changes. Please try to reselect the files for uploading and edit your prompt file."
raise DiffError(

Check warning on line 159 in gpt_engineer/core/chat_to_files.py

View check run for this annotation

Codecov / codecov/patch

gpt_engineer/core/chat_to_files.py#L159

Added line #L159 was not covered by tests
"GPT did not provide any proposed changes. "
"Please try to reselect the files for uploading and edit your prompt file."
)

return diffs
Expand Down
Loading
Loading