Merge pull request #40 from Significant-Gravitas/feat/basics

addition of basic challenges, easier challenge creation, --mock flag, adding mini-agi
pull/5155/head
merwanehamadi 2023-06-27 18:50:23 -07:00 committed by GitHub
commit 11303e2ef7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
26 changed files with 573 additions and 222 deletions

3
.env.example Normal file
View File

@ -0,0 +1,3 @@
AGENT_NAME=mini-agi
AGENT_TIMEOUT=60
MOCK_TEST=False

134
README.md
View File

@ -2,73 +2,94 @@
A repo built for the purpose of benchmarking the performance of agents far and wide, regardless of how they are set up and how they work
## As a user
1. `pip install auto-gpt-benchmarks`
2. Add boilerplate code to run and kill agent
3. `agbenchmark start`
- `--category challenge_category` to run tests in a specific category
- `--mock` to only run mock tests if they exists for each test
- `--noreg` to skip any tests that have passed in the past. When you run without this flag and a previous challenge that passed fails, it will now not be regression tests
4. We call boilerplate code for your agent
5. Show pass rate of tests, logs, and any other metrics
## Contributing
##### Diagrams: https://whimsical.com/agbenchmark-5n4hXBq1ZGzBwRsK4TVY7x
### To run the basic existing mock (June 21)
### To run the existing mocks
1. clone the repo `auto-gpt-benchmarks`
2. `pip install poetry`
3. `poetry shell`
4. `poetry install`
5. `agbenchmark start`
5. `cp .env_example .env`
6. `agbenchmark start --mock`
Keep config the same and watch the logs :)
### To run with mini-agi
1. Navigate to `auto-gpt-benchmarks/agent/mini-agi`
2. `pip install -r requirements.txt`
3. `cp .env_example .env`, set `PROMPT_USER=false` and add your `OPENAI_API_KEY=`. Sset `MODEL="gpt-3.5-turbo"` if you don't have access to `gpt-4` yet. Also make sure you have Python 3.10^ installed
4. Make sure to follow the commands above, and remove mock flag `agbenchmark start`
- To add requirements `poetry add requirement`.
Feel free to create prs to merge with `main` at will (but also feel free to ask for review) - if you can't send msg in R&D chat for access.
If you push at any point and break things - it'll happen to everyone - fix it asap. Step 1 is to revert `main` to last working commit
If you push at any point and break things - it'll happen to everyone - fix it asap. Step 1 is to revert `master` to last working commit
Let people know what beautiful code you write does, document everything well
Share your progress :)
## How this works
1. `pip install auto-gpt-benchmarks`
2. Add boilerplate code to start webserver to your agent (run loop and stop condition)
3. `agbenchmark start --category challenge_category` remove challenge flag to run all tests. specify config of hostname, port, and workspace directory
4. We call the server to run the agent for each test
5. Show pass rate of tests, logs, and any other metrics
### To run the basic existing mock (June 21)
1. clone the repo `auto-gpt-benchmarks`
2. `pip install poetry`
3. `poetry shell`
4. `poetry install`
5. `agbenchmark start`
Keep config the same and watch the logs :)
#### Bonuses
- You can adds tests by git cloning auto-gpt-benchmarks to your repo
- Agent is abstracted from benchmark, don't need to do any extra setup other then starting the server
- Simple, easy to use
- Don't have to deal with cloud or parallelization yet
### Pytest
to create a test:
an example of a test is below, use it as a template and change the class name, the .json name, what the test depends on and it's name, and the scoring logic
```
@pytest.mark.parametrize(
"server_response",
["VARIABLE"], # VARIABLE = the query/goal you provide to the model
indirect=True,
)
@pytest.mark.(VARIABLE) # VARIABLE = category of the test
def test_file_in_workspace(workspace): # VARIABLE = the actual test that asserts
assert os.path.exists(os.path.join(workspace, "file_to_check.txt"))
```python
import pytest
from agbenchmark.tests.basic_abilities.BasicChallenge import BasicChallenge
import os
class TestWriteFile(BasicChallenge):
"""Testing if LLM can write to a file"""
def get_file_path(self) -> str: # all tests must implement this method
return os.path.join(os.path.dirname(__file__), "w_file_data.json")
@pytest.mark.depends(on=[], name="basic_write_file")
def test_method(self, workspace):
# implement scoring logic by looking at workspace
```
## Api
All challenges will inherit from parent class which has the mark and any specific methods for their category
FastAPI with REST, import requests to call in auto-gpt-benchmarks. Boilerplate code given to agent project to start server
```python
@pytest.mark.basic
class BasicChallenge(Challenge):
pass
```
To create a file to test a challenge, add this to the challenge file which will create a file before running the server
```python
@pytest.fixture(
scope="module", autouse=True
) # this is specific to setting up a file for the test, not all tests have this
def setup_module(self, workspace):
Challenge.write_to_file(
workspace, self.data.ground.files[0], "this is how we're doing"
)
```
#### The main Challenge class has all the parametrization and loading logic so that all tests can inherit from it. It lives within [this file](https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/blob/master/agbenchmark/Challenge.py)
## Workspace
Defined by the user on config
If `--mock` flag is used it is at `agbenchmark/mocks/workspace`. Otherwise for mini-agi it is at `C:/Users/<name>/miniagi` - it will be automitcally set on config
#### Dataset
@ -80,9 +101,9 @@ Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.git
|-- auto-gpt-benchmarks/ **main project directory**
| |-- metrics.py **combining scores, metrics, final evaluation**
| |-- start_benchmark.py **entry point from cli**
| |-- conftest.py **shared fixtures across all tests**
| |-- Challenge.py **easy challenge creation class?**
| |-- config.json **hostname, port, workspace folder**
| |-- conftest.py **config, workspace creation + teardown, regression tesst markers, parameterization**
| |-- Challenge.py **easy challenge creation class**
| |-- config.json **workspace folder**
| |-- challenges/ **challenges across different domains**
| | |-- adaptability/
| | |-- basic_abilities/
@ -91,28 +112,7 @@ Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.git
| | |-- retrieval/
| | |-- web_navigation/
| | |-- writing/
| |-- tests/ **challenges across different metrics**
| | |-- basic_abilities/
| | |-- interface/
| |-- workspace/ **workspace related func**
| | |-- **init**.py
| | |-- workspace_manager.py **creation, deletion**
| |-- tests/
| | |-- basic_abilities/ **every llm should pass these challenges**
| | |-- regression/ **challenges that already passed**
```
### Easy Challenge Creation
tbd, but potentially shared Challenge class that challenges instantiate as challenges need different utils/metrics for eval
#### Written Challenges
For code, writing we can create a reference text and use metrics like METEOR, BERTScore, BARTScore
#### Validators
Designed to handle specific types of output (e.g., text, code, structured data)
#### Logging
Log different requests coming in - write file, change file, etc. Maybe a db in the future for metrics, logs, etc
Later: GitHub Actions integration, OpenAPI?, good versioning and backward compatibility

View File

@ -1,11 +1,63 @@
import os
from typing import Optional
import glob
import pytest
from abc import ABC, abstractmethod
from agbenchmark.challenges.define_task_types import Ground
from agbenchmark.challenges.define_task_types import ChallengeData
from dotenv import load_dotenv, set_key
load_dotenv()
mock_test_str = os.getenv("MOCK_TEST")
MOCK_TEST = mock_test_str.lower() == "true" if mock_test_str else False
class Challenge:
class Challenge(ABC):
"""The parent class to all specific challenges classes.
Defines helper methods for running a challenge"""
@abstractmethod
def get_file_path(self) -> str:
"""This should be implemented by any class which inherits from BasicChallenge"""
pass
@property
def data(self) -> ChallengeData:
return ChallengeData.deserialize(self.get_file_path())
@property
def mock(self):
return self.data.mock.mock_func if self.data.mock else None
@property
def task(self):
return (
self.data.mock.mock_task if self.data.mock and MOCK_TEST else self.data.task
)
@property
def dependencies(self) -> list:
print("self.data.dependencies", self.data.dependencies)
return self.data.dependencies
@property
def name(self) -> str:
print("self.data.name", self.data.name)
return self.data.name
@pytest.mark.parametrize(
"run_agent",
[(task, mock)],
indirect=True,
)
@pytest.mark.parametrize(
"challenge_data",
[data],
indirect=True,
)
def test_method(self, workspace):
raise NotImplementedError
@staticmethod
def open_file(workspace: str, filename: str):
script_dir = os.path.abspath(workspace)
@ -13,6 +65,26 @@ class Challenge:
with open(workspace_dir, "r") as f:
return f.read()
@staticmethod
def open_files(workspace: str, file_patterns: list):
script_dir = os.path.abspath(workspace)
files_contents = []
for file_pattern in file_patterns:
# Check if it is a file extension
if file_pattern.startswith("."):
# Find all files with the given extension in the workspace
matching_files = glob.glob(os.path.join(script_dir, "*" + file_pattern))
else:
# Otherwise, it is a specific file
matching_files = [os.path.join(script_dir, file_pattern)]
for file_path in matching_files:
with open(file_path, "r") as f:
files_contents.append(f.read())
return files_contents
@staticmethod
def write_to_file(workspace: str, filename: str, content: str):
script_dir = os.path.abspath(workspace)
@ -30,3 +102,24 @@ class Challenge:
for filename in os.listdir(workspace)
if os.path.isfile(os.path.join(workspace, filename))
]
def scoring(self, content: str, ground: Ground):
if ground.should_contain:
for should_contain_word in ground.should_contain:
if should_contain_word not in content:
return 0.0
else:
print(
f"Word that should exist: {should_contain_word} exists in the content"
)
if ground.should_not_contain:
for should_not_contain_word in ground.should_not_contain:
if should_not_contain_word in content:
return 0.0
else:
print(
f"Word that should not exist: {should_not_contain_word} does not exist in the content"
)
return 1.0

View File

@ -4,40 +4,49 @@
Input:
- **category** (str): information-retrieval
- **difficulty**(str): the difficulty of this query. choices from
## Information-retrieval challenges
Input:
- **category** (str): information-retrieval
- **task** (str): the question the agent needs to be solve.
- **name** (str): Name of the challenge.
- **category** (str[]): Category of the challenge such as 'basic', 'retrieval', 'comprehension', etc. _this is not currently used. for the future it may be needed_
- **task** (str): The task that the agent needs to solve.
- **dependencies** (str[]): The dependencies that the challenge needs to run. Needs to be the full node to the test function.
- **ground** (dict): The ground truth.
- **answer** (str): The raw text of ground truth answer
- **should_contain** (list): the exact strings that is required in the final answer
- **should_not_contain** (list): the exact strings that should not be in the final answer
- **files**: files that the are used for retrieval. Can specify file here or an extension **TODO:** like .txt
- **difficulty**(str): the difficulty of this query. choices from
- **mock_func**: function to mock the agent's response. This is used for testing purposes
- **answer** (str): The raw text of the ground truth answer.
- **should_contain** (list): The exact strings that are required in the final answer.
- **should_not_contain** (list): The exact strings that should not be in the final answer.
- **files** (list): Files that are used for retrieval. Can specify file here or an extension.
- **mock** (dict): Mock response for testing.
- **mock_func** (str): Function to mock the agent's response. This is used for testing purposes.
- **mock_task** (str): Task to provide for the mock function.
- **info** (dict): Additional info about the challenge.
- **difficulty** (str): The difficulty of this query.
- **description** (str): Description of the challenge.
- **side_effects** (str[]): Describes the effects of the challenge.
Example:
```python
{
"category": "retrieval",
"task": "What is the capital of America?",
"name": "basic_write_file",
"category": ["basic"],
"task": "Print the the capital of America to a .txt file",
"dependencies": [],
"ground": {
"answer": "Washington",
"should_contain": ["Washington"],
"should_not_contain": ["New York", "Los Angeles", "San Francisco"],
"files": ["file_to_check.txt"]
"files": [".txt"]
},
"difficulty": "easy"
"mock": {
"mock_func": "basic_write_file_mock",
"mock_task": "What is the capital of America?"
},
"info": {
"difficulty": "basic",
"description": "Tests the writing to file",
"side_effects": ["tests if there is in fact an LLM attached"]
}
}
```
Output:
Current Output:
- **score** (float): scores range from [0, 1]

View File

@ -4,27 +4,40 @@ import json
import os
class Mock(BaseModel):
mock_func: str
mock_task: Optional[str] = None
class Info(BaseModel):
difficulty: str
description: str
side_effects: List[str]
class Ground(BaseModel):
answer: str
should_contain: Optional[List[str]]
should_not_contain: Optional[List[str]]
should_contain: Optional[List[str]] = None
should_not_contain: Optional[List[str]] = None
files: List[str]
class Challenge(BaseModel):
category: str
class ChallengeData(BaseModel):
name: str
category: List[str]
task: str
dependencies: List[str]
ground: Ground
difficulty: str
mock_func: Optional[str] = None
mock: Optional[Mock] = None
info: Info
def serialize(self, path: str) -> None:
with open(path, "w") as file:
file.write(self.json())
@staticmethod
def deserialize(path: str) -> "Challenge":
def deserialize(path: str) -> "ChallengeData":
print("Deserializing", path)
with open(path, "r") as file:
data = json.load(file)
return Challenge(**data)
return ChallengeData(**data)

View File

@ -1,27 +1,9 @@
from agbenchmark.Challenge import Challenge
from agbenchmark.challenges.define_task_types import Ground
import pytest
@pytest.mark.retrieval
class RetrievalChallenge(Challenge):
"""Challenge for information-retrieval"""
def scoring(self, content: str, ground: Ground):
if ground.should_contain:
for should_contain_word in ground.should_contain:
if should_contain_word not in content:
return 0.0
else:
print(
f"Word that should exist: {should_contain_word} exists in the content"
)
if ground.should_not_contain:
for should_not_contain_word in ground.should_not_contain:
if should_not_contain_word in content:
return 0.0
else:
print(
f"Word that should not exist: {should_not_contain_word} does not exist in the content"
)
return 1.0
pass

View File

@ -1,12 +1,21 @@
{
"category": "retrieval",
"task": "What is the capital of America?",
"name": "retrieval1",
"category": ["basic"],
"task": "Print the the capital of America to a .txt file",
"dependencies": [],
"ground": {
"answer": "Washington",
"should_contain": ["Washington"],
"should_not_contain": ["New York", "Los Angeles", "San Francisco"],
"files": ["file_to_check.txt"]
"files": [".txt"]
},
"difficulty": "easy",
"mock_func": "retrieval_1_mock"
"mock": {
"mock_func": "basic_write_file_mock",
"mock_task": "What is the capital of America?"
},
"info": {
"difficulty": "basic",
"description": "Tests the writing to file",
"side_effects": ["tests if there is in fact an LLM attached"]
}
}

View File

@ -1,25 +1,22 @@
import pytest
from agbenchmark.challenges.retrieval.Retrieval import RetrievalChallenge
from agbenchmark.challenges.define_task_types import Challenge, Ground
from agbenchmark.challenges.define_task_types import ChallengeData, Ground
import os
data = Challenge.deserialize(os.path.join(os.path.dirname(__file__), "r1_data.json"))
class TestRetrieval1(RetrievalChallenge):
"""The first information-retrieval challenge"""
@pytest.mark.parametrize(
"server_response",
[(data.task, data.mock_func)],
indirect=True,
)
@pytest.mark.retrieval
def test_retrieval(self, workspace):
file = self.open_file(workspace, data.ground.files[0])
def get_file_path(self) -> str: # all tests must implement this method
return os.path.join(os.path.dirname(__file__), "r1_data.json")
score = self.scoring(file, data.ground)
def test_method(self, workspace):
files_contents = self.open_files(workspace, self.data.ground.files)
print("You score is:", score)
scores = []
for file_content in files_contents:
score = self.scoring(file_content, self.data.ground)
print("Your score is:", score)
scores.append(score)
assert score
assert 1 in scores

View File

@ -1,5 +1,3 @@
{
"hostname": "localhost",
"port": 8080,
"workspace": "agbenchmark/mocks/workspace"
"hostname": "localhost"
}

View File

@ -4,20 +4,28 @@ import pytest
import shutil
from agbenchmark.tests.regression.RegressionManager import RegressionManager
import requests
from requests.exceptions import RequestException
from agbenchmark.mocks.MockManager import MockManager
import subprocess
from agbenchmark.Challenge import Challenge
from dotenv import load_dotenv
load_dotenv()
@pytest.fixture(scope="module")
def config():
def config(request):
config_file = os.path.abspath("agbenchmark/config.json")
print(f"Config file: {config_file}")
with open(config_file, "r") as f:
config = json.load(f)
if request.config.getoption("--mock"):
config["workspace"] = "agbenchmark/mocks/workspace"
return config
@pytest.fixture
@pytest.fixture(scope="module")
def workspace(config):
yield config["workspace"]
# teardown after test function completes
@ -32,61 +40,87 @@ def workspace(config):
print(f"Failed to delete {file_path}. Reason: {e}")
def pytest_addoption(parser):
parser.addoption("--mock", action="store_true", default=False)
AGENT_NAME = os.getenv("AGENT_NAME")
AGENT_TIMEOUT = os.getenv("AGENT_TIMEOUT")
@pytest.fixture(autouse=True)
def server_response(request, config):
def run_agent(request, config):
"""Calling to get a response"""
if isinstance(request.param, tuple):
task = request.param[0] # The task is passed in indirectly
mock_function_name = request.param[1]
mock_function_name = request.param[1] or None
else:
task = request.param
mock_function_name = None
# print(f"Server starting at {request.module}")
# try:
# response = requests.post(
# f"{config['hostname']}:{config['port']}", data={"task": task}
# )
# response.raise_for_status() # This will raise an HTTPError if the status is 4xx or 5xx
# except RequestException:
# # If an exception occurs (could be connection, timeout, or HTTP errors), we use the mock
if mock_function_name:
mock_manager = MockManager(
task
) # workspace doesn't need to be passed in, stays the same
print("Server unavailable, using mock", mock_function_name)
mock_manager.delegate(mock_function_name)
if mock_function_name != None and (request.config.getoption("--mock")):
if mock_function_name:
mock_manager = MockManager(
task
) # workspace doesn't need to be passed in, stays the same
print("Server unavailable, using mock", mock_function_name)
mock_manager.delegate(mock_function_name)
else:
print("No mock provided")
else:
print("No mock provided")
path = os.path.join(os.getcwd(), f"agent\\{AGENT_NAME}")
# else:
# # This code is run if no exception occurred
# print(f"Request succeeded with status code {response.status_code}")
try:
timeout = int(AGENT_TIMEOUT) if AGENT_TIMEOUT is not None else 60
subprocess.run(
["python", "miniagi.py", task],
check=True,
cwd=path,
timeout=timeout
# text=True,
# capture_output=True
)
except subprocess.TimeoutExpired:
print("The subprocess has exceeded the time limit and was terminated.")
regression_txt = "agbenchmark/tests/regression/regression_tests.txt"
regression_json = "agbenchmark/tests/regression/regression_tests.json"
regression_manager = RegressionManager(regression_txt)
regression_manager = RegressionManager(regression_json)
# this is to get the challenge_data from every test
@pytest.fixture(autouse=True)
def challenge_data(request):
return request.param
def pytest_runtest_makereport(item, call):
"""Called for each test report. Generated for each stage
of a test run (setup, call, teardown)."""
if call.when == "call":
if (
call.excinfo is None
): # if no error in the call stage, add it as a regression test
regression_manager.add_test(item.nodeid)
else: # otherwise, :(
regression_manager.remove_test(item.nodeid)
challenge_data = item.funcargs.get("challenge_data", None)
difficulty = challenge_data.info.difficulty if challenge_data else "unknown"
dependencies = challenge_data.dependencies if challenge_data else []
test_details = {
"difficulty": difficulty,
"dependencies": dependencies,
"test": item.nodeid,
}
print("pytest_runtest_makereport", test_details)
if call.excinfo is None:
regression_manager.add_test(item.nodeid.split("::")[1], test_details)
else:
regression_manager.remove_test(item.nodeid.split("::")[1])
def pytest_collection_modifyitems(items):
"""Called once all test items are collected. Used
to add regression marker to collected test items."""
to add regression and depends markers to collected test items."""
for item in items:
print("pytest_collection_modifyitems", item.nodeid)
if item.nodeid + "\n" in regression_manager.tests:
# regression add
if item.nodeid.split("::")[1] in regression_manager.tests:
print(regression_manager.tests)
item.add_marker(pytest.mark.regression)
@ -94,3 +128,26 @@ def pytest_collection_modifyitems(items):
def pytest_sessionfinish():
"""Called at the end of the session to save regression tests"""
regression_manager.save()
# this is so that all tests can inherit from the Challenge class
def pytest_generate_tests(metafunc):
if "challenge_data" in metafunc.fixturenames:
# Get the instance of the test class
test_class = metafunc.cls()
# Generate the parameters
params = test_class.data
# Add the parameters to the test function
metafunc.parametrize("challenge_data", [params], indirect=True)
if "run_agent" in metafunc.fixturenames:
# Get the instance of the test class
test_class = metafunc.cls()
# Generate the parameters
params = [(test_class.task, test_class.mock)]
# Add the parameters to the test function
metafunc.parametrize("run_agent", params, indirect=True)

View File

@ -1,20 +0,0 @@
import json
import openai
def basic_gpt_agent(query) -> str:
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo-0613", messages=[{"role": "user", "content": query}]
)
answer = response["choices"][0]["message"]["content"] # type: ignore
print("QUERY : ", query)
print("AGENT ANSWER: ", answer)
return answer
if __name__ == "__main__":
# server boilerplate example here
basic_gpt_agent("")

View File

@ -0,0 +1,24 @@
from agbenchmark.Challenge import Challenge
def basic_read_file_mock(task: str, workspace: str):
"""
This mock reads a file and returns its content.
"""
file_contents = Challenge.open_file(workspace, "file_to_check.txt")
Challenge.write_to_file(
workspace, "file_to_check.txt", f"random string: {file_contents}"
)
def basic_write_file_mock(task: str, workspace: str):
"""
This mock writes to a file (creates one if it doesn't exist)
"""
Challenge.write_to_file(
workspace,
"file_to_check.txt",
"Washington DC is the capital of the United States of America",
)

View File

@ -1,4 +1,3 @@
from ..basic_gpt_agent import basic_gpt_agent
from agbenchmark.Challenge import Challenge
@ -6,8 +5,4 @@ from agbenchmark.Challenge import Challenge
# Prerequisites here would be writing to a file (basic_abilities test).
# Should also check if prerequisites exists in regression file
def retrieval_1_mock(task: str, workspace: str):
# Call the basic_gpt_agent to get a response.
response = basic_gpt_agent(task)
# Open the file in write mode.
Challenge.write_to_file(workspace, "file_to_check.txt", response)
pass

View File

@ -2,6 +2,10 @@ import click
import pytest
import json
import os
from pathlib import Path
from dotenv import load_dotenv, set_key
load_dotenv()
@click.group()
@ -12,8 +16,8 @@ def cli():
@cli.command()
@click.option("--category", default=None, help="Specific category to run")
@click.option("--noreg", is_flag=True, help="Skip regression tests")
def start(category, noreg):
"""Start the benchmark tests. If a category flag is is provided, run the categories with that mark."""
@click.option("--mock", is_flag=True, help="Run with mock")
def start(category, noreg, mock):
"""Start the benchmark tests. If a category flag is provided, run the categories with that mark."""
config_file = "agbenchmark/config.json"
@ -23,12 +27,9 @@ def start(category, noreg):
if not os.path.exists(config_dir) or os.stat(config_dir).st_size == 0:
config = {}
config["hostname"] = click.prompt(
"\nPlease enter a new hostname", default="localhost"
)
config["port"] = click.prompt("Please enter a new port", default=8080)
config["workspace"] = click.prompt(
"Please enter a new workspace path", default="agbenchmark/mocks/workspace"
"Please enter a new workspace path",
default=os.path.join(Path.home(), "miniagi"),
)
with open(config_dir, "w") as f:
@ -38,13 +39,17 @@ def start(category, noreg):
with open(config_dir, "r") as f:
config = json.load(f)
set_key(".env", "MOCK_TEST", "True" if mock else "False")
if mock:
config["workspace"] = "agbenchmark/mocks/workspace"
# create workspace directory if it doesn't exist
workspace_path = os.path.abspath(config["workspace"])
if not os.path.exists(workspace_path):
os.makedirs(workspace_path, exist_ok=True)
regression_path = os.path.abspath(
"agbenchmark/tests/regression/regression_tests.txt"
"agbenchmark/tests/regression/regression_tests.json"
)
if not os.path.exists(regression_path):
with open(regression_path, "a"):
@ -74,6 +79,9 @@ def start(category, noreg):
else:
print("Running all categorys") # run all categorys
if mock:
pytest_args.append("--mock")
# Run pytest with the constructed arguments
pytest.main(pytest_args)

View File

@ -0,0 +1,9 @@
import pytest
from agbenchmark.Challenge import Challenge
from agbenchmark.challenges.define_task_types import ChallengeData
from abc import abstractmethod
@pytest.mark.basic
class BasicChallenge(Challenge):
pass

View File

@ -0,0 +1,19 @@
{
"name": "basic_read_file",
"category": ["basic"],
"task": "Write the string 'random string' before any existing text to the file called file_to_check.txt",
"dependencies": ["basic_write_file"],
"ground": {
"answer": "random string: this is how we're doing",
"should_contain": ["random string: this is how we're doing"],
"files": ["file_to_check.txt"]
},
"mock": {
"mock_func": "basic_read_file_mock"
},
"info": {
"description": "This reads the file quickly",
"difficulty": "basic",
"side_effects": [""]
}
}

View File

@ -0,0 +1,31 @@
import pytest
from agbenchmark.Challenge import Challenge
from agbenchmark.tests.basic_abilities.BasicChallenge import BasicChallenge
import os
class TestReadFile(BasicChallenge):
"""Testing if LLM can read a file"""
@pytest.fixture(scope="module", autouse=True)
def setup_module(self, workspace):
Challenge.write_to_file(
workspace, self.data.ground.files[0], "this is how we're doing"
)
def get_file_path(self) -> str: # all tests must implement this method
return os.path.join(os.path.dirname(__file__), "r_file_data.json")
@pytest.mark.depends(on=["basic_write_file"], name="basic_read_file")
def test_method(
self, workspace
): # run_test is a common name that all tests must implement
files_contents = self.open_files(workspace, self.data.ground.files)
scores = []
for file_content in files_contents:
score = self.scoring(file_content, self.data.ground)
print("Your score is:", score)
scores.append(score)
assert 1 in scores

View File

@ -0,0 +1,21 @@
{
"name": "basic_write_file",
"category": ["basic"],
"task": "Print the the capital of America to a .txt file",
"dependencies": [],
"ground": {
"answer": "Washington",
"should_contain": ["Washington"],
"should_not_contain": ["New York", "Los Angeles", "San Francisco"],
"files": [".txt"]
},
"mock": {
"mock_func": "basic_write_file_mock",
"mock_task": "What is the capital of America?"
},
"info": {
"difficulty": "basic",
"description": "Tests the writing to file",
"side_effects": ["tests if there is in fact an LLM attached"]
}
}

View File

@ -0,0 +1,23 @@
import pytest
from agbenchmark.tests.basic_abilities.BasicChallenge import BasicChallenge
import os
class TestWriteFile(BasicChallenge):
"""Testing if LLM can write to a file"""
def get_file_path(self) -> str: # all tests must implement this method
return os.path.join(os.path.dirname(__file__), "w_file_data.json")
@pytest.mark.depends(on=[], name="basic_write_file")
def test_method(self, workspace):
print("my workspace is ", workspace)
files_contents = self.open_files(workspace, self.data.ground.files)
scores = []
for file_content in files_contents:
score = self.scoring(file_content, self.data.ground)
print("Your score is:", score)
scores.append(score)
assert 1 in scores

View File

@ -1,3 +1,6 @@
import json
class RegressionManager:
"""Abstracts interaction with the regression tests file"""
@ -6,17 +9,21 @@ class RegressionManager:
self.load()
def load(self) -> None:
with open(self.filename, "r") as f:
self.tests = f.readlines()
try:
with open(self.filename, "r") as f:
self.tests = json.load(f)
except (FileNotFoundError, json.decoder.JSONDecodeError):
self.tests = {}
def save(self) -> None:
with open(self.filename, "w") as f:
f.writelines(self.tests)
json.dump(self.tests, f, indent=4)
def add_test(self, test_id) -> None:
if f"{test_id}\n" not in self.tests:
self.tests.append(f"{test_id}\n")
def add_test(self, test_name: str, test_details: dict) -> None:
self.tests[test_name] = test_details
self.save()
def remove_test(self, test_id) -> None:
if f"{test_id}\n" in self.tests:
self.tests.remove(f"{test_id}\n")
def remove_test(self, test_name: str) -> None:
if test_name in self.tests:
del self.tests[test_name]
self.save()

View File

@ -0,0 +1,7 @@
{
"TestWriteFile": {
"difficulty": "basic",
"dependencies": [],
"test": "agbenchmark/tests/basic_abilities/write_file/write_file_test.py::TestWriteFile::test_method[challenge_data0-run_agent0]"
}
}

65
poetry.lock generated
View File

@ -368,6 +368,20 @@ files = [
{file = "frozenlist-1.3.3.tar.gz", hash = "sha256:58bcc55721e8a90b88332d6cd441261ebb22342e238296bb330968952fbb3a6a"},
]
[[package]]
name = "future-fstrings"
version = "1.2.0"
description = "A backport of fstrings to python<3.6"
optional = false
python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"
files = [
{file = "future_fstrings-1.2.0-py2.py3-none-any.whl", hash = "sha256:90e49598b553d8746c4dc7d9442e0359d038c3039d802c91c0a55505da318c63"},
{file = "future_fstrings-1.2.0.tar.gz", hash = "sha256:6cf41cbe97c398ab5a81168ce0dbb8ad95862d3caf23c21e4430627b90844089"},
]
[package.extras]
rewrite = ["tokenize-rt (>=3)"]
[[package]]
name = "idna"
version = "3.4"
@ -473,6 +487,24 @@ files = [
{file = "multidict-6.0.4.tar.gz", hash = "sha256:3666906492efb76453c0e7b97f2cf459b0682e7402c0489a95484965dbc1da49"},
]
[[package]]
name = "networkx"
version = "3.1"
description = "Python package for creating and manipulating graphs and networks"
optional = false
python-versions = ">=3.8"
files = [
{file = "networkx-3.1-py3-none-any.whl", hash = "sha256:4f33f68cb2afcf86f28a45f43efc27a9386b535d567d2127f8f61d51dec58d36"},
{file = "networkx-3.1.tar.gz", hash = "sha256:de346335408f84de0eada6ff9fafafff9bcda11f0a0dfaa931133debb146ab61"},
]
[package.extras]
default = ["matplotlib (>=3.4)", "numpy (>=1.20)", "pandas (>=1.3)", "scipy (>=1.8)"]
developer = ["mypy (>=1.1)", "pre-commit (>=3.2)"]
doc = ["nb2plots (>=0.6)", "numpydoc (>=1.5)", "pillow (>=9.4)", "pydata-sphinx-theme (>=0.13)", "sphinx (>=6.1)", "sphinx-gallery (>=0.12)", "texext (>=0.6.7)"]
extra = ["lxml (>=4.6)", "pydot (>=1.4.2)", "pygraphviz (>=1.10)", "sympy (>=1.10)"]
test = ["codecov (>=2.1)", "pytest (>=7.2)", "pytest-cov (>=4.0)"]
[[package]]
name = "openai"
version = "0.27.8"
@ -595,6 +627,37 @@ tomli = {version = ">=1.0.0", markers = "python_version < \"3.11\""}
[package.extras]
testing = ["argcomplete", "attrs (>=19.2.0)", "hypothesis (>=3.56)", "mock", "nose", "pygments (>=2.7.2)", "requests", "setuptools", "xmlschema"]
[[package]]
name = "pytest-depends"
version = "1.0.1"
description = "Tests that depend on other tests"
optional = false
python-versions = "*"
files = [
{file = "pytest-depends-1.0.1.tar.gz", hash = "sha256:90a28e2b87b75b18abd128c94015248544acac20e4392e9921e5a86f93319dfe"},
{file = "pytest_depends-1.0.1-py3-none-any.whl", hash = "sha256:a1df072bcc93d77aca3f0946903f5fed8af2d9b0056db1dfc9ed5ac164ab0642"},
]
[package.dependencies]
colorama = "*"
future-fstrings = "*"
networkx = "*"
pytest = ">=3"
[[package]]
name = "python-dotenv"
version = "1.0.0"
description = "Read key-value pairs from a .env file and set them as environment variables"
optional = false
python-versions = ">=3.8"
files = [
{file = "python-dotenv-1.0.0.tar.gz", hash = "sha256:a8df96034aae6d2d50a4ebe8216326c61c3eb64836776504fcca410e5937a3ba"},
{file = "python_dotenv-1.0.0-py3-none-any.whl", hash = "sha256:f5971a9226b701070a4bf2c38c89e5a3f0d64de8debda981d1db98583009122a"},
]
[package.extras]
cli = ["click (>=5.0)"]
[[package]]
name = "requests"
version = "2.31.0"
@ -765,4 +828,4 @@ multidict = ">=4.0"
[metadata]
lock-version = "2.0"
python-versions = "^3.9"
content-hash = "a13e69f2bd9e511e1af92ed02b155a90dec38a9b8d983a711e1b67931b467d38"
content-hash = "f8de5e973c92360108aaca1cecc2fdd505f10a9c2975b46c83ea9c24b4af3cfe"

View File

@ -14,6 +14,8 @@ click = "^8.1.3"
requests = "^2.31.0"
openai = "^0.27.8"
pydantic = "^1.10.9"
pytest-depends = "^1.0.1"
python-dotenv = "^1.0.0"
[build-system]
@ -28,7 +30,8 @@ testpaths = [
]
markers = [
"retrieval",
"regression"
"regression",
"basic",
]
[tool.poetry.scripts]