Merge pull request #40 from Significant-Gravitas/feat/basics
addition of basic challenges, easier challenge creation, --mock flag, adding mini-agipull/5155/head
commit
11303e2ef7
|
@ -0,0 +1,3 @@
|
|||
AGENT_NAME=mini-agi
|
||||
AGENT_TIMEOUT=60
|
||||
MOCK_TEST=False
|
134
README.md
134
README.md
|
@ -2,73 +2,94 @@
|
|||
|
||||
A repo built for the purpose of benchmarking the performance of agents far and wide, regardless of how they are set up and how they work
|
||||
|
||||
## As a user
|
||||
|
||||
1. `pip install auto-gpt-benchmarks`
|
||||
2. Add boilerplate code to run and kill agent
|
||||
3. `agbenchmark start`
|
||||
- `--category challenge_category` to run tests in a specific category
|
||||
- `--mock` to only run mock tests if they exists for each test
|
||||
- `--noreg` to skip any tests that have passed in the past. When you run without this flag and a previous challenge that passed fails, it will now not be regression tests
|
||||
4. We call boilerplate code for your agent
|
||||
5. Show pass rate of tests, logs, and any other metrics
|
||||
|
||||
## Contributing
|
||||
|
||||
##### Diagrams: https://whimsical.com/agbenchmark-5n4hXBq1ZGzBwRsK4TVY7x
|
||||
|
||||
### To run the basic existing mock (June 21)
|
||||
### To run the existing mocks
|
||||
|
||||
1. clone the repo `auto-gpt-benchmarks`
|
||||
2. `pip install poetry`
|
||||
3. `poetry shell`
|
||||
4. `poetry install`
|
||||
5. `agbenchmark start`
|
||||
5. `cp .env_example .env`
|
||||
6. `agbenchmark start --mock`
|
||||
Keep config the same and watch the logs :)
|
||||
|
||||
### To run with mini-agi
|
||||
|
||||
1. Navigate to `auto-gpt-benchmarks/agent/mini-agi`
|
||||
2. `pip install -r requirements.txt`
|
||||
3. `cp .env_example .env`, set `PROMPT_USER=false` and add your `OPENAI_API_KEY=`. Sset `MODEL="gpt-3.5-turbo"` if you don't have access to `gpt-4` yet. Also make sure you have Python 3.10^ installed
|
||||
4. Make sure to follow the commands above, and remove mock flag `agbenchmark start`
|
||||
|
||||
- To add requirements `poetry add requirement`.
|
||||
|
||||
Feel free to create prs to merge with `main` at will (but also feel free to ask for review) - if you can't send msg in R&D chat for access.
|
||||
|
||||
If you push at any point and break things - it'll happen to everyone - fix it asap. Step 1 is to revert `main` to last working commit
|
||||
If you push at any point and break things - it'll happen to everyone - fix it asap. Step 1 is to revert `master` to last working commit
|
||||
|
||||
Let people know what beautiful code you write does, document everything well
|
||||
|
||||
Share your progress :)
|
||||
|
||||
## How this works
|
||||
|
||||
1. `pip install auto-gpt-benchmarks`
|
||||
2. Add boilerplate code to start webserver to your agent (run loop and stop condition)
|
||||
3. `agbenchmark start --category challenge_category` remove challenge flag to run all tests. specify config of hostname, port, and workspace directory
|
||||
4. We call the server to run the agent for each test
|
||||
5. Show pass rate of tests, logs, and any other metrics
|
||||
|
||||
### To run the basic existing mock (June 21)
|
||||
|
||||
1. clone the repo `auto-gpt-benchmarks`
|
||||
2. `pip install poetry`
|
||||
3. `poetry shell`
|
||||
4. `poetry install`
|
||||
5. `agbenchmark start`
|
||||
Keep config the same and watch the logs :)
|
||||
|
||||
#### Bonuses
|
||||
|
||||
- You can adds tests by git cloning auto-gpt-benchmarks to your repo
|
||||
- Agent is abstracted from benchmark, don't need to do any extra setup other then starting the server
|
||||
- Simple, easy to use
|
||||
- Don't have to deal with cloud or parallelization yet
|
||||
|
||||
### Pytest
|
||||
|
||||
to create a test:
|
||||
an example of a test is below, use it as a template and change the class name, the .json name, what the test depends on and it's name, and the scoring logic
|
||||
|
||||
```
|
||||
@pytest.mark.parametrize(
|
||||
"server_response",
|
||||
["VARIABLE"], # VARIABLE = the query/goal you provide to the model
|
||||
indirect=True,
|
||||
)
|
||||
@pytest.mark.(VARIABLE) # VARIABLE = category of the test
|
||||
def test_file_in_workspace(workspace): # VARIABLE = the actual test that asserts
|
||||
assert os.path.exists(os.path.join(workspace, "file_to_check.txt"))
|
||||
```python
|
||||
import pytest
|
||||
from agbenchmark.tests.basic_abilities.BasicChallenge import BasicChallenge
|
||||
import os
|
||||
|
||||
|
||||
class TestWriteFile(BasicChallenge):
|
||||
"""Testing if LLM can write to a file"""
|
||||
|
||||
def get_file_path(self) -> str: # all tests must implement this method
|
||||
return os.path.join(os.path.dirname(__file__), "w_file_data.json")
|
||||
|
||||
@pytest.mark.depends(on=[], name="basic_write_file")
|
||||
def test_method(self, workspace):
|
||||
# implement scoring logic by looking at workspace
|
||||
```
|
||||
|
||||
## Api
|
||||
All challenges will inherit from parent class which has the mark and any specific methods for their category
|
||||
|
||||
FastAPI with REST, import requests to call in auto-gpt-benchmarks. Boilerplate code given to agent project to start server
|
||||
```python
|
||||
@pytest.mark.basic
|
||||
class BasicChallenge(Challenge):
|
||||
pass
|
||||
```
|
||||
|
||||
To create a file to test a challenge, add this to the challenge file which will create a file before running the server
|
||||
|
||||
```python
|
||||
@pytest.fixture(
|
||||
scope="module", autouse=True
|
||||
) # this is specific to setting up a file for the test, not all tests have this
|
||||
def setup_module(self, workspace):
|
||||
Challenge.write_to_file(
|
||||
workspace, self.data.ground.files[0], "this is how we're doing"
|
||||
)
|
||||
```
|
||||
|
||||
#### The main Challenge class has all the parametrization and loading logic so that all tests can inherit from it. It lives within [this file](https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/blob/master/agbenchmark/Challenge.py)
|
||||
|
||||
## Workspace
|
||||
|
||||
Defined by the user on config
|
||||
If `--mock` flag is used it is at `agbenchmark/mocks/workspace`. Otherwise for mini-agi it is at `C:/Users/<name>/miniagi` - it will be automitcally set on config
|
||||
|
||||
#### Dataset
|
||||
|
||||
|
@ -80,9 +101,9 @@ Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.git
|
|||
|-- auto-gpt-benchmarks/ **main project directory**
|
||||
| |-- metrics.py **combining scores, metrics, final evaluation**
|
||||
| |-- start_benchmark.py **entry point from cli**
|
||||
| |-- conftest.py **shared fixtures across all tests**
|
||||
| |-- Challenge.py **easy challenge creation class?**
|
||||
| |-- config.json **hostname, port, workspace folder**
|
||||
| |-- conftest.py **config, workspace creation + teardown, regression tesst markers, parameterization**
|
||||
| |-- Challenge.py **easy challenge creation class**
|
||||
| |-- config.json **workspace folder**
|
||||
| |-- challenges/ **challenges across different domains**
|
||||
| | |-- adaptability/
|
||||
| | |-- basic_abilities/
|
||||
|
@ -91,28 +112,7 @@ Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.git
|
|||
| | |-- retrieval/
|
||||
| | |-- web_navigation/
|
||||
| | |-- writing/
|
||||
| |-- tests/ **challenges across different metrics**
|
||||
| | |-- basic_abilities/
|
||||
| | |-- interface/
|
||||
| |-- workspace/ **workspace related func**
|
||||
| | |-- **init**.py
|
||||
| | |-- workspace_manager.py **creation, deletion**
|
||||
| |-- tests/
|
||||
| | |-- basic_abilities/ **every llm should pass these challenges**
|
||||
| | |-- regression/ **challenges that already passed**
|
||||
```
|
||||
|
||||
### Easy Challenge Creation
|
||||
|
||||
tbd, but potentially shared Challenge class that challenges instantiate as challenges need different utils/metrics for eval
|
||||
|
||||
#### Written Challenges
|
||||
|
||||
For code, writing we can create a reference text and use metrics like METEOR, BERTScore, BARTScore
|
||||
|
||||
#### Validators
|
||||
|
||||
Designed to handle specific types of output (e.g., text, code, structured data)
|
||||
|
||||
#### Logging
|
||||
|
||||
Log different requests coming in - write file, change file, etc. Maybe a db in the future for metrics, logs, etc
|
||||
|
||||
Later: GitHub Actions integration, OpenAPI?, good versioning and backward compatibility
|
||||
|
|
|
@ -1,11 +1,63 @@
|
|||
import os
|
||||
from typing import Optional
|
||||
import glob
|
||||
import pytest
|
||||
from abc import ABC, abstractmethod
|
||||
from agbenchmark.challenges.define_task_types import Ground
|
||||
from agbenchmark.challenges.define_task_types import ChallengeData
|
||||
from dotenv import load_dotenv, set_key
|
||||
|
||||
load_dotenv()
|
||||
|
||||
mock_test_str = os.getenv("MOCK_TEST")
|
||||
MOCK_TEST = mock_test_str.lower() == "true" if mock_test_str else False
|
||||
|
||||
|
||||
class Challenge:
|
||||
class Challenge(ABC):
|
||||
"""The parent class to all specific challenges classes.
|
||||
Defines helper methods for running a challenge"""
|
||||
|
||||
@abstractmethod
|
||||
def get_file_path(self) -> str:
|
||||
"""This should be implemented by any class which inherits from BasicChallenge"""
|
||||
pass
|
||||
|
||||
@property
|
||||
def data(self) -> ChallengeData:
|
||||
return ChallengeData.deserialize(self.get_file_path())
|
||||
|
||||
@property
|
||||
def mock(self):
|
||||
return self.data.mock.mock_func if self.data.mock else None
|
||||
|
||||
@property
|
||||
def task(self):
|
||||
return (
|
||||
self.data.mock.mock_task if self.data.mock and MOCK_TEST else self.data.task
|
||||
)
|
||||
|
||||
@property
|
||||
def dependencies(self) -> list:
|
||||
print("self.data.dependencies", self.data.dependencies)
|
||||
return self.data.dependencies
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
print("self.data.name", self.data.name)
|
||||
return self.data.name
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"run_agent",
|
||||
[(task, mock)],
|
||||
indirect=True,
|
||||
)
|
||||
@pytest.mark.parametrize(
|
||||
"challenge_data",
|
||||
[data],
|
||||
indirect=True,
|
||||
)
|
||||
def test_method(self, workspace):
|
||||
raise NotImplementedError
|
||||
|
||||
@staticmethod
|
||||
def open_file(workspace: str, filename: str):
|
||||
script_dir = os.path.abspath(workspace)
|
||||
|
@ -13,6 +65,26 @@ class Challenge:
|
|||
with open(workspace_dir, "r") as f:
|
||||
return f.read()
|
||||
|
||||
@staticmethod
|
||||
def open_files(workspace: str, file_patterns: list):
|
||||
script_dir = os.path.abspath(workspace)
|
||||
files_contents = []
|
||||
|
||||
for file_pattern in file_patterns:
|
||||
# Check if it is a file extension
|
||||
if file_pattern.startswith("."):
|
||||
# Find all files with the given extension in the workspace
|
||||
matching_files = glob.glob(os.path.join(script_dir, "*" + file_pattern))
|
||||
else:
|
||||
# Otherwise, it is a specific file
|
||||
matching_files = [os.path.join(script_dir, file_pattern)]
|
||||
|
||||
for file_path in matching_files:
|
||||
with open(file_path, "r") as f:
|
||||
files_contents.append(f.read())
|
||||
|
||||
return files_contents
|
||||
|
||||
@staticmethod
|
||||
def write_to_file(workspace: str, filename: str, content: str):
|
||||
script_dir = os.path.abspath(workspace)
|
||||
|
@ -30,3 +102,24 @@ class Challenge:
|
|||
for filename in os.listdir(workspace)
|
||||
if os.path.isfile(os.path.join(workspace, filename))
|
||||
]
|
||||
|
||||
def scoring(self, content: str, ground: Ground):
|
||||
if ground.should_contain:
|
||||
for should_contain_word in ground.should_contain:
|
||||
if should_contain_word not in content:
|
||||
return 0.0
|
||||
else:
|
||||
print(
|
||||
f"Word that should exist: {should_contain_word} exists in the content"
|
||||
)
|
||||
|
||||
if ground.should_not_contain:
|
||||
for should_not_contain_word in ground.should_not_contain:
|
||||
if should_not_contain_word in content:
|
||||
return 0.0
|
||||
else:
|
||||
print(
|
||||
f"Word that should not exist: {should_not_contain_word} does not exist in the content"
|
||||
)
|
||||
|
||||
return 1.0
|
||||
|
|
|
@ -4,40 +4,49 @@
|
|||
|
||||
Input:
|
||||
|
||||
- **category** (str): information-retrieval
|
||||
- **difficulty**(str): the difficulty of this query. choices from
|
||||
|
||||
## Information-retrieval challenges
|
||||
|
||||
Input:
|
||||
|
||||
- **category** (str): information-retrieval
|
||||
- **task** (str): the question the agent needs to be solve.
|
||||
- **name** (str): Name of the challenge.
|
||||
- **category** (str[]): Category of the challenge such as 'basic', 'retrieval', 'comprehension', etc. _this is not currently used. for the future it may be needed_
|
||||
- **task** (str): The task that the agent needs to solve.
|
||||
- **dependencies** (str[]): The dependencies that the challenge needs to run. Needs to be the full node to the test function.
|
||||
- **ground** (dict): The ground truth.
|
||||
- **answer** (str): The raw text of ground truth answer
|
||||
- **should_contain** (list): the exact strings that is required in the final answer
|
||||
- **should_not_contain** (list): the exact strings that should not be in the final answer
|
||||
- **files**: files that the are used for retrieval. Can specify file here or an extension **TODO:** like .txt
|
||||
- **difficulty**(str): the difficulty of this query. choices from
|
||||
- **mock_func**: function to mock the agent's response. This is used for testing purposes
|
||||
- **answer** (str): The raw text of the ground truth answer.
|
||||
- **should_contain** (list): The exact strings that are required in the final answer.
|
||||
- **should_not_contain** (list): The exact strings that should not be in the final answer.
|
||||
- **files** (list): Files that are used for retrieval. Can specify file here or an extension.
|
||||
- **mock** (dict): Mock response for testing.
|
||||
- **mock_func** (str): Function to mock the agent's response. This is used for testing purposes.
|
||||
- **mock_task** (str): Task to provide for the mock function.
|
||||
- **info** (dict): Additional info about the challenge.
|
||||
- **difficulty** (str): The difficulty of this query.
|
||||
- **description** (str): Description of the challenge.
|
||||
- **side_effects** (str[]): Describes the effects of the challenge.
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
{
|
||||
"category": "retrieval",
|
||||
"task": "What is the capital of America?",
|
||||
"name": "basic_write_file",
|
||||
"category": ["basic"],
|
||||
"task": "Print the the capital of America to a .txt file",
|
||||
"dependencies": [],
|
||||
"ground": {
|
||||
"answer": "Washington",
|
||||
"should_contain": ["Washington"],
|
||||
"should_not_contain": ["New York", "Los Angeles", "San Francisco"],
|
||||
"files": ["file_to_check.txt"]
|
||||
"files": [".txt"]
|
||||
},
|
||||
"difficulty": "easy"
|
||||
"mock": {
|
||||
"mock_func": "basic_write_file_mock",
|
||||
"mock_task": "What is the capital of America?"
|
||||
},
|
||||
"info": {
|
||||
"difficulty": "basic",
|
||||
"description": "Tests the writing to file",
|
||||
"side_effects": ["tests if there is in fact an LLM attached"]
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
Output:
|
||||
Current Output:
|
||||
|
||||
- **score** (float): scores range from [0, 1]
|
||||
|
|
|
@ -4,27 +4,40 @@ import json
|
|||
import os
|
||||
|
||||
|
||||
class Mock(BaseModel):
|
||||
mock_func: str
|
||||
mock_task: Optional[str] = None
|
||||
|
||||
|
||||
class Info(BaseModel):
|
||||
difficulty: str
|
||||
description: str
|
||||
side_effects: List[str]
|
||||
|
||||
|
||||
class Ground(BaseModel):
|
||||
answer: str
|
||||
should_contain: Optional[List[str]]
|
||||
should_not_contain: Optional[List[str]]
|
||||
should_contain: Optional[List[str]] = None
|
||||
should_not_contain: Optional[List[str]] = None
|
||||
files: List[str]
|
||||
|
||||
|
||||
class Challenge(BaseModel):
|
||||
category: str
|
||||
class ChallengeData(BaseModel):
|
||||
name: str
|
||||
category: List[str]
|
||||
task: str
|
||||
dependencies: List[str]
|
||||
ground: Ground
|
||||
difficulty: str
|
||||
mock_func: Optional[str] = None
|
||||
mock: Optional[Mock] = None
|
||||
info: Info
|
||||
|
||||
def serialize(self, path: str) -> None:
|
||||
with open(path, "w") as file:
|
||||
file.write(self.json())
|
||||
|
||||
@staticmethod
|
||||
def deserialize(path: str) -> "Challenge":
|
||||
def deserialize(path: str) -> "ChallengeData":
|
||||
print("Deserializing", path)
|
||||
with open(path, "r") as file:
|
||||
data = json.load(file)
|
||||
return Challenge(**data)
|
||||
return ChallengeData(**data)
|
||||
|
|
|
@ -1,27 +1,9 @@
|
|||
from agbenchmark.Challenge import Challenge
|
||||
from agbenchmark.challenges.define_task_types import Ground
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.retrieval
|
||||
class RetrievalChallenge(Challenge):
|
||||
"""Challenge for information-retrieval"""
|
||||
|
||||
def scoring(self, content: str, ground: Ground):
|
||||
if ground.should_contain:
|
||||
for should_contain_word in ground.should_contain:
|
||||
if should_contain_word not in content:
|
||||
return 0.0
|
||||
else:
|
||||
print(
|
||||
f"Word that should exist: {should_contain_word} exists in the content"
|
||||
)
|
||||
|
||||
if ground.should_not_contain:
|
||||
for should_not_contain_word in ground.should_not_contain:
|
||||
if should_not_contain_word in content:
|
||||
return 0.0
|
||||
else:
|
||||
print(
|
||||
f"Word that should not exist: {should_not_contain_word} does not exist in the content"
|
||||
)
|
||||
|
||||
return 1.0
|
||||
pass
|
||||
|
|
|
@ -1,12 +1,21 @@
|
|||
{
|
||||
"category": "retrieval",
|
||||
"task": "What is the capital of America?",
|
||||
"name": "retrieval1",
|
||||
"category": ["basic"],
|
||||
"task": "Print the the capital of America to a .txt file",
|
||||
"dependencies": [],
|
||||
"ground": {
|
||||
"answer": "Washington",
|
||||
"should_contain": ["Washington"],
|
||||
"should_not_contain": ["New York", "Los Angeles", "San Francisco"],
|
||||
"files": ["file_to_check.txt"]
|
||||
"files": [".txt"]
|
||||
},
|
||||
"difficulty": "easy",
|
||||
"mock_func": "retrieval_1_mock"
|
||||
"mock": {
|
||||
"mock_func": "basic_write_file_mock",
|
||||
"mock_task": "What is the capital of America?"
|
||||
},
|
||||
"info": {
|
||||
"difficulty": "basic",
|
||||
"description": "Tests the writing to file",
|
||||
"side_effects": ["tests if there is in fact an LLM attached"]
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,25 +1,22 @@
|
|||
import pytest
|
||||
from agbenchmark.challenges.retrieval.Retrieval import RetrievalChallenge
|
||||
from agbenchmark.challenges.define_task_types import Challenge, Ground
|
||||
from agbenchmark.challenges.define_task_types import ChallengeData, Ground
|
||||
import os
|
||||
|
||||
data = Challenge.deserialize(os.path.join(os.path.dirname(__file__), "r1_data.json"))
|
||||
|
||||
|
||||
class TestRetrieval1(RetrievalChallenge):
|
||||
"""The first information-retrieval challenge"""
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"server_response",
|
||||
[(data.task, data.mock_func)],
|
||||
indirect=True,
|
||||
)
|
||||
@pytest.mark.retrieval
|
||||
def test_retrieval(self, workspace):
|
||||
file = self.open_file(workspace, data.ground.files[0])
|
||||
def get_file_path(self) -> str: # all tests must implement this method
|
||||
return os.path.join(os.path.dirname(__file__), "r1_data.json")
|
||||
|
||||
score = self.scoring(file, data.ground)
|
||||
def test_method(self, workspace):
|
||||
files_contents = self.open_files(workspace, self.data.ground.files)
|
||||
|
||||
print("You score is:", score)
|
||||
scores = []
|
||||
for file_content in files_contents:
|
||||
score = self.scoring(file_content, self.data.ground)
|
||||
print("Your score is:", score)
|
||||
scores.append(score)
|
||||
|
||||
assert score
|
||||
assert 1 in scores
|
||||
|
|
|
@ -1,5 +1,3 @@
|
|||
{
|
||||
"hostname": "localhost",
|
||||
"port": 8080,
|
||||
"workspace": "agbenchmark/mocks/workspace"
|
||||
"hostname": "localhost"
|
||||
}
|
||||
|
|
|
@ -4,20 +4,28 @@ import pytest
|
|||
import shutil
|
||||
from agbenchmark.tests.regression.RegressionManager import RegressionManager
|
||||
import requests
|
||||
from requests.exceptions import RequestException
|
||||
from agbenchmark.mocks.MockManager import MockManager
|
||||
import subprocess
|
||||
from agbenchmark.Challenge import Challenge
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv()
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def config():
|
||||
def config(request):
|
||||
config_file = os.path.abspath("agbenchmark/config.json")
|
||||
print(f"Config file: {config_file}")
|
||||
with open(config_file, "r") as f:
|
||||
config = json.load(f)
|
||||
|
||||
if request.config.getoption("--mock"):
|
||||
config["workspace"] = "agbenchmark/mocks/workspace"
|
||||
|
||||
return config
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope="module")
|
||||
def workspace(config):
|
||||
yield config["workspace"]
|
||||
# teardown after test function completes
|
||||
|
@ -32,61 +40,87 @@ def workspace(config):
|
|||
print(f"Failed to delete {file_path}. Reason: {e}")
|
||||
|
||||
|
||||
def pytest_addoption(parser):
|
||||
parser.addoption("--mock", action="store_true", default=False)
|
||||
|
||||
|
||||
AGENT_NAME = os.getenv("AGENT_NAME")
|
||||
AGENT_TIMEOUT = os.getenv("AGENT_TIMEOUT")
|
||||
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def server_response(request, config):
|
||||
def run_agent(request, config):
|
||||
"""Calling to get a response"""
|
||||
if isinstance(request.param, tuple):
|
||||
task = request.param[0] # The task is passed in indirectly
|
||||
mock_function_name = request.param[1]
|
||||
mock_function_name = request.param[1] or None
|
||||
else:
|
||||
task = request.param
|
||||
mock_function_name = None
|
||||
# print(f"Server starting at {request.module}")
|
||||
# try:
|
||||
# response = requests.post(
|
||||
# f"{config['hostname']}:{config['port']}", data={"task": task}
|
||||
# )
|
||||
# response.raise_for_status() # This will raise an HTTPError if the status is 4xx or 5xx
|
||||
# except RequestException:
|
||||
# # If an exception occurs (could be connection, timeout, or HTTP errors), we use the mock
|
||||
|
||||
if mock_function_name:
|
||||
mock_manager = MockManager(
|
||||
task
|
||||
) # workspace doesn't need to be passed in, stays the same
|
||||
print("Server unavailable, using mock", mock_function_name)
|
||||
mock_manager.delegate(mock_function_name)
|
||||
if mock_function_name != None and (request.config.getoption("--mock")):
|
||||
if mock_function_name:
|
||||
mock_manager = MockManager(
|
||||
task
|
||||
) # workspace doesn't need to be passed in, stays the same
|
||||
print("Server unavailable, using mock", mock_function_name)
|
||||
mock_manager.delegate(mock_function_name)
|
||||
else:
|
||||
print("No mock provided")
|
||||
else:
|
||||
print("No mock provided")
|
||||
path = os.path.join(os.getcwd(), f"agent\\{AGENT_NAME}")
|
||||
|
||||
# else:
|
||||
# # This code is run if no exception occurred
|
||||
# print(f"Request succeeded with status code {response.status_code}")
|
||||
try:
|
||||
timeout = int(AGENT_TIMEOUT) if AGENT_TIMEOUT is not None else 60
|
||||
|
||||
subprocess.run(
|
||||
["python", "miniagi.py", task],
|
||||
check=True,
|
||||
cwd=path,
|
||||
timeout=timeout
|
||||
# text=True,
|
||||
# capture_output=True
|
||||
)
|
||||
except subprocess.TimeoutExpired:
|
||||
print("The subprocess has exceeded the time limit and was terminated.")
|
||||
|
||||
|
||||
regression_txt = "agbenchmark/tests/regression/regression_tests.txt"
|
||||
regression_json = "agbenchmark/tests/regression/regression_tests.json"
|
||||
|
||||
regression_manager = RegressionManager(regression_txt)
|
||||
regression_manager = RegressionManager(regression_json)
|
||||
|
||||
|
||||
# this is to get the challenge_data from every test
|
||||
@pytest.fixture(autouse=True)
|
||||
def challenge_data(request):
|
||||
return request.param
|
||||
|
||||
|
||||
def pytest_runtest_makereport(item, call):
|
||||
"""Called for each test report. Generated for each stage
|
||||
of a test run (setup, call, teardown)."""
|
||||
if call.when == "call":
|
||||
if (
|
||||
call.excinfo is None
|
||||
): # if no error in the call stage, add it as a regression test
|
||||
regression_manager.add_test(item.nodeid)
|
||||
else: # otherwise, :(
|
||||
regression_manager.remove_test(item.nodeid)
|
||||
challenge_data = item.funcargs.get("challenge_data", None)
|
||||
difficulty = challenge_data.info.difficulty if challenge_data else "unknown"
|
||||
dependencies = challenge_data.dependencies if challenge_data else []
|
||||
|
||||
test_details = {
|
||||
"difficulty": difficulty,
|
||||
"dependencies": dependencies,
|
||||
"test": item.nodeid,
|
||||
}
|
||||
|
||||
print("pytest_runtest_makereport", test_details)
|
||||
if call.excinfo is None:
|
||||
regression_manager.add_test(item.nodeid.split("::")[1], test_details)
|
||||
else:
|
||||
regression_manager.remove_test(item.nodeid.split("::")[1])
|
||||
|
||||
|
||||
def pytest_collection_modifyitems(items):
|
||||
"""Called once all test items are collected. Used
|
||||
to add regression marker to collected test items."""
|
||||
to add regression and depends markers to collected test items."""
|
||||
for item in items:
|
||||
print("pytest_collection_modifyitems", item.nodeid)
|
||||
if item.nodeid + "\n" in regression_manager.tests:
|
||||
# regression add
|
||||
if item.nodeid.split("::")[1] in regression_manager.tests:
|
||||
print(regression_manager.tests)
|
||||
item.add_marker(pytest.mark.regression)
|
||||
|
||||
|
@ -94,3 +128,26 @@ def pytest_collection_modifyitems(items):
|
|||
def pytest_sessionfinish():
|
||||
"""Called at the end of the session to save regression tests"""
|
||||
regression_manager.save()
|
||||
|
||||
|
||||
# this is so that all tests can inherit from the Challenge class
|
||||
def pytest_generate_tests(metafunc):
|
||||
if "challenge_data" in metafunc.fixturenames:
|
||||
# Get the instance of the test class
|
||||
test_class = metafunc.cls()
|
||||
|
||||
# Generate the parameters
|
||||
params = test_class.data
|
||||
|
||||
# Add the parameters to the test function
|
||||
metafunc.parametrize("challenge_data", [params], indirect=True)
|
||||
|
||||
if "run_agent" in metafunc.fixturenames:
|
||||
# Get the instance of the test class
|
||||
test_class = metafunc.cls()
|
||||
|
||||
# Generate the parameters
|
||||
params = [(test_class.task, test_class.mock)]
|
||||
|
||||
# Add the parameters to the test function
|
||||
metafunc.parametrize("run_agent", params, indirect=True)
|
||||
|
|
|
@ -1,20 +0,0 @@
|
|||
import json
|
||||
import openai
|
||||
|
||||
|
||||
def basic_gpt_agent(query) -> str:
|
||||
response = openai.ChatCompletion.create(
|
||||
model="gpt-3.5-turbo-0613", messages=[{"role": "user", "content": query}]
|
||||
)
|
||||
|
||||
answer = response["choices"][0]["message"]["content"] # type: ignore
|
||||
|
||||
print("QUERY : ", query)
|
||||
print("AGENT ANSWER: ", answer)
|
||||
|
||||
return answer
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# server boilerplate example here
|
||||
basic_gpt_agent("")
|
|
@ -0,0 +1,24 @@
|
|||
from agbenchmark.Challenge import Challenge
|
||||
|
||||
|
||||
def basic_read_file_mock(task: str, workspace: str):
|
||||
"""
|
||||
This mock reads a file and returns its content.
|
||||
"""
|
||||
|
||||
file_contents = Challenge.open_file(workspace, "file_to_check.txt")
|
||||
|
||||
Challenge.write_to_file(
|
||||
workspace, "file_to_check.txt", f"random string: {file_contents}"
|
||||
)
|
||||
|
||||
|
||||
def basic_write_file_mock(task: str, workspace: str):
|
||||
"""
|
||||
This mock writes to a file (creates one if it doesn't exist)
|
||||
"""
|
||||
Challenge.write_to_file(
|
||||
workspace,
|
||||
"file_to_check.txt",
|
||||
"Washington DC is the capital of the United States of America",
|
||||
)
|
|
@ -1,4 +1,3 @@
|
|||
from ..basic_gpt_agent import basic_gpt_agent
|
||||
from agbenchmark.Challenge import Challenge
|
||||
|
||||
|
||||
|
@ -6,8 +5,4 @@ from agbenchmark.Challenge import Challenge
|
|||
# Prerequisites here would be writing to a file (basic_abilities test).
|
||||
# Should also check if prerequisites exists in regression file
|
||||
def retrieval_1_mock(task: str, workspace: str):
|
||||
# Call the basic_gpt_agent to get a response.
|
||||
response = basic_gpt_agent(task)
|
||||
|
||||
# Open the file in write mode.
|
||||
Challenge.write_to_file(workspace, "file_to_check.txt", response)
|
||||
pass
|
||||
|
|
|
@ -2,6 +2,10 @@ import click
|
|||
import pytest
|
||||
import json
|
||||
import os
|
||||
from pathlib import Path
|
||||
from dotenv import load_dotenv, set_key
|
||||
|
||||
load_dotenv()
|
||||
|
||||
|
||||
@click.group()
|
||||
|
@ -12,8 +16,8 @@ def cli():
|
|||
@cli.command()
|
||||
@click.option("--category", default=None, help="Specific category to run")
|
||||
@click.option("--noreg", is_flag=True, help="Skip regression tests")
|
||||
def start(category, noreg):
|
||||
"""Start the benchmark tests. If a category flag is is provided, run the categories with that mark."""
|
||||
@click.option("--mock", is_flag=True, help="Run with mock")
|
||||
def start(category, noreg, mock):
|
||||
"""Start the benchmark tests. If a category flag is provided, run the categories with that mark."""
|
||||
config_file = "agbenchmark/config.json"
|
||||
|
||||
|
@ -23,12 +27,9 @@ def start(category, noreg):
|
|||
if not os.path.exists(config_dir) or os.stat(config_dir).st_size == 0:
|
||||
config = {}
|
||||
|
||||
config["hostname"] = click.prompt(
|
||||
"\nPlease enter a new hostname", default="localhost"
|
||||
)
|
||||
config["port"] = click.prompt("Please enter a new port", default=8080)
|
||||
config["workspace"] = click.prompt(
|
||||
"Please enter a new workspace path", default="agbenchmark/mocks/workspace"
|
||||
"Please enter a new workspace path",
|
||||
default=os.path.join(Path.home(), "miniagi"),
|
||||
)
|
||||
|
||||
with open(config_dir, "w") as f:
|
||||
|
@ -38,13 +39,17 @@ def start(category, noreg):
|
|||
with open(config_dir, "r") as f:
|
||||
config = json.load(f)
|
||||
|
||||
set_key(".env", "MOCK_TEST", "True" if mock else "False")
|
||||
if mock:
|
||||
config["workspace"] = "agbenchmark/mocks/workspace"
|
||||
|
||||
# create workspace directory if it doesn't exist
|
||||
workspace_path = os.path.abspath(config["workspace"])
|
||||
if not os.path.exists(workspace_path):
|
||||
os.makedirs(workspace_path, exist_ok=True)
|
||||
|
||||
regression_path = os.path.abspath(
|
||||
"agbenchmark/tests/regression/regression_tests.txt"
|
||||
"agbenchmark/tests/regression/regression_tests.json"
|
||||
)
|
||||
if not os.path.exists(regression_path):
|
||||
with open(regression_path, "a"):
|
||||
|
@ -74,6 +79,9 @@ def start(category, noreg):
|
|||
else:
|
||||
print("Running all categorys") # run all categorys
|
||||
|
||||
if mock:
|
||||
pytest_args.append("--mock")
|
||||
|
||||
# Run pytest with the constructed arguments
|
||||
pytest.main(pytest_args)
|
||||
|
||||
|
|
|
@ -0,0 +1,9 @@
|
|||
import pytest
|
||||
from agbenchmark.Challenge import Challenge
|
||||
from agbenchmark.challenges.define_task_types import ChallengeData
|
||||
from abc import abstractmethod
|
||||
|
||||
|
||||
@pytest.mark.basic
|
||||
class BasicChallenge(Challenge):
|
||||
pass
|
|
@ -0,0 +1,19 @@
|
|||
{
|
||||
"name": "basic_read_file",
|
||||
"category": ["basic"],
|
||||
"task": "Write the string 'random string' before any existing text to the file called file_to_check.txt",
|
||||
"dependencies": ["basic_write_file"],
|
||||
"ground": {
|
||||
"answer": "random string: this is how we're doing",
|
||||
"should_contain": ["random string: this is how we're doing"],
|
||||
"files": ["file_to_check.txt"]
|
||||
},
|
||||
"mock": {
|
||||
"mock_func": "basic_read_file_mock"
|
||||
},
|
||||
"info": {
|
||||
"description": "This reads the file quickly",
|
||||
"difficulty": "basic",
|
||||
"side_effects": [""]
|
||||
}
|
||||
}
|
|
@ -0,0 +1,31 @@
|
|||
import pytest
|
||||
from agbenchmark.Challenge import Challenge
|
||||
from agbenchmark.tests.basic_abilities.BasicChallenge import BasicChallenge
|
||||
import os
|
||||
|
||||
|
||||
class TestReadFile(BasicChallenge):
|
||||
"""Testing if LLM can read a file"""
|
||||
|
||||
@pytest.fixture(scope="module", autouse=True)
|
||||
def setup_module(self, workspace):
|
||||
Challenge.write_to_file(
|
||||
workspace, self.data.ground.files[0], "this is how we're doing"
|
||||
)
|
||||
|
||||
def get_file_path(self) -> str: # all tests must implement this method
|
||||
return os.path.join(os.path.dirname(__file__), "r_file_data.json")
|
||||
|
||||
@pytest.mark.depends(on=["basic_write_file"], name="basic_read_file")
|
||||
def test_method(
|
||||
self, workspace
|
||||
): # run_test is a common name that all tests must implement
|
||||
files_contents = self.open_files(workspace, self.data.ground.files)
|
||||
|
||||
scores = []
|
||||
for file_content in files_contents:
|
||||
score = self.scoring(file_content, self.data.ground)
|
||||
print("Your score is:", score)
|
||||
scores.append(score)
|
||||
|
||||
assert 1 in scores
|
|
@ -0,0 +1,21 @@
|
|||
{
|
||||
"name": "basic_write_file",
|
||||
"category": ["basic"],
|
||||
"task": "Print the the capital of America to a .txt file",
|
||||
"dependencies": [],
|
||||
"ground": {
|
||||
"answer": "Washington",
|
||||
"should_contain": ["Washington"],
|
||||
"should_not_contain": ["New York", "Los Angeles", "San Francisco"],
|
||||
"files": [".txt"]
|
||||
},
|
||||
"mock": {
|
||||
"mock_func": "basic_write_file_mock",
|
||||
"mock_task": "What is the capital of America?"
|
||||
},
|
||||
"info": {
|
||||
"difficulty": "basic",
|
||||
"description": "Tests the writing to file",
|
||||
"side_effects": ["tests if there is in fact an LLM attached"]
|
||||
}
|
||||
}
|
|
@ -0,0 +1,23 @@
|
|||
import pytest
|
||||
from agbenchmark.tests.basic_abilities.BasicChallenge import BasicChallenge
|
||||
import os
|
||||
|
||||
|
||||
class TestWriteFile(BasicChallenge):
|
||||
"""Testing if LLM can write to a file"""
|
||||
|
||||
def get_file_path(self) -> str: # all tests must implement this method
|
||||
return os.path.join(os.path.dirname(__file__), "w_file_data.json")
|
||||
|
||||
@pytest.mark.depends(on=[], name="basic_write_file")
|
||||
def test_method(self, workspace):
|
||||
print("my workspace is ", workspace)
|
||||
files_contents = self.open_files(workspace, self.data.ground.files)
|
||||
|
||||
scores = []
|
||||
for file_content in files_contents:
|
||||
score = self.scoring(file_content, self.data.ground)
|
||||
print("Your score is:", score)
|
||||
scores.append(score)
|
||||
|
||||
assert 1 in scores
|
|
@ -1,3 +1,6 @@
|
|||
import json
|
||||
|
||||
|
||||
class RegressionManager:
|
||||
"""Abstracts interaction with the regression tests file"""
|
||||
|
||||
|
@ -6,17 +9,21 @@ class RegressionManager:
|
|||
self.load()
|
||||
|
||||
def load(self) -> None:
|
||||
with open(self.filename, "r") as f:
|
||||
self.tests = f.readlines()
|
||||
try:
|
||||
with open(self.filename, "r") as f:
|
||||
self.tests = json.load(f)
|
||||
except (FileNotFoundError, json.decoder.JSONDecodeError):
|
||||
self.tests = {}
|
||||
|
||||
def save(self) -> None:
|
||||
with open(self.filename, "w") as f:
|
||||
f.writelines(self.tests)
|
||||
json.dump(self.tests, f, indent=4)
|
||||
|
||||
def add_test(self, test_id) -> None:
|
||||
if f"{test_id}\n" not in self.tests:
|
||||
self.tests.append(f"{test_id}\n")
|
||||
def add_test(self, test_name: str, test_details: dict) -> None:
|
||||
self.tests[test_name] = test_details
|
||||
self.save()
|
||||
|
||||
def remove_test(self, test_id) -> None:
|
||||
if f"{test_id}\n" in self.tests:
|
||||
self.tests.remove(f"{test_id}\n")
|
||||
def remove_test(self, test_name: str) -> None:
|
||||
if test_name in self.tests:
|
||||
del self.tests[test_name]
|
||||
self.save()
|
||||
|
|
|
@ -0,0 +1,7 @@
|
|||
{
|
||||
"TestWriteFile": {
|
||||
"difficulty": "basic",
|
||||
"dependencies": [],
|
||||
"test": "agbenchmark/tests/basic_abilities/write_file/write_file_test.py::TestWriteFile::test_method[challenge_data0-run_agent0]"
|
||||
}
|
||||
}
|
|
@ -368,6 +368,20 @@ files = [
|
|||
{file = "frozenlist-1.3.3.tar.gz", hash = "sha256:58bcc55721e8a90b88332d6cd441261ebb22342e238296bb330968952fbb3a6a"},
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "future-fstrings"
|
||||
version = "1.2.0"
|
||||
description = "A backport of fstrings to python<3.6"
|
||||
optional = false
|
||||
python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"
|
||||
files = [
|
||||
{file = "future_fstrings-1.2.0-py2.py3-none-any.whl", hash = "sha256:90e49598b553d8746c4dc7d9442e0359d038c3039d802c91c0a55505da318c63"},
|
||||
{file = "future_fstrings-1.2.0.tar.gz", hash = "sha256:6cf41cbe97c398ab5a81168ce0dbb8ad95862d3caf23c21e4430627b90844089"},
|
||||
]
|
||||
|
||||
[package.extras]
|
||||
rewrite = ["tokenize-rt (>=3)"]
|
||||
|
||||
[[package]]
|
||||
name = "idna"
|
||||
version = "3.4"
|
||||
|
@ -473,6 +487,24 @@ files = [
|
|||
{file = "multidict-6.0.4.tar.gz", hash = "sha256:3666906492efb76453c0e7b97f2cf459b0682e7402c0489a95484965dbc1da49"},
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "networkx"
|
||||
version = "3.1"
|
||||
description = "Python package for creating and manipulating graphs and networks"
|
||||
optional = false
|
||||
python-versions = ">=3.8"
|
||||
files = [
|
||||
{file = "networkx-3.1-py3-none-any.whl", hash = "sha256:4f33f68cb2afcf86f28a45f43efc27a9386b535d567d2127f8f61d51dec58d36"},
|
||||
{file = "networkx-3.1.tar.gz", hash = "sha256:de346335408f84de0eada6ff9fafafff9bcda11f0a0dfaa931133debb146ab61"},
|
||||
]
|
||||
|
||||
[package.extras]
|
||||
default = ["matplotlib (>=3.4)", "numpy (>=1.20)", "pandas (>=1.3)", "scipy (>=1.8)"]
|
||||
developer = ["mypy (>=1.1)", "pre-commit (>=3.2)"]
|
||||
doc = ["nb2plots (>=0.6)", "numpydoc (>=1.5)", "pillow (>=9.4)", "pydata-sphinx-theme (>=0.13)", "sphinx (>=6.1)", "sphinx-gallery (>=0.12)", "texext (>=0.6.7)"]
|
||||
extra = ["lxml (>=4.6)", "pydot (>=1.4.2)", "pygraphviz (>=1.10)", "sympy (>=1.10)"]
|
||||
test = ["codecov (>=2.1)", "pytest (>=7.2)", "pytest-cov (>=4.0)"]
|
||||
|
||||
[[package]]
|
||||
name = "openai"
|
||||
version = "0.27.8"
|
||||
|
@ -595,6 +627,37 @@ tomli = {version = ">=1.0.0", markers = "python_version < \"3.11\""}
|
|||
[package.extras]
|
||||
testing = ["argcomplete", "attrs (>=19.2.0)", "hypothesis (>=3.56)", "mock", "nose", "pygments (>=2.7.2)", "requests", "setuptools", "xmlschema"]
|
||||
|
||||
[[package]]
|
||||
name = "pytest-depends"
|
||||
version = "1.0.1"
|
||||
description = "Tests that depend on other tests"
|
||||
optional = false
|
||||
python-versions = "*"
|
||||
files = [
|
||||
{file = "pytest-depends-1.0.1.tar.gz", hash = "sha256:90a28e2b87b75b18abd128c94015248544acac20e4392e9921e5a86f93319dfe"},
|
||||
{file = "pytest_depends-1.0.1-py3-none-any.whl", hash = "sha256:a1df072bcc93d77aca3f0946903f5fed8af2d9b0056db1dfc9ed5ac164ab0642"},
|
||||
]
|
||||
|
||||
[package.dependencies]
|
||||
colorama = "*"
|
||||
future-fstrings = "*"
|
||||
networkx = "*"
|
||||
pytest = ">=3"
|
||||
|
||||
[[package]]
|
||||
name = "python-dotenv"
|
||||
version = "1.0.0"
|
||||
description = "Read key-value pairs from a .env file and set them as environment variables"
|
||||
optional = false
|
||||
python-versions = ">=3.8"
|
||||
files = [
|
||||
{file = "python-dotenv-1.0.0.tar.gz", hash = "sha256:a8df96034aae6d2d50a4ebe8216326c61c3eb64836776504fcca410e5937a3ba"},
|
||||
{file = "python_dotenv-1.0.0-py3-none-any.whl", hash = "sha256:f5971a9226b701070a4bf2c38c89e5a3f0d64de8debda981d1db98583009122a"},
|
||||
]
|
||||
|
||||
[package.extras]
|
||||
cli = ["click (>=5.0)"]
|
||||
|
||||
[[package]]
|
||||
name = "requests"
|
||||
version = "2.31.0"
|
||||
|
@ -765,4 +828,4 @@ multidict = ">=4.0"
|
|||
[metadata]
|
||||
lock-version = "2.0"
|
||||
python-versions = "^3.9"
|
||||
content-hash = "a13e69f2bd9e511e1af92ed02b155a90dec38a9b8d983a711e1b67931b467d38"
|
||||
content-hash = "f8de5e973c92360108aaca1cecc2fdd505f10a9c2975b46c83ea9c24b4af3cfe"
|
||||
|
|
|
@ -14,6 +14,8 @@ click = "^8.1.3"
|
|||
requests = "^2.31.0"
|
||||
openai = "^0.27.8"
|
||||
pydantic = "^1.10.9"
|
||||
pytest-depends = "^1.0.1"
|
||||
python-dotenv = "^1.0.0"
|
||||
|
||||
|
||||
[build-system]
|
||||
|
@ -28,7 +30,8 @@ testpaths = [
|
|||
]
|
||||
markers = [
|
||||
"retrieval",
|
||||
"regression"
|
||||
"regression",
|
||||
"basic",
|
||||
]
|
||||
|
||||
[tool.poetry.scripts]
|
||||
|
|
Loading…
Reference in New Issue