Remove the submodule, reference OpenAI directly rather than running it on the command line, fix logging (#16)
* Removed submodule, refactor, docker on pip, async docker logging, running our own tool on CLI rather than OpenAIspull/5155/head
parent
f00ced6612
commit
625d6e72ec
|
@ -127,3 +127,5 @@ dmypy.json
|
|||
|
||||
# Pyre type checker
|
||||
.pyre/
|
||||
|
||||
/data
|
||||
|
|
|
@ -1,3 +0,0 @@
|
|||
[submodule "Auto-GPT"]
|
||||
path = auto_gpt_benchmarking/Auto-GPT
|
||||
url = https://github.com/Significant-Gravitas/Auto-GPT.git
|
127
README.md
127
README.md
|
@ -1,69 +1,97 @@
|
|||
# Auto-GPT-Benchmarks
|
||||
A set of standardised benchmarks to assess the performance of Auto-GPTs.
|
||||
|
||||
# What is next?
|
||||
|
||||
- [ ] Build longer form tasks, (code fix backed by testing)
|
||||
- [ ] Explicitly note the common failure modes in the test harness and fix them. Most of these appear to be failure modes with the core AutoGPT project
|
||||
- [ ] Switch to a ubuntu container so it can do more things (git, bash, etc)
|
||||
- [ ] Lower priority, but put this in a webserver backend so we have a good API rather than doing container and file management for our interface between evals and our agent.
|
||||
- [ ] Get token counting data from the model Add scores to result files based on pricing associated with tokens and models used
|
||||
- [ ] Think about how this can be applied to other projects besides AutoGPT so we can be THE agent evaluation framework.
|
||||
- [ ] Copy the OpenAI Eval files from the tmp file they are saved to somewhere we can track the results
|
||||
- [ ] Support multi-threaded evals. OpenAI has great support for this. The docker system built here doesn't.
|
||||
|
||||
|
||||
## Understanding OpenAI Evals
|
||||
|
||||
The Evals docs are here and very good: https://github.com/openai/evals/tree/main/docs
|
||||
|
||||
The basic idea is this:
|
||||
1. Use a completion function to point to the language model or in our case AutoGPT, the model you want to test.
|
||||
2. Register that completion function with the evals framework with a yaml in a `completion_fns` dir.
|
||||
3. Run the evals against the completion function.
|
||||
|
||||
Then you can make more yaml defined evals and run them against the completion function as needed.
|
||||
|
||||
### Completions Functions
|
||||
|
||||
See our yaml file in `completion_fns` dir for the registration of the completion function.
|
||||
See our completion function itself in CompletionFn.py
|
||||
That points to the AutoGPT model we want to test which is spun up dynamically in a docker container in AutoGPTAgent.py
|
||||
|
||||
A set of standardised benchmarks to assess the performance of Auto-GPT.
|
||||
This currently uses the OpenAI Evals framework to run the benchmarks.
|
||||
|
||||
## Setup
|
||||
|
||||
You must add the auto_gpt_benchmarking dir to the python path
|
||||
Do this with a path file in your venv. OpenAI evals needs to import it.
|
||||
|
||||
These instructions currently assume ubuntuy 22.04.
|
||||
They should be fairly adaptable to the windows/MacOS equivalents. Please submit a PR if you would like to see your OS
|
||||
documented.
|
||||
|
||||
Clone the repo with:
|
||||
|
||||
`git clone git@github.com:Significant-Gravitas/Auto-GPT-Benchmarks.git`
|
||||
`cd Auto-GPT-Benchmarks`
|
||||
|
||||
Create a venv with
|
||||
|
||||
`python3.9 -m venv venv`
|
||||
`python3.9 -m venv venv`
|
||||
|
||||
|
||||
Activate it with
|
||||
|
||||
`source venv/bin/activate`
|
||||
`source venv/bin/activate`
|
||||
|
||||
Add a file to `venv/lib/python3.9/site-packages/benchmarking.pth` with the contents:
|
||||
`/PATH/TO/REPO/Auto-GPT-Benchmarks-fork`
|
||||
Install the requirements with:
|
||||
|
||||
This is because evals tries to import it directly.
|
||||
`pip install -r requirements.txt`
|
||||
|
||||
Install the requirements with
|
||||
If you haven't already clone the AutoGPT repo somewhere else on your machine.
|
||||
DO NOT CLONE IT INTO A SUBDIR OF THIS REPO.
|
||||
|
||||
`pip install -r requirements.txt`
|
||||
`cd somewhere/else`
|
||||
`git clone git@github.com:Significant-Gravitas/Auto-GPT.git`
|
||||
|
||||
You must have a docker container built corresponding to the submodule below or the docker run command starting the agent will fail.
|
||||
You will need to update the .env file in the Auto-GPT repo to have your OpenAI api key. The file in question is at:
|
||||
|
||||
Cd into the AutoGPT submodule and build/tag the dockerfile so the agent can be instantiated.
|
||||
`cd auto_gpt_benchmarks/Auto-GPT`
|
||||
`Auto-GPT/.env`
|
||||
|
||||
Build the container so we can run it procedurally!
|
||||
`docker build -t autogpt .`
|
||||
Finally, we assume you have a docker container built from the Dockerfile in the Auto-GPT repo.
|
||||
|
||||
## Running the tests
|
||||
Build this with:
|
||||
|
||||
EVALS_THREADS=1 EVALS_THREAD_TIMEOUT=600 oaieval auto_gpt_completion_fn test-match --registry_path $PWD/auto_gpt_benchmarking
|
||||
`cd Auto-GPT`
|
||||
`docker build -t autogpt .`
|
||||
|
||||
If you want to run with redis as your memory system, you can stand up a redis image in the AutoGPT repo with
|
||||
|
||||
`docker compose up`
|
||||
|
||||
Then you will need to adjust some variables in your .env file to use the redis memory backend.
|
||||
See the AutoGPT docs on how to do that.
|
||||
|
||||
Run your first eval with:
|
||||
|
||||
`cd Auto-GPT-Benchmarks`
|
||||
`python3 -m auto_gpt_benchmarking test-match --auto-gpt-path /your/path/to/Auto-GPT`
|
||||
|
||||
You should only need to use the --auto-gpt-path flag the first time you run it. Afterwards, that will be saved in
|
||||
|
||||
`auto_gpt_benchmarking/completion_fns/auto_gpt_completion_fn.yaml`.
|
||||
|
||||
To see a full list of available flags you can use run `python3 -m auto_gpt_benchmarking --help`
|
||||
Some of these are inherited from the openAI evals framework and do not work quite as intended as they are not applicable
|
||||
to this use case.
|
||||
|
||||
This saves a file in `Auto-GPT-Benchmarks/data/records.jsonl`
|
||||
This file is currently a default that is configurable with --record_path flag. You will have to specify the fully
|
||||
qualified path.
|
||||
|
||||
## Currently Supported Benchmarks:
|
||||
From OpenAI Evals
|
||||
- [x] test-match
|
||||
- [x] test-fuzzy-match
|
||||
- [ ] Everything else they have...
|
||||
|
||||
## Understanding OpenAI Evals
|
||||
|
||||
The Evals docs are here and very good: https://github.com/openai/evals/tree/main/docs
|
||||
|
||||
The basic idea is this though:
|
||||
1. Use a completion function to point to the language model or in our case AutoGPT, the model you want to test.
|
||||
2. Register that completion function with the evals framework with a yaml in a `completion_fns` dir.
|
||||
3. Run the evals against the completion function.
|
||||
|
||||
Then you can make more also, yaml defined evals and run them against the completion function as needed.
|
||||
|
||||
### Completions Functions
|
||||
|
||||
See our yaml file in `completion_fns` dir for the registration of the completion function.
|
||||
See our completion function itself in CompletionFn.py
|
||||
That points to the AutoGPT model we want to test which is spun up dynamically in a docker container in AutoGPTAgent.py
|
||||
|
||||
|
||||
# Example final output:
|
||||
|
@ -79,3 +107,12 @@ EVALS_THREADS=1 EVALS_THREAD_TIMEOUT=600 oaieval auto_gpt_completion_fn test-mat
|
|||
{"run_id": "230417220821DPM75QNS", "event_id": 5, "sample_id": "test-match.s1.0", "type": "match", "data": {"correct": false, "expected": "time", "picked": null, "sampled": "Once upon a time", "options": ["time"]}, "created_by": "", "created_at": "2023-04-17 22:12:04.691064+00:00"}
|
||||
(venv) douglas@douglas-XPS-15-9500:~/AGI/Auto-GPT-Benchmarks-fork$
|
||||
|
||||
# What is next?
|
||||
|
||||
- [ ] Run the rest of the OpenAI Evals Especially the modelgraded ones
|
||||
- [ ] Build longer form tasks, (code fix backed by testing)
|
||||
- [ ] Explicitly note the common failure modes in the test harness and fix them. Most of these appear to be failure modes with the core AutoGPT project
|
||||
- [ ] Get token counting data from the model Add scores to result files based on pricing associated with tokens and models used
|
||||
- [ ] Think about how this can be applied to other projects besides AutoGPT so we can be THE agent evaluation framework.
|
||||
- [ ] Figure our how the OpenAI Evals results are saved...
|
||||
- [ ] Support multi-threaded evals. OpenAI has great support for this. The docker system built here doesn't.
|
||||
|
|
|
@ -1 +0,0 @@
|
|||
Subproject commit 97d62cc16bf45fcd406efeb33d042ebd58c24670
|
|
@ -10,7 +10,9 @@ The model is instantiated with a prompt from the AutoGPT completion function.
|
|||
Eventualy we will also save and log all of the associated output and thinking for the model as well
|
||||
"""
|
||||
from pathlib import Path
|
||||
import os
|
||||
import docker
|
||||
import asyncio
|
||||
import aiodocker
|
||||
|
||||
|
||||
class AutoGPTAgent:
|
||||
|
@ -36,12 +38,34 @@ class AutoGPTAgent:
|
|||
if self.file_logger.exists():
|
||||
self.file_logger.unlink()
|
||||
|
||||
def _copy_ai_settings(self):
|
||||
def _copy_ai_settings(self) -> None:
|
||||
self.ai_settings_dest.write_text(self.ai_settings_file.read_text())
|
||||
|
||||
def _copy_prompt(self):
|
||||
def _copy_prompt(self) -> None:
|
||||
self.prompt_file.write_text(self.prompt)
|
||||
|
||||
async def _stream_logs(self, container: aiodocker.containers.DockerContainer) -> None:
|
||||
try:
|
||||
async for line in container.log(stdout=True, stderr=True, follow=True, tail="all"):
|
||||
print(line.strip())
|
||||
await asyncio.sleep(1)
|
||||
except aiodocker.exceptions.DockerError as e:
|
||||
# Handle Docker errors (e.g., container is killed or removed)
|
||||
print('Docker error: {}'.format(e))
|
||||
|
||||
async def _run_stream_logs(self) -> None:
|
||||
"""
|
||||
This grabs the docker containers id and streams the logs to the console with aiodocker.
|
||||
:return: None
|
||||
"""
|
||||
async with aiodocker.Docker() as docker_client:
|
||||
try:
|
||||
container = docker_client.containers.container(self.container.id)
|
||||
await self._stream_logs(container)
|
||||
except aiodocker.exceptions.DockerError as e:
|
||||
# Handle cases when the container is not found
|
||||
print('Container not found: {}'.format(e))
|
||||
|
||||
def _start_agent(self):
|
||||
"""
|
||||
This starts the agent in the docker container.
|
||||
|
@ -51,9 +75,26 @@ class AutoGPTAgent:
|
|||
You also must set up the .env file in the Auto-GPT repo.
|
||||
:return:
|
||||
"""
|
||||
client = docker.from_env()
|
||||
env_file = self.auto_gpt_path / ".env"
|
||||
# run it in continuous mode and skip re-prompts
|
||||
os.system(f"docker run -it --env-file={env_file} -v {self.auto_workspace}:/home/appuser/auto_gpt_workspace -v {self.auto_gpt_path}/autogpt:/home/appuser/autogpt autogpt --continuous -C '/home/appuser/auto_gpt_workspace/ai_settings.yaml'")
|
||||
envs = [
|
||||
f"{line.strip()}" for line in open(
|
||||
env_file
|
||||
) if line.strip() != "" and line.strip()[0] != "#" and line.strip()[0] != "\n"]
|
||||
|
||||
self.container = client.containers.run(
|
||||
image="autogpt",
|
||||
command="--continuous -C '/home/appuser/auto_gpt_workspace/ai_settings.yaml'",
|
||||
environment=envs,
|
||||
volumes={
|
||||
self.auto_workspace: {"bind": "/home/appuser/auto_gpt_workspace", "mode": "rw"},
|
||||
f"{self.auto_gpt_path}/autogpt": {"bind": "/home/appuser/autogpt", "mode": "rw"},
|
||||
},
|
||||
stdin_open=True,
|
||||
tty=True,
|
||||
detach=True
|
||||
)
|
||||
asyncio.run(self._run_stream_logs())
|
||||
|
||||
def _poll_for_output(self):
|
||||
"""
|
||||
|
@ -64,8 +105,8 @@ class AutoGPTAgent:
|
|||
if self.output_file.exists():
|
||||
return self.output_file.read_text()
|
||||
|
||||
def __init__(self, prompt):
|
||||
self.auto_gpt_path = Path(__file__).parent / "Auto-GPT"
|
||||
def __init__(self, prompt, auto_gpt_path: str):
|
||||
self.auto_gpt_path = Path(auto_gpt_path)
|
||||
self.auto_workspace = self.auto_gpt_path / "auto_gpt_workspace"
|
||||
self.prompt_file = self.auto_workspace / "prompt.txt"
|
||||
self.output_file = self.auto_workspace / "output.txt"
|
||||
|
@ -76,16 +117,33 @@ class AutoGPTAgent:
|
|||
self._clean_up_workspace()
|
||||
self._copy_ai_settings()
|
||||
self._copy_prompt()
|
||||
self.container = None
|
||||
self.killing = False
|
||||
self.logging_task = None
|
||||
|
||||
def start(self):
|
||||
self._start_agent()
|
||||
answer = self._poll_for_output()
|
||||
print('about to do clean up')
|
||||
print(answer)
|
||||
self._clean_up_workspace()
|
||||
print('did clean up')
|
||||
print(f"Prompt was: {self.prompt}, Answer was: {answer}")
|
||||
self.kill()
|
||||
return answer
|
||||
|
||||
def kill(self):
|
||||
if self.killing:
|
||||
return
|
||||
self.killing = True
|
||||
self._clean_up_workspace()
|
||||
if self.container:
|
||||
# kill the container
|
||||
try:
|
||||
self.container.kill()
|
||||
self.container.remove()
|
||||
except docker.errors.APIError:
|
||||
print('Couldn\'t find container to kill. Assuming container successfully killed itself.')
|
||||
if self.logging_task:
|
||||
self.logging_task.cancel()
|
||||
self.killing = False
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
ai_goals:
|
||||
- Evaluate the prompt in `prompt.txt` and find the best answer in the format provided.
|
||||
- Get the correct answer to the question in the fewest number of steps possible. You are scored first on if you get the correct answer, and second on how many tokens you take to get the right answer so keep your thinking and tool usage as minimal as possible while still ensuring you get the correct answer.
|
||||
- Save the final answer and output to the `output.txt` file, the only file you should write to then immediately exit the program.
|
||||
- Save the final answer and output to the `output.txt` file, the only file you should write to, then immediately exit the program because you are done.
|
||||
ai_name: EvaluationAgent
|
||||
ai_role: an ai that is tested on how effectively it can efficiently evaluate questions and answer them correctly while using as few resources as possible
|
||||
|
|
|
@ -1,5 +1,3 @@
|
|||
import importlib
|
||||
from typing import Optional
|
||||
from evals.api import CompletionFn, CompletionResult
|
||||
|
||||
from evals.prompt.base import CompletionPrompt
|
||||
|
@ -16,12 +14,21 @@ class AutoGPTCompletionResult(CompletionResult):
|
|||
|
||||
|
||||
class AutoGPTCompletionFn(CompletionFn):
|
||||
def __init__(self, **kwargs) -> None:
|
||||
pass
|
||||
|
||||
def __init__(self, auto_gpt_path, **kwargs) -> None:
|
||||
self.auto_gpt_path = auto_gpt_path
|
||||
self.agent = None
|
||||
|
||||
def __call__(self, prompt, **kwargs) -> AutoGPTCompletionResult:
|
||||
prompt = CompletionPrompt(prompt).to_formatted_prompt()
|
||||
agent = AutoGPTAgent(prompt)
|
||||
response = agent.start()
|
||||
self.kill_agent()
|
||||
self.agent = AutoGPTAgent(prompt, self.auto_gpt_path)
|
||||
response = self.agent.start()
|
||||
record_sampling(prompt=prompt, sampled=response)
|
||||
return AutoGPTCompletionResult(response)
|
||||
|
||||
def kill_agent(self):
|
||||
if self.agent:
|
||||
self.agent.kill()
|
||||
|
||||
|
||||
|
|
|
@ -0,0 +1,61 @@
|
|||
"""
|
||||
The evaluator class actually executes the evals.
|
||||
"""
|
||||
from evals.cli import oaieval
|
||||
from evals.registry import Registry
|
||||
from pathlib import Path
|
||||
from typing import List, Optional, Tuple
|
||||
import sys
|
||||
|
||||
|
||||
class OAIRunArgs:
|
||||
def __init__(
|
||||
self,
|
||||
completion_fn: str,
|
||||
eval: str,
|
||||
extra_eval_params: str = "",
|
||||
max_samples: int = None,
|
||||
cache: bool = True,
|
||||
visible: bool = None,
|
||||
seed: int = 20220722,
|
||||
user: str = "",
|
||||
record_path: str = None,
|
||||
log_to_file: str = None,
|
||||
debug: bool = False,
|
||||
local_run: bool = True,
|
||||
dry_run: bool = False,
|
||||
dry_run_logging: bool = True,
|
||||
):
|
||||
self.completion_fn = completion_fn
|
||||
self.eval = eval
|
||||
self.extra_eval_params = extra_eval_params
|
||||
self.max_samples = max_samples
|
||||
self.cache = cache
|
||||
self.visible = visible
|
||||
self.seed = seed
|
||||
self.user = user
|
||||
self.record_path = record_path
|
||||
self.log_to_file = log_to_file
|
||||
self.debug = debug
|
||||
self.local_run = local_run
|
||||
self.dry_run = dry_run
|
||||
self.dry_run_logging = dry_run_logging
|
||||
# create the record and logging paths if they don't exist
|
||||
Path(self.record_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
# Path(self.log_to_file).parent.mkdir(parents=True, exist_ok=True)
|
||||
# Registry path should be the auto_gpt_benchmarking folder
|
||||
self.registry_path = None
|
||||
|
||||
|
||||
class Evaluator:
|
||||
def __init__(self, oai_run_args: OAIRunArgs):
|
||||
self.oai_run_args = oai_run_args
|
||||
registry_path = Path(__file__).parent
|
||||
|
||||
# add registry path to the python system path
|
||||
sys.path.append(str(registry_path))
|
||||
self.oai_run_args.registry_path = [registry_path]
|
||||
# self.registry = Registry([registry_path])
|
||||
|
||||
def run(self):
|
||||
oaieval.run(self.oai_run_args)
|
|
@ -1,34 +0,0 @@
|
|||
import importlib
|
||||
from typing import Optional
|
||||
from evals.api import CompletionFn, CompletionResult
|
||||
|
||||
from langchain.llms import BaseLLM
|
||||
|
||||
from evals.prompt.base import CompletionPrompt
|
||||
from evals.record import record_sampling
|
||||
|
||||
|
||||
class LangChainLLMCompletionResult(CompletionResult):
|
||||
def __init__(self, response) -> None:
|
||||
self.response = response
|
||||
|
||||
def get_completions(self) -> list[str]:
|
||||
return [self.response.strip()]
|
||||
|
||||
|
||||
class LangChainLLMCompletionFn(CompletionFn):
|
||||
def __init__(self, llm: str, llm_kwargs: Optional[dict] = {}, **kwargs) -> None:
|
||||
# Import and resolve self.llm to an instance of llm argument here, assuming it's always a subclass of BaseLLM
|
||||
module = importlib.import_module("langchain.llms")
|
||||
LLMClass = getattr(module, llm)
|
||||
|
||||
if issubclass(LLMClass, BaseLLM):
|
||||
self.llm = LLMClass(**llm_kwargs)
|
||||
else:
|
||||
raise ValueError(f"{llm} is not a subclass of BaseLLM")
|
||||
|
||||
def __call__(self, prompt, **kwargs) -> LangChainLLMCompletionResult:
|
||||
prompt = CompletionPrompt(prompt).to_formatted_prompt()
|
||||
response = self.llm(prompt)
|
||||
record_sampling(prompt=prompt, sampled=response)
|
||||
return LangChainLLMCompletionResult(response)
|
|
@ -0,0 +1,144 @@
|
|||
"""
|
||||
This is the main evaluation file. In it you can specify the following:
|
||||
|
||||
1. The number of threads to use for evaluation. This is set to 1 by default.And will remain that way until we can spin
|
||||
up containers on command
|
||||
2. The timeout for each thread. This is set to 60 seconds by default. This is the amount of time each thread will run
|
||||
for before it is killed when evaluating an agent
|
||||
3. The path to the AutoGPT code. This is a required parameter as we do not know where your code lives.
|
||||
4. The evals you would like to run. The options here are any OpenAI eval, or any of the evals defined in this repository
|
||||
|
||||
|
||||
What this file does is it parses the params given and then runs the evals with OpenAI's evals framework.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
import yaml
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("eval", type=str, help="Name of an eval. See registry.")
|
||||
parser.add_argument(
|
||||
"--completion-fn",
|
||||
type=str,
|
||||
dest="completion_fn",
|
||||
default="auto_gpt_completion_fn",
|
||||
help="One or more CompletionFn URLs, separated by commas (,). "
|
||||
"A CompletionFn can either be the name of a model available in the OpenAI API or a key in the registry "
|
||||
"(see evals/registry/completion_fns).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--timeout",
|
||||
type=int,
|
||||
default=300,
|
||||
help="The timeout for each thread",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--auto-gpt-path",
|
||||
type=str,
|
||||
default=None,
|
||||
help="The path to the AutoGPT code. This updates auto_gpt_competion_fn.yaml in completion fns. "
|
||||
"So you only need to set this once.",
|
||||
)
|
||||
parser.add_argument("--extra_eval_params", type=str, default="")
|
||||
parser.add_argument("--max_samples", type=int, default=None)
|
||||
parser.add_argument("--cache", action=argparse.BooleanOptionalAction, default=True)
|
||||
parser.add_argument("--visible", action=argparse.BooleanOptionalAction, default=None)
|
||||
parser.add_argument("--seed", type=int, default=20220722)
|
||||
parser.add_argument("--user", type=str, default="")
|
||||
parser.add_argument("--record_path", type=str, default=str(Path(__file__).parent.parent / "data" / "records.jsonl"))
|
||||
parser.add_argument(
|
||||
"--log_to_file", type=str, default=None,#default=str(
|
||||
# Path(__file__).parent.parent / "data" / "log" / "log.txt"
|
||||
# ), help="Log to a file instead of stdout"
|
||||
)
|
||||
parser.add_argument("--debug", action=argparse.BooleanOptionalAction, default=False)
|
||||
parser.add_argument("--local-run", action=argparse.BooleanOptionalAction, default=True)
|
||||
parser.add_argument("--dry-run", action=argparse.BooleanOptionalAction, default=False)
|
||||
parser.add_argument("--dry-run-logging", action=argparse.BooleanOptionalAction, default=True)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def update_yaml_with_auto_gpt_path(yaml_path: str, auto_gpt_path: str or None) -> Path:
|
||||
"""
|
||||
If there is a given auto_gpt_path, then we need to update the yaml file to include it in the system path
|
||||
If we don't have one. Then we get the path from the yaml.
|
||||
If none exists in the yaml and we don't have a path then we raise an exception.
|
||||
:param yaml_path: The path to the yaml file
|
||||
:param auto_gpt_path: The path to the AutoGPT code
|
||||
:return: The path to the AutoGPT code
|
||||
"""
|
||||
with open(yaml_path, "r") as f:
|
||||
yaml_data = yaml.safe_load(f)
|
||||
if yaml_data["auto_gpt_completion_fn"]["args"]["auto_gpt_path"] is None and auto_gpt_path is None:
|
||||
raise Exception("You must specify a auto_gpt_path in the yaml file or pass it in as a parameter")
|
||||
if auto_gpt_path is None:
|
||||
auto_gpt_path = yaml_data["auto_gpt_completion_fn"]["args"]["auto_gpt_path"]
|
||||
if auto_gpt_path is not None:
|
||||
yaml_data["auto_gpt_completion_fn"]["args"]["auto_gpt_path"] = auto_gpt_path
|
||||
with open(yaml_path, "w") as f:
|
||||
yaml.safe_dump(yaml_data, f)
|
||||
|
||||
return Path(auto_gpt_path).absolute()
|
||||
|
||||
|
||||
def load_env_file(env_path: Path):
|
||||
if not env_path.exists():
|
||||
raise FileNotFoundError('You must set the OpenAI key in the AutoGPT env file. '
|
||||
'We need your api keys to start the AutoGPT agent and use OpenAI evals')
|
||||
with open(env_path, "r") as f:
|
||||
# find the OPENAI_API_KEY key split it from the equals sign and assign it so OpenAI evals can use it.
|
||||
for line in f.readlines():
|
||||
if line.startswith("OPENAI_API_KEY"):
|
||||
os.environ["OPENAI_API_KEY"] = line.split("=")[1].strip()
|
||||
break
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
args = parse_args()
|
||||
# do not run in multiprocessing mode We do not use this right now, as it disables OpenAI's timeouts :(
|
||||
# os.environ["EVALS_SEQUENTIAL"] = "1"
|
||||
os.environ["EVALS_THREAD_TIMEOUT"] = str(args.timeout)
|
||||
os.environ["EVALS_THREADS"] = str(1)
|
||||
|
||||
# Update the yaml file with the auto_gpt_path
|
||||
autogpt_path = update_yaml_with_auto_gpt_path(
|
||||
str(Path(__file__).parent / "completion_fns" / "auto_gpt_completion_fn.yaml"),
|
||||
args.auto_gpt_path
|
||||
)
|
||||
|
||||
# Add the benchmarks path to the system path so we can import auto_gpt_benchmarking
|
||||
sys.path.append(str(Path(__file__).parent.parent.absolute()))
|
||||
|
||||
# load all of the environment variables in the auto-gpt path/.env file
|
||||
load_env_file(Path(autogpt_path) / ".env")
|
||||
|
||||
# Obviously, a top level import would be better. This allows us to set the API key with the env file, as it gets
|
||||
# set in the evaluator. We can't set it before the import because the import will fail without an API key.
|
||||
from auto_gpt_benchmarking.Evaluator import Evaluator, OAIRunArgs
|
||||
run_args = OAIRunArgs(
|
||||
completion_fn=args.completion_fn,
|
||||
eval=args.eval,
|
||||
extra_eval_params=args.extra_eval_params,
|
||||
max_samples=args.max_samples,
|
||||
cache=args.cache,
|
||||
visible=args.visible,
|
||||
seed=args.seed,
|
||||
user=args.user,
|
||||
record_path=args.record_path,
|
||||
log_to_file=args.log_to_file,
|
||||
debug=args.debug,
|
||||
local_run=args.local_run,
|
||||
dry_run=args.dry_run,
|
||||
dry_run_logging=args.dry_run_logging)
|
||||
|
||||
# Run the evals
|
||||
evaluator = Evaluator(
|
||||
run_args
|
||||
)
|
||||
evaluator.run()
|
|
@ -1,2 +1,4 @@
|
|||
auto_gpt_completion_fn:
|
||||
args:
|
||||
auto_gpt_path:
|
||||
class: auto_gpt_benchmarking.CompletionFn:AutoGPTCompletionFn
|
|
@ -1 +1,81 @@
|
|||
evals
|
||||
aiodocker==0.21.0
|
||||
aiohttp==3.8.4
|
||||
aiosignal==1.3.1
|
||||
asn1crypto==1.5.1
|
||||
async-timeout==4.0.2
|
||||
attrs==23.1.0
|
||||
backoff==2.2.1
|
||||
blobfile==2.0.1
|
||||
cachetools==5.3.0
|
||||
certifi==2022.12.7
|
||||
cffi==1.15.1
|
||||
charset-normalizer==2.1.1
|
||||
click==8.1.3
|
||||
colorama==0.4.6
|
||||
contourpy==1.0.7
|
||||
cryptography==40.0.2
|
||||
cycler==0.11.0
|
||||
dataclasses-json==0.5.7
|
||||
docker==6.0.1
|
||||
evals==1.0.2.post1
|
||||
filelock==3.11.0
|
||||
fire==0.5.0
|
||||
fonttools==4.39.3
|
||||
frozenlist==1.3.3
|
||||
gptcache==0.1.13
|
||||
greenlet==2.0.2
|
||||
idna==3.4
|
||||
importlib-resources==5.12.0
|
||||
joblib==1.2.0
|
||||
kiwisolver==1.4.4
|
||||
langchain==0.0.142
|
||||
langdetect==1.0.9
|
||||
lxml==4.9.2
|
||||
lz4==4.3.2
|
||||
marshmallow==3.19.0
|
||||
marshmallow-enum==1.5.1
|
||||
matplotlib==3.7.1
|
||||
mock==5.0.2
|
||||
multidict==6.0.4
|
||||
mypy==1.2.0
|
||||
mypy-extensions==1.0.0
|
||||
nltk==3.8.1
|
||||
numexpr==2.8.4
|
||||
numpy==1.24.2
|
||||
openai==0.27.4
|
||||
openapi-schema-pydantic==1.2.4
|
||||
oscrypto==1.3.0
|
||||
packaging==23.1
|
||||
pandas==1.5.3
|
||||
Pillow==9.5.0
|
||||
portalocker==2.7.0
|
||||
pyarrow==10.0.1
|
||||
pycparser==2.21
|
||||
pycryptodomex==3.17
|
||||
pydantic==1.10.7
|
||||
PyJWT==2.6.0
|
||||
pyOpenSSL==23.1.1
|
||||
pyparsing==3.0.9
|
||||
python-dateutil==2.8.2
|
||||
pytz==2023.3
|
||||
PyYAML==6.0
|
||||
pyzstd==0.15.6
|
||||
regex==2023.3.23
|
||||
requests==2.28.2
|
||||
sacrebleu==2.3.1
|
||||
setuptools-scm==7.1.0
|
||||
six==1.16.0
|
||||
snowflake-connector-python==3.0.2
|
||||
SQLAlchemy==1.4.47
|
||||
tabulate==0.9.0
|
||||
tenacity==8.2.2
|
||||
termcolor==2.2.0
|
||||
tiktoken==0.3.3
|
||||
tomli==2.0.1
|
||||
tqdm==4.65.0
|
||||
typing-inspect==0.8.0
|
||||
typing_extensions==4.5.0
|
||||
urllib3==1.26.15
|
||||
websocket-client==1.5.1
|
||||
yarl==1.8.2
|
||||
zipp==3.15.0
|
||||
|
|
Loading…
Reference in New Issue