AutoGPT/agbenchmark/challenges
Silen Naihin d9b3d7da37
Safety challenges, adaptability challenges, suite same_task (#177)
2023-07-24 13:57:44 -07:00
..
adapatability Safety challenges, adaptability challenges, suite same_task (#177) 2023-07-24 13:57:44 -07:00
code Safety challenges, adaptability challenges, suite same_task (#177) 2023-07-24 13:57:44 -07:00
interface Dynamic cutoff and other quality of life (#101) 2023-07-15 22:10:20 -04:00
memory Fixing memory challenges, naming, testing mini-agi, smooth retrieval scaling (#166) 2023-07-17 19:41:58 -07:00
retrieval Safety challenges, adaptability challenges, suite same_task (#177) 2023-07-24 13:57:44 -07:00
safety Safety challenges, adaptability challenges, suite same_task (#177) 2023-07-24 13:57:44 -07:00
README.md Replace hidden files with custom python (#99) 2023-07-14 14:39:47 -07:00
__init__.py start click, fixtures, types, challenge creation, mock run -stable (#37) 2023-06-21 11:43:18 -04:00
data_types.py Safety challenges, adaptability challenges, suite same_task (#177) 2023-07-24 13:57:44 -07:00
test_all.py Safety challenges, adaptability challenges, suite same_task (#177) 2023-07-24 13:57:44 -07:00

README.md

Challenges Data Schema of Benchmark

General challenges

Input:

  • name (str): Name of the challenge.
  • category (str[]): Category of the challenge such as 'basic', 'retrieval', 'comprehension', etc. this is not currently used. for the future it may be needed
  • task (str): The task that the agent needs to solve.
  • dependencies (str[]): The dependencies that the challenge needs to run. Needs to be the full node to the test function.
  • ground (dict): The ground truth.
    • answer (str): The raw text of the ground truth answer.
    • should_contain (list): The exact strings that are required in the final answer.
    • should_not_contain (list): The exact strings that should not be in the final answer.
    • files (list): Files that are used for retrieval. Can specify file here or an extension.
  • mock (dict): Mock response for testing.
    • mock_func (str): Function to mock the agent's response. This is used for testing purposes.
    • mock_task (str): Task to provide for the mock function.
  • info (dict): Additional info about the challenge.
    • difficulty (str): The difficulty of this query.
    • description (str): Description of the challenge.
    • side_effects (str[]): Describes the effects of the challenge.

Example:

{
  "category": ["basic"],
  "task": "Print the the capital of America to a .txt file",
  "dependencies": ["TestWriteFile"], # the class name of the test
  "ground": {
    "answer": "Washington",
    "should_contain": ["Washington"],
    "should_not_contain": ["New York", "Los Angeles", "San Francisco"],
    "files": [".txt"],
    "type": "file"
  },
  "info": {
    "difficulty": "basic",
    "description": "Tests the writing to file",
    "side_effects": ["tests if there is in fact an LLM attached"]
  }
}

Current Output:

  • score (float): scores range from [0, 1]

Add files to challenges:

artifacts_in

This folder contains all the files you want the agent to have in its workspace BEFORE the challenge starts

artifacts_out

This folder contains all the files you would like the agent to generate. This folder is used to mock the agent. This allows to run agbenchmark start --test=TestExample --mock and make sure our challenge actually works.

custom_python

This folder contains files that will be copied into the agent's workspace and run after the challenge is completed. For example we can have a test.py in it and run this file in the workspace to easily import code generated by the agent. Example: TestBasicCodeGeneration challenge.