|
||
---|---|---|
.. | ||
adapatability | ||
code | ||
interface | ||
memory | ||
retrieval | ||
safety | ||
README.md | ||
__init__.py | ||
data_types.py | ||
test_all.py |
README.md
Challenges Data Schema of Benchmark
General challenges
Input:
- name (str): Name of the challenge.
- category (str[]): Category of the challenge such as 'basic', 'retrieval', 'comprehension', etc. this is not currently used. for the future it may be needed
- task (str): The task that the agent needs to solve.
- dependencies (str[]): The dependencies that the challenge needs to run. Needs to be the full node to the test function.
- ground (dict): The ground truth.
- answer (str): The raw text of the ground truth answer.
- should_contain (list): The exact strings that are required in the final answer.
- should_not_contain (list): The exact strings that should not be in the final answer.
- files (list): Files that are used for retrieval. Can specify file here or an extension.
- mock (dict): Mock response for testing.
- mock_func (str): Function to mock the agent's response. This is used for testing purposes.
- mock_task (str): Task to provide for the mock function.
- info (dict): Additional info about the challenge.
- difficulty (str): The difficulty of this query.
- description (str): Description of the challenge.
- side_effects (str[]): Describes the effects of the challenge.
Example:
{
"category": ["basic"],
"task": "Print the the capital of America to a .txt file",
"dependencies": ["TestWriteFile"], # the class name of the test
"ground": {
"answer": "Washington",
"should_contain": ["Washington"],
"should_not_contain": ["New York", "Los Angeles", "San Francisco"],
"files": [".txt"],
"type": "file"
},
"info": {
"difficulty": "basic",
"description": "Tests the writing to file",
"side_effects": ["tests if there is in fact an LLM attached"]
}
}
Current Output:
- score (float): scores range from [0, 1]
Add files to challenges:
artifacts_in
This folder contains all the files you want the agent to have in its workspace BEFORE the challenge starts
artifacts_out
This folder contains all the files you would like the agent to generate. This folder is used to mock the agent. This allows to run agbenchmark start --test=TestExample --mock and make sure our challenge actually works.
custom_python
This folder contains files that will be copied into the agent's workspace and run after the challenge is completed. For example we can have a test.py in it and run this file in the workspace to easily import code generated by the agent. Example: TestBasicCodeGeneration challenge.