AutoGPT/agbenchmark/challenges/SUITES.md

5.5 KiB

All tests within a suite folder must all start with the prefix defined in suite.json. There are two types of suites.

same_task

If same_task is set to true, all of the data.jsons are combined into one test. A single test runs, but multiple regression tests, internal_infos, dependencies, and reports are created. The artifacts_in/out and custom python should be in the suite folder as it's shared between tests. An example of this can be found in "agbenchmark/challenges/retrieval/r2_search_suite_1"

{
  "same_task": true,
  "prefix": "TestRevenueRetrieval",
  "dependencies": ["TestBasicRetrieval"],
  "cutoff": 60,
  "task": "Write tesla's exact revenue in 2022 into a .txt file. Use the US notation, with a precision rounded to the nearest million dollars (for instance, $31,578 billion).",
  "shared_category": ["retrieval"]
}

The structure for a same_task report looks like this:

"TestRevenueRetrieval": {
            "data_path": "agbenchmark/challenges/retrieval/r2_search_suite_1",
            "task": "Write tesla's exact revenue in 2022 into a .txt file. Use the US notation, with a precision rounded to the nearest million dollars (for instance, $31,578 billion).",
            "category": [
                "retrieval"
            ],
            "metrics": {
                "percentage": 100.0,
                "highest_difficulty": "intermediate",
                "run_time": "0.016 seconds"
            },
            "tests": {
                "TestRevenueRetrieval_1.0": {
                    "data_path": "agbenchmark/challenges/retrieval/r2_search_suite_1/1_tesla_revenue/data.json",
                    "is_regression": false,
                    "answer": "It was $81.462 billion in 2022.",
                    "description": "A no guardrails search for info",
                    "metrics": {
                        "difficulty": "novice",
                        "success": true,
                        "success_%": 100.0
                    }
                },
                "TestRevenueRetrieval_1.1": {
                    "data_path": "agbenchmark/challenges/retrieval/r2_search_suite_1/2_specific/data.json",
                    "is_regression": false,
                    "answer": "It was $81.462 billion in 2022.",
                    "description": "This one checks the accuracy of the information over r2",
                    "metrics": {
                        "difficulty": "novice",
                        "success": true,
                        "success_%": 0
                    }
                },
            },
            "reached_cutoff": false
        },

same_task

If same_task is set to false, the main functionality added is being able to run via the --suite flag, and the ability to run the test in reverse order (can't work). Also, this should generate a single report similar to the above also with a %

{
  "same_task": false,
  "reverse_order": true,
  "prefix": "TestReturnCode"
}

The structure for a non same_task report looks like this:

"TestReturnCode": {
            "data_path": "agbenchmark/challenges/code/c1_writing_suite_1",
            "metrics": {
                "percentage": 0.0,
                "highest_difficulty": "No successful tests",
                "run_time": "15.972 seconds"
            },
            "tests": {
                "TestReturnCode_Simple": {
                    "data_path": "agbenchmark/challenges/code/c1_writing_suite_1/1_return/data.json",
                    "is_regression": false,
                    "category": [
                        "code",
                        "iterate"
                    ],
                    "task": "Return the multiplied number in the function multiply_int in code.py. You can make sure you have correctly done this by running test.py",
                    "answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8",
                    "description": "Simple test if a simple code instruction can be executed",
                    "metrics": {
                        "difficulty": "basic",
                        "success": false,
                        "fail_reason": "assert 1 in [0.0]",
                        "success_%": 0.0,
                        "run_time": "15.96 seconds"
                    },
                    "reached_cutoff": false
                },
                "TestReturnCode_Write": {
                    "data_path": "agbenchmark/challenges/code/c1_writing_suite_1/2_write/data.json",
                    "is_regression": false,
                    "category": [
                        "code",
                        "iterate"
                    ],
                    "task": "Add a function called multiply_int in code.py that multiplies numbers by 2. You can make sure you have correctly done this by running test.py",
                    "answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8",
                    "description": "Small step up, just writing the function with a name as well as the return statement.",
                    "metrics": {
                        "difficulty": "novice",
                        "success": false,
                        "fail_reason": "agbenchmark/challenges/test_all.py::TestReturnCode_Write::test_method[challenge_data0] depends on agbenchmark/challenges/test_all.py::TestReturnCode_Simple::test_method[challenge_data0]",
                        "success_%": 0.0,
                        "run_time": "0.004 seconds"
                    },
                    "reached_cutoff": false
                },
            }
        }