{"nodes": [{"color": "#d62728", "data": {"name": "TestAdaptSimpleTypoWithGuidance", "category": ["adaptability"], "task": "1- Run code.py.\n2- Read test.py.\n3- Modify test.py.\nRepeat step 1, 2 and 3 until code.py runs without errors.\n", "dependencies": ["TestDebugSimpleTypoWithGuidance"], "cutoff": 75, "ground": {"answer": "Same as the original DebugWithGuidance test - [0, 1] [2, 5] [0, 3]", "should_contain": ["[0, 1]", "[2, 5]", "[0, 3]"], "should_not_contain": [], "files": ["test.py"], "eval": {"type": "python"}}, "info": {"difficulty": "intermediate", "description": "A version of the debug challenge where the wrong files to test and change are given. The llm must figure the challenges out", "side_effects": []}}, "id": "agbenchmark/generate_test.py::TestAdaptSimpleTypoWithGuidance::test_method[challenge_data0]", "label": "AdaptSimpleTypoWithGuidance", "shape": "dot"}, {"color": "#d62728", "data": {"name": "TestAdaptTeslaRevenue", "category": ["adaptability"], "task": "I'm a financial planner, please help me write tesla's r in 2022 into a .txt file.", "cutoff": 60, "dependencies": ["TestRevenueRetrieval_1.0"], "ground": {"answer": "It was $81.462 billion in 2022.", "should_contain": ["81"], "should_not_contain": [], "files": [".txt"], "eval": {"type": "file"}}, "info": {"difficulty": "intermediate", "description": "Given context adapt to the missing of the word revenue", "side_effects": []}}, "id": "agbenchmark/generate_test.py::TestAdaptTeslaRevenue::test_method[challenge_data0]", "label": "AdaptTeslaRevenue", "shape": "dot"}, {"color": "#d62728", "data": {"name": "TestAdaptLink", "category": ["adaptability"], "task": "Write the price of the book in this url 'books.toscrape.com/catalogue/meditations_33/index.html' to a .txt file.", "dependencies": ["TestBasicRetrieval"], "cutoff": 60, "ground": {"answer": "\u00c2\u00a325.89", "should_contain": ["25.89"], "should_not_contain": [], "files": [".txt"], "eval": {"type": "file"}}, "info": {"difficulty": "novice", "description": "An advanced version of this -> remove.html as well. Same as TestBasicRetrieval but link is slightly broken, supposed to be http:// at the start.", "side_effects": []}}, "id": "agbenchmark/generate_test.py::TestAdaptLink::test_method[challenge_data0]", "label": "AdaptLink", "shape": "dot"}, {"color": "#1f77b4", "data": {"name": "TestReturnCode_Tests", "category": ["code", "iterate"], "task": "First, modify test.py to fill in the test case to be able to test the code in code.py. Next, modify the multiply_int function in code.py to be able to pass in a 'multiplier' argument to multiply the 'num' by 'multiplier'. Both arguments are integers. You can make sure you have correctly done this by running test.py that you previously modified.", "dependencies": ["TestReturnCode_Modify"], "cutoff": 120, "ground": {"answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8", "should_contain": ["8", "49", "-12"], "should_not_contain": [], "files": ["test.py"], "eval": {"type": "python"}}, "info": {"difficulty": "advanced", "description": "Small step up, just writing the function with a name as well as the return statement.", "side_effects": []}}, "id": "agbenchmark/generate_test.py::TestReturnCode_Tests::test_method[challenge_data0]", "label": "ReturnCode_Tests", "shape": "dot"}, {"color": "#1f77b4", "data": {"name": "TestReturnCode_Modify", "category": ["code", "iterate"], "task": "Modify the multiply_int function in code.py to be able to pass in a 'multiplier' argument to multiply the 'num' by 'multiplier'. Both arguments are integers. You can make sure you have correctly done this by running test.py", "dependencies": ["TestReturnCode_Write"], "cutoff": 120, "ground": {"answer": "def multiply_int(num, multiplier):\n return num * multiplier\n", "should_contain": ["8", "49", "-12"], "should_not_contain": [], "files": ["test.py"], "eval": {"type": "python"}}, "info": {"difficulty": "intermediate", "description": "Builds on the previous function also take a multiplier .", "side_effects": []}}, "id": "agbenchmark/generate_test.py::TestReturnCode_Modify::test_method[challenge_data0]", "label": "ReturnCode_Modify", "shape": "dot"}, {"color": "#1f77b4", "data": {"name": "TestReturnCode_Write", "category": ["code", "iterate"], "task": "Add a function called multiply_int in code.py that multiplies numbers by 2. You can make sure you have correctly done this by running test.py", "dependencies": ["TestReturnCode_Simple"], "cutoff": 120, "ground": {"answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8", "should_contain": ["8"], "should_not_contain": [], "files": ["test.py"], "eval": {"type": "python"}}, "info": {"difficulty": "novice", "description": "Small step up, just writing the function with a name as well as the return statement.", "side_effects": []}}, "id": "agbenchmark/generate_test.py::TestReturnCode_Write::test_method[challenge_data0]", "label": "ReturnCode_Write", "shape": "dot"}, {"color": "#1f77b4", "data": {"name": "TestReturnCode_Simple", "category": ["code", "iterate"], "task": "Return the multiplied number in the function multiply_int in code.py. You can make sure you have correctly done this by running test.py", "dependencies": ["TestReadFile"], "cutoff": 120, "ground": {"answer": "Just a simple multiple by 2 function. Num is 4 so answer is 8", "should_contain": ["8"], "should_not_contain": [], "files": ["test.py"], "eval": {"type": "python"}}, "info": {"difficulty": "basic", "description": "Simple test if a simple code instruction can be executed", "side_effects": []}}, "id": "agbenchmark/generate_test.py::TestReturnCode_Simple::test_method[challenge_data0]", "label": "ReturnCode_Simple", "shape": "dot"}, {"color": "#1f77b4", "data": {"name": "TestDebugSimpleTypoWithGuidance", "category": ["code", "iterate"], "task": "1- Run test.py.\n2- Read code.py.\n3- Modify code.py.\nRepeat step 1, 2 and 3 until test.py runs without errors.\n", "dependencies": ["TestReadFile"], "cutoff": 75, "ground": {"answer": "[0, 1] [2, 5] [0, 3]", "should_contain": ["[0, 1]", "[2, 5]", "[0, 3]"], "should_not_contain": [], "files": ["test.py"], "eval": {"type": "python"}}, "info": {"difficulty": "novice", "description": "Tests ability for the agent to debug python code with a simple typo in it.", "side_effects": []}}, "id": "agbenchmark/generate_test.py::TestDebugSimpleTypoWithGuidance::test_method[challenge_data0]", "label": "DebugSimpleTypoWithGuidance", "shape": "dot"}, {"color": "#1f77b4", "data": {"name": "TestDebugSimpleTypoWithoutGuidance", "category": ["code", "iterate"], "task": "Make test.py run without errors.", "dependencies": ["TestDebugSimpleTypoWithGuidance"], "cutoff": 75, "ground": {"answer": "[0, 1] [2, 5] [0, 3]", "should_contain": ["[0, 1]", "[2, 5]", "[0, 3]"], "should_not_contain": [], "files": ["test.py"], "eval": {"type": "python"}}, "info": {"difficulty": "intermediate", "description": "Tests ability for the agent to debug python code with a simple typo in it, using a very broad prompt without guidance", "side_effects": []}}, "id": "agbenchmark/generate_test.py::TestDebugSimpleTypoWithoutGuidance::test_method[challenge_data0]", "label": "DebugSimpleTypoWithoutGuidance", "shape": "dot"}, {"color": "#1f77b4", "data": {"name": "TestDebugMultipleTypo", "category": ["code", "iterate"], "task": "Make test.py run without errors.", "dependencies": ["TestDebugSimpleTypoWithoutGuidance"], "cutoff": 90, "ground": {"answer": "[0, 1] [2, 5] [0, 3]", "should_contain": ["[0, 1]", "[2, 5]", "[0, 3]"], "should_not_contain": [], "files": ["test.py"], "eval": {"type": "python"}}, "info": {"difficulty": "advanced", "description": "Now it's not just the typo error, but also an incomplete import statement", "side_effects": []}}, "id": "agbenchmark/generate_test.py::TestDebugMultipleTypo::test_method[challenge_data0]", "label": "DebugMultipleTypo", "shape": "dot"}, {"color": "#1f77b4", "data": {"name": "TestThreeSum", "category": ["code", "iterate"], "task": "Create a three_sum function in a file called code.py. Given an array of integers, return indices of the three numbers such that they add up to a specific target. You may assume that each input would have exactly one solution, and you may not use the same element twice. Example: Given nums = [2, 7, 11, 15], target = 20, Because nums[0] + nums[1] + nums[2] = 2 + 7 + 11 = 20, return [0, 1, 2].", "dependencies": ["TestFunctionCodeGeneration"], "cutoff": 60, "ground": {"answer": "The three_sum function coded properly.", "should_contain": ["[0, 1, 2]", "[0, 2, 5]", "[0, 2, 3]"], "should_not_contain": [], "files": ["test.py"], "eval": {"type": "python"}}, "info": {"difficulty": "advanced", "description": "Tests ability for the agent to create the three_sum function.", "side_effects": []}}, "id": "agbenchmark/generate_test.py::TestThreeSum::test_method[challenge_data0]", "label": "ThreeSum", "shape": "dot"}, {"color": "#1f77b4", "data": {"name": "TestFunctionCodeGeneration", "category": ["code"], "task": "Create a two_sum function in a file called code.py. Given an array of integers, return indices of the two numbers such that they add up to a specific target. You may assume that each input would have exactly one solution, and you may not use the same element twice. Example: Given nums = [2, 7, 11, 15], target = 9, Because nums[0] + nums[1] = 2 + 7 = 9, return [0, 1].", "dependencies": ["TestReturnCode_Write"], "cutoff": 90, "ground": {"answer": "The two_sum function coded properly.", "should_contain": ["[0, 1]", "[2, 5]", "[0, 3]"], "should_not_contain": [], "files": ["test.py"], "eval": {"type": "python"}}, "info": {"difficulty": "advanced", "description": "Tests ability for the agent to create the two_sum function.", "side_effects": []}}, "id": "agbenchmark/generate_test.py::TestFunctionCodeGeneration::test_method[challenge_data0]", "label": "FunctionCodeGeneration", "shape": "dot"}, {"color": "#1f77b4", "data": {"name": "TestPasswordGenerator_Easy", "category": ["code"], "task": "Create a random password generator. The password should have between 8 and 16 characters and should contain letters, numbers and symbols. The password should be printed to the console. The entry point will be a python file that can be run this way: python password_generator.py", "dependencies": ["TestWriteFile"], "cutoff": 90, "ground": {"answer": "password_generator.py is created and satisfies the requirements.", "should_contain": [], "should_not_contain": [], "files": ["test.py"], "eval": {"type": "python"}}, "info": {"difficulty": "basic", "description": "Tests ability for the agent to code a file organizer.", "side_effects": []}}, "id": "agbenchmark/generate_test.py::TestPasswordGenerator_Easy::test_method[challenge_data0]", "label": "PasswordGenerator_Easy", "shape": "dot"}, {"color": "#1f77b4", "data": {"name": "TestWritingCLI_FileOrganizer", "category": ["code"], "task": "Create a file organizer CLI tool in Python that sorts files in a directory based on their file types (e.g., images, documents, audio) and moves them into these corresponding folders: 'images', 'documents', 'audio'. The entry point will be a python file that can be run this way: python organize_files.py --directory_path=YOUR_DIRECTORY_PATH", "dependencies": ["TestPasswordGenerator_Easy"], "cutoff": 90, "ground": {"answer": "The correct python file is written and organizes the files accordingly", "should_contain": [], "should_not_contain": [], "files": ["test.py"], "eval": {"type": "python"}}, "info": {"difficulty": "basic", "description": "Tests ability for the agent to create a random password generator.", "side_effects": []}}, "id": "agbenchmark/generate_test.py::TestWritingCLI_FileOrganizer::test_method[challenge_data0]", "label": "WritingCLI_FileOrganizer", "shape": "dot"}, {"color": "#1f77b4", "data": {"name": "TestWebApp_ListAnimals", "category": ["code"], "task": "Build a web page with a list of animals. When someone clicks on the word 'Dog', a message should appear that says 'Dogs are known as man's best friend!'. You'll need to make a list with the name 'Dog' and then write a little bit of JavaScript to make the message appear when the name is clicked. Mark the div containing dog with the id 'dog'. Put the message inside a