AutoGPT

Commit Graph

Author	SHA1	Message	Date
Reinier van der Leer	db0e726954	fix(agent, benchmark): Specify `path_type=Path` for CLI path options/arguments Without `path_type=Path`, an option/argument with `type=click.Path()` will return a `str`.	2024-07-03 15:21:04 -06:00
Krzysztof Czerwinski	7cb4d4a903	feat(forge, agent, benchmark): Upgrade to Pydantic v2 (#7280 ) Update Pydantic dependency of `autogpt`, `forge` and `benchmark` to `^2.7` [Pydantic Migration Guide](https://docs.pydantic.dev/2.7/migration/) - Migrate usages of now-deprecated functions to their replacements - Update `Field` definitions - Ellipsis `...` for required fields is deprecated - `Field` no longer supports extra `kwargs`, replace use of this feature with field metadata - Replace `Config` class for specifying model configuration with `model_config = ConfigDict(..)` - Removed `ModelContainer` in `BaseAgent`, component configuration dict is now directly serialized using Pydantic v2 helper functions - Forked `agent-protocol` and updated `packages/client/python` for Pydantic v2 support: https://github.com/Significant-Gravitas/agent-protocol --------- Co-authored-by: Reinier van der Leer <pwuts@agpt.co>	2024-07-02 20:45:32 +02:00
Reinier van der Leer	cbae8b5c14	chore(agent, forge, benchmark): Clean up dependencies (#7286 ) * Remove unused dependencies * Move dependencies for moved code from `autogpt` to `forge` * Loosen dependency for `uvicorn` to improve compatibility	2024-06-28 02:21:36 +02:00
Reinier van der Leer	fbb3891e79	chore(forge, agent, benchmark): Update `pytest-asyncio` to v0.23.x Resolves #7283	2024-06-27 14:09:36 -06:00
SwiftyOS	1e4ef7b313	chore(benchmark): delete notebooks	2024-06-11 11:30:46 +02:00
Reinier van der Leer	f2cb553c9a	chore(agent, forge, benchmark): Update `pyright` to v1.1.366	2024-06-08 21:48:25 +02:00
RainRat	cb9ad6f64d	fix typos (#7123 ) * fix typos in various places * Revert changes to NOTICES --------- Co-authored-by: Nicholas Tindle <nicholas.tindle@agpt.co>	2024-05-31 11:16:23 +02:00
Reinier van der Leer	738c8ffff0	fix(benchmark): Improve output and debug logging of pytest evals	2024-05-30 17:16:17 +02:00
Reinier van der Leer	f107ff8cf0	Set up unified pre-commit + CI w/ linting + type checking & FIX EVERYTHING (#7171 ) - FIX ALL LINT/TYPE ERRORS IN AUTOGPT, FORGE, AND BENCHMARK ### Linting - Clean up linter configs for `autogpt`, `forge`, and `benchmark` - Add type checking with Pyright - Create unified pre-commit config - Create unified linting and type checking CI workflow ### Testing - Synchronize CI test setups for `autogpt`, `forge`, and `benchmark` - Add missing pytest-cov to benchmark dependencies - Mark GCS tests as slow to speed up pre-commit test runs - Repair `forge` test suite - Add `AgentDB.close()` method for test DB teardown in db_test.py - Use actual temporary dir instead of forge/test_workspace/ - Move left-behind dependencies for moved `forge`-code to from autogpt to forge ### Notable type changes - Replace uses of `ChatModelProvider` by `MultiProvider` - Removed unnecessary exports from various __init__.py - Simplify `FileStorage.open_file` signature by removing `IOBase` from return type union - Implement `S3BinaryIOWrapper(BinaryIO)` type interposer for `S3FileStorage` - Expand overloads of `GCSFileStorage.open_file` for improved typing of read and write modes Had to silence type checking for the extra overloads, because (I think) Pyright is reporting a false-positive: https://github.com/microsoft/pyright/issues/8007 - Change `count_tokens`, `get_tokenizer`, `count_message_tokens` methods on `ModelProvider`s from class methods to instance methods - Move `CompletionModelFunction.schema` method -> helper function `format_function_def_for_openai` in `forge.llm.providers.openai` - Rename `ModelProvider` -> `BaseModelProvider` - Rename `ChatModelProvider` -> `BaseChatModelProvider` - Add type `ChatModelProvider` which is a union of all subclasses of `BaseChatModelProvider` ### Removed rather than fixed - Remove deprecated and broken autogpt/agbenchmark_config/benchmarks.py - Various base classes and properties on base classes in `forge.llm.providers.schema` and `forge.models.providers` ### Fixes for other issues that came to light - Clean up `forge.agent_protocol.api_router`, `forge.agent_protocol.database`, and `forge.agent.agent` - Add fallback behavior to `ImageGeneratorComponent` - Remove test for deprecated failure behavior - Fix `agbenchmark.challenges.builtin` challenge exclusion mechanism on Windows - Fix `_tool_calls_compat_extract_calls` in `forge.llm.providers.openai` - Add support for `any` (= no type specified) in `JSONSchema.typescript_type`	2024-05-28 05:04:21 +02:00
Swifty	2cca4fa47f	clean(benchmark): Remove Depreciated Challenges (#7144 ) * Remove depreciated challanges * Update license and pyproject.toml	2024-05-20 15:01:36 +02:00
Reinier van der Leer	c26c79c34c	fix(benchmark/reports): Resolve error in format.py on `attempt.cost` is `None`	2024-02-29 19:01:47 +01:00
Reinier van der Leer	d5f2bbf093	fix(benchmark/reports): Make format.py executable	2024-02-20 14:50:32 +01:00
Albert Örwall	4ef912d734	fix(benchmark/challenges): Improve spec and eval of TicTacToe challenge * In challenge specification, specify `subprocess.PIPE` for `stdin` and `stderr` for completeness * Additional tweak: let Pytest load only the current file when running the test file as a script Co-authored-by: Reinier van der Leer <pwuts@agpt.co>	2024-02-20 11:52:59 +01:00
Reinier van der Leer	bfd479a50b	feat(benchmark): Add reports/format.py script to convert report.json to markdown	2024-02-19 17:13:05 +01:00
Reinier van der Leer	3a17011129	feat(benchmark): Include Steps in Report	2024-02-19 17:08:24 +01:00
Reinier van der Leer	7f71d6d9fd	debug(benchmark): Improve `TestResult` validation error output format	2024-02-18 17:10:14 +01:00
Reinier van der Leer	4ede773f5a	debug(benchmark): Add more debug code to pinpoint cause of rare crash Target: https://github.com/Significant-Gravitas/AutoGPT/actions/runs/7941977633/job/21684817491	2024-02-17 15:48:57 +01:00
Reinier van der Leer	e2b519ef3b	debug(benchmark): Make sure `TestResult` validator error output is sufficient to debug	2024-02-17 13:36:17 +01:00
Reinier van der Leer	09c307d679	debug(benchmark): Add log statement to validator on `TestResult` Validation errors don't mention the values causing the error, making it hard to debug. This happened a few times in autogpts-benchmark.yml, so let's put this log statement here until we figure out what makes it crash.	2024-02-17 13:32:22 +01:00
Reinier van der Leer	63e6014b27	fix(benchmark): Fix `TestResult.fail_reason` assignment condition The condition must be the same as for `success`, because otherwise it causes a crash when `call.excinfo` evaluates to `False` but is not `None`.	2024-02-16 19:05:00 +01:00
Reinier van der Leer	f9792ed7f3	fix(benchmark): Unbreak `-N`/`--attempts` option	2024-02-16 18:43:37 +01:00
Reinier van der Leer	21f1e64559	feat(benchmark): Get agent task cost from `Step.additional_output`	2024-02-16 18:10:46 +01:00
Reinier van der Leer	752bac099b	feat(benchmark/report): Add and record `TestResult.n_steps` - Added `n_steps` attribute to `TestResult` type - Added logic to record the number of steps to `BuiltinChallenge.test_method`, `WebArenaChallenge.test_method`, and `.reports.add_test_result_to_report`	2024-02-16 17:53:19 +01:00
Reinier van der Leer	483c01b681	lint(benchmark): Remove unnecessary `pass` statement in __main__.py	2024-02-16 17:27:56 +01:00
Reinier van der Leer	2a55efb322	fix(benchmark): Include `WebArenaSiteInfo.additional_info` (e.g. credentials) in task input Without the `additional_info`, it is impossible to get past the login page on challenges where that is necessary.	2024-02-16 17:20:44 +01:00
Reinier van der Leer	23d58a3cc0	feat(benchmark/cli): Add `challenge list`, `challenge info` subcommands - Add `challenge list` command with options `--all`, `--names`, `--json` - Add `tabular` dependency - Add `.utils.utils.sorted_by_enum_index` function to easily sort lists by an enum value/property based on the order of the enum's definition - Add `challenge info [name]` command with option `--json` - Add `.utils.utils.pretty_print_model` routine to pretty-print Pydantic models - Refactor `config` subcommand to use `pretty_print_model`	2024-02-16 15:17:11 +01:00
Reinier van der Leer	70e345b2ce	refactor(benchmark): `load_webarena_challenges` - Reduce duplicate and nested statements - Add `skip_unavailable` parameter Related changes: - Add `available` and `unavailable_reason` attributes to `ChallengeInfo` and `WebArenaChallengeSpec` - Add `pytest.skip` statement to `WebArenaChallenge.test_method` to make sure unavailable challenges are not run	2024-02-16 15:11:48 +01:00
Reinier van der Leer	679339d00c	feat(benchmark): Make report output folder configurable - Make `AgentBenchmarkConfig.reports_folder` directly configurable (through `REPORTS_FOLDER` env variable). The default is still `./agbenchmark_config/reports`. - Change all mentions of `REPORT_LOCATION` (which fulfilled the same function at some point in the past) to `REPORTS_FOLDER`.	2024-02-15 18:07:45 +01:00
Reinier van der Leer	d0c9b7c405	lint(benchmark): Remove unused imports	2024-02-14 01:34:30 +01:00
Reinier van der Leer	327fb1f916	fix(benchmark): Mock mode, python evals, `--attempts` flag, challenge definitions - Fixed `--mock` mode - Moved interrupt to beginning of the step iterator pipeline (from `BuiltinChallenge` to `agent_api_interface.py:run_api_agent`). This ensures that any finish-up code is properly executed after executing a single step. - Implemented mock mode in `WebArenaChallenge` - Fixed `fixture 'i_attempt' not found` error when `--attempts`/`-N` is omitted - Fixed handling of `python`/`pytest` evals in `BuiltinChallenge` - Disabled left-over Helicone code (see `056163e`) - Fixed a couple of challenge definitions - WebArena task 107: fix spelling of months (Sepetember, Octorbor lmao) - synthesize/1_basic_content_gen (SynthesizeInfo): remove empty string from `should_contain` list - Added some debug logging in agent_api_interface.py and challenges/builtin.py	2024-02-14 01:05:34 +01:00
Reinier van der Leer	91cec515d4	chore(benchmark): Update `python-multipart` dependency to mitigate vulnerability - python-multipart vulnerable to Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/55	2024-02-13 12:36:00 +01:00
Reinier van der Leer	e641cccb42	chore(benchmark): Update `aiohttp` and `fastapi` dependencies to mitigate vulnerabilities Addressed vulnerabilities: - python-multipart vulnerable to Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/55 Dependants: - FastAPI Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/53 - Starlette Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/48 - aiohttp is vulnerable to directory traversal - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/46 - aiohttp's HTTP parser (the python one, not llhttp) still overly lenient about separators - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/43	2024-02-13 12:21:52 +01:00
Reinier van der Leer	a0cae78ba3	feat(benchmark): Add `-N`, `--attempts` option for multiple attempts per challenge LLMs are probabilistic systems. Reproducibility of completions is not guaranteed. It only makes sense to account for this, by running challenges multiple times to obtain a success ratio rather than a boolean success/failure result. Changes: - Add `-N`, `--attempts` option to CLI and `attempts_per_challenge` parameter to `main.py:run_benchmark`. - Add dynamic `i_attempt` fixture through `pytest_generate_tests` hook in conftest.py to achieve multiple runs per challenge. - Modify `pytest_runtest_makereport` hook in conftest.py to handle multiple reporting calls per challenge. - Refactor report_types.py, reports.py, process_report.ty to allow multiple results per challenge. - Calculate `success_percentage` from results of the current run, rather than all known results ever. - Add docstrings to a number of models in report_types.py. - Allow `None` as a success value, e.g. for runs that did not render any results before being cut off. - Make SingletonReportManager thread-safe.	2024-01-22 17:16:55 +01:00
Reinier van der Leer	488f40a20f	feat(benchmark): JungleGym WebArena (#6691 ) * feat(benchmark): Add JungleGym WebArena challenges - Add `WebArenaChallenge`, `WebArenaChallengeSpec`, and other logic to make these challenges work - Add WebArena challenges to Pytest collection endpoint generate_test.py * feat(benchmark/webarena): Add hand-picked selection of WebArena challenges	2024-01-19 20:34:04 +01:00
Reinier van der Leer	05b018a837	fix(benchmark/report): Fix and clean up logic in `update_challenges_already_beaten` - `update_challenges_already_beaten` incorrectly marked challenges as beaten if it was present in the report file but set to `false`	2024-01-19 19:52:09 +01:00
Reinier van der Leer	9e4dfd8058	fix(benchmark): Fix challenge input artifact upload	2024-01-19 17:29:53 +01:00
Reinier van der Leer	9012ff4db2	refactor(benchmark): Interface & type consoledation, and arch change, to allow adding challenge providers Squashed commit of the following: commit `7d6476d329` Author: Reinier van der Leer <pwuts@agpt.co> Date: Tue Jan 9 18:10:45 2024 +0100 refactor(benchmark/challenge): Set up structure to support more challenge providers - Move `Challenge`, `ChallengeData`, `load_challenges` to `challenges/builtin.py` and rename to `BuiltinChallenge`, `BuiltinChallengeSpec`, `load_builtin_challenges` - Create `BaseChallenge` to serve as interface and base class for different challenge implementations - Create `ChallengeInfo` model to serve as universal challenge info object - Create `get_challenge_from_source_uri` function in `challenges/__init__.py` - Replace `ChallengeData` by `ChallengeInfo` everywhere except in `BuiltinChallenge` - Add strong typing to `task_informations` store in app.py - Use `call.duration` in `finalize_test_report` and remove `timer` fixture - Update docstring on `challenges/__init__.py:get_unique_categories` - Add docstring to `generate_test.py` commit `5df2aa7939` Author: Reinier van der Leer <pwuts@agpt.co> Date: Tue Jan 9 16:58:01 2024 +0100 refactor(benchmark): Refactor & rename functions in agent_interface.py and agent_api_interface.py - `copy_artifacts_into_temp_folder` -> `copy_challenge_artifacts_into_workspace` - `copy_agent_artifacts_into_folder` -> `download_agent_artifacts_into_folder` - Reorder parameters of `run_api_agent`, `copy_challenge_artifacts_into_workspace`; use `Path` instead of `str` commit `6a256fef4c` Author: Reinier van der Leer <pwuts@agpt.co> Date: Tue Jan 9 16:02:25 2024 +0100 refactor(benchmark): Refactor & typefix report generation and handling logic - Rename functions in reports.py and ReportManager.py to better reflect what they do - `get_previous_test_results` -> `get_and_update_success_history` - `generate_single_call_report` -> `initialize_test_report` - `finalize_reports` -> `finalize_test_report` - `ReportManager.end_info_report` -> `SessionReportManager.finalize_session_report` - Modify `pytest_runtest_makereport` hook in conftest.py to finalize the report immediately after the challenge finishes running instead of after teardown - Move result processing logic from `initialize_test_report` to `finalize_test_report` in reports.py - Use `Test` and `Report` types from report_types.py where possible instead of untyped dicts: reports.py, utils.py, ReportManager.py - Differentiate `ReportManager` into `SessionReportManager`, `RegressionTestsTracker`, `SuccessRateTracker` - Move filtering of optional challenge categories from challenge.py (`Challenge.skip_optional_categories`) to conftest.py (`pytest_collection_modifyitems`) - Remove unused `scores` fixture in conftest.py commit `370d6dbf5d` Author: Reinier van der Leer <pwuts@agpt.co> Date: Tue Jan 9 15:16:43 2024 +0100 refactor(benchmark): Simplify models in report_types.py - Removed ForbidOptionalMeta and BaseModelBenchmark classes. - Changed model attributes to optional: `Metrics.difficulty`, `Metrics.success`, `Metrics.success_percentage`, `Metrics.run_time`, and `Test.reached_cutoff`. - Added validator to `Metrics` model to require `success` and `run_time` fields if `attempted=True`. - Added default values to all optional model fields. - Removed duplicate imports. - Added condition in process_report.py to prevent null lookups if `metrics.difficulty` is not set.	2024-01-18 15:19:06 +01:00
Reinier van der Leer	0a4185a919	chore(benchmark): Upgrade OpenAI client lib from v0 to v1	2024-01-16 15:49:46 +01:00
Reinier van der Leer	056163ee57	refactor(benchmark): Disable Helicone integrations We want to upgrade the OpenAI library, but `helicone` does not support `openai@^1.0.0`, so we're disabling the Helicone integration for now.	2024-01-16 15:38:47 +01:00
Reinier van der Leer	25cc6ad6ae	AGBenchmark codebase clean-up (#6650 ) * refactor(benchmark): Deduplicate configuration loading logic - Move the configuration loading logic to a separate `load_agbenchmark_config` function in `agbenchmark/config.py` module. - Replace the duplicate loading logic in `conftest.py`, `generate_test.py`, `ReportManager.py`, `reports.py`, and `__main__.py` with calls to `load_agbenchmark_config` function. * fix(benchmark): Fix type errors, linting errors, and clean up CLI validation in __main__.py - Fixed type errors and linting errors in `__main__.py` - Improved the readability of CLI argument validation by introducing a separate function for it * refactor(benchmark): Lint and typefix app.py - Rearranged and cleaned up import statements - Fixed type errors caused by improper use of `psutil` objects - Simplified a number of `os.path` usages by converting to `pathlib` - Use `Task` and `TaskRequestBody` classes from `agent_protocol_client` instead of `.schema` * refactor(benchmark): Replace `.agent_protocol_client` by `agent-protcol-client`, clean up schema.py - Remove `agbenchmark.agent_protocol_client` (an offline copy of `agent-protocol-client`). - Add `agent-protocol-client` as a dependency and change imports to `agent_protocol_client`. - Fix type annotation on `agent_api_interface.py::upload_artifacts` (`ApiClient` -> `AgentApi`). - Remove all unused types from schema.py (= most of them). * refactor(benchmark): Use pathlib in agent_interface.py and agent_api_interface.py * refactor(benchmark): Improve typing, response validation, and readability in app.py - Simplified response generation by leveraging type checking and conversion by FastAPI. - Introduced use of `HTTPException` for error responses. - Improved naming, formatting, and typing in `app.py::create_evaluation`. - Updated the docstring on `app.py::create_agent_task`. - Fixed return type annotations of `create_single_test` and `create_challenge` in generate_test.py. - Added default values to optional attributes on models in report_types_v2.py. - Removed unused imports in `generate_test.py` * refactor(benchmark): Clean up logging and print statements - Introduced use of the `logging` library for unified logging and better readability. - Converted most print statements to use `logger.debug`, `logger.warning`, and `logger.error`. - Improved descriptiveness of log statements. - Removed unnecessary print statements. - Added log statements to unspecific and non-verbose `except` blocks. - Added `--debug` flag, which sets the log level to `DEBUG` and enables a more comprehensive log format. - Added `.utils.logging` module with `configure_logging` function to easily configure the logging library. - Converted raw escape sequences in `.utils.challenge` to use `colorama`. - Renamed `generate_test.py::generate_tests` to `load_challenges`. * refactor(benchmark): Remove unused server.py and agent_interface.py::run_agent - Remove unused server.py file - Remove unused run_agent function from agent_interface.py * refactor(benchmark): Clean up conftest.py - Fix and add type annotations - Rewrite docstrings - Disable or remove unused code - Fix definition of arguments and their types in `pytest_addoption` * refactor(benchmark): Clean up generate_test.py file - Refactored the `create_single_test` function for clarity and readability - Removed unused variables - Made creation of `Challenge` subclasses more straightforward - Made bare `except` more specific - Renamed `Challenge.setup_challenge` method to `run_challenge` - Updated type hints and annotations - Made minor code/readability improvements in `load_challenges` - Added a helper function `_add_challenge_to_module` for attaching a Challenge class to the current module * fix(benchmark): Fix and add type annotations in execute_sub_process.py * refactor(benchmark): Simplify const determination in agent_interface.py - Simplify the logic that determines the value of `HELICONE_GRAPHQL_LOGS` * fix(benchmark): Register category markers to prevent warnings - Use the `pytest_configure` hook to register the known challenge categories as markers. Otherwise, Pytest will raise "unknown marker" warnings at runtime. * refactor(benchmark/challenges): Fix indentation in 4_revenue_retrieval_2/data.json * refactor(benchmark): Update agent_api_interface.py - Add type annotations to `copy_agent_artifacts_into_temp_folder` function - Add note about broken endpoint in the `agent_protocol_client` library - Remove unused variable in `run_api_agent` function - Improve readability and resolve linting error * feat(benchmark): Improve and centralize pathfinding - Search path hierarchy for applicable `agbenchmark_config`, rather than assuming it's in the current folder. - Create `agbenchmark.utils.path_manager` with `AGBenchmarkPathManager` and exporting a `PATH_MANAGER` const. - Replace path constants defined in __main__.py with usages of `PATH_MANAGER`. * feat(benchmark/cli): Clean up and improve CLI - Updated commands, options, and their descriptions to be more intuitive and consistent - Moved slow imports into the entrypoints that use them to speed up application startup - Fixed type hints to match output types of Click options - Hid deprecated `agbenchmark start` command - Refactored code to improve readability and maintainability - Moved main entrypoint into `run` subcommand - Fixed `version` and `serve` subcommands - Added `click-default-group` package to allow using `run` implicitly (for backwards compatibility) - Renamed `--no_dep` to `--no-dep` for consistency - Fixed string formatting issues in log statements * refactor(benchmark/config): Move AgentBenchmarkConfig and related functions to config.py - Move the `AgentBenchmarkConfig` class from `utils/data_types.py` to `config.py`. - Extract the `calculate_info_test_path` function from `utils/data_types.py` and move it to `config.py` as a private helper function `_calculate_info_test_path`. - Move `load_agent_benchmark_config()` to `AgentBenchmarkConfig.load()`. - Changed simple getter methods on `AgentBenchmarkConfig` to calculated properties. - Update all code references according to the changes mentioned above. * refactor(benchmark): Fix ReportManager init parameter types and use pathlib - Fix the type annotation of the `benchmark_start_time` parameter in `ReportManager.__init__`, was mistyped as `str` instead of `datetime`. - Change the type of the `filename` parameter in the `ReportManager.__init__` method from `str` to `Path`. - Rename `self.filename` with `self.report_file` in `ReportManager`. - Change the way the report file is created, opened and saved to use the `Path` object. * refactor(benchmark): Improve typing surrounding ChallengeData and clean up its implementation - Use `ChallengeData` objects instead of untyped `dict` in app.py, generate_test.py, reports.py. - Remove unnecessary methods `serialize`, `get_data`, `get_json_from_path`, `deserialize` from `ChallengeData` class. - Remove unused methods `challenge_from_datum` and `challenge_from_test_data` from `ChallengeData class. - Update function signatures and annotations of `create_challenge` and `generate_single_test` functions in generate_test.py. - Add types to function signatures of `generate_single_call_report` and `finalize_reports` in reports.py. - Remove unnecessary `challenge_data` parameter (in generate_test.py) and fixture (in conftest.py). * refactor(benchmark): Clean up generate_test.py, conftest.py and __main__.py - Cleaned up generate_test.py and conftest.py - Consolidated challenge creation logic in the `Challenge` class itself, most notably the new `Challenge.from_challenge_spec` method. - Moved challenge selection logic from generate_test.py to the `pytest_collection_modifyitems` hook in conftest.py. - Converted methods in the `Challenge` class to class methods where appropriate. - Improved argument handling in the `run_benchmark` function in `__main__.py`. * refactor(benchmark/config): Merge AGBenchmarkPathManager into AgentBenchmarkConfig and reduce fragmented/global state - Merge the functionality of `AGBenchmarkPathManager` into `AgentBenchmarkConfig` to consolidate the configuration management. - Remove the `.path_manager` module containing `AGBenchmarkPathManager`. - Pass the `AgentBenchmarkConfig` and its attributes through function arguments to reduce global state and improve code clarity. * feat(benchmark/serve): Configurable port for `serve` subcommand - Added `--port` option to `serve` subcommand to allow for specifying the port to run the API on. - If no `--port` option is provided, the port will default to the value specified in the `PORT` environment variable, or 8080 if not set. * feat(benchmark/cli): Add `config` subcommand - Added a new subcommand `config` to the AGBenchmark CLI, to display information about the present AGBenchmark config. * fix(benchmark): Gracefully handle incompatible challenge spec files in app.py - Added a check to skip deprecated challenges - Added logging to allow debugging of the loading process - Added handling of validation errors when parsing challenge spec files - Added missing `spec_file` attribute to `ChallengeData` * refactor(benchmark): Move `run_benchmark` entrypoint to main.py, use it in `/reports` endpoint - Move `run_benchmark` and `validate_args` from __main__.py to main.py - Replace agbenchmark subprocess in `app.py:run_single_test` with `run_benchmark` - Move `get_unique_categories` from __main__.py to challenges/__init__.py - Move `OPTIONAL_CATEGORIES` from __main__.py to challenge.py - Reduce operations on updates.json (including `initialize_updates_file`) outside of API * refactor(benchmark): Remove unused `/updates` endpoint and all related code - Remove `updates_json_file` attribute from `AgentBenchmarkConfig` - Remove `get_updates` and `_initialize_updates_file` in app.py - Remove `append_updates_file` and `create_update_json` functions in agent_api_interface.py - Remove call to `append_updates_file` in challenge.py * refactor(benchmark/config): Clean up and update docstrings on `AgentBenchmarkConfig` - Add and update docstrings - Change base class from `BaseModel` to `BaseSettings`, allow extras for backwards compatibility - Make naming of path attributes on `AgentBenchmarkConfig` more consistent - Remove unused `agent_home_directory` attribute - Remove unused `workspace` attribute * fix(benchmark): Restore mechanism to select (optional) categories in agent benchmark config * fix(benchmark): Update agent-protocol-client to v1.1.0 - Fixes issue with fetching task artifact listings	2024-01-02 22:23:09 +01:00
Reinier van der Leer	b106a61352	Clean up & fix GitHub workflows (#6313 ) * ci: Mitigate security issues in autogpt-ci.yml - Remove unnecessary pull_request_target paths and related variables and config - Set permissions for contents to read only * ci: Simplify steps in autogpt-ci.yml workflow using GitHub CLI - Simplify step in 'autogpt-ci.yml' by using GitHub CLI instead of API for adding label and comment functionality - Replace curl command with 'gh issue edit' to add "behaviour change" label to the pull request - Replace gh api command with 'gh issue comment' to leave a comment about the changed behavior of AutoGPT in the pull request * ci: Fix issues in workflows - Move environment variable definition to top level in benchmark-ci.yml (because the other job also needs it) - Removed invalid 'branches: [hackathon]' restriction in hackathon.yml workflow - Removed redundant 'ref' and 'repository' fields in the 'checkout' step of both workflows. * ci: Delete legacy benchmarks.yml workflow * ci: Add triggers for CI workflows - Add triggers to run CI workflows when they are edited. - Update the paths for the CI workflows in the trigger configuration. * fix: Fix benchmark lint error - Removed unnecessary blank lines in report_types.py - Fixed string quotes in challenge.py to maintain consistency * fix: Update task description in password generator data.json - Update task description in `data.json` file for the password generator challenge to clarify the input requirements and error handling. - This change is made in an attempt to make the Benchmark CI pass. * fix: Fix PasswordGenerator challenge in CI - Fix the behavior of the reference password_generator.py to align with the task description - Use default password length 8 instead of a random length in the generate_password function - Retrieve the password length from the command line arguments if "--length" is provided, else set it to 8	2023-11-21 10:58:54 +01:00
SwiftyOS	fa357dd139	fix: Fixing Benchmarking - Importing missing metadata field in Test class in report_types.py - Adding GAIA categories 1, 2, and 3 in data_types.py	2023-11-09 10:00:50 +01:00
Silen Naihin	e5e0c4bf9d	reverting new challenges	2023-10-20 21:13:09 -07:00
Silen Naihin	825c3adf62	case sensitivity, updating challenges	2023-10-20 08:26:29 -07:00
Silen Naihin	09f6a37292	fix capitalization, rename	2023-10-20 07:21:41 -07:00
Silen Naihin	655bc8b08e	fix data challenges	2023-10-19 17:42:24 -07:00
Silen Naihin	7ddef39918	scrape synthesize challenge additions	2023-10-19 17:39:09 -07:00
Silen Naihin	344ef3bf8b	fixing password gen and revenue retrieval 2 challenges	2023-10-17 20:28:49 -07:00
Reinier van der Leer	10aececc6a	Fix subproject dependency compatibility	2023-10-17 10:36:05 -07:00
Silen Naihin	74ee69daf1	Update data.json	2023-10-14 08:04:37 -07:00

1 2 3 4

165 Commits (6ffa644fb636b196ae3154276c4f099b0d244534)