Commit Graph

5345 Commits (785a40ff9defd01be0b57fc797ec940627f00c46)

Author SHA1 Message Date
Reinier van der Leer 959377f54c
fix(ci/benchmark): Add `set +e` because we expect (some) challenges to fail 2024-02-17 15:56:55 +01:00
Reinier van der Leer 6bc83e925c
chore: Update `agbenchmark` dependency for agent and forge 2024-02-17 15:56:33 +01:00
Reinier van der Leer 4ede773f5a
debug(benchmark): Add more debug code to pinpoint cause of rare crash
Target: https://github.com/Significant-Gravitas/AutoGPT/actions/runs/7941977633/job/21684817491
2024-02-17 15:48:57 +01:00
Reinier van der Leer d5ad719757
ci: Allow telemetry for non-push events, as long as it's on `master`
Also disable telemetry for AutoGPT's unit/integration tests.
2024-02-17 15:12:43 +01:00
Reinier van der Leer 1ca9b9fa93
ci: Fix setting/passing `TELEMETRY_*` environment variables 2024-02-17 14:26:03 +01:00
Reinier van der Leer 15024fb5a1
chore: Update `agbenchmark` dependency for agent and forge 2024-02-17 14:18:02 +01:00
Reinier van der Leer fa4bdef17c
ci: Update actions to newest versions
- `actions/stale` -> `v9`
- `actions/cache` -> `v4`
- `actions/checkout` -> `v4`
- `actions/setup-node` -> `v4`
- `docker/login-action` -> `v3`
- `actions/setup-python` -> `v5`
- `codecov/codecov-action` -> `v4`
- `actions/upload-artifact` -> `v4`
- `subosito/flutter-action` -> `v2`
- `docker/build-push-action` -> `v5`
- `docker/setup-buildx-action` -> `v3`
2024-02-17 13:59:13 +01:00
Reinier van der Leer e2b519ef3b
debug(benchmark): Make sure `TestResult` validator error output is sufficient to debug 2024-02-17 13:36:17 +01:00
Reinier van der Leer 09c307d679
debug(benchmark): Add log statement to validator on `TestResult`
Validation errors don't mention the values causing the error, making it hard to debug. This happened a few times in autogpts-benchmark.yml, so let's put this log statement here until we figure out what makes it crash.
2024-02-17 13:32:22 +01:00
Reinier van der Leer 880c8e804c
fix(ci/benchmark): Allow workflow to continue regardless of challenge outcomes 2024-02-17 11:52:26 +01:00
Reinier van der Leer 5f0764b65c
chore: Update agbenchmark dependency for agent and forge 2024-02-16 19:07:37 +01:00
Reinier van der Leer 63e6014b27
fix(benchmark): Fix `TestResult.fail_reason` assignment condition
The condition must be the same as for `success`, because otherwise it causes a crash when `call.excinfo` evaluates to `False` but is not `None`.
2024-02-16 19:05:00 +01:00
Reinier van der Leer 83fcd9ad16
chore: Update `agbenchmark` dependency for agent and forge 2024-02-16 18:44:58 +01:00
Reinier van der Leer f9792ed7f3
fix(benchmark): Unbreak `-N`/`--attempts` option 2024-02-16 18:43:37 +01:00
Reinier van der Leer d6ab470c58
Rename autogpts-benchmark-nightly.yml to autogpts-benchmark.yml 2024-02-16 18:32:50 +01:00
Reinier van der Leer 666a5a8777
feat(agent/serve): Report task cost through `Step.additional_output`
- Added `task_cumulative_cost` and `task_total_cost` attributes to the `Step.additional_output` in the `AgentProtocolServer.execute_step` endpoint.
- Updated `agbenchmark` dependency in Agent and Forge
2024-02-16 18:19:04 +01:00
Reinier van der Leer 21f1e64559
feat(benchmark): Get agent task cost from `Step.additional_output` 2024-02-16 18:10:46 +01:00
Reinier van der Leer 752bac099b
feat(benchmark/report): Add and record `TestResult.n_steps`
- Added `n_steps` attribute to `TestResult` type
- Added logic to record the number of steps to `BuiltinChallenge.test_method`, `WebArenaChallenge.test_method`, and `.reports.add_test_result_to_report`
2024-02-16 17:53:19 +01:00
Reinier van der Leer a5de79beb6
ci(benchmark): Add nightly benchmark workflow
Added autogpts-benchmark-nightly.yml, which will run every night at 02:00 UTC with a selection of challenges.
2024-02-16 17:41:58 +01:00
Reinier van der Leer 483c01b681
lint(benchmark): Remove unnecessary `pass` statement in __main__.py 2024-02-16 17:27:56 +01:00
Reinier van der Leer 992b8874fc
chore: Update `agbenchmark` dependency for agent and forge 2024-02-16 17:22:58 +01:00
Reinier van der Leer 2a55efb322
fix(benchmark): Include `WebArenaSiteInfo.additional_info` (e.g. credentials) in task input
Without the `additional_info`, it is impossible to get past the login page on challenges where that is necessary.
2024-02-16 17:20:44 +01:00
Reinier van der Leer 23d58a3cc0
feat(benchmark/cli): Add `challenge list`, `challenge info` subcommands
- Add `challenge list` command with options `--all`, `--names`, `--json`
   - Add `tabular` dependency
   - Add `.utils.utils.sorted_by_enum_index` function to easily sort lists by an enum value/property based on the order of the enum's definition
- Add `challenge info [name]` command with option `--json`
   - Add `.utils.utils.pretty_print_model` routine to pretty-print Pydantic models
- Refactor `config` subcommand to use `pretty_print_model`
2024-02-16 15:17:11 +01:00
Reinier van der Leer 70e345b2ce
refactor(benchmark): `load_webarena_challenges`
- Reduce duplicate and nested statements
- Add `skip_unavailable` parameter

Related changes:
- Add `available` and `unavailable_reason` attributes to `ChallengeInfo` and `WebArenaChallengeSpec`
- Add `pytest.skip` statement to `WebArenaChallenge.test_method` to make sure unavailable challenges are not run
2024-02-16 15:11:48 +01:00
Reinier van der Leer 650a701317
chore: Update `agbenchmark` dependency for agent and forge 2024-02-15 18:19:06 +01:00
Reinier van der Leer 679339d00c
feat(benchmark): Make report output folder configurable
- Make `AgentBenchmarkConfig.reports_folder` directly configurable (through `REPORTS_FOLDER` env variable). The default is still `./agbenchmark_config/reports`.
- Change all mentions of `REPORT_LOCATION` (which fulfilled the same function at some point in the past) to `REPORTS_FOLDER`.
2024-02-15 18:07:45 +01:00
Reinier van der Leer fd5730b04a
feat(agent/telemetry): Distinguish between `production` and `dev` environment based on VCS state
- Added a helper function `.app.utils.vcs_state_diverges_from_master()`. This function determines whether the relevant part of the codebase diverges from our `master`.
- Updated `.app.telemetry._setup_sentry()` to determine the default environment name using `vcs_state_diverges_from_master`.
2024-02-15 16:00:30 +01:00
Reinier van der Leer b7f08cd0f7
feat(agent/telemetry): Enable performance tracing & update opt-in prompt accordingly 2024-02-15 14:46:36 +01:00
Reinier van der Leer 8762f7ab3d
fix(forge): Make `watchfiles` pattern more specific to prevent unwanted (breaking) reloads
This fixes the issue of changes in artifacts triggering an application reload (which caused connection errors for in-progress requests).
2024-02-15 13:42:38 +01:00
Reinier van der Leer a9b7b175ff
fix(agent/profile_generator): Improve robustness by leveraging `create_chat_completion`'s parse handling 2024-02-15 11:48:07 +01:00
Reinier van der Leer 52b93dd84e
fix(cli/agent start): Wait for applications to finish starting before returning
- Added a helper function `wait_until_conn_ready(port)` to wait for the benchmark and agent applications to finish starting
- Improved the CLI's own logging (within the `agent start` command)
2024-02-15 11:26:26 +01:00
Reinier van der Leer 6a09a44ef7
lint(agent): Fix telemetry.py linting error & formatting 2024-02-14 23:31:35 +01:00
Toran Bruce Richards 32a627eda9 Add Privacy Policy link to telementry opt-in. 2024-02-14 16:42:34 +00:00
Reinier van der Leer 67bafa6302
fix(autogpt/llm): `AssistantChatMessage.tool_calls` default `[]` instead of `None`
OpenAI ChatCompletion calls fail when `tool_calls = None`. This issue came to light after 22aba6d.
2024-02-14 14:34:04 +01:00
Reinier van der Leer 6017eefb32
ci: Enable telemetry in CI runs on `master` 2024-02-14 12:03:54 +01:00
Reinier van der Leer ae197fc85f
feat(agent/telemetry): Distinguish between users
This allows us to get a much better sense of how many users actually experience issues, and how issue occurrence is distributed among users.
2024-02-14 11:50:45 +01:00
Reinier van der Leer 22aba6dd8a
fix(agent/llm): Include bad response in parse-fix prompt in `OpenAIProvider.create_chat_completion`
Apparently I forgot to also append the response that caused the parse error before throwing it back to the LLM and letting it fix its mistake(s).
2024-02-14 11:20:31 +01:00
Reinier van der Leer 88bbdfc7fc
ci: Pick 3 challenges to run with `--mock` in smoke test CI 2024-02-14 02:30:03 +01:00
Reinier van der Leer d0c9b7c405
lint(benchmark): Remove unused imports 2024-02-14 01:34:30 +01:00
Reinier van der Leer e7698a4610
chore(agent): Update `forge` and `agbenchmark` dependencies 2024-02-14 01:32:28 +01:00
Reinier van der Leer ab05b7ae70
chore(forge): Update `agbenchmark` dependency 2024-02-14 01:27:07 +01:00
Reinier van der Leer 327fb1f916
fix(benchmark): Mock mode, python evals, `--attempts` flag, challenge definitions
- Fixed `--mock` mode
   - Moved interrupt to beginning of the step iterator pipeline (from `BuiltinChallenge` to `agent_api_interface.py:run_api_agent`). This ensures that any finish-up code is properly executed after executing a single step.
   - Implemented mock mode in `WebArenaChallenge`

- Fixed `fixture 'i_attempt' not found` error when `--attempts`/`-N` is omitted

- Fixed handling of `python`/`pytest` evals in `BuiltinChallenge`

- Disabled left-over Helicone code (see 056163e)

- Fixed a couple of challenge definitions
   - WebArena task 107: fix spelling of months (Sepetember, Octorbor *lmao*)
   - synthesize/1_basic_content_gen (SynthesizeInfo): remove empty string from `should_contain` list

- Added some debug logging in agent_api_interface.py and challenges/builtin.py
2024-02-14 01:05:34 +01:00
Reinier van der Leer bb7f5abc6c
fix(agent/text_processing): Fix `extract_information` LLM response parsing
OpenAI's newest models return JSON with markdown fences around it, breaking the `json.loads` parser.

This commit adds an `extract_list_from_response` function to json_utils/utilities.py and uses this function to replace `json.loads` in `_process_text`.
2024-02-13 18:28:17 +01:00
Reinier van der Leer 393d6b97e6
feat(agent): Add Sentry integration for telemetry
* Add Sentry integration for telemetry
   - Add `sentry_sdk` dependency
   - Add setup logic and config flow using `TELEMETRY_OPT_IN` environment variable
      - Add app/telemetry.py with `setup_telemetry` helper routine
      - Call `setup_telemetry` in `cli()` in app/cli.py
      - Add `TELEMETRY_OPT_IN` to .env.template
      - Add helper function `env_file_exists` and routine `set_env_config_value` to app/utils.py
         - Add unit tests for `set_env_config_value` in test_utils.py
      - Add prompt to startup to ask whether the user wants to enable telemetry if the env variable isn't set

* Add `capture_exception` statements for LLM parsing errors and command failures
2024-02-13 18:10:52 +01:00
Reinier van der Leer 3b8d63dfb6
chore(agent): Update autogpt-forge and agbenchmark dependencies to propagate dependency updates
This also indirectly updates `python-multipart` and fixes "python-multipart vulnerable to Content-Type Header ReDoS" https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/57
2024-02-13 13:24:24 +01:00
Reinier van der Leer 6763196d78
chore(forge): Update agbenchmark dependency 2024-02-13 12:44:17 +01:00
Reinier van der Leer e1da58da02
chore(forge): Update aiohttp, fastapi, and python-multipart dependencies to mitigate vulnerabilities
Addressed vulnerabilities:

- python-multipart vulnerable to Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/56
   Dependants:
   - FastAPI Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/52
   - Starlette Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/49

- aiohttp is vulnerable to directory traversal - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/45
- aiohttp's HTTP parser (the python one, not llhttp) still overly lenient about separators - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/42
2024-02-13 12:38:36 +01:00
Reinier van der Leer 91cec515d4
chore(benchmark): Update `python-multipart` dependency to mitigate vulnerability
- python-multipart vulnerable to Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/55
2024-02-13 12:36:00 +01:00
Reinier van der Leer cc585a014f
chore(agent): Update aiohttp and fastapi dependencies to mitigate vulnerabilities
Addressed vulnerabilities:

- python-multipart vulnerable to Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/57
   Dependants:
   - FastAPI Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/54
   - Starlette Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/50

- aiohttp is vulnerable to directory traversal - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/44
- aiohttp's HTTP parser (the python one, not llhttp) still overly lenient about separators - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/41
2024-02-13 12:30:12 +01:00
Reinier van der Leer e641cccb42
chore(benchmark): Update `aiohttp` and `fastapi` dependencies to mitigate vulnerabilities
Addressed vulnerabilities:

- python-multipart vulnerable to Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/55
   Dependants:
   - FastAPI Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/53
   - Starlette Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/48

- aiohttp is vulnerable to directory traversal - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/46
- aiohttp's HTTP parser (the python one, not llhttp) still overly lenient about separators - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/43
2024-02-13 12:21:52 +01:00