Commit Graph

5113 Commits (32a627eda91f0c36645b78574981137c4c1563f3)

Author SHA1 Message Date
Toran Bruce Richards 32a627eda9 Add Privacy Policy link to telementry opt-in. 2024-02-14 16:42:34 +00:00
Reinier van der Leer 67bafa6302
fix(autogpt/llm): `AssistantChatMessage.tool_calls` default `[]` instead of `None`
OpenAI ChatCompletion calls fail when `tool_calls = None`. This issue came to light after 22aba6d.
2024-02-14 14:34:04 +01:00
Reinier van der Leer 6017eefb32
ci: Enable telemetry in CI runs on `master` 2024-02-14 12:03:54 +01:00
Reinier van der Leer ae197fc85f
feat(agent/telemetry): Distinguish between users
This allows us to get a much better sense of how many users actually experience issues, and how issue occurrence is distributed among users.
2024-02-14 11:50:45 +01:00
Reinier van der Leer 22aba6dd8a
fix(agent/llm): Include bad response in parse-fix prompt in `OpenAIProvider.create_chat_completion`
Apparently I forgot to also append the response that caused the parse error before throwing it back to the LLM and letting it fix its mistake(s).
2024-02-14 11:20:31 +01:00
Reinier van der Leer 88bbdfc7fc
ci: Pick 3 challenges to run with `--mock` in smoke test CI 2024-02-14 02:30:03 +01:00
Reinier van der Leer d0c9b7c405
lint(benchmark): Remove unused imports 2024-02-14 01:34:30 +01:00
Reinier van der Leer e7698a4610
chore(agent): Update `forge` and `agbenchmark` dependencies 2024-02-14 01:32:28 +01:00
Reinier van der Leer ab05b7ae70
chore(forge): Update `agbenchmark` dependency 2024-02-14 01:27:07 +01:00
Reinier van der Leer 327fb1f916
fix(benchmark): Mock mode, python evals, `--attempts` flag, challenge definitions
- Fixed `--mock` mode
   - Moved interrupt to beginning of the step iterator pipeline (from `BuiltinChallenge` to `agent_api_interface.py:run_api_agent`). This ensures that any finish-up code is properly executed after executing a single step.
   - Implemented mock mode in `WebArenaChallenge`

- Fixed `fixture 'i_attempt' not found` error when `--attempts`/`-N` is omitted

- Fixed handling of `python`/`pytest` evals in `BuiltinChallenge`

- Disabled left-over Helicone code (see 056163e)

- Fixed a couple of challenge definitions
   - WebArena task 107: fix spelling of months (Sepetember, Octorbor *lmao*)
   - synthesize/1_basic_content_gen (SynthesizeInfo): remove empty string from `should_contain` list

- Added some debug logging in agent_api_interface.py and challenges/builtin.py
2024-02-14 01:05:34 +01:00
Reinier van der Leer bb7f5abc6c
fix(agent/text_processing): Fix `extract_information` LLM response parsing
OpenAI's newest models return JSON with markdown fences around it, breaking the `json.loads` parser.

This commit adds an `extract_list_from_response` function to json_utils/utilities.py and uses this function to replace `json.loads` in `_process_text`.
2024-02-13 18:28:17 +01:00
Reinier van der Leer 393d6b97e6
feat(agent): Add Sentry integration for telemetry
* Add Sentry integration for telemetry
   - Add `sentry_sdk` dependency
   - Add setup logic and config flow using `TELEMETRY_OPT_IN` environment variable
      - Add app/telemetry.py with `setup_telemetry` helper routine
      - Call `setup_telemetry` in `cli()` in app/cli.py
      - Add `TELEMETRY_OPT_IN` to .env.template
      - Add helper function `env_file_exists` and routine `set_env_config_value` to app/utils.py
         - Add unit tests for `set_env_config_value` in test_utils.py
      - Add prompt to startup to ask whether the user wants to enable telemetry if the env variable isn't set

* Add `capture_exception` statements for LLM parsing errors and command failures
2024-02-13 18:10:52 +01:00
Reinier van der Leer 3b8d63dfb6
chore(agent): Update autogpt-forge and agbenchmark dependencies to propagate dependency updates
This also indirectly updates `python-multipart` and fixes "python-multipart vulnerable to Content-Type Header ReDoS" https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/57
2024-02-13 13:24:24 +01:00
Reinier van der Leer 6763196d78
chore(forge): Update agbenchmark dependency 2024-02-13 12:44:17 +01:00
Reinier van der Leer e1da58da02
chore(forge): Update aiohttp, fastapi, and python-multipart dependencies to mitigate vulnerabilities
Addressed vulnerabilities:

- python-multipart vulnerable to Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/56
   Dependants:
   - FastAPI Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/52
   - Starlette Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/49

- aiohttp is vulnerable to directory traversal - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/45
- aiohttp's HTTP parser (the python one, not llhttp) still overly lenient about separators - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/42
2024-02-13 12:38:36 +01:00
Reinier van der Leer 91cec515d4
chore(benchmark): Update `python-multipart` dependency to mitigate vulnerability
- python-multipart vulnerable to Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/55
2024-02-13 12:36:00 +01:00
Reinier van der Leer cc585a014f
chore(agent): Update aiohttp and fastapi dependencies to mitigate vulnerabilities
Addressed vulnerabilities:

- python-multipart vulnerable to Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/57
   Dependants:
   - FastAPI Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/54
   - Starlette Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/50

- aiohttp is vulnerable to directory traversal - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/44
- aiohttp's HTTP parser (the python one, not llhttp) still overly lenient about separators - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/41
2024-02-13 12:30:12 +01:00
Reinier van der Leer e641cccb42
chore(benchmark): Update `aiohttp` and `fastapi` dependencies to mitigate vulnerabilities
Addressed vulnerabilities:

- python-multipart vulnerable to Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/55
   Dependants:
   - FastAPI Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/53
   - Starlette Content-Type Header ReDoS - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/48

- aiohttp is vulnerable to directory traversal - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/46
- aiohttp's HTTP parser (the python one, not llhttp) still overly lenient about separators - https://github.com/Significant-Gravitas/AutoGPT/security/dependabot/43
2024-02-13 12:21:52 +01:00
Mahdi Karami cc73d4104b
fix(forge): incorrect import 'sdk' in .actions.finish (#6822) 2024-02-13 11:02:03 +01:00
Reinier van der Leer 250552cb3d
fix(agent/tests): Update test_config.py:test_initial_values 2024-02-12 13:26:47 +01:00
Reinier van der Leer 1d653973e9
feat(agent/llm): Use new OpenAI models as default `SMART_LLM`, `FAST_LLM`, and `EMBEDDING_MODEL`
- Change default `SMART_LLM` from `gpt-4` to `gpt-4-turbo-preview`
- Change default `FAST_LLM` from `gpt-3.5-turbo-16k` to `gpt-3.5-turbo-0125`
- Change default `EMBEDDING_MODEL` from `text-embedding-ada-002` to `text-embedding-3-small`
- Update .env.template, azure.yaml.template, and documentation accordingly
2024-02-12 13:19:37 +01:00
Reinier van der Leer 7bf9ba5502
chore(agent/llm): Update OpenAI model info
- Add `text-embedding-3-small` and `text-embedding-3-large` as `EMBEDDING_v3_S` and `EMBEDDING_v3_L` respectively
- Add `gpt-3.5-turbo-0125` as `GPT3_v4`
- Add `gpt-4-1106-vision-preview` as `GPT4_v3_VISION`
- Add GPT-4V models to info map
- Change chat model info mapping to derive info for aliases (e.g. `gpt-3.5-turbo`) from specific versions instead of the other way around
2024-02-12 13:17:20 +01:00
Reinier van der Leer 14c9773890
ci(agent): Add `GIT_REVISION` label to Docker builds 2024-02-12 12:31:04 +01:00
Reinier van der Leer 39fddb1214
fix(agent): Fix application of `extra_request_headers` in `OpenAIProvider` 2024-02-12 12:21:30 +01:00
Reinier van der Leer fe0923ba6c
feat(agent/web): Add browser extensions to deal with cookie walls and ads (#6778)
* Add `_sideload_chrome_extensions` subroutine to `open_page_in_browser` in web_selenium.py
   * Sideloads uBlock Origin and I Still Don't Care About Cookies, downloading them if necessary
* Add 2-second delay to `open_page_in_browser` to allow time for handling cookie walls
2024-02-02 18:30:37 +01:00
Reinier van der Leer dfaeda7cd5
lint(agent/tests): Fix line length in test_utils.py 2024-02-02 18:29:28 +01:00
Reinier van der Leer 9b7fee673e
fix(agent/tests): Update `test_utils.py:test_extract_json_from_response*` in accordance with 956cdc7
Commit 956cdc7 "fix(agent/json_utils): Decode as JSON rather than Python objects" broke these unit tests because they generated "JSON" by stringifying a Python object.
2024-02-02 18:21:19 +01:00
Reinier van der Leer 925269d17b
lint(agent): Fix line length in docstring of `EpisodicActionHistory.handle_compression` 2024-02-02 17:43:42 +01:00
Fernando Navarro Páez 266fe3a3f7
fix(forge): Fix "no module named 'forge.sdk.abilities'" (#6571)
Fixes #6537
2024-02-01 11:23:35 +01:00
Reinier van der Leer 66e0c87894
feat(agent): Add history compression to increase longevity and efficiency
* Compress steps in the prompt to reduce token usage, and to increase longevity when using models with limited context windows
* Move multiple copies of step formatting code to `Episode.format` method
* Add `EpisodicActionHistory.handle_compression` method to handle compression of new steps
2024-01-31 17:51:45 +01:00
Reinier van der Leer 55433f468a
feat(agent/web): Improve `read_webpage` information extraction abilities
* Implement `extract_information` function in `autogpt.processing.text` module. This function extracts pieces of information from a body of text based on a list of topics of interest.
* Add `topics_of_interest` and `get_raw_content` parameters to `read_webpage` commmand
   * Limit maximum content length if `get_raw_content=true` is specified
2024-01-31 15:08:08 +01:00
Reinier van der Leer 956cdc77fa
fix(agent/json_utils): Decode as JSON rather than Python objects
* Replace `ast.literal_eval` with `json.loads` in `extract_dict_from_response`

This fixes a bug where boolean values could not be decoded because of their required capitalization in Python.
2024-01-31 14:15:02 +01:00
Reinier van der Leer 83a0b03523
fix(agent/prompting): Fix representation of (optional) command parameters in prompt 2024-01-31 14:10:22 +01:00
Reinier van der Leer 25b9e290a5
fix(agent/json_utils): Make `extract_dict_from_response` more robust
* Accommodate for both ```json and ```JSON blocks in responses
2024-01-29 15:03:09 +01:00
Reinier van der Leer ab860981d8
feat(agent/llm): Add support for `gpt-4-0125-preview`
* Add `gpt-4-0125-preview` model to OpenAI model list
* Add `gpt-4-turbo-preview` alias to OpenAI model list
2024-01-29 11:22:32 +01:00
Reinier van der Leer a0cae78ba3
feat(benchmark): Add `-N`, `--attempts` option for multiple attempts per challenge
LLMs are probabilistic systems. Reproducibility of completions is not guaranteed. It only makes sense to account for this, by running challenges multiple times to obtain a success ratio rather than a boolean success/failure result.

Changes:
- Add `-N`, `--attempts` option to CLI and `attempts_per_challenge` parameter to `main.py:run_benchmark`.
- Add dynamic `i_attempt` fixture through `pytest_generate_tests` hook in conftest.py to achieve multiple runs per challenge.
- Modify `pytest_runtest_makereport` hook in conftest.py to handle multiple reporting calls per challenge.
- Refactor report_types.py, reports.py, process_report.ty to allow multiple results per challenge.
   - Calculate `success_percentage` from results of the current run, rather than all known results ever.
   - Add docstrings to a number of models in report_types.py.
   - Allow `None` as a success value, e.g. for runs that did not render any results before being cut off.
- Make SingletonReportManager thread-safe.
2024-01-22 17:16:55 +01:00
Reinier van der Leer 488f40a20f
feat(benchmark): JungleGym WebArena (#6691)
* feat(benchmark): Add JungleGym WebArena challenges
   - Add `WebArenaChallenge`, `WebArenaChallengeSpec`, and other logic to make these challenges work
   - Add WebArena challenges to Pytest collection endpoint generate_test.py

* feat(benchmark/webarena): Add hand-picked selection of WebArena challenges
2024-01-19 20:34:04 +01:00
Reinier van der Leer 05b018a837
fix(benchmark/report): Fix and clean up logic in `update_challenges_already_beaten`
- `update_challenges_already_beaten` incorrectly marked challenges as beaten if it was present in the report file but set to `false`
2024-01-19 19:52:09 +01:00
Reinier van der Leer fc37ffdfcf
feat(agent/llm/openai): Include compatibility tool call extraction in LLM response parse-fix loop 2024-01-19 19:23:17 +01:00
Reinier van der Leer 8c65f3c748
fix(agent/serve): Fix task cost tracking persistence in `AgentProtocolServer`
- Pydantic shallow-copies models when they are passed into a parent model, meaning they can't be updated through the original reference. This commit adds a fix for the resulting cost persistence issue.
2024-01-19 19:20:05 +01:00
Reinier van der Leer 354106be7b
feat(agent/llm): Add cost tracking and logging to `AgentProtocolServer` 2024-01-19 17:31:59 +01:00
Reinier van der Leer 9e4dfd8058
fix(benchmark): Fix challenge input artifact upload 2024-01-19 17:29:53 +01:00
Reinier van der Leer faf5f9e5a4
fix(agent): Fix `extract_dict_from_response` flakiness
- The `extract_dict_from_response` function, which is supposed to reliably extract a JSON object from an LLM's response, positively discriminated objects defined on a single line, causing issues.
2024-01-19 15:49:32 +01:00
Reinier van der Leer e4687e0f03
fix(agent): Fix "ChatModelResponse not subscriptable" errors in `summarize_text` and `QueryLanguageModel` ability
- `summarize_text` and `QueryLanguageModel.__call__` still tried to access `response["content"]`, which isn't possible since upgrading to the OpenAI v1 client library.
2024-01-19 15:45:31 +01:00
Reinier van der Leer c5b17851e0
fix(agent): Handle artifact modification properly
- When an Artifact's file is modified by the agent, set its `agent_created` attribute to `True` instead of registering a new Artifact
- Update the `autogpt-forge` dependency to the newest version, in which `AgentDB.update_artifact` has been implemented
2024-01-19 12:08:59 +01:00
Reinier van der Leer b238abac52
feat(forge/db): Add `AgentDB.update_artifact` method 2024-01-19 11:41:40 +01:00
Reinier van der Leer 9012ff4db2
refactor(benchmark): Interface & type consoledation, and arch change, to allow adding challenge providers
Squashed commit of the following:

commit 7d6476d329
Author: Reinier van der Leer <pwuts@agpt.co>
Date:   Tue Jan 9 18:10:45 2024 +0100

    refactor(benchmark/challenge): Set up structure to support more challenge providers

    - Move `Challenge`, `ChallengeData`, `load_challenges` to `challenges/builtin.py` and rename to `BuiltinChallenge`, `BuiltinChallengeSpec`, `load_builtin_challenges`
    - Create `BaseChallenge` to serve as interface and base class for different challenge implementations
    - Create `ChallengeInfo` model to serve as universal challenge info object
    - Create `get_challenge_from_source_uri` function in `challenges/__init__.py`
    - Replace `ChallengeData` by `ChallengeInfo` everywhere except in `BuiltinChallenge`
    - Add strong typing to `task_informations` store in app.py
    - Use `call.duration` in `finalize_test_report` and remove `timer` fixture
    - Update docstring on `challenges/__init__.py:get_unique_categories`
    - Add docstring to `generate_test.py`

commit 5df2aa7939
Author: Reinier van der Leer <pwuts@agpt.co>
Date:   Tue Jan 9 16:58:01 2024 +0100

    refactor(benchmark): Refactor & rename functions in agent_interface.py and agent_api_interface.py

    - `copy_artifacts_into_temp_folder` -> `copy_challenge_artifacts_into_workspace`
    - `copy_agent_artifacts_into_folder` -> `download_agent_artifacts_into_folder`
    - Reorder parameters of `run_api_agent`, `copy_challenge_artifacts_into_workspace`; use `Path` instead of `str`

commit 6a256fef4c
Author: Reinier van der Leer <pwuts@agpt.co>
Date:   Tue Jan 9 16:02:25 2024 +0100

    refactor(benchmark): Refactor & typefix report generation and handling logic

    - Rename functions in reports.py and ReportManager.py to better reflect what they do
       - `get_previous_test_results` -> `get_and_update_success_history`
       - `generate_single_call_report` -> `initialize_test_report`
       - `finalize_reports` -> `finalize_test_report`
       - `ReportManager.end_info_report` -> `SessionReportManager.finalize_session_report`
    - Modify `pytest_runtest_makereport` hook in conftest.py to finalize the report immediately after the challenge finishes running instead of after teardown
       - Move result processing logic from `initialize_test_report` to `finalize_test_report` in reports.py
    - Use `Test` and `Report` types from report_types.py where possible instead of untyped dicts: reports.py, utils.py, ReportManager.py
    - Differentiate `ReportManager` into `SessionReportManager`, `RegressionTestsTracker`, `SuccessRateTracker`
    - Move filtering of optional challenge categories from challenge.py (`Challenge.skip_optional_categories`) to conftest.py (`pytest_collection_modifyitems`)
    - Remove unused `scores` fixture in conftest.py

commit 370d6dbf5d
Author: Reinier van der Leer <pwuts@agpt.co>
Date:   Tue Jan 9 15:16:43 2024 +0100

    refactor(benchmark): Simplify models in report_types.py

    - Removed ForbidOptionalMeta and BaseModelBenchmark classes.
    - Changed model attributes to optional: `Metrics.difficulty`, `Metrics.success`, `Metrics.success_percentage`, `Metrics.run_time`, and `Test.reached_cutoff`.
    - Added validator to `Metrics` model to require `success` and `run_time` fields if `attempted=True`.
    - Added default values to all optional model fields.
    - Removed duplicate imports.
    - Added condition in process_report.py to prevent null lookups if `metrics.difficulty` is not set.
2024-01-18 15:19:06 +01:00
Reinier van der Leer f2595af362
refactor(agent/openai): Upgrade OpenAI library to v1
- Update `openai` dependency from ^v0.27.10 to ^v1.7.2
- Update poetry.lock
- Update code for changed endpoints and new output types of OpenAI library
- Replace uses of `AssistantChatMessageDict` by `AssistantChatMessage`
   - Update `PromptStrategy`, `BaseAgent`, and all of their subclasses accordingly
- Update `OpenAIProvider`, `OpenAICredentials`, azure.yaml.template, .env.template and test_config.py to work with new separate `AzureOpenAI` client
- Remove `_OpenAIRetryHandler` and implement retry mechanism with `tenacity`
- Rewrite pytest fixture `cached_openai_client` (renamed from `patched_api_requestor`) for OpenAI v1 library
2024-01-17 20:11:13 +01:00
Reinier van der Leer 39fd1d6be1
lint(forge): `black .` and `isort .` 2024-01-16 16:30:37 +01:00
Reinier van der Leer f0ede64ded
chore(forge): Upgrade OpenAI client lib and LiteLLM from v0 to v1
* Update `openai` dependency from `^0.27.8` to `^1.7.2`
* Update `litellm` dependency from `^0.1.821` to `^1.17.9`
* Migrate llm.py from OpenAI module-level client to client instance
* Update return types in llm.py for new OpenAI and LiteLLM versions
   * Also remove `Exception` as a return type because they are raised, not returned
   * Update tutorials/003_crafting_agent_logic.md accordingly

Note: this changes the output types of the functions in `forge.llm`: `chat_completion_request`, `create_embedding_request`, `transcribe_audio`
2024-01-16 16:14:52 +01:00