- Add `challenge list` command with options `--all`, `--names`, `--json`
- Add `tabular` dependency
- Add `.utils.utils.sorted_by_enum_index` function to easily sort lists by an enum value/property based on the order of the enum's definition
- Add `challenge info [name]` command with option `--json`
- Add `.utils.utils.pretty_print_model` routine to pretty-print Pydantic models
- Refactor `config` subcommand to use `pretty_print_model`
- Reduce duplicate and nested statements
- Add `skip_unavailable` parameter
Related changes:
- Add `available` and `unavailable_reason` attributes to `ChallengeInfo` and `WebArenaChallengeSpec`
- Add `pytest.skip` statement to `WebArenaChallenge.test_method` to make sure unavailable challenges are not run
- Make `AgentBenchmarkConfig.reports_folder` directly configurable (through `REPORTS_FOLDER` env variable). The default is still `./agbenchmark_config/reports`.
- Change all mentions of `REPORT_LOCATION` (which fulfilled the same function at some point in the past) to `REPORTS_FOLDER`.
- Added a helper function `.app.utils.vcs_state_diverges_from_master()`. This function determines whether the relevant part of the codebase diverges from our `master`.
- Updated `.app.telemetry._setup_sentry()` to determine the default environment name using `vcs_state_diverges_from_master`.
- Added a helper function `wait_until_conn_ready(port)` to wait for the benchmark and agent applications to finish starting
- Improved the CLI's own logging (within the `agent start` command)
- Fixed `--mock` mode
- Moved interrupt to beginning of the step iterator pipeline (from `BuiltinChallenge` to `agent_api_interface.py:run_api_agent`). This ensures that any finish-up code is properly executed after executing a single step.
- Implemented mock mode in `WebArenaChallenge`
- Fixed `fixture 'i_attempt' not found` error when `--attempts`/`-N` is omitted
- Fixed handling of `python`/`pytest` evals in `BuiltinChallenge`
- Disabled left-over Helicone code (see 056163e)
- Fixed a couple of challenge definitions
- WebArena task 107: fix spelling of months (Sepetember, Octorbor *lmao*)
- synthesize/1_basic_content_gen (SynthesizeInfo): remove empty string from `should_contain` list
- Added some debug logging in agent_api_interface.py and challenges/builtin.py
OpenAI's newest models return JSON with markdown fences around it, breaking the `json.loads` parser.
This commit adds an `extract_list_from_response` function to json_utils/utilities.py and uses this function to replace `json.loads` in `_process_text`.
* Add Sentry integration for telemetry
- Add `sentry_sdk` dependency
- Add setup logic and config flow using `TELEMETRY_OPT_IN` environment variable
- Add app/telemetry.py with `setup_telemetry` helper routine
- Call `setup_telemetry` in `cli()` in app/cli.py
- Add `TELEMETRY_OPT_IN` to .env.template
- Add helper function `env_file_exists` and routine `set_env_config_value` to app/utils.py
- Add unit tests for `set_env_config_value` in test_utils.py
- Add prompt to startup to ask whether the user wants to enable telemetry if the env variable isn't set
* Add `capture_exception` statements for LLM parsing errors and command failures
- Change default `SMART_LLM` from `gpt-4` to `gpt-4-turbo-preview`
- Change default `FAST_LLM` from `gpt-3.5-turbo-16k` to `gpt-3.5-turbo-0125`
- Change default `EMBEDDING_MODEL` from `text-embedding-ada-002` to `text-embedding-3-small`
- Update .env.template, azure.yaml.template, and documentation accordingly
- Add `text-embedding-3-small` and `text-embedding-3-large` as `EMBEDDING_v3_S` and `EMBEDDING_v3_L` respectively
- Add `gpt-3.5-turbo-0125` as `GPT3_v4`
- Add `gpt-4-1106-vision-preview` as `GPT4_v3_VISION`
- Add GPT-4V models to info map
- Change chat model info mapping to derive info for aliases (e.g. `gpt-3.5-turbo`) from specific versions instead of the other way around
* Add `_sideload_chrome_extensions` subroutine to `open_page_in_browser` in web_selenium.py
* Sideloads uBlock Origin and I Still Don't Care About Cookies, downloading them if necessary
* Add 2-second delay to `open_page_in_browser` to allow time for handling cookie walls
Commit 956cdc7 "fix(agent/json_utils): Decode as JSON rather than Python objects" broke these unit tests because they generated "JSON" by stringifying a Python object.
* Compress steps in the prompt to reduce token usage, and to increase longevity when using models with limited context windows
* Move multiple copies of step formatting code to `Episode.format` method
* Add `EpisodicActionHistory.handle_compression` method to handle compression of new steps
* Implement `extract_information` function in `autogpt.processing.text` module. This function extracts pieces of information from a body of text based on a list of topics of interest.
* Add `topics_of_interest` and `get_raw_content` parameters to `read_webpage` commmand
* Limit maximum content length if `get_raw_content=true` is specified
* Replace `ast.literal_eval` with `json.loads` in `extract_dict_from_response`
This fixes a bug where boolean values could not be decoded because of their required capitalization in Python.
LLMs are probabilistic systems. Reproducibility of completions is not guaranteed. It only makes sense to account for this, by running challenges multiple times to obtain a success ratio rather than a boolean success/failure result.
Changes:
- Add `-N`, `--attempts` option to CLI and `attempts_per_challenge` parameter to `main.py:run_benchmark`.
- Add dynamic `i_attempt` fixture through `pytest_generate_tests` hook in conftest.py to achieve multiple runs per challenge.
- Modify `pytest_runtest_makereport` hook in conftest.py to handle multiple reporting calls per challenge.
- Refactor report_types.py, reports.py, process_report.ty to allow multiple results per challenge.
- Calculate `success_percentage` from results of the current run, rather than all known results ever.
- Add docstrings to a number of models in report_types.py.
- Allow `None` as a success value, e.g. for runs that did not render any results before being cut off.
- Make SingletonReportManager thread-safe.
* feat(benchmark): Add JungleGym WebArena challenges
- Add `WebArenaChallenge`, `WebArenaChallengeSpec`, and other logic to make these challenges work
- Add WebArena challenges to Pytest collection endpoint generate_test.py
* feat(benchmark/webarena): Add hand-picked selection of WebArena challenges