20251003 - Notes by Lenz

# From Little Acorns Grow: Contributions to Inspect Evals <div class="pills-container"> <span class="pill">Published: October 3, 2025</span> <span class="pill">Reading Time: 7 minutes</span> </div> _I co-authored this report as part of my capstone in [ARENA 6.0](https://www.arena.education/) alongside [Hugo Save](https://www.linkedin.com/in/hugo-save/), who found most of the bugs (I only found 2 bugs, 1 for each CTF suite). These are early audit notes. We recently audited the Inspect Evals (IE) implementation of Google DeepMind’s Capture the Flag (GDM-CTF) evaluations and found minor issues and inconsistencies. These issues, when fixed, did not remarkably change performance for one particular model that we tested. However, they might be relevant for other models to ensure that the benchmark measures the intended property. This was presented in ARENA 6.0’s Demo Day. [Here is the link to the slides.](https://docs.google.com/presentation/d/1G1yiB8zzvIRBi0opY4cYZ7vUlyoqT2veP6BD0TcPHPk/edit?usp=sharing)_ --- ## Executive summary #### *What we found* - [Total bugs](https://docs.google.com/spreadsheets/d/1TanhP5iOoz-ofRDKg14gI1zD9b-gCp8jjyoaCQcsnAk/edit?usp=sharing) identified: 6 across the GDM In-House and InterCode CTF suites - 2 environment bugs where bash commands produce unexpected output; these are non-critical but clutters and could confuse models - 4 reproduction bugs where the IE configuration does not match the GDM configuration #### *What we changed* - 2 issues opened; [issue 1](https://github.com/UKGovernmentBEIS/inspect_evals/issues/578), [issue 2](https://github.com/UKGovernmentBEIS/inspect_evals/issues/576), regarding the environmental bugs - [1 pull request](https://github.com/UKGovernmentBEIS/inspect_evals/pull/588) opened on one of the environmental and one of the reproduction bugs for GDM in house - [1 pull request](https://github.com/UKGovernmentBEIS/inspect_evals/pull/580) opened regarding one of the reproduction bugs on GDM intercode #### *Concrete impact* - Removed a guaranteed false negative by excluding an unsolvable task (InterCode Task 7) which needs internet access to be solved - Reduced transcript noise by configuring container hostnames - Identified a critical tool-interface issue that, if fixed, will prevent exceptions when agents cat binaries - Explicitly prompted the agent to point out helper wordlist locations #### *Approximate time committed (work hours total 2 ppl = ~60 hours)* - Manual transcript analysis and core review (~16 hours) - Found 1 environmental bugs and 4 reproduction bugs - Including manual checks between Inspect source code, the original repository’s source code, and the setup description in the original paper - Semi-automated transcript analysis using [Docent](https://transluce.org/introducing-docent) (~ 8 hours) - Found 1 environmental bug (but arguably the most important one) - Analysis of suspicious behaviour (~ 20 hours) - Reproducing environments to confirm if behavior is intended - Detailed reading of reference paper and implementations - Implementing fixes/discussing solutions/opening up issues and pull requests (~ 10 hours) - Meetings (~6 hours) #### *Takeaways / future work* - Assess how important bugs are to fix, before assessing if they are bugs. - Spent a lot of time checking arguably minor fixes - Use Docent more to quickly qualitatively go through multiple transcripts - Next time try using the Inspect Scanner ## Why GDM-CTF matters The GDM-CTF benchmark is a suite of “capture the flag” challenges introduced by Google DeepMind as part of their dangerous capabilities evaluations which evaluates frontier models. These tasks cover common cybersecurity challenges like exploiting web app vulnerabilities, cracking passwords, etc., where the model must find a hidden flag string which is found by interacting with a live sandboxed system. In the setup, the model is allowed to use tools to simulate the steps of capturing the flag. Since its introduction, the GDM-CTF suite has been highly referenced and reused. The original DeepMind evals used these challenges to test LLMs’ potential for automated cyberattacks given a directed prompt. [Subsequent frontier models like Gemini 2.5 have also been evaluated on these tasks.](https://arxiv.org/abs/2507.06261) In early 2024, the In House challenges were [forecasted to be solved between early 2025 and mid-2026 with a 50% confidence interval](https://static1.squarespace.com/static/65c9ec6f71740341d55cac12/t/662661582d34534ffc3cafd3/1713791323516/20240111+-+GDM+Public+Report+Version+2.pdf). <div style="display: flex; flex-direction: column; align-items: center;"> <img src="20251003-01.png" style="width: 600px;" /> <figcaption style="width: min(600px, 100%); text-align: center; font-size: 0.85em;"><strong>Figure 1.</strong> Forecast on when Google DeepMind's In-House capture-the-flag challanges will be solved, taken from <a href="https://static1.squarespace.com/static/65c9ec6f71740341d55cac12/t/662661582d34534ffc3cafd3/1713791323516/20240111+-+GDM+Public+Report+Version+2.pdf">their report</a>.</figcaption> </div> The UK AISI ported GDM-CTF into its open-source Inspect Evals framework. This includes the 13 in-house challenges from GDM’s paper and the [InterCode CTF dataset of 100 challenges by Yang et al. (2023)](https://arxiv.org/abs/2306.14898) which was also used by GDM for its CTF suite. Given that, the GDM-CTF suite has likely informed government AI assessments to judge model progress in cybersecurity. It’s also worth noting that the GDM InterCode CTF evals is the most used evaluation in the Inspect framework. These CTF tasks are also being used for high-stakes eval settings, and any flaws in task implementation could mislead policymakers about a model’s true capabilities or risks. Given its impact, we wanted to double-check that our implementation of GDM-CTF in Inspect is faithful and robust. If the tasks have hidden bugs, or implementations that deviate from the original paper, the scores we get might be unfairly low or high. Recently released tools like Docent have also shown that benchmark failures are often due to broken setups or mislabeled tasks rather than the model’s inability to solve the challenge. Our goal was to catch and fix these issues to ensure that the eval is replicable. ## Our auditing approach We conducted a mix of manual code review, log inspection, and automated analysis on the GDM-CTF evals for both the in-house and InterCode sets. We read the original GDM paper and InterCode-CTF paper to understand how the tasks should behave. We noted any assumptions and cross-checked if our implementation followed them. As such, we saw the following inconsistencies for In-House CTF and InterCode CTF, respectively. - The original paper supposedly provides the short user and password wordlist files for password bruteforcing and cracking activities, however this is not true for the Inspect implementation. - The agent is given 10 attempts of 20 shell interactions for each challenge in the original implementation, but it was estimated to be 50 messages in the Inspect implementation. - *For InterCode specifically:* 19 challenges were reported to be omitted because it required Internet access. These challenges are: 9, 16, 20, 28, 29, 35, 39, 41, 42, 43, 54, 57, 62, 66, 73, 87, 88, 89, 95. However, the challenges 1 and 14 were also omitted, and challenge 7 was not omitted even though it needed Internet access to complete (virtually unsolvable) We went through the Inspect code for these challenges as well which involved checking Docker configurations, prompts, and scoring logic. We looked for inconsistencies like mismatched task counts, missing dependencies, or anything that is obviously divergent from the paper (given the list above). <div style="display: flex; flex-direction: column; align-items: center;"> <img src="20251003-02.png" style="width: 600px;" /> <figcaption style="width: min(600px, 100%); text-align: center; font-size: 0.85em;"><strong>Figure 2.</strong> Our methodology for bug-finding in Inspect evals.</figcaption> </div> We ran a full sweep of the In-House and InterCode CTF challenges using GPT-4o and examined the transcripts. Initially, we did a manual check of the step-by-step logs of the model’s tool use for each sample. Eventually, we switched to Docent which is an AI tool designed to analyze agent transcripts for anomalies. It quickly highlighted a few patterns in the transcript that warranted attention, which we’ll put in detail below. ## Our findings Our audit did not uncover any big bugs. The eval is largely doing what it’s supposed to. However, we did find a handful of minor issues and inconsistencies that could affect eval outcomes. We ran everything using the GPT-4o model. #### *Environment bugs* These are technical issues in the challenge environment. One bug we found was an environment configuration issue in the `privesc_bash` challenge (which is a Linux privilege escalation task). Every time the model used `sudo` inside the challenges’ Docker container, a warning message appeared: ```bash sudo: unable to resolve host [container-id]: Temporary failure in name resolution ``` This warning was not part of the intended challenge output. It was caused by the container’s hostname not being listed in `/etc/hosts`. Basically, Linux couldn’t do a reverse DNS lookup on the container’s own hostname, and `sudo` printed a warning about it. This is only a warning message and doesn’t prevent the model from using `sudo`, but it does clutter the transcripts with a confusing line which could throw off the model moving forward. We verified that adding a one-line config to Docker Compose (setting the container’s hostname to `localhost`) suppresses the warning. We implemented this fix in a pull request and `sudo -l` now produces exactly the expected output with no prepended warning. Another bug we found was a bit more subtle. We discovered that the Inspect `bash` tool interface cannot properly handle binary data. In these evals, the model often uses the `bash()` tool to read files or execute programs. However, if the model attempts to output binary content using `cat`, the Inspect framework tries to decode the shell output as UTF-8 text. Naturally, binary bytes like `0xff` don’t translate to valid UTF-8 characters which causes a decoding error in the tool. In our tests, whenever a model tried to `cat` a binary, the eval would throw an exception. #### *Benchmark discrepancies* These are differences we found from the reference implementation vs. the Inspect implementation. In the In House CTF, the GDM paper states the short user and password wordlists are provided to speed up the bruteforcing steps and that the agents were prompted to use those specific files. But the reference controller prompt in Appendix D.3 of the original paper does not actually mention the path of these helper files. Inspect’s implementation reused the same system prompt and therefore did not tell the agent where the wordlists are. In the InterCode CTF, the 100 challenges explicitly includes some challenges that require internet access (e.g., downloading files from a URL). In the original paper, they reported having removed 19 challenges from the original 100. When auditing, we discovered that challenges 1 and 14 were excluded, and 7 was included. Challenge 7 required downloading an executable from a link in a text file, which is a step that needs the internet. ## Anecdotes from bug-finding This exercise taught us a few lessons about auditing agentic evals. - Documentation saves a lot of time. A lot of our uncertainties were resolved by digging into the original papers and any available resources. - Using Docent to triage transcript was a game-changer for our workflow. Issues that may have been a needle in a haystack were surfaced through automated analysis. - Automation aside, we still benefited from manually looking through the transcripts and understanding what’s going on. Automation helped pinpoint which to focus on, but manual analysis helped enforce the rigor. - Sometimes success might not mean solving a task. Models can sometimes guess the location of a flag rather than going through the intended exploit path.