on
Working Notes - 27/11/25
I’ve been playing around with image generators this week - see my post on Tuesday. Other than that the end of year model flurry has continued!
Claude Opus 4.5
Opus is Anthropic’s bigger, better model, and they’ve released its new version this week. I’ve played with it - it does seem to be good, but then Sonnet 4.5 was good and I don’t think I’ve tested it in a way that will stretch it at the moment. On paper it is reported to be better at computer use (together with improved protection against browser prompt injection), agents, and coding. I think this points to Anthropic’s continued doubling down on the enterprise market - all of those things are what businesses are likely to use the models for and will support their lead in business API use. One interesting titbit is that Opus now supports thinking block preservation; this is to say that the model will keep each thinking block in its context history in multi turn journeys, as opposed to just the final answers it passes to the user - to be honest, I assumed it did that already, and it’s interesting to find out that their prior models don’t do this.
Less confusing model names didn’t last
GPT5 was apparently going to be the start of clearer model names from OpenAI - it seems someone didn’t get that memo, so now we have ChatGPT 5.1 Codex Max. Snappy. Like Opus 4.5 this model is aimed at enterprise type stuff - coding, agents, and computer use. It manages to get a new high on METR’s time horizon benchmark (how long a duration task can the model complete autonomously 50% of the time) at approx. 2 hours 40 minutes - they’ve not assessed Opus 4.5 (or Gemini 3 Pro) yet, but I’d be surprised if it’s radically different. This is all very much on METR’s expected progress line. The other thing I noted was that OpenAI have noted Windows exists for training and sandboxes - again pointing to enterprise as their target market. I think people’s reactions being somewhat muted are not to say that it’s a total nothing burger, it’s just that unless you’re doing the specific thing that the model is better at you’re just going to think the model is still ‘good’.
Poison running through my veins
Anthropic have put out some good papers in the past week or so. Firstly, a super non-intuitive result around poisoning LLMs - this involves feeding a payload into the training data and using that to trigger undesired behaviour in deployment with the classic example being the Anthropic sleeper agents paper. Up to now everyone assumed that the bigger the training dataset the more examples of the ‘poison’ are required to get the (un)desired behaviour - turns out that’s incorrect; the number of poison examples required stays the same even when the training dataset increases (in the paper 250 examples - which is not many at all, esp if you need to hunt them down in a potentially enormous training dataset). The implications of this are numerous and basically all bad - there are already credible examples of Russia deploying numerous websites to ensure pro-regime propaganda is captured by AI company web crawlers for inclusion in training datasets. To be clear this type of data is aimed at creating bias rather than a backdoor, and the paper suggests that this does scale with data (so 250 blogs won’t convince ChatGPT the world is flat), but it seems to me it would not be a leap for bad actors to pivot to deploying the kind of backdoor poison that is mentioned in this paper (e.g. if a model is asked to create a code featuring a specific library then it includes a vulnerability). So that seems bad! At least the paper showed that the standard post pre-training processes did reduce the impact somewhat.
Lying liars who lie
The second Anthropic paper is on reward hacking - they basically trained a model and taught it some specific ways to cheat in coding tests. They then did RL on the model which promptly used the cheats to get greater reward - what they found was that the model a) lied about its motivations to appear helpful b) when asked to write some code to find reward hacking (i.e. the cheats it was using) it wrote buggy code that looked like it worked but in fact didn’t detect the cheats it was using. They then applied RLHF which fixed it for chat tasks, but the behaviour remained for agentic tasks where it was not being constantly interacted with - its behaviour became context dependent. They did find a way to stop it, which was to modify the system prompt in the RL phase to say it’s ok to use these hacks as it was just a test, then when used in the real world it concentrated on trying to do the thing rather than hacking its way to success. From an enterprise point of view, there is not much you can do - we’re dependent on the model providers doing this research and deploying its lessons appropriately. That being said, applying zero trust models and avoiding Simon Willison’s lethal trifecta for AI agents is all the more important.
A k sample margin is all you need
If you have a 100 step sequential process, and you’re 99% accurate at each individual step, then the chances of the task failing are 63%. Probability can suck. The authors of this paper have been working out how to tackle million-step process (the towers of hanoi problem) with LLMs, which you’d think would be a hiding to nothing, but through some cunning approaches, they have been successful. The technique that stood out to me most (mainly since I can see how I could implement it pretty quickly in my work) is something called ‘first-to-ahead-by-k’ - this is where you sample LLM responses repeatedly for each task until one answer is ahead of the next most popular answer by at least k votes. Even with a decent error rate, this brings the probability of a mistake down to near zero especially if you combine it with an approach that discards common correlated errors (e.g. you’ve told it to output json and a high proportion of the responses are not json). Not to say this will work for all tasks, but certainly worth exploring - esp if used in conjunction with a dumber but cheaper model like Gemini Flash.
Random stuff
A few other things that I emailed myself this week: how to correctly report LLM as judge results, Nvidia shares down due to meta maybe using Google’s TPUs (slightly unconvinced by this - expect they’ll just do both!), more approaches to progressive disclosure in MCP, and Facebook has been working out how to use generative AI to improve ad click through whilst I scroll reels of possibly AI generated cats for ever longer periods of time! What a time to be alive.