Working Notes - 06/03/26
The US government has been making some very odd AI choices this week around supporting the fastest-growing software company ever, but rather than duplicating the discussion on this, I will instead concentrate on two papers which have stood out to me in the last couple of weeks.
Do I have to repeat myself?
I have a use case where a lot of text information from various sources is put into a non-reasoning model’s context window and it’s asked to validate a previous model’s summarisation. Due to the specific method we were using to add some of the information to the prompt, we discovered we were actually accidentally putting some of the data in twice. As this was already a long prompt that was run many times, we corrected the prompt to only put the information in once. Surprisingly the results actually got worse - a counter-intuitive result as I was expecting context rot.
Whilst I was pondering this, I coincidentally came across this short (3 pages!), but fascinating, paper from Google. The abstract is so short I don’t think I could summarise it any better - so I won’t - “When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.” This sounds suspiciously like what we’ve been seeing in my use case!
The reason why this works is super simple. Transformer models can’t attend to future tokens (as the entire architecture is built around predicting the next token based strictly on past ones) so if something is important (e.g. the question) but the model has not yet seen some important context, it won’t be able to take this information into account and will process the question without seeing the context. If you just duplicate the entire prompt then every token can attend to every other token.
The quoted results are dramatic - across all the models tested (Gemini 2.0 Flash, GPT-4o, Claude 3.7 Sonnet, Deepseek V3) prompt repetition won 47 times to 0! Accuracy in needle in a haystack tasks jumped from 21% to 97%. The length of the output was unchanged - which assuming if this is what’s going on with my example, I can attest to - we would have noticed if the output was changing.
So basically, a free (input token cost notwithstanding) improvement! It won’t even significantly increase latency as the repeated prompt is processed in the very parallelised prefill stage. Before getting too excited, the trick doesn’t work with reasoning models, as the RL training the models undergo tend to result in the models repeating the prompt in their chain of thought anyway. But if you have a use case that uses a non-reasoning model, you should absolutely be trying this out (I am in the process of A/B testing this right now!).
Multi-agent teams? Don’t.
A very interesting paper from Stanford which is something of a tour de force on multi-agent frameworks. The short version is that they found that model training naturally means that they will tend to reach compromise positions even when a member of the team has objectively the correct answer. This has quite important implications for how agentic teams are set up, and supports my current view that you should only do production agentic ‘stuff’ within strict scaffolding.
The experimental set up was interesting with two distinct approaches. The first approach was with a team of agents where each agent was a different model (e.g. Claude, OpenAI, etc). The team was then tested on standard maths, question/answer benchmarks to see if the models could work out which model was smartest (spoiler - nope).
The second setup was perhaps more interesting with each agent being the same model, but was provided with different information given in their context windows. They then either gave one agent the ground truth answer, or they gave each agent part of the answer and they had to collectively share their part to get the correct answer.
Suffice to say both setups resulted in the agentic team getting the wrong answer or at least significantly underperformed the score that would have been obtained by the expert model alone. I’m not particularly surprised by the model mix experiment result, but the experiments where one of the agents is given the ground truth but the team still got the wrong answer is pretty shocking. They did the maximal version of this and actually told all the agents which agent had the correct answer, and yet - failure to get the right answer.
The actual failure mechanism is surprising but obvious at the same time - you know when you give a chatbot some new information, no matter how inane, and it says “wow, that’s a really good point I hadn’t considered”? That’s the failure mechanism. The models love to compromise - I guess blame RLHF. As an example of how that plays out, in one scenario they have to rank items that would be useful on the moon - the model with the correct answer, will still state things like “I think that oxygen is the most important item, but model 2 makes valid points, so I will compromise and move oxygen to the second position”. The very act of negotiation dilutes away any expertise that the models possess.
Even some of the things that the researchers controlled for are interesting - for example there is a concept in psychology called first-speaker bias - they controlled for this through random starting etc, but it is not something I would have immediately considered when structuring an agentic team.
The main takeaway for me from the paper is not to use agentic teams and expect to get the correct answer in scenarios with objective truth. Instead use a lot of scaffolding - chain the agents together to best use their expertise (e.g. if GPT5.4 is best at maths, then give a maths component to the GPT5.4 agent, if Claude is best at writing, then give that role to that Claude agent, etc.). Secondly in a scenario where you don’t know which agent will generate the best answer, use a model as a router to select the best answer from a number of agents. Finally, the somewhat heartening third option is to defer to a human in the loop - I guess we still have some uses!
Working Notes - 19/02/26
This week I wanted to cover one of the interesting things that came out of Anthropic that wasn’t Sonnet 4.6.
The future is not evenly distributed
Anthropic release a lot of interesting information into the public domain, this week they’ve published a blog about agent autonomy. The TLDR of the blog post is that they’re mainly looking at how long ‘turns’ (how long an agent runs for before asking for human input) last for in Claude Code (Anthropic’s agentic coding tool) and on the Claude API, and they find that the length of the longest turns have been increasing from 25 minutes in October 2025 to about 45 minutes now.
The immediate thought is well, yes you would see that since the models have increased in quality. However what you don’t see is immediate jumps after a model release, the graph is somewhat bumpy but the trend is fairly smooth. For Claude Code the authors point to a number of reasons for this, principally the Claude Code agentic harness has improved, but also users are trusting it more as they become more experienced. This is an interesting observation and I buy it based on my own experience. It took me quite a few goes with Claude Code before I realised what the tool was actually capable of – I’m pretty certain I’m still not using it to full capacity!
For API calls the blog also shows what users have been using agent functionality on the models for and plot this on a risk / autonomy scale. A high risk/ autonomy task might be someone using the API to make financial trading decisions, whereas a low risk / autonomy task would be like someone using it to complete simple calculations. What would be really interesting, is whether this is changing over time – unfortunately this part of the analysis is based on a snapshot of data, although they do say they’ll repeat in the future so perhaps this can be derived. I would expect that as people’s comfort level with the tool goes up with use, they will also start using the tool for high risk/ autonomy tasks.
I also find the article interesting for something it doesn’t cover - the diffusion of knowledge about this technology through enterprises. It’s increasingly obvious to me there is a substantial epistemic gap between people who have used tools like Claude Code and those who have only used the free tier of say ChatGPT 4o or MS Copilot (which in many enterprises will still be using 4o under the hood). To quote William Gibson, ‘The future is here it’s just not evenly distributed’, is very much true when it comes to AI tools at the moment.
Those with unrestricted access (i.e. the latest models) to tools like Claude Code / Cowork are having a very different experience to those who are using Copilot in Outlook to give tone coaching. There are good reasons why this is so - for example the bar for enterprise security and privacy is rightly higher due to consequence and complexity. But this does not change the fact that the gap is very much there and that it’s likely impacting strategic decision making in enterprises.
Furthermore, it’s natural to use an existing mental model to frame this technology. To a first approximation one CRM system is very much like another - there are nuances but they do the same thing. This is not true for current AI. There seems a substantial risk that applying a commodity mind-set will result in a company being less competitive. The article is discussing systems that will run autonomously for 45 minutes plus, create their own tests, create pull requests, and self-correct. This is a different class of product with potentially profound operating model impacts. At the very least serious consideration should be given to how this type of capability can be embedded in an organisation and what priority we should put on getting agentic systems through our security / privacy processes.
Until more people across enterprises - at all levels - have experienced the difference between state-of-the-art agentic harnesses and simple chatbots / autocomplete it’s going to be difficult for businesses to make well-informed strategic decisions about AI.
One more thing
One of the things that makes agentic systems so powerful is their ability to apply agent skills as they allow agents to behave in a reliable / structured manner. Anthropic’s new guide to building skills is an excellent little resource and definitely worth a read – and trying out!
Working Notes - 12/02/26
Busy week so I’m just going to talk about Anthropic’s new release – Opus 4.6. I will note with a raised eyebrow that GPT5.3 was released within 45 minutes of Opus – I guess OpenAI are feeling pretty threatened at the moment!
New model, new capabilities
The main changes for Opus 4.6 are the normal ‘more intelligence’ but also a larger 1M token context window and the ability to tune the level of reasoning from low to max. The pricing remains the same as for Opus 4.5, although you can now pay (a lot) more for ‘fast’ mode.
For the past few releases it’s been clear that in order to actually experience the power of the newer models you need to be using them in an agentic framework such as Claude Code. This is for the simple reason that it’s hard to stretch the capabilities of the models with a straight forward Q&A session in a web chat app.
I’ll briefly sketch out a few examples of how I’ve been using it this week:
- Worked through a problem I had about how best to create a (sort of) data dictionary whilst only having access to WebI-generated SQL queries - it suggested a conceptual approach I’d not considered which does indeed appear to work. Note that although I did this within Claude Code the task was entirely conceptual with no code or any data at all - just a description of the problem.
- Created a revised draft of an external presentation I’d delivered a few months ago, again - no code involved, but I got back an extremely decent attempt along with some great speaking notes (I just gave it a bunch of these blogs and told it to make it sound like me).
- Used it to do a few python related coding tasks - these it one-shotted. None of them were particularly complicated, and Opus 4.5 would probably one-shotted them as well, but it’s amazing how far this has come even from Sonnet 4.5.
These are just some isolated anecdotes, but having had a week to play with it, my TLDR is that a lot of the hype is real. When used in Claude Code it does seem better than previous versions at self-critiquing its answers and improving them without further prompting – also I notice my expectation of coding problems is now that I’ll probably get a good answer on the first go. All in all it does seem to be cleverer than Opus 4.5. And let us not forget that 4.5 was already very clever indeed.
What could go wrong?
Anthropic have released the model as an ASL-3 model (AI Safety Level 3). This is a level defined within their safety framework as a model “that substantially increase the risk of catastrophic misuse compared to non-AI baselines”.
In this same framework document they state they will provide a definition of ASL-4 before a model that reaches ASL-3 is released. Whilst they have done this at a high level for AI R&D and biological risk, the thresholds remain qualitative even if the evaluations behind it are detailed. For biological risk specifically, they’ve simply put it as “the ability for a model to substantially uplift moderately-resourced state programs” which I guess is fine as publishing a more detailed criteria could in itself be dangerous (e.g. if you published a list of the very specific things that makes biological weapons tricky to make, you’re kinda providing an instruction book). The AI R&D assessment is based on a survey of 16 of their technical staff. Again, this doesn’t seem too bad – I’d expect those employees to know if they’re in imminent danger of being replaced.
However, for cyber threats they have explicitly not provided an ASL definition at any level whilst simultaneously being certain that they are not at ASL-4 yet. This seems a bit weak - they are essentially saturating all of their automated ASL-3 benchmarks at this point. Whilst this is markedly better than say Deepseek, who are on record as saying they don’t have compute to spare for safety work, it does appear to highlight a gap in their own published standards. To their credit, Anthropic are not blind to this criticism and have released the model with ‘additional safeguards’ in a number of cyber related areas (e.g. agentic coding use).
This really matters, as I think it’s fair to say that if Opus 4.6 is manipulated to circumvent its guardrails in the same way that Claude Code was used in September then it’s likely a potent cyber threat to enterprises large and small. Indeed, as part of the release hype, Anthropic have posted a blog about finding 500+ zero day exploits - I am somewhat sceptical about how much weight to place on this as no CVE details have been released, but there is little doubt AI assisted hacking is already a thing.
There is positive news on prompt injection - this is basically now 0% success in coding tests, though given it’s already been jail broken to extract the system prompt perhaps they need some harder tests. Slightly less rosy on the indirect prompt injection, but at least it’s much improved on the key enterprise use case of browser use (aka can I get this thing to update my legacy system that doesn’t have an API) - this has a success of only 0.08% of attempts in their testing with their safeguards in place.
Reading the above back, this does come across as a little negative. However, I want to stress that this is an amazing model, and I’d encourage everyone to go and use it inside Claude Code. Most of my cyber safety concerns are just as valid applied to OpenAI and other model providers (if not more so!). The next year is going to be quite challenging!
Other stuff
New model releases aside – here in no particular order are some other items that caught my eye this week:
- AI is cool, mars is cool, robots are cool, so the Perseverance rover being driven around the surface of mars by Claude is triple cool.
- Has Google cracked the holy grail of enterprise AI use cases? Automated calendar scheduling that takes into account everyone’s availability – I will pay good money for this in Outlook.
Working Notes - 05/02/26
A mix this week - thoughts on research that shows getting AI to write unfamiliar code means you learn less, bringing interactivity to MCP, and a clever RAG technique that’s worth adding to your experiment pile.
GPT 4o still doesn’t make you faster
There is a meme on X that essentially says that every time there is a paper that states that AI doesn’t make you more productive, you invariably look inside and find out they used ChatGPT4 in a sidebar. Whilst that is probably doing this paper something of a disservice, when you look inside, you find that yes - they are using 4o in a sidebar.
Leaving that aside it’s still quite an interesting paper as the framing is less about productivity (although that is commented on) it’s about learning in junior software engineers. Specifically they find that when two groups of engineers are asked to learn and take a quiz on an obscure software library, the group with access to the AI scored lower (approx. 17%) on the quiz and were also not really any faster (p=0.39 for the stats fans).
In their interpretation they split the AI group into two buckets. Firstly the high scorers, these tended to either use the AI to ask conceptual questions, or used the AI to generate answers which gave code and explanations. The lower scorers were on the other hand inclined to just YOLO it and fully delegate to the AI, either immediately or after the first task, or those who iteratively debugged the code by sticking the error straight into the AI and saying ‘fix it’.
In many ways this is unsurprising - they found that if someone engages with the problem and understands the solution then they learn something. If on the other hand, if you get a tool to do it for you, then you don’t learn much. An analogy might be that if you use a calculator to do long division you’re not going to get good at long division.
This analogy then prompts the question - does it matter?
The authors argue that it does matter as you need someone to verify the answer. I find that less than compelling. It matters not a jot in my life that I’m appalling at long division - in any likely situation where I’ll need to do it, I can pull out my phone. Likewise, if the study subjects had been equipped with a modern agentic system like Claude Code I expect they’d have learnt even less, but they’d undoubtedly have been a lot faster, almost certainly got 100% in the coding task and the agent would have written its own verification tests. Is there any likely scenario where a junior engineer is going to progress through their career without the benefit of an agentic AI assistant? I find it more likely than not that most (though not all) engineers will not be writing much code, if any, in 5 years’ time - I expect they’ll be directing swarms of coding agents if not replaced entirely. This is not without precedent – people can still make jam, or knit their own clothes, but these are now hobbies rather than professional necessities.
Beyond the narrow coding example, these findings definitely have wider implications, particularly for educators who need to work out how we can use the tools to enhance rather than supplant learning. That’s a bigger question than I can answer in my blog, but it certainly makes me ponder how education is going to pan out for my two young kids. On a tangential but related point, we may also see a lessening of the open source ecosystem - the authors used an obscure library in the example, but in the future world are we going to have similar obscure libraries to use in such studies? The agents are going to go to what they know, and if it doesn’t do what they want, they’ll code something new. After all, open source libraries exist to prevent people reinventing the wheel, agentic systems are quite happy to reinvent the wheel and won’t have the same incentive to share their code and learnings with the world in the same way as today’s engineers. The better the agents get, the bigger these problems.
MCP Apps
MCP has been an open standard for over a year now and it’s now got an official extension - MCP Apps. In a nutshell this allows a custom interactive UI to be displayed within a chat window via an embedded iframe.
For use cases where you’re going back to the human with some results from the MCP query, this is pretty useful. For example, let’s say your MCP server can query your BI system and you’ve pulled back some sales stats, this can now be displayed as a chart inline with the chat window. The inline window can then interact directly with the MCP server without going via the LLM - if you want to filter by geography you just use the dropdown. This both saves on tokens, but importantly makes it a deterministic process so you’re not worried that some weird UI will be generated or the LLM will hallucinate filtering details. Obviously, this can make the model out of sync with what you’ve just done to the view (e.g. you’ve filtered on Europe) but there is provision in the extension to allow you send back a (short) message to give the model updated context (e.g. the user filtered on Europe).
Like the rest of MCP this is a build once use many situation - the UI will work in Claude or ChatGPT or any software that supports the extension. This already includes VS Code, so potentially you’re going to get things like interactive diffs in the chat which has the potential to be quite useful. Importantly it’s also a progressive enhancement, if the client doesn’t support it, the MCP tool will still work.
Whether this will take off remains to be seen. The main blocker is the walled garden incentive - MS wants to keep you in their Copilot world, so are going to prefer you to build a Copilot connector, but I think we’ll probably see some adoption where it makes sense (e.g. I expect Salesforce will add it to their MCP server). When combined with the Agent Skills you’ve got a pretty powerful combination - instead of having text prompts at points of human interaction, you can now have interactive elements.
Working Notes - 29/01/26
A mix this week - some thoughts on OpenAI’s value-based pricing model pitch (spoiler: I’m a sceptic for most scenarios), testing Gemini Flash 3’s new agentic vision feature, some new chips and models, and another scary voice cloner.
Future business models
Given the scale of the investment that AI companies have seen and the expected impact on enterprises, changes in their business models are obviously quite interesting. The OpenAI CFO, Sarah Friar, has started making noises about how she sees OpenAI starting to claw back some of their outlay. Her basic premise is that AI firms will move beyond the standard SaaS model of bums on seats / usage charging and towards a more consulting model of value based outcomes.
This is a pretty interesting hypothesis and bears some scrutiny (I’ll take the statement at face value rather than as a funding round talking point). If you look at where this model currently works, you’ve got things like litigation (no win no fee), tender bidding support, M&A sourcing, payment providers, bug bounties, and recruitment. The common factor with all of these is either that the value is explicit (e.g. a 2% transaction fee by Stripe, or I decided to hire this guy) or there is neutral third party that will judge success (e.g. we won the litigation, I tested the bug, or we won the contract).
A corollary of this is that where there is no neutral measure of value, the approach is much less common. The main example is consulting companies doing value-based deals. From my own experience (management consultant for 11 years!), I’ve seen how difficult these deals can be at the end of a contract, even where everyone thought bulletproof definitions had been agreed a year earlier.
The other major factor that needs to be considered is competition. At the moment at least, it does not seem likely that OpenAI (or any AI company) will get a comparative advantage that is so large that any enterprise would be crazy to use an alternative provider. If OpenAI is asking for 10% of value and a competitor is offering a flat per token (or some other utility pricing measure) then a CFO is likely to be going with the competitor. After all, the vast majority of intelligence that companies purchase today (aka actual humans) is done on a flat rate basis with the only exceptions being very senior employees.
The only place I can see this working in the 2026/27 timescales is in joint ventures. If OpenAI can identify some high-value use cases with a clear neutral reward signal (e.g. a new drug makes it to a certain FDA milestone) and offer a deal along the lines of ‘you provide the data, I provide the compute and salary Opex, we split the value of the outcome’ then it’s going to be an interesting proposition for a CFO. The assessment won’t be around ‘can I get a cheaper deal’ it’ll be more of an opportunity cost discussion (what else could I do with this data, where else could I deploy these high-value employees).
This is a fairly bold play by Sarah Friar, but at the moment OpenAI doesn’t have a sufficient moat compared to other frontier labs to make it work for most enterprises. I expect things to go in the direction of utility pricing, which means that for any hope of recovering their capex the AI companies are going to need to see huge adoption in enterprise. The JV exception, if one emerges, will be industries with verifiable rewards like litigation and drug discovery. As with many things in AI, finding the correct reward signal is the key.
Agents with vision
Google has now enabled Gemini Flash 3 to perform agentically when given an image input. Normally when you give a model an image it’ll look once and then tell you your answer, however when code execution is enabled, the new version of Flash will look and use a tool based agentic loop before providing the result. What this means in practice is that it’ll use Python to slice up the image and effectively zoom into parts of the image that it identifies as interesting in its first look.
This is potentially quite useful for one of the most important enterprise AI use cases - ingesting existing word documents and PDFs. These often contain diagrams, Gantt charts, process flows, etc that you might either want to add immediately to a prompt or create markdown versions of for later retrieval as part of a RAG pipeline. So does it work? My experience is a big fat sort-of. When I asked it to extract detail from an example Gantt chart, it certainly used python to zoom in and manipulate the image, but it didn’t get the correct answers for everything. Whilst it got the order and task titles correct, it incorrectly recorded the start dates, end dates and durations. This may have been a prompting issue on my part, but I did try several approaches with similar results. Alternatively it might be due to the need to reference the date scale at the top of the chart with the bars themselves.
My second example was much more successful – I found a fairly detailed consulting framework diagram showing an ML model lifecycle and it provided me with a word-perfect transcription, complete with all the required information about how each part flowed to the next.
So - from a conceptual point of view, this does seem like ‘the way’ but at the moment at least it’s worth being circumspect with the types of images to which you apply it. I would expect this to improve over time and it’s already a useful tool on a model as cheap as Gemini Flash.
Kimi K2.5
After a brief Xmas pause we’re back to a new model release from Moonshot - this builds on the K2 thinking release in November. It has landed to a generally good reaction, particularly with respect to its coding abilities. All the benchmarks are suitably impressive, although as ever, treat these with a pinch of salt.
Other than the model now being natively multimodal (a first for Moonshot) the most interesting part of the release blog relates to what they are calling agent swarms. This is essentially what it sounds like - a orchestrator agent controls up to 100 sub-agents to complete a task.
The key thing is that they’ve trained the model to be able to decompose the tasks itself without any special prompting - they initially achieved with RL early on by rewarding parallel execution, then slowly removed this reward and only rewarded success. Additionally the reward function penalised the critical task duration (i.e the critical path in project management terminology) anything that happened whilst that longest task was completing was essentially not penalised - this has the effect of preventing sequential execution (as the critical path increases) but encouraging lots of shorter tasks. Clever, and their metrics show the swarm approach reducing time to result by up to 80%.