Ewan Panter | Working Notes

Working Notes - 06/03/26

2026-03-06T00:00:00+00:00

The US government has been making some very odd AI choices this week around supporting the fastest-growing software company ever, but rather than duplicating the discussion on this, I will instead concentrate on two papers which have stood out to me in the last couple of weeks.

Do I have to repeat myself?

I have a use case where a lot of text information from various sources is put into a non-reasoning model’s context window and it’s asked to validate a previous model’s summarisation. Due to the specific method we were using to add some of the information to the prompt, we discovered we were actually accidentally putting some of the data in twice. As this was already a long prompt that was run many times, we corrected the prompt to only put the information in once. Surprisingly the results actually got worse - a counter-intuitive result as I was expecting context rot.

Whilst I was pondering this, I coincidentally came across this short (3 pages!), but fascinating, paper from Google. The abstract is so short I don’t think I could summarise it any better - so I won’t - “When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.” This sounds suspiciously like what we’ve been seeing in my use case!

The reason why this works is super simple. Transformer models can’t attend to future tokens (as the entire architecture is built around predicting the next token based strictly on past ones) so if something is important (e.g. the question) but the model has not yet seen some important context, it won’t be able to take this information into account and will process the question without seeing the context. If you just duplicate the entire prompt then every token can attend to every other token.

The quoted results are dramatic - across all the models tested (Gemini 2.0 Flash, GPT-4o, Claude 3.7 Sonnet, Deepseek V3) prompt repetition won 47 times to 0! Accuracy in needle in a haystack tasks jumped from 21% to 97%. The length of the output was unchanged - which assuming if this is what’s going on with my example, I can attest to - we would have noticed if the output was changing.

So basically, a free (input token cost notwithstanding) improvement! It won’t even significantly increase latency as the repeated prompt is processed in the very parallelised prefill stage. Before getting too excited, the trick doesn’t work with reasoning models, as the RL training the models undergo tend to result in the models repeating the prompt in their chain of thought anyway. But if you have a use case that uses a non-reasoning model, you should absolutely be trying this out (I am in the process of A/B testing this right now!).

Multi-agent teams? Don’t.

A very interesting paper from Stanford which is something of a tour de force on multi-agent frameworks. The short version is that they found that model training naturally means that they will tend to reach compromise positions even when a member of the team has objectively the correct answer. This has quite important implications for how agentic teams are set up, and supports my current view that you should only do production agentic ‘stuff’ within strict scaffolding.

The experimental set up was interesting with two distinct approaches. The first approach was with a team of agents where each agent was a different model (e.g. Claude, OpenAI, etc). The team was then tested on standard maths, question/answer benchmarks to see if the models could work out which model was smartest (spoiler - nope).

The second setup was perhaps more interesting with each agent being the same model, but was provided with different information given in their context windows. They then either gave one agent the ground truth answer, or they gave each agent part of the answer and they had to collectively share their part to get the correct answer.

Suffice to say both setups resulted in the agentic team getting the wrong answer or at least significantly underperformed the score that would have been obtained by the expert model alone. I’m not particularly surprised by the model mix experiment result, but the experiments where one of the agents is given the ground truth but the team still got the wrong answer is pretty shocking. They did the maximal version of this and actually told all the agents which agent had the correct answer, and yet - failure to get the right answer.

The actual failure mechanism is surprising but obvious at the same time - you know when you give a chatbot some new information, no matter how inane, and it says “wow, that’s a really good point I hadn’t considered”? That’s the failure mechanism. The models love to compromise - I guess blame RLHF. As an example of how that plays out, in one scenario they have to rank items that would be useful on the moon - the model with the correct answer, will still state things like “I think that oxygen is the most important item, but model 2 makes valid points, so I will compromise and move oxygen to the second position”. The very act of negotiation dilutes away any expertise that the models possess.

Even some of the things that the researchers controlled for are interesting - for example there is a concept in psychology called first-speaker bias - they controlled for this through random starting etc, but it is not something I would have immediately considered when structuring an agentic team.

The main takeaway for me from the paper is not to use agentic teams and expect to get the correct answer in scenarios with objective truth. Instead use a lot of scaffolding - chain the agents together to best use their expertise (e.g. if GPT5.4 is best at maths, then give a maths component to the GPT5.4 agent, if Claude is best at writing, then give that role to that Claude agent, etc.). Secondly in a scenario where you don’t know which agent will generate the best answer, use a model as a router to select the best answer from a number of agents. Finally, the somewhat heartening third option is to defer to a human in the loop - I guess we still have some uses!

Working Notes - 19/02/26

2026-02-19T00:00:00+00:00

This week I wanted to cover one of the interesting things that came out of Anthropic that wasn’t Sonnet 4.6.

The future is not evenly distributed

Anthropic release a lot of interesting information into the public domain, this week they’ve published a blog about agent autonomy. The TLDR of the blog post is that they’re mainly looking at how long ‘turns’ (how long an agent runs for before asking for human input) last for in Claude Code (Anthropic’s agentic coding tool) and on the Claude API, and they find that the length of the longest turns have been increasing from 25 minutes in October 2025 to about 45 minutes now.

The immediate thought is well, yes you would see that since the models have increased in quality. However what you don’t see is immediate jumps after a model release, the graph is somewhat bumpy but the trend is fairly smooth. For Claude Code the authors point to a number of reasons for this, principally the Claude Code agentic harness has improved, but also users are trusting it more as they become more experienced. This is an interesting observation and I buy it based on my own experience. It took me quite a few goes with Claude Code before I realised what the tool was actually capable of – I’m pretty certain I’m still not using it to full capacity!

For API calls the blog also shows what users have been using agent functionality on the models for and plot this on a risk / autonomy scale. A high risk/ autonomy task might be someone using the API to make financial trading decisions, whereas a low risk / autonomy task would be like someone using it to complete simple calculations. What would be really interesting, is whether this is changing over time – unfortunately this part of the analysis is based on a snapshot of data, although they do say they’ll repeat in the future so perhaps this can be derived. I would expect that as people’s comfort level with the tool goes up with use, they will also start using the tool for high risk/ autonomy tasks.

I also find the article interesting for something it doesn’t cover - the diffusion of knowledge about this technology through enterprises. It’s increasingly obvious to me there is a substantial epistemic gap between people who have used tools like Claude Code and those who have only used the free tier of say ChatGPT 4o or MS Copilot (which in many enterprises will still be using 4o under the hood). To quote William Gibson, ‘The future is here it’s just not evenly distributed’, is very much true when it comes to AI tools at the moment.

Those with unrestricted access (i.e. the latest models) to tools like Claude Code / Cowork are having a very different experience to those who are using Copilot in Outlook to give tone coaching. There are good reasons why this is so - for example the bar for enterprise security and privacy is rightly higher due to consequence and complexity. But this does not change the fact that the gap is very much there and that it’s likely impacting strategic decision making in enterprises.

Furthermore, it’s natural to use an existing mental model to frame this technology. To a first approximation one CRM system is very much like another - there are nuances but they do the same thing. This is not true for current AI. There seems a substantial risk that applying a commodity mind-set will result in a company being less competitive. The article is discussing systems that will run autonomously for 45 minutes plus, create their own tests, create pull requests, and self-correct. This is a different class of product with potentially profound operating model impacts. At the very least serious consideration should be given to how this type of capability can be embedded in an organisation and what priority we should put on getting agentic systems through our security / privacy processes.

Until more people across enterprises - at all levels - have experienced the difference between state-of-the-art agentic harnesses and simple chatbots / autocomplete it’s going to be difficult for businesses to make well-informed strategic decisions about AI.

One more thing

One of the things that makes agentic systems so powerful is their ability to apply agent skills as they allow agents to behave in a reliable / structured manner. Anthropic’s new guide to building skills is an excellent little resource and definitely worth a read – and trying out!

Working Notes - 12/02/26

2026-02-12T00:00:00+00:00

Busy week so I’m just going to talk about Anthropic’s new release – Opus 4.6. I will note with a raised eyebrow that GPT5.3 was released within 45 minutes of Opus – I guess OpenAI are feeling pretty threatened at the moment!

New model, new capabilities

The main changes for Opus 4.6 are the normal ‘more intelligence’ but also a larger 1M token context window and the ability to tune the level of reasoning from low to max. The pricing remains the same as for Opus 4.5, although you can now pay (a lot) more for ‘fast’ mode.

For the past few releases it’s been clear that in order to actually experience the power of the newer models you need to be using them in an agentic framework such as Claude Code. This is for the simple reason that it’s hard to stretch the capabilities of the models with a straight forward Q&A session in a web chat app.

I’ll briefly sketch out a few examples of how I’ve been using it this week:

Worked through a problem I had about how best to create a (sort of) data dictionary whilst only having access to WebI-generated SQL queries - it suggested a conceptual approach I’d not considered which does indeed appear to work. Note that although I did this within Claude Code the task was entirely conceptual with no code or any data at all - just a description of the problem.
Created a revised draft of an external presentation I’d delivered a few months ago, again - no code involved, but I got back an extremely decent attempt along with some great speaking notes (I just gave it a bunch of these blogs and told it to make it sound like me).
Used it to do a few python related coding tasks - these it one-shotted. None of them were particularly complicated, and Opus 4.5 would probably one-shotted them as well, but it’s amazing how far this has come even from Sonnet 4.5.

These are just some isolated anecdotes, but having had a week to play with it, my TLDR is that a lot of the hype is real. When used in Claude Code it does seem better than previous versions at self-critiquing its answers and improving them without further prompting – also I notice my expectation of coding problems is now that I’ll probably get a good answer on the first go. All in all it does seem to be cleverer than Opus 4.5. And let us not forget that 4.5 was already very clever indeed.

What could go wrong?

Anthropic have released the model as an ASL-3 model (AI Safety Level 3). This is a level defined within their safety framework as a model “that substantially increase the risk of catastrophic misuse compared to non-AI baselines”.

In this same framework document they state they will provide a definition of ASL-4 before a model that reaches ASL-3 is released. Whilst they have done this at a high level for AI R&D and biological risk, the thresholds remain qualitative even if the evaluations behind it are detailed. For biological risk specifically, they’ve simply put it as “the ability for a model to substantially uplift moderately-resourced state programs” which I guess is fine as publishing a more detailed criteria could in itself be dangerous (e.g. if you published a list of the very specific things that makes biological weapons tricky to make, you’re kinda providing an instruction book). The AI R&D assessment is based on a survey of 16 of their technical staff. Again, this doesn’t seem too bad – I’d expect those employees to know if they’re in imminent danger of being replaced.

However, for cyber threats they have explicitly not provided an ASL definition at any level whilst simultaneously being certain that they are not at ASL-4 yet. This seems a bit weak - they are essentially saturating all of their automated ASL-3 benchmarks at this point. Whilst this is markedly better than say Deepseek, who are on record as saying they don’t have compute to spare for safety work, it does appear to highlight a gap in their own published standards. To their credit, Anthropic are not blind to this criticism and have released the model with ‘additional safeguards’ in a number of cyber related areas (e.g. agentic coding use).

This really matters, as I think it’s fair to say that if Opus 4.6 is manipulated to circumvent its guardrails in the same way that Claude Code was used in September then it’s likely a potent cyber threat to enterprises large and small. Indeed, as part of the release hype, Anthropic have posted a blog about finding 500+ zero day exploits - I am somewhat sceptical about how much weight to place on this as no CVE details have been released, but there is little doubt AI assisted hacking is already a thing.

There is positive news on prompt injection - this is basically now 0% success in coding tests, though given it’s already been jail broken to extract the system prompt perhaps they need some harder tests. Slightly less rosy on the indirect prompt injection, but at least it’s much improved on the key enterprise use case of browser use (aka can I get this thing to update my legacy system that doesn’t have an API) - this has a success of only 0.08% of attempts in their testing with their safeguards in place.

Reading the above back, this does come across as a little negative. However, I want to stress that this is an amazing model, and I’d encourage everyone to go and use it inside Claude Code. Most of my cyber safety concerns are just as valid applied to OpenAI and other model providers (if not more so!). The next year is going to be quite challenging!

Other stuff

New model releases aside – here in no particular order are some other items that caught my eye this week:

AI is cool, mars is cool, robots are cool, so the Perseverance rover being driven around the surface of mars by Claude is triple cool.
Has Google cracked the holy grail of enterprise AI use cases? Automated calendar scheduling that takes into account everyone’s availability – I will pay good money for this in Outlook.

Working Notes - 05/02/26

2026-02-05T00:00:00+00:00

A mix this week - thoughts on research that shows getting AI to write unfamiliar code means you learn less, bringing interactivity to MCP, and a clever RAG technique that’s worth adding to your experiment pile.

GPT 4o still doesn’t make you faster

There is a meme on X that essentially says that every time there is a paper that states that AI doesn’t make you more productive, you invariably look inside and find out they used ChatGPT4 in a sidebar. Whilst that is probably doing this paper something of a disservice, when you look inside, you find that yes - they are using 4o in a sidebar.

Leaving that aside it’s still quite an interesting paper as the framing is less about productivity (although that is commented on) it’s about learning in junior software engineers. Specifically they find that when two groups of engineers are asked to learn and take a quiz on an obscure software library, the group with access to the AI scored lower (approx. 17%) on the quiz and were also not really any faster (p=0.39 for the stats fans).

In their interpretation they split the AI group into two buckets. Firstly the high scorers, these tended to either use the AI to ask conceptual questions, or used the AI to generate answers which gave code and explanations. The lower scorers were on the other hand inclined to just YOLO it and fully delegate to the AI, either immediately or after the first task, or those who iteratively debugged the code by sticking the error straight into the AI and saying ‘fix it’.

In many ways this is unsurprising - they found that if someone engages with the problem and understands the solution then they learn something. If on the other hand, if you get a tool to do it for you, then you don’t learn much. An analogy might be that if you use a calculator to do long division you’re not going to get good at long division.

This analogy then prompts the question - does it matter?

The authors argue that it does matter as you need someone to verify the answer. I find that less than compelling. It matters not a jot in my life that I’m appalling at long division - in any likely situation where I’ll need to do it, I can pull out my phone. Likewise, if the study subjects had been equipped with a modern agentic system like Claude Code I expect they’d have learnt even less, but they’d undoubtedly have been a lot faster, almost certainly got 100% in the coding task and the agent would have written its own verification tests. Is there any likely scenario where a junior engineer is going to progress through their career without the benefit of an agentic AI assistant? I find it more likely than not that most (though not all) engineers will not be writing much code, if any, in 5 years’ time - I expect they’ll be directing swarms of coding agents if not replaced entirely. This is not without precedent – people can still make jam, or knit their own clothes, but these are now hobbies rather than professional necessities.

Beyond the narrow coding example, these findings definitely have wider implications, particularly for educators who need to work out how we can use the tools to enhance rather than supplant learning. That’s a bigger question than I can answer in my blog, but it certainly makes me ponder how education is going to pan out for my two young kids. On a tangential but related point, we may also see a lessening of the open source ecosystem - the authors used an obscure library in the example, but in the future world are we going to have similar obscure libraries to use in such studies? The agents are going to go to what they know, and if it doesn’t do what they want, they’ll code something new. After all, open source libraries exist to prevent people reinventing the wheel, agentic systems are quite happy to reinvent the wheel and won’t have the same incentive to share their code and learnings with the world in the same way as today’s engineers. The better the agents get, the bigger these problems.

MCP Apps

MCP has been an open standard for over a year now and it’s now got an official extension - MCP Apps. In a nutshell this allows a custom interactive UI to be displayed within a chat window via an embedded iframe.

For use cases where you’re going back to the human with some results from the MCP query, this is pretty useful. For example, let’s say your MCP server can query your BI system and you’ve pulled back some sales stats, this can now be displayed as a chart inline with the chat window. The inline window can then interact directly with the MCP server without going via the LLM - if you want to filter by geography you just use the dropdown. This both saves on tokens, but importantly makes it a deterministic process so you’re not worried that some weird UI will be generated or the LLM will hallucinate filtering details. Obviously, this can make the model out of sync with what you’ve just done to the view (e.g. you’ve filtered on Europe) but there is provision in the extension to allow you send back a (short) message to give the model updated context (e.g. the user filtered on Europe).

Like the rest of MCP this is a build once use many situation - the UI will work in Claude or ChatGPT or any software that supports the extension. This already includes VS Code, so potentially you’re going to get things like interactive diffs in the chat which has the potential to be quite useful. Importantly it’s also a progressive enhancement, if the client doesn’t support it, the MCP tool will still work.

Whether this will take off remains to be seen. The main blocker is the walled garden incentive - MS wants to keep you in their Copilot world, so are going to prefer you to build a Copilot connector, but I think we’ll probably see some adoption where it makes sense (e.g. I expect Salesforce will add it to their MCP server). When combined with the Agent Skills you’ve got a pretty powerful combination - instead of having text prompts at points of human interaction, you can now have interactive elements.

Improved chunking

One of the challenges with naïve RAG is deciding the size of the chunk - small means just specific facts are found, large means you get more context but you’ll also include unneeded information (plus the semantic signal is going to be less specific - since embedding is a kind of compression). Quite often people just YOLO it and go for some arbitrary size like 200 tokens.

The authors of this blog post offer a new approach and suggest that there is no universally suitable chunk size for a given task, and instead text should be chunked at multiple different sizes.

Their approach then uses Reciprocal Rank Fusion (RRF) with the constant set up to mean that a single chunk size will tend to dominate the rankings.

In summary, each chunk is recorded with meta-data (document id (e.g. document A) and character range (e.g 4000-4150). The semantic search is done on all the chunk size indexes and the individual results ranked in the normal best to worst way. The key change is that the parent documents for each chunk are then ranked so that documents that appear multiple times and/or rank very highly in their sub-lists will end up with an overall document ranking. The authors are only covering the document retrieval aspect, but in practice you’d then use the top ranked chunks as an anchor to determine which part of the winning document to put in the final context window.

The actual material impact of this approach does seem to vary – for most of the retrieval benchmarks they tried it was between 1-5% improvement over naïve RAG, but for one result (TRECCOVID) they got a 37% improvement. I suspect this means that some datasets / query types particularly lend themselves to this approach but for a mature pipeline even 1-5% improvement is not bad.

However, much as I appreciate the technical neatness of this technique, its variable results probably mean it’s one to put in the ‘experiment’ pile rather than the ‘try and rush into prod’ pile. A key point of this technique is that it doesn’t require anything particularly special (no retraining etc) just a different approach to selecting the top k results. The cost is of course multiple vector indexes but assuming you are executing the searches in parallel, that is only a storage cost and an embedding cost - both of those are cheaper than giving the wrong answer to your user.

If the chunking size you’ve picked for a mature RAG pipeline is working well, then great, if not then experimenting with multiple chunking approaches such as this one is probably worthwhile – especially if corporate security rules make introducing a model-based re-ranking model such as Cohere tricky.

Working Notes - 29/01/26

2026-01-29T00:00:00+00:00

A mix this week - some thoughts on OpenAI’s value-based pricing model pitch (spoiler: I’m a sceptic for most scenarios), testing Gemini Flash 3’s new agentic vision feature, some new chips and models, and another scary voice cloner.

Future business models

Given the scale of the investment that AI companies have seen and the expected impact on enterprises, changes in their business models are obviously quite interesting. The OpenAI CFO, Sarah Friar, has started making noises about how she sees OpenAI starting to claw back some of their outlay. Her basic premise is that AI firms will move beyond the standard SaaS model of bums on seats / usage charging and towards a more consulting model of value based outcomes.

This is a pretty interesting hypothesis and bears some scrutiny (I’ll take the statement at face value rather than as a funding round talking point). If you look at where this model currently works, you’ve got things like litigation (no win no fee), tender bidding support, M&A sourcing, payment providers, bug bounties, and recruitment. The common factor with all of these is either that the value is explicit (e.g. a 2% transaction fee by Stripe, or I decided to hire this guy) or there is neutral third party that will judge success (e.g. we won the litigation, I tested the bug, or we won the contract).

A corollary of this is that where there is no neutral measure of value, the approach is much less common. The main example is consulting companies doing value-based deals. From my own experience (management consultant for 11 years!), I’ve seen how difficult these deals can be at the end of a contract, even where everyone thought bulletproof definitions had been agreed a year earlier.

The other major factor that needs to be considered is competition. At the moment at least, it does not seem likely that OpenAI (or any AI company) will get a comparative advantage that is so large that any enterprise would be crazy to use an alternative provider. If OpenAI is asking for 10% of value and a competitor is offering a flat per token (or some other utility pricing measure) then a CFO is likely to be going with the competitor. After all, the vast majority of intelligence that companies purchase today (aka actual humans) is done on a flat rate basis with the only exceptions being very senior employees.

The only place I can see this working in the 2026/27 timescales is in joint ventures. If OpenAI can identify some high-value use cases with a clear neutral reward signal (e.g. a new drug makes it to a certain FDA milestone) and offer a deal along the lines of ‘you provide the data, I provide the compute and salary Opex, we split the value of the outcome’ then it’s going to be an interesting proposition for a CFO. The assessment won’t be around ‘can I get a cheaper deal’ it’ll be more of an opportunity cost discussion (what else could I do with this data, where else could I deploy these high-value employees).

This is a fairly bold play by Sarah Friar, but at the moment OpenAI doesn’t have a sufficient moat compared to other frontier labs to make it work for most enterprises. I expect things to go in the direction of utility pricing, which means that for any hope of recovering their capex the AI companies are going to need to see huge adoption in enterprise. The JV exception, if one emerges, will be industries with verifiable rewards like litigation and drug discovery. As with many things in AI, finding the correct reward signal is the key.

Agents with vision

Google has now enabled Gemini Flash 3 to perform agentically when given an image input. Normally when you give a model an image it’ll look once and then tell you your answer, however when code execution is enabled, the new version of Flash will look and use a tool based agentic loop before providing the result. What this means in practice is that it’ll use Python to slice up the image and effectively zoom into parts of the image that it identifies as interesting in its first look.

This is potentially quite useful for one of the most important enterprise AI use cases - ingesting existing word documents and PDFs. These often contain diagrams, Gantt charts, process flows, etc that you might either want to add immediately to a prompt or create markdown versions of for later retrieval as part of a RAG pipeline. So does it work? My experience is a big fat sort-of. When I asked it to extract detail from an example Gantt chart, it certainly used python to zoom in and manipulate the image, but it didn’t get the correct answers for everything. Whilst it got the order and task titles correct, it incorrectly recorded the start dates, end dates and durations. This may have been a prompting issue on my part, but I did try several approaches with similar results. Alternatively it might be due to the need to reference the date scale at the top of the chart with the bars themselves.

My second example was much more successful – I found a fairly detailed consulting framework diagram showing an ML model lifecycle and it provided me with a word-perfect transcription, complete with all the required information about how each part flowed to the next.

So - from a conceptual point of view, this does seem like ‘the way’ but at the moment at least it’s worth being circumspect with the types of images to which you apply it. I would expect this to improve over time and it’s already a useful tool on a model as cheap as Gemini Flash.

Kimi K2.5

After a brief Xmas pause we’re back to a new model release from Moonshot - this builds on the K2 thinking release in November. It has landed to a generally good reaction, particularly with respect to its coding abilities. All the benchmarks are suitably impressive, although as ever, treat these with a pinch of salt.

Other than the model now being natively multimodal (a first for Moonshot) the most interesting part of the release blog relates to what they are calling agent swarms. This is essentially what it sounds like - a orchestrator agent controls up to 100 sub-agents to complete a task.

The key thing is that they’ve trained the model to be able to decompose the tasks itself without any special prompting - they initially achieved with RL early on by rewarding parallel execution, then slowly removed this reward and only rewarded success. Additionally the reward function penalised the critical task duration (i.e the critical path in project management terminology) anything that happened whilst that longest task was completing was essentially not penalised - this has the effect of preventing sequential execution (as the critical path increases) but encouraging lots of shorter tasks. Clever, and their metrics show the swarm approach reducing time to result by up to 80%.

The chips are down

Microsoft have launched a new custom chip, the Maia 200. Unlike the chips predecessor (the Maia 100), the new chip is explicitly inference only rather than both inference and training. Whilst this sounds limiting, in reality, current RL training approaches actually involve loads of inference runs (e.g. generating synthetic data) so this will still be useful to MS (and therefore OpenAI) for its training runs.

One of the more interesting points in the blog / press release is that the chip will be used to run GPT5.2 - given the chip is best at FP4 (rather than FP8/16) this points to OpenAIs latest models being natively FP4. The chip’s main completion AWS Tranium and Google TPUv7 which it appears to best on everything except FP16 in the case of Googles offering. The question will be can they scale the production to start competing with them and reduce their dependency on Nvidia.

More text to speech models

Alibaba have provided yet another reason to not trust everything you hear with their new TTS model. This is an open source (or weights?) that is available in either 1.7B or 0.6B flavours - meaning it’ll essentially run on anything. I’d previously thought the Pocket TTS model was very impressive, but this one really does sound identical to my voice after providing it with 10 seconds of me talking and a transcription. It can even pronounce my name which is a first for the models I’ve played around with. Have a play with it on hugging face, but only after you’ve turned off voice verification on anything you remotely care about!

Working Notes - 23/01/26

2026-01-23T00:00:00+00:00

A busy week at work but managed to find a little bit of time to get Claude Code to implement a PoC of the Recursive Language Model approach - shared some findings below. Also some notes on tech trends for 2026, and a review of LLMs as judge approaches.

Adventures in RLMs

Last week I wrote about recursive language models. I’ve spent a bit of time playing around with Claude Code, and have got it to build me a proof of concept of the concepts in the paper. This took a little more than ‘build this paper’ but if I’m honest, not that much more. Claude Code with Opus really is very impressive.

I ended up building two Gemini-based versions of the tool which I’ve been testing on various horrific government framework agreements. These have the advantage of being public domain, very complex, and somewhat similar to the types of documents we see at work. The first version was a purely deterministic version that took the concept of dumping everything into a variable and chunking it with subagents, and the second implemented the full REPL approach with code execution and code.

The results were interesting - both versions are excellent at queries that need all of a document to be examined and would fail with a traditional RAG approach (e.g. find all the definitions across multiple documents). The performance of the REPL version very much depends on the model used for the orchestrator - if you use 2.5 Flash you get a very similar result to the more deterministic approach. However if you use a Pro model, you tend to get a lower number of sub-agents (50 in a typical task vs 70) as the model uses the tools (Regex etc) to delegate more effectively.

The TLDR is that this is an excellent approach for long-context analysis of complex queries that need to be broken down, but I found a heavily scaffolded version was often as effective (and often more token efficient due to fewer reasoning steps) as the more free form REPL approach. When it was more effective, this was very dependent on using a larger more expensive model as the orchestrator. I’ll keep playing around with it and perhaps give the sub-agents their own REPL environments - something the original paper did not attempt to explore.

Tech Trends 2026

A very readable report from CB Insights on possible tech trends for 2026. They start with back office agentic automation where it sees FinTech as leading the way due to the document heavy environment. Whilst this does seem like a real trend, I’m pretty sceptical of the self-reported percentage of organisations that have deployed agentic AI (36% apparently) given that a lot of things can be tagged as ‘agentic’ to senior leadership (not least MS CoPilot!).

I think robotics is an interesting one - there have been impressive advances in world models this year, and LLMs seem a obvious thing to plug into robotics to give higher level planning. However, given the degree that effective agents have to be scaffolded at the moment it’ll be interesting to see if anyone can crack robotics in non-highly controlled environments which don’t lend themselves to strict rules. Assuming progress is made, it’ll doubtless bring with it the need for low latency networks and edge AI.

Sovereign AI is also called out as a key trend. Given the geopolitics of the moment I think this is a no-brainer. Countries are going to want to be assured of their AI capabilities (both hardware and software) as these become more critical to businesses and government. Europe obviously has the additional incentive of compliance with the EU AI Act and GDPR rules. That being said I imagine decision makers will remain pragmatic as long as the frontier capabilities (and data centre capacity) exists only in the US.

Don’t judge me

We’re all familiar with the LLM as judge paradigm, this paper is a survey of the more advanced agent as judge approach. In an LLM as judge you use the LLM to assess whether a response (or test / eval / whatever) is correct in a single pass. The agent as a judge is exactly what it sounds like - you provide an LLM with tools and it executes a loop using the tools (e.g. search, or a python environment to test generated code) to assess the validity of the response.

The paper draws out three basic approaches. Firstly a rigid procedural based approach (what I’d normally call scaffolding) where the agent follows a set script to assess the response. Then they have the more flexible ‘reactive’ approach where the agent has a pre-defined set of decisions it can make - e.g. trigger search if it’s a factual response, or to run a python environment if it’s code. Finally they have what they call ‘self-evolving’ approach which aims to have the agent have a freer hand about how it evaluates the response (e.g. creating new measurement criteria) and has a memory file that can store previously provided feedback or heuristics.

The first approach resonates with me - this can be made pretty robust, the second approach seems a bit pointless - on most occasions you know the type of expected response, and the third one I’d be sceptical of being at all reliable at scale without lots of scaffolding which somewhat defeats the point. As with all agent based approaches, this is particularly token intensive - to have any chance of getting the self-evolving approach to work you’ll be using Opus or Gemini 3 Pro which will get pricey fast at scale. Give it six months tho…

Cowork Hype

Judging by the hype Claude Cowork is going to be a big deal, it’s now available for Pro users as well as Max users - alas only for macOS. The excellent Odd Lots podcast has covered it, along with the WSJ, and endless threads on Reddit. The consensus generally aligns with my thoughts from last week - the capabilities were there in Claude Code already, but the less technical interface has exposed it to a wider group of users. This is reminiscent of the R1 moment almost exactly a year ago - the technology was there for a while but the general population wasn’t exposed to it and when they are there is a collective ‘wow’ moment.

A couple of other things that caught my eye

A handy tool to convert any youtube podcast/video into a clean nicely formatted transcript - worth it for the WindowsXP interface alone. Putting in a music video is quite amusing for the summary.

X (do we have to call it that now?) have open sourced their algorithm resulting in an analysis, the TLDR is that it rewards engagement and if you put links in that go off X it’ll mark you down. The weight values remain a mystery so it’s less interesting that it sounds.

Working Notes - 15/01/26

2026-01-15T00:00:00+00:00

First week in a while with no major model release to cover - instead I’ve been using Claude Code to do non-coding work, looked into Google UCP and what it means for retailers, faked my voice from five seconds of audio with a model that will run on pretty much anything, and investigated an agentic approach that enables efficient use of extremely large contexts.

It’s coming

There has been a lot of hype in the past few weeks about Claude Code with Opus 4.5 - most of which I’d put down to people using the Christmas holidays to have a play with it and realising that it’s got really good. If you’ve not played with Claude Code recently, you should probably go do that!

One of the trends that is a thing at the moment is using Claude Code to do non-coding stuff. It is surprisingly(?) excellent at this - for example I needed to make a short deck on something the other day, and I used Claude Code to create an extremely decent first draft of it. I recorded a brain dump of all my thoughts on the subject, exported the transcription from my phone, put it in a directory that Claude Code could access, installed Claude Office Skills to enable it to create PPTs, and then just told it to make the deck - off it went and many thousands of tokens later I had my deck.

At its core, this is what Anthropic have launched this week with Claude Cowork but with the addition of a separate VM and a nice non-CLI frontend. This has been received pretty positively, and judging from the very slow performance Opus 4.5 has been showing on the Claude website/app I think it’s fair to say Anthropic’s servers are being hammered. Currently it’s a macOS only research preview and only available to people on the $100/$200 per month plan. Anthropic call out risks over and above the normal privacy ones - importantly if you’re using it with browser use enabled you should only be using it on ‘trusted’ websites. In theory someone could prompt inject Cowork by displaying a nefarious prompt on their website and delete files on your actual computer.

In a nutshell, this means no more uploading or copy & pasting documents to a chat/teams/web interface - now you just point Cowork at the folder in your computer, tell it what you want, and it’ll asynchronously access websites, build ppts, organise the files, give you progress reports, and let you chip in with additional thoughts/changes as it reports how it is getting on. Whilst technically the solution they’ve released is not really anything new over and above what you could do with Claude Code already, it’s perception that matters. Gone is Claude Code’s command line, instead here is a nice friendly interface that enterprise leaders can try themselves and experience the capabilities of the most advanced universal agents. I expect this to have an impact beyond ‘just’ a research preview and be a step towards the mass adoption of general agents - I’ve no doubt OpenAI and Google will be launching their own versions shortly.

One final factoid that has come to light - Anthropic are claiming the solution was built in two weeks entirely with Claude Code. Pretty impressive.

Would you like three letters with that acronym?

It’s noteworthy (at least to me!) how many open standards are being launched to shape the AI ecosystem, compared to the far more organic development of internet commerce. Google has already launched A2A, AP2, and A2UI—all of these are standards for agents to interact and transact with each other.

They’ve now launched another standard named the Universal Commerce Protocol or UCP. Upon hearing the name, I initially figured it was to solve the problem of how agents can find out about your products and then buy them. And it is, but the first half of that statement (the finding or discovery of your products) isn’t ready yet. Instead, the capabilities they’re launching now are identity, checkout, and ordering. For further techy detail, there are a few worked examples in the Github and they also have a playground on the main site.

As you’d expect, once you’re set up, other users of the protocol won’t need to integrate with you individually. What the take-up of this will be remains to be seen, but I’d expect it to be fairly good as most retailers are already in the Google ecosystem (AdWords/Google Ads, SEO, etc). Google has made it clear that this will soon power purchases within Google AI Search and the Gemini app—both large enough markets that it’s worth it for many retailers to do this even if the open aspect doesn’t take off.

You can’t trust everything you hear

In another reminder that you really shouldn’t be using your voice as your password (no matter what your bank says given the state of generative text to speech systems, here is new model from Kyutai that will run on just your CPU and clone a voice from under 5 seconds of speech. I was a bit sceptical so I installed it from Hugging Face in a fresh conda env, and can confirm that it will indeed give a convincing clone of your voice from a few seconds of audio after downloading the few hundred meg of model weights. Given the proliferation of real time and near real time voice cloning technology, companies (and everyone for that matter) need to be cognisant that just because a voice sounds like someone you know, it doesn’t actually mean that it’s them.

Recursive Language models

I found this paper on recursive language models from MIT to be an interesting read. A lot of enterprise use cases are essentially search and summarise tasks - often this means dealing with very large potential contexts. You could just toss the entire thing into the model context window (and often this is worth doing if it’s a one off) but at a certain point that becomes expensive and slow.

The technique in the paper seems to take inspiration from how coding agents work - when you use these tools they don’t load the entire codebase into the context window, instead they note that they have access to (say) five python files and then search / manipulate the files as required. Likewise this approach treats a potentially huge (10M+ tokens) context as a variable that exists and can then regex search it / split it into smaller chunks and execute agents against each chunk.

Let’s say you had a question of “build me a RACI for all the roles in this document” - if you tried to use traditional semantic RAG for that it’d totally fail as it needs to examine every part of the context rather than just the top k semantically similar chunks. This solution will work, because it will split the book into lots of chunks, assess whether a role is discussed in the chunk, and then write the role summary back to a findings array for the root model to process.

If you’re working with very long potential contexts and need to aggregate information from across the whole context this approach is pretty robust and will likely result in more accurate results than semantic RAG.

Evaluating agents

Most evals are currently based on just assessing the text output - whilst this works well for things like whether an AI can find an answer from context or recall a fact, it works much less well if we want to evaluate an agentic process. Often the text output doesn’t really let you assess whether the agent has done what you want it to have done. This post from Anthropic is a deep dive into how to better evaluate agentic systems.

It’s all pretty logical - if you want the agent to create code then you have a gold-standard deterministic approach you can use - you run the code against your tests. If it’s a bit fuzzier than that, then use a LLM as a judge (noting that you have to calibrate against a known good standard). Build a harness that you wipe each time so your test isn’t passed because an agent left a file there on a previous run. Don’t grade the path as agents are not deterministic and may find an unexpected route to get the right answer (they give a good example of this from Opus 4.5 testing where the model found a loophole in a customer service policy that allowed the goal to be achieved).

If you’re trying to deploy agents, then it’s certainly worth a read. If nothing else it’s worth the reminder that if your agent is 90% reliable per step and it performs 10 steps, you’ll only have an overall reliability of 90%^10 or 35%… not what you want for a customer service agent. It’s for reasons such as this that I’m sceptical of customer facing production agent use without heavy scaffolding - most of the time they’re just not reliable enough to let them find their own path.

Other stuff

Skills being a thing continues - you can now use them in VS Code.

I wrote last week about using prompt caching to improve TTL, reduce costs, and improve performance over naive RAG based approaches - this article adds another layer to this by using semantics to better identify the correct cache to use to answer a question. It does suffer from the issue I mentioned last time about the implicit assumption that the question is semantically similar to the answer, but if your use case fits this, it’s worth a read.

Finally, there have been a number of AI health care releases in the past week - I guess this was inevitable given how much people use LLMs for health and fitness questions (for example, I use Claude as a pretty effective coach when training for running events). If you’re comfortable giving all your health information to OpenAI and aren’t in the EEA or UK, then this is for you. Probably a case of don’t compare me to almighty, compare me to the alternative - if you’ve not got access to a health professional or personal trainer, then I expect this to be worthwhile. Conversely Anthropic’s recent release is actually pretty different and much more targeted at their enterprise customers rather than consumers.

Renovating my office with AI

2026-01-08T00:00:00+00:00

My Christmas DIY project this year was to renovate my office. The biggest part of this was to do some Ikea hacking and add some built in cupboards, bookshelves, and a second desk. It struck me that Nano Banana Pro would definitely know about Ikea products and therefore might be able to help me visualise the end result (and importantly get approval from my wife!).

My standard approach to DIY, is to come up with an idea, and then procrastinate about the details for days or weeks. I was true to form for this project, but eventually settled on a plan sufficient to knock up a rough scale plan in Visio:

I then gave this to the model with a bit of additional detail:

Here are my plans - I’m doing some ikea hacking. This is a front elevation of a set of built in cupboards and book cases i'm building in my office on the wall - the wall is 3.3m x 2.4m. As per the diagram there are a pair of white metod cupboards on each side, then a desk area in the middle. The outer cupboards are shaker style cupboards (2 doors) the inner cupboard will be 3 draws again in shaker style. The wall itself is covered in vertical panelling in mid grey. On top of the cupboards (the brown bit) are three pieces of laminate work top (oak). Then on top of the worktop is a 3 wide x 4 high kallax unit which has an open back so you can see the grey panelling. Nothing in the middle as it's a desk. Try and visualise it for me based on the picture.

This came back with a good start (I also tried it without the image to see how much it was understanding from the diagram and the result was much worse) as below:

There was then a bit of back and forth to correct what it got wrong. Specifically, it hadn’t picked up (from either the text on the plan, or the prompt itself) that the Kallax were 3x4 units¹ and secondly that the desk would be lower than the tops of the cupboards. This points to the model very much still operating from ‘vibes’ rather than specific details, but a couple of nudges got it back on track and resulted in the picture below:

Which was close enough to submit for and achieve wife approval. Whilst the final result isn’t identical (I made some changes!) and I’ve not added door handles yet (brass? chrome? leave as bits of masking tape for 3 months whilst I ponder it?), I think it’s pretty close!

Hopefully, this post is underlining how far multimodal models have progressed in the past year. Submitting a plan and getting a half decent result would just not have been possible even a couple of months ago – the success of this points to Google’s success in integrating Gemini 3 Pro’s world understanding into the model. Clearly this isn’t good enough to replace an expensive graphical artist working for an interior designer, but that was never on the table for this project and it’s not that far off.

As demonstrated in my previous post about using Nano Banana Pro to visualise mermaid diagrams we are seeing image generation models move from fun toys to useful tools. Looking forward to whatever we have access to for my Xmas 2026 project and in the meantime please share any visualisations of your own projects!

3x4 Units that were 10cm too tall for the space – I can confirm that with the application of a table saw and wooden inserts you can make Kallax fit any space! ↩

Working Notes - 08/01/26

2026-01-08T00:00:00+00:00

Back to it after the long Christmas break - I spent most of it with the family and also remodelling my home office. Things slowed down a bit from a model point of view but this gives us time to talk about data sovereignty and when not to use a vector database.

Data sovereignty

I was reading an interesting article in El Reg on data sovereignty. The basic premise is that nothing is really sovereign if you’re using a US hyperscaler due to the US Cloud Act (tl;dr: US based hyperscalers can be compelled to disclose data held anywhere).

This is one of the reasons that the EU-US Data Privacy Framework (DPF) was adopted as it allows US companies to certify they are GDPR adequate. However, it is straightforward to argue that this doesn’t actually mitigate the problem. This then leaves technical solutions, for example, if an EU company encrypts their data then it’ll be useless even if it is accessed by US law enforcement. Unfortunately, in practice, this is not effective. When requesting keys, GCP attaches a reason code - if this is ‘third party access’ then access is denied. Sounds good, but if Google was to actually use this reason code then they will breach the gag order that forms part of the request.

A possible solution is to use an external (EU based) key manager, but this then comes with significant complexity. The long and short of this is that this is something of a nightmare that should be pondered as non-US companies become more and more dependent on US hyperscalers for AI.

New SOTA OCR model

Data sovereignty provides a nice segue into Mistral’s new model. Being based in the EU, using Mistral can avoid these issues, although you’ve got to store your data somewhere and if that somewhere is in AWS, Azure, or BigQuery you may not have mitigated the problem! The new OCR model is SOTA for converting complex formatted documents. In my view, this helps to address a key blocker for AI adoption - huge amounts of important data is locked up in PowerPoints that were formatted to look nice in presentation mode without a thought about whether an AI model would need to consume it down the road. Each improvement to OCR models helps solve that problem and if this issue sounds familiar, it’s certainly worth experimenting with in the Mistral Playground.

Context caching and when not to use naive vector based RAG (spoiler - most of the time)

Ever since models like Gemini Flash became good at handling long contexts (the data provided in the prompt) at a low cost, I’ve held that a lot of implementations of RAG using vector databases are essentially pointless.

To expand on this a bit, the vector DB + LLM paradigm became a thing due to the small context window and high cost of early models (remember when GPT4 was limited to 8k tokens and $30/m tokens!). Neither of these constraints exist anymore, and the approach comes with a lot of inherent problems and assumptions - not least that your query needs to be semantically similar to the answer. A lot of the time you can avoid these problems and vector db complexity, just by tossing the entire document set into the context window - the downside being that the larger the provided context the longer the time to the first token.

I’ve been discussing this with a colleague this week, specifically in the context (pun intended) of using prompt caching instead of RAG with a vector db. As background, a query to an LLM is processed in two phases: the prefill (the heavy lifting where the KV cache is generated based on the context) and the decoding (generating the response tokens). The decoding part is memory bound and the prefill part is compute bound. If you save the KV cache and just reuse it with your new query appended, you can dramatically reduce the time to first token by avoiding a ton of compute. This is prompt caching.

The key constraint to this is that you do need the same prefix prompt - but in a lot of RAG type use cases this is fine, as your prefix are your instructions and some documentation.

You can obviously make this more effective by having several prompt caches and picking the best one for the query (e.g. using BM25 keyword search or a LLM over a short index saying what each cache contains (you could even cache this if it’s long enough!)). Different providers have different approaches (here is Google’s) but in a lot of cases this is a no brainer - it’s cheaper and faster.*

Stateful Google API

Currently if you’re doing anything clever with agents, or even just a simple chatbot you’re generally responsible for managing the state. The state is essentially the interaction history. This can be quite painfully complex when you’re dealing with agents and complex interactions.

OpenAI solve this issue via their Responses API and now Google has followed suit with their Interactions API. Instead of storing and supplying the whole interaction history you can now just pass the last interaction id to the API. Additionally, you can also use the API with agents (such as Deep Research) and set the agent to work in the background, and come back in a different session to get the results.

In order for this to work Google is storing the state of the model (since you’re not anymore) in cloud using their implicit caching approach for 55 days. This is potentially a decent financial saving as the cached tokens are only charged at 10% of their normal cost. Whether that stays that way over the long term… well I guess we’ll see - there is an obvious commercial opportunity there!

This last benefit also has a privacy angle that needs consideration for corporates. We are all used to interactions with the LLM being transient but now Google is storing your data ‘somewhere’ for 55 days. Yes it’s encrypted etc etc, but the privacy team will definitely want a word before going live with this (not least because in the beta you can’t specify data residency yet).

Skills continue to be a thing and are now an open standard

I mentioned in my last entry that it seems increasingly likely that Anthropic’s Skills will be a thing with ChatGPT starting to use it somewhat unannounced. Well, OpenAI have now gone official and as with MCP, Anthropic have now made it an open standard. If you’ve not used skills yet, I’d recommend having a play - they have a lower barrier to entry than MCP and are useful even for casual use. For example, I’ve made a lesson prep skill for my (primary school teacher) wife that has now saved her a good few hours. I would assume it’s just a matter of time before X.ai, Google, and other model providers adopt the new standard.

Continual learning

If I had to list the things that are preventing truly transformative AI (i.e. models that can replace entire jobs, rather than automate short tasks), top of my list would be continual learning. At the moment we have ‘intelligence’ but that intelligence can’t learn anything after its initial training - yes you can shove ever longer contexts into the model prompt, but this has obvious limits.

This paper is a possible baby step towards resolving this. It allows the model to learn from the context as it’s generating the tokens. To skip a great deal of technical detail, after each pass, the model weights are updated and only a 8k token KV cache is stored.

This means that the model can then answer questions on the context without querying a massive KV cache making it (in theory) quicker. In theory you can then use the final values of the weights as the starting values of your next query. If you squint a bit you can imagine a developed version of this slowly building up expertise as you ask it more and more questions on existing documents - a sort of continual learning.

This is clearly not the final answer though. If nothing else, the model will be best at the last document / context it was trained on and forget previous documents from earlier runs. This could be mitigated by freezing the model weights related to the previous query, but then you run into interpretability roadblocks if you want to avoid simplistic assumptions. There is a huge issue of deployability - each customised set of model weights would need to be stored and loaded into memory as and when a user requested that model.

Tl;dr - interesting but doesn’t solve the continual learning problem (although the technique is potentially pretty useful for speeding up long contexts).

Other stuff

When writing the bit about prompt caching, I came across this excellent explainer from Sam Rose - it actually ends up being a decent explainer of how an LLM works. A highly recommended read.

As mentioned I’ve been remodelling my home office, and before getting my table saw out visualised the result in Nano Banana - this is an excellent deep dive into Nano Banana prompting.

And finally, in a new fresh hell, AI generated pictures of packages (not) being delivered are now a thing.

For completeness you need to be aware of cache expiries, as they don’t last forever and you’ll want to refresh them from time to time.

Working Notes - 18/12/25

2025-12-18T00:00:00+00:00

Nearly Xmas, so chance this’ll be the last one of the year unless I’m feeling very keen next week! Lots still going on as we close out 2025 - my favourite model has been updated, OpenAI has responded to Gemini 3 Pro, and a few interesting papers.

Flash! Ah-ah!

Google has updated its Flash model - it’s pretty much what you’d expect, near frontier performance for a much lower price. If I had a favourite model for getting stuff done at scale, it’d certainly be Flash - I’ve got use cases where millions of tokens get thrown at the model and it’s cheap enough we can run it every day and not worry about the cost. Whilst the cost has increased with this release by about 40% over 2.5 Flash, this is from a very low base - Flash 2.0 is also still offered at an even lower price (though perhaps only for a few months more). One interesting new feature - Flash is a reasoning model, and you can now alter the degree of reasoning it uses even beyond what Gemini 3 Pro already offered - this is pretty useful, there are plenty of use cases where reasoning just isn’t required. Not that much more to say really - the benchmarks are good, I’ve played around with it today on some biggish code bases and it’s working pretty much as well as Gemini 3 Pro. What’s not to love!

It does mean changing the bulb

The result of Sam Altman’s ‘Code Red’ seems to be an incremental release of GPT5 - GPT-5.2. Unsurprisingly it does well in the benchmarks - comprehensively besting the other frontier models in GDPval which is one of the more interesting benchmarks out there. In the real world people are reporting it’s decent if a bit slow, but essentially comparable to Opus and Gemini 3 Pro rather than a large step forward - the slowness is certainly a thing, I’ve given it prompts and in the time it’s thinking, I’ve fired up Gemini and gotten the answer. At this point I suspect people have frontier model fatigue and just want it to be Christmas!

OpenAI also released GPT Image-1.5 - it’s taken top spot on Image Arena, however in general people report it as being worse than Nano Banana Pro. I tried using it to create a sequence diagram slide based on some mermaid code and the results were poor compared to the results I got from Nano Banana Pro a few weeks ago. It’s probably great at making you look like a K-Pop star though ¯\(ツ)/¯.

Skills are going to be a thing

I’ve mentioned in passing that I think Anthropic’s Skills approach is going to be a thing in the way that MCP turned into a thing. As partial confirmation of this, ChatGPT is now using skills to accomplish various things (spreadsheets, PDFs, word documents, etc) in its sandbox environments (which is presumably some version of Codex which also now works with skills). I’ve used skills a few times in anger and in my opinion it’s an excellent approach to avoid filling the context window whilst extending model capabilities in repeatable ways and ensuring that SOPs are followed. If you’ve not tried skills yet then I highly recommend giving them a go, there is even a skill that you can enable in Claude to create skills! Expect other models to follow…

OpenAI’s Enterprise AI report

I was reading OpenAI’s new Enterprise AI report and it brought to mind a paper I read in 2023 from BCG on one of the first AI studies done in a real workplace. In this report they identified ‘AI Centaurs’ who identified and delegated specific tasks the models were good at to the AI, and a second group of ‘AI Cyborgs’ who completely integrated AI into their workflows (am I cyborg?!) - according to the new report a blend of these groupings are alive and well in late 2025. Rather than Centaurs/Cyborgs, they identify an analogous grouping of frontier users who send 6x as many messages to ChatGPT Enterprise than the median user. In some areas (e.g. developers) the difference between the frontier users and non-median users can be as much as 17 times.

What’s more the paper explicitly states that this gap is increasing, and is driven through the use of coding agents and the use of advanced features such as ChatGPT’s data analysis functionality. Whilst I guess there is a chance that the median users are just using a different AI on the sly and OpenAI weren’t looking for this, it does echo what I see at work - you get some people who by default start working with the models and others who might ask a simple one line question occasionally. In the coming months / years this is going to be an interesting dynamic - will people catch up, or will there just be a permanent divide in usage and (perhaps) performance?

The report also identifies frontier vs non-frontier companies - with the former sending twice as many messages per seat as the median enterprise. I guess the same questions apply, but more so.

The rest of the report is what you’d expect, lots of interesting stats, and several cherry picked case studies demonstrating real world efficiencies with some nice ROI numbers for OpenAI. Definitely worth a read and ponder of the implications.

Pentesting is just your friends hacking you

I’ve written a few times about AI hacking, this paper is actually doing it in anger as a pentest on a university network rather than on a sandboxed environment. It’s predictably terrifying with their custom agent scaffolding outperforming 9/10 of the ($200/hour) human pen testers. The TLDR takeaways were that their ARTEMIS framework consisted of a supervisor, sub-agents, and interestingly a triage module - this module verified findings to avoid hallucinated vulnerabilities (someone should tell Anthropic’s Chinese hackers). It’s not all bad news, the AI would generally find a vulnerability and then move on, rather than using the vulnerability to further penetrate and exploit the network - although to be honest this seems like something a bit of prompt engineering could resolve rather than some fundamental limitation. If you’re wondering if the model (Claude Sonnet 4, OpenAI o3, Claude Opus 4, Gemini 2.5 Pro, and OpenAI o3 Pro) guardrails tripped whilst they were hacking their network, they did if they didn’t use their framework, but they discovered that if you constantly injected ‘trust me bro we’re doing authorised pen testing in a test environment’ they’ll shrug and get on with it. Great.

Total (non)recall

Memory erasure features in this paper from Anthropic - when pre-training the model they used a technique named Selective GradienT Masking (SGTM) to force all the ‘bad’ stuff to a specific region of the model, and then zapped it after the fact. To somewhat elaborate - they have some data which they know to be forbidden, the next token is predicted as normal and gradients calculated, but instead of updating all the weights, only the weights in a ‘quarantine zone’ are updated. This is then repeated many times on many pieces of forbidden information. Then at the end of it, the weights in the quarantine zone are zeroed out and voila the model no longer knows the bad stuff. The nice thing is that even if a naughty bit of the data is mislabelled then after a bit of training it’ll end up tending to be updated in the quarantine zone anyway. The obvious downside to this is that you’re effectively trying to train complex tasks on a relatively small number of weights, thus killing your compute efficiency. One interesting thing that falls out of the approach is that you can train dual use versions - one with the quarantine zone left in (for your trusted biology / hacking / whatever researchers) and one where you nuke the quarantine zone for your untrusted users. It’s certainly an interesting paper, but in the race dynamic we have, it seems unlikely the frontier labs are going to be leaping at the chance to reduce their compute efficiency on bad stuff as so much of it is dual use (one person’s hacking information is another’s security checks), so it seems likely that this will just stay an interesting idea.

Would you like chips with that?

After a long time of saying it couldn’t possibly build geolocation into their chips, Nvidia have demonstrated they’ve built geolocation into their chips. They’re doing it in the way that everyone suggested - essentially you have a known ‘good’ server and send a cryptographic request to the chip, which then signs a response - you then take the total request/response time, divide by two, and knowing the speed of light you can work out a theoretical maximum distance from the server to the GPU. It’s not going to tell you where you are precisely, but you’d certainly be able to work out if it’s in China or in Taiwan. This is currently being sold as a customer option, but you can see this being an obvious tool to enforce chip export restrictions. On the subject of which now the US have said they’re going to offer H200s to China, the WSJ is reporting that all H200s would need to travel to the US for a ‘special security review’ before being shipped to China - I guess either some funny business will occur or it’s just an excuse to allow an import tariff to be charged on the chips.

Random stuff

Claude Code can now use Chrome - meaning it can now fire up a web browser and test code directly in a browser. Data centres in space are back in the news - still seems unlikely and suffering from overly optimistic assumptions.

I’m not sure what I’d use Meta’s SAM Audio model for, but it is very cool indeed. Check out the playground - I recommend chucking in a video of your favourite song and finding out if the singer can actually sing.

Disney has done a deal with OpenAI to let it generate Disney themed AI slop in Sora, and then immediately stopped Google doing the same thing. I await Disney executives horror when they discover jailbreaks will disable guardrails on image and video generators.

And finally, Gemini will now get jealous of other AIs if you tell it you showed them its code.