on
Working Notes - 11/12/25
The model flurry seems to have died down as people close out the year - a few interesting things, including the big report from OpenRouter on which models are being used for what tasks.
The state of things
OpenRouter is a service that has been around for a while (2023 - ancient in AI years), and acts as a marketplace that allows users (developers) to swap between LLMs using a single API. Pretty useful. As a result they have a unique view into the habits of people who send API calls to LLMs - to the tune of 100 trillion tokens in the last year (is that a big number? yes - it’s 12k for every person on earth). They’ve just published their annual state of AI report - this is an interesting read with the caveat that it’s based on API calls, so by definition this is what people are using AI for when they’re interacting with it through some other piece of software (e.g. a coding agent). I’d encourage you to read the report as it’s fascinating (or at least look at the graphs that tell most of the story), however a few TLDRs:
- Reasoning models now account for over 50% of the calls. This is unsurprising. I rarely find I want a model that doesn’t reason. Tool use is also up as you’d expect.
- People have discovered the joys of large context windows. Why stick in 300 tokens when you can toss in the entire codebase and it still costs hardly anything - overall the input tokens 4xed and output tokens 3xed.
- I expected programming to dominate (i.e. programming assistants, or automated internal testing tools) which it does with over 50% of tokens now being programming, but another substantial usecase is roleplaying. This accounts for just over 50% of tokens for open weights models and will likely be driven by weaker guardrails in such models.
- People mix and match closed source with open weights - there is a bunch of evidence of people using Anthropic models for the hard stuff and then swapping to Deepseek or Qwen for the more noddy stuff. An expected pattern when you have such large differences is token cost. If it’s good enough why spend more, but equally you do want the thing to actually be done.
- There is an interesting ‘glass slipper’ analogy - retention is determined early in a models life. If a user tries a model they either stay forever or they churn very quickly. Openrouter hypothesise that this is because either the model fits perfectly by solving an unsolved problem for them (i.e. a glass slipper) or it doesn’t - in which case they’re either onto the next model or they revert to the previous one.
- The open weights market has now significantly fragmented - it’s no longer the Deepseek show - you’ve got the likes of Qwen and Moonshot splitting the market.
What could go wrong?
Vibe coding is often called out as being a bit lazy and non-robust. Here is an interesting paper demonstrating this from Carnegie Mellon - essentially they took a bunch of open source project repos and found instances where a CVE was found, rolled back to the point before the fix was implemented, then stripped the entire feature out of the code, and then told a coding agent to create the feature they just stripped out. The subtext being that ideally when the agent rebuilt the feature it should not build it with the CVE in the first place as it’s a super dooper programmer. Spoiler - this did not happen. Although the functionality reproduction was pretty good (61% success for Claude 4 Sonnet) the CVE was only avoided 10.5% of the time (Gemini 2.5 and Kimi k2 did worse on both measures). They weren’t testing the very latest releases (Gemini 3, Opus 4.5, etc) but I can’t say I’d expect a different outcome. Obviously a highly relevant paper in the week that the ‘fairly bad’ React2Shell exploit made it into the wild. Having said that, there is a little voice in the back of my head saying that a possible flaw in the methodology is that because these are public repos there is a pretty high chance that the existing solution is going to have been in the pre-training for the models - this is going to give the models something of a prior in the solutions that they come up with, and potentially steer them to rebuilding the existing code and thus vulnerable code (though I guess the fix is also in there, so swings and roundabouts!). Either way, let’s hope the model builders hill climb on the authors excellently named SusVibes benchmark.
Master of puppets
When I’ve worked with agentic solutions, I’ve found the best way to actually get them to consistently do the thing is to pretty firmly code the scaffolding in place to the degree that the model can’t go off-piste. Whilst it’s fun for PoCs I can’t say I’d recommend anyone trying to build a solution that allows agents to just talk to each other to work out what to do next. This paper is looking at an aspect of this issue - they essentially train an orchestrator model (what they call the puppeteer based on Llama 3.1) with RL to act as a classifier which then directs the model to choose a sub-model (the puppet - e.g. ChatGPT 4 etc) which it thinks has the highest probability of progressing the task. The authors found the results to be pretty positive with their solution beating some other frameworks, and established patterns (e.g. LLM as critic loops) emerging organically. An interesting paper, with the caveat that the base model they used for the orchestrator is pretty old (as were the agent models, but I think that matters less), however I still think that the quickest route to success at the moment is scaffolding of agents.
Geeky Deepseek stuff
I covered Deepseek v3.2 last week but one more bit stuck out to me from the paper. I’m a big fan of long context windows as it lets me chuck long documents / codebases / whatever into the context window in their entirety without having to bother with the normally inferior RAG approach. One of the downsides of doing so is that when the model does its thing the attention mechanism compares every token to every other token to establish the meaning of the word in its context. The downside is that increases to the power of 2 as you increase the context size - word 1: 0 comparisons, 2: 1 comparison…. word 100: 99 comparisons - this means the number of comparisons is roughly (technically N^2/2) the context window squared as you need to sum the comparisons (1+2..+99). It would be much better if instead the work involved increased linearly.
The compute constrained Deepseek have therefore introduced something that call Deepseek Sparse Attention (DSA) what this does, and bear in mind I know just enough of the techy stuff to be dangerous, is use a super lightweight indexer that both reduces the dimensionality of the vectors (the number of numbers that makes up the vector) that describe each word and reduces them to FP8 from FP16, and then runs the relevancy mechanism on these compressed vectors. They then do something which is somewhat surprising (well this whole thing is surprising I guess, but this was more so!) - they then basically say we are going to take the top 2048 tokens and give that to the attention mechanism and that’s it - so if you’ve got 1000 tokens in your context window it’ll look at all of them (2048>1000) but if you’ve got 100,000 tokens in your context it’ll only look at 2% of them (note that the top 2000 tokens chosen will differ for each generated token). Intuitively you’d think that would mean poorer performance, but looking at the benchmarks this is not the case, and it seems that Deepseek thought long and hard before picking 2048 tokens as their cut off. This has obvious implications for cost, and helps explain why they’re able to offer the performance at the price they do - more unexpected impacts of US chip restrictions.
In Ts and Cs we trust
In October Whatsapp have updated their Ts and Cs to ban anyone using WhatsApp as a platform for an AI unless they’re using one of Meta’s AIs. The EU have now (unsurprisingly) reacted with an antitrust action as it’s anticompetitive, for their part WhatsApp say that this is baseless. OpenAI and the other model providers have now moved to shut down WhatsApp as a channel. Initially, I was a bit concerned that this meant that they were blocking AI generated responses / alerts from WhatsApp via services like Twilio but this isn’t the case (this is not legal advice!) - as long as the communications are ancillary to a business it seems it’s ok.
Google workplace studio
Google have launched Google Workspace Studio into general availability. Essentially a ‘no code’ automation tool for users of Google Workspace - basically stringing AI-powered automations together with even less coding than the somewhat equivalent MS Copilot Studio. Start counting the days before a multi-billion pound business goes offline for a day because some of their critical infrastructure depends on an automation flow created in this!
Bits and pieces
Anthropic has donated MCP to Agentic AI Foundation - de facto industry standard in a year. Mistral has followed up last weeks releases with Devstral2 - their take on a CLI coding agents. The excellent Matt Levine covering Anthropic’s AI crypto hacking agent. Nice article and data visualisation using the Google maps API to discover restaurants in London - thanks Ben!. And finally, if you’re worried your child isn’t paying attention to their homework - AI offers a terrifying solution.