Working Notes - 12/11/25

Not too many notes this week - I’ve been in training all week and have had a nasty cold (thanks kids!).

More progressive exploration

Anthropic is offering better practices for tool use within MCP. Instead of loading the tool into context window, this approach means the model can progressively explore as required and write code to access the tool. The code is then executed in a sandboxed env. For example, if we had a tool that wanted to use a large text file, the model would write code to use the tool on the large text file within the env file system rather than loading the large file into the actual context window. If the files and/or MCP server descriptions you are using a large, this obviously save a great deal of the context window. This progressive discovery of tools and execution on a virtual file system does seem to be ‘the way’ and is very similar to Anthropic’s other recent work - Claude skills (which I love btw).

A telco benchmark you say? With a dual sided agentic test harness? Tell me more!

Sierra Research has made a telco agent benchmark called tau2-bench. What they’ve done is created a test harness which provides a simulated world both for the agent being evaluated but also a simulated user - this is quite novel, either party can make a change via tools and the other side would see that change - e.g. if the agent suggests you’re in airplane mode to explain why your data isn’t working, the user can toggle airplane mode on a simulated mobile phone and then evaluate the result. You can either use their reference agent architecture with whatever LLM you want to evaluate, or you can swap the agent out for your own customer service agent (assuming you hack it to output the required format, which tbf is pretty generic). The user is always simulated by the same LLM (for some reason they’ve chosen GPT4.1 which probably seemed a good choice at the time). Clever stuff. You can look at the leader board here or read the paper here.

EU Kicking the can down the road

It’s being widely reported that the EU is going to water down the EU AI Act. I have mixed feelings on the act - being use case focused I think there is a lot of scope for constraining innovation. From my experience in the real world, it certainly has a chilling effect on AI implementation. The current reporting suggests that most of the rules for items defined as high risk within the act will be delayed by a year and they won’t fine anyone for a further year after that. In a lot of ways that just further muddies the water by kicking the can down the road - AI is likely to be quite different in two years time - the rules might not even make sense then.

Continue reading →

Working Notes - 6/11/25

Lot’s of things to cover this week - new models with terrible names and a deep dive into the water use of AI.

Does AI use incredible amounts of water?

Andy Masley offers an in-depth look into how much water AI uses. Spoiler, quite a bit but then so does everything else. I find it quite useful when thinking about big numbers in the news to channel my inner David Spiegelhalter and ask ‘is that a big number’. For example, the water footprint of a pair of jeans is nearly 11,000L which is the equivilant of approximately 5.4m prompts (you can argue that number up or down an order of magnitude, but that doesn’t alter the point). As ever context is everything (pun intended).

Awful name good model

MiniMax (yet another Chinese AI company) have released a new open weights model called M2 (MoE with 10B parameters etc etc). Gets good benchmarks for a open weights model. I’ve not properly used it in anger, but gave it a complex task (compare my houses council tax valuation with nearby prosperities) and it gave it a decent stab. Minimax is a Chinese based company so corporate privacy and security (and implementing the TSA) may rule it out for actual prod use even if self hosted.

Anthropic is corporate number one

Anthropic has seemingly overtaken OpenAI in corporate LLM token use according to various sources. This is pretty interesting, and doubtless driven by use of Claude Code. I’m a big fan of the Claude models, esp the new Claude skills functionality. I do note that the articles don’t mention whether MS Copilot token use is counted in the API share - I would guess not (not least as MS host their own instances of OpenAI models) and would assume this might change what the graphs look like. Anthropic have also been doing interesting things in tailoring their products to specific industries / corporate usecases - currently on the wait list for Claude for Excel will be interesting to see if they’ve come up with a robust way of handling tabular spreadsheet data (as MS Copilot for Excel is not great).

Continue reading →