Working Notes - 15/01/26

First week in a while with no major model release to cover - instead I’ve been using Claude Code to do non-coding work, looked into Google UCP and what it means for retailers, faked my voice from five seconds of audio with a model that will run on pretty much anything, and investigated an agentic approach that enables efficient use of extremely large contexts. 

It’s coming

There has been a lot of hype in the past few weeks about Claude Code with Opus 4.5 - most of which I’d put down to people using the Christmas holidays to have a play with it and realising that it’s got really good. If you’ve not played with Claude Code recently, you should probably go do that!

One of the trends that is a thing at the moment is using Claude Code to do non-coding stuff. It is surprisingly(?) excellent at this - for example I needed to make a short deck on something the other day, and I used Claude Code to create an extremely decent first draft of it. I recorded a brain dump of all my thoughts on the subject, exported the transcription from my phone, put it in a directory that Claude Code could access, installed Claude Office Skills to enable it to create PPTs, and then just told it to make the deck - off it went and many thousands of tokens later I had my deck.

At its core, this is what Anthropic have launched this week with Claude Cowork but with the addition of a separate VM and a nice non-CLI frontend. This has been received pretty positively, and judging from the very slow performance Opus 4.5 has been showing on the Claude website/app I think it’s fair to say Anthropic’s servers are being hammered. Currently it’s a macOS only research preview and only available to people on the $100/$200 per month plan. Anthropic call out risks over and above the normal privacy ones - importantly if you’re using it with browser use enabled you should only be using it on ‘trusted’ websites. In theory someone could prompt inject Cowork by displaying a nefarious prompt on their website and delete files on your actual computer.

In a nutshell, this means no more uploading or copy & pasting documents to a chat/teams/web interface - now you just point Cowork at the folder in your computer, tell it what you want, and it’ll asynchronously access websites, build ppts, organise the files, give you progress reports, and let you chip in with additional thoughts/changes as it reports how it is getting on. Whilst technically the solution they’ve released is not really anything new over and above what you could do with Claude Code already, it’s perception that matters. Gone is Claude Code’s command line, instead here is a nice friendly interface that enterprise leaders can try themselves and experience the capabilities of the most advanced universal agents. I expect this to have an impact beyond ‘just’ a research preview and be a step towards the mass adoption of general agents - I’ve no doubt OpenAI and Google will be launching their own versions shortly.

One final factoid that has come to light - Anthropic are claiming the solution was built in two weeks entirely with Claude Code. Pretty impressive.

Would you like three letters with that acronym?

It’s noteworthy (at least to me!) how many open standards are being launched to shape the AI ecosystem, compared to the far more organic development of internet commerce. Google has already launched A2A, AP2, and A2UI—all of these are standards for agents to interact and transact with each other.

They’ve now launched another standard named the Universal Commerce Protocol or UCP. Upon hearing the name, I initially figured it was to solve the problem of how agents can find out about your products and then buy them. And it is, but the first half of that statement (the finding or discovery of your products) isn’t ready yet. Instead, the capabilities they’re launching now are identity, checkout, and ordering. For further techy detail, there are a few worked examples in the Github and they also have a playground on the main site.

As you’d expect, once you’re set up, other users of the protocol won’t need to integrate with you individually. What the take-up of this will be remains to be seen, but I’d expect it to be fairly good as most retailers are already in the Google ecosystem (AdWords/Google Ads, SEO, etc). Google has made it clear that this will soon power purchases within Google AI Search and the Gemini app—both large enough markets that it’s worth it for many retailers to do this even if the open aspect doesn’t take off.

You can’t trust everything you hear

In another reminder that you really shouldn’t be using your voice as your password (no matter what your bank says given the state of generative text to speech systems, here is new model from Kyutai that will run on just your CPU and clone a voice from under 5 seconds of speech. I was a bit sceptical so I installed it from Hugging Face in a fresh conda env, and can confirm that it will indeed give a convincing clone of your voice from a few seconds of audio after downloading the few hundred meg of model weights. Given the proliferation of real time and near real time voice cloning technology, companies (and everyone for that matter) need to be cognisant that just because a voice sounds like someone you know, it doesn’t actually mean that it’s them. 

Recursive Language models

I found this paper on recursive language models from MIT to be an interesting read. A lot of enterprise use cases are essentially search and summarise tasks - often this means dealing with very large potential contexts. You could just toss the entire thing into the model context window (and often this is worth doing if it’s a one off) but at a certain point that becomes expensive and slow.

The technique in the paper seems to take inspiration from how coding agents work - when you use these tools they don’t load the entire codebase into the context window, instead they note that they have access to (say) five python files and then search / manipulate the files as required. Likewise this approach treats a potentially huge (10M+ tokens) context as a variable that exists and can then regex search it / split it into smaller chunks and execute agents against each chunk.

Let’s say you had a question of “build me a RACI for all the roles in this document” - if you tried to use traditional semantic RAG for that it’d totally fail as it needs to examine every part of the context rather than just the top k semantically similar chunks. This solution will work, because it will split the book into lots of chunks, assess whether a role is discussed in the chunk, and then write the role summary back to a findings array for the root model to process.

If you’re working with very long potential contexts and need to aggregate information from across the whole context this approach is pretty robust and will likely result in more accurate results than semantic RAG.

Evaluating agents

Most evals are currently based on just assessing the text output - whilst this works well for things like whether an AI can find an answer from context or recall a fact, it works much less well if we want to evaluate an agentic process. Often the text output doesn’t really let you assess whether the agent has done what you want it to have done. This post from Anthropic is a deep dive into how to better evaluate agentic systems.

It’s all pretty logical - if you want the agent to create code then you have a gold-standard deterministic approach you can use - you run the code against your tests. If it’s a bit fuzzier than that, then use a LLM as a judge (noting that you have to calibrate against a known good standard). Build a harness that you wipe each time so your test isn’t passed because an agent left a file there on a previous run. Don’t grade the path as agents are not deterministic and may find an unexpected route to get the right answer (they give a good example of this from Opus 4.5 testing where the model found a loophole in a customer service policy that allowed the goal to be achieved).

If you’re trying to deploy agents, then it’s certainly worth a read. If nothing else it’s worth the reminder that if your agent is 90% reliable per step and it performs 10 steps, you’ll only have an overall reliability of 90%^10 or 35%… not what you want for a customer service agent. It’s for reasons such as this that I’m sceptical of customer facing production agent use without heavy scaffolding - most of the time they’re just not reliable enough to let them find their own path.

Other stuff

Skills being a thing continues - you can now use them in VS Code.

I wrote last week about using prompt caching to improve TTL, reduce costs, and improve performance over naive RAG based approaches - this article adds another layer to this by using semantics to better identify the correct cache to use to answer a question. It does suffer from the issue I mentioned last time about the implicit assumption that the question is semantically similar to the answer, but if your use case fits this, it’s worth a read.

Finally, there have been a number of AI health care releases in the past week - I guess this was inevitable given how much people use LLMs for health and fitness questions (for example, I use Claude as a pretty effective coach when training for running events). If you’re comfortable giving all your health information to OpenAI and aren’t in the EEA or UK, then this is for you. Probably a case of don’t compare me to almighty, compare me to the alternative - if you’ve not got access to a health professional or personal trainer, then I expect this to be worthwhile. Conversely Anthropic’s recent release is actually pretty different and much more targeted at their enterprise customers rather than consumers.