on
Working Notes - 04/12/25
The deluge of model releases continues - Mistral coming out with a EU AI Act friendly release and Deepseek showing the US export restrictions seem to work a bit. Also - my one month anniversary for the blog - a few people mentioned it was useful, so I’ll see if I can get to two months now!
The winds of change
Mistral have released a new family of models - not quite at the bleeding edge but pretty close. All are open weight models including their largest model - the imaginatively named Mistral Large 3 - which is something of a change from their previous largest models. I suspect this is a direct reflection of the increasingly SOTA models that Chinese companies are releasing as open weights (e.g. Kimi K2, Qwen, etc) - they’ve also directly released a fp4 version of their largest model presumably to make it easy for people to choose them. Important to note that this is not a reasoning model, so we’ll need to wait and see if they release a Magistral version of this model. The quite brief release note was that it was trained on 3k H200s which means this is their old cluster as they announced an 18k cluster a few months back.
This release is pretty important, not because of the actual performance of the model (which seems fine to be clear but not noticeably better than the Chinese models) but because it means that there’s a model that you could call EU Act native - this is quite a differentiator as they’ve already prepared most of the legal boiler plate for the Act. There is a lot of push back internally in European corporates to not use Chinese models, especially if the company is infrastructure related, having an open source model option that meets all the act requirements, and the provider has signed up to the General-Purpose AI Code of Practice is potentially quite attractive to a lot of companies / government organisations. Obviously Mistral know this and if you look to see where they’re focusing their marketing - it’s in exactly these areas.
More Opus
Been playing with Opus more, I do like it. I found it immediately solved a coding problem that had Sonnet going around in circles (making a sheet cutting optimising tool - it identified there was an algorithm that could be used and implemented it). There is also a truly epic system card - I’m not going to pretend to have read it all but like Gemini 3 the model achieved very high cyber attack performance. I guess expect to see it in AI cyber attacks soon! Seemingly has a ‘soul’ spec which is interesting reading, though it’s worth reading the comments to understand the nuance of what this is or isn’t (and this post from an Anthropic employee).
US Export restrictions having a measurable impact
Deepseek have a new model - V3.2 out. As this is a reasoning model, I assume they’re going to drop the R1 naming convention. Seems to perform very well, but there is always the danger of benchmark hacking. If you use the model on the Deepseek website you’ll note it is quite verbose - something which is mentioned in the paper in the conclusions discussion. It takes two or three times the number of thinking tokens to reach the same answer as another SOTA model such as Gemini 3. The paper pins this on the compute density (I guess intelligence per token) which the authors suggest is a result of having less total flops to train on - their solution is to have it think for longer and do self verification loops etc. To me this is the most interesting aspect of the paper / model release - it’s a concrete example of the US AI chip restrictions having a direct (and acknowledged) impact on Chinese AI progress.
Reward balance matters
This paper goes into a lot of detail on how to do RL training on multi-step, tool using agents. It suggests the use of a mask where the (direct) response from the tool is masked out when calculating the loss function so the model is instead rewarded for (a) believing the output of the tool (i.e. if the tool said answer A did it relay answer A to the user) and (b) whether the tool was called correctly (i.e. syntax correct and no error returned).
As a paper I found it a bit disjointed as there is a load of stuff about Markov Decision Process before they actually got to their implementation details, but what really struck me were the thoughts it triggered about failure mechanisms. Clearly I’m never going to implement RL training on a model, but as someone who uses AI in anger it’s useful to consider the importance of balancing the rewards for process vs the rewards for getting the correct answer. If the reward weight given to the process (b) is too high compared with the reward for getting the answer correct (a) then you’re effectively training the model to go off on side quests where it correctly calls lots of tools that don’t get it any closer to the answer. I guess it’s a bit like work in that respect - it’s not uncommon to see following a bureaucratic process rewarded more than actually doing the thing!
HunyuanOCR
One of the things I track closely in the world of AI is the deeply dull subject of document ingestion - mainly because a lot of the use cases I see at work are quite dependent on it. Whilst I’d love to mandate that people need to make their documents readily ingestable as markdown, I’m very aware that in the real world people are making pretty slides with weird layouts and hiding important information in jpegs of a project lifecycle diagram.
The solution to this is essentially fancy OCR - using multimodal LLMs to ingest and understand complexly formatted documents after ingesting them as images. HunyuanOCR from Tencent is a new model that stood out to me because it scores highly on the benchmarks, does some interesting things like natively outputting structured formats (converting flowcharts to mermaid, formulas to latex, tables to HTML, text to markdown, etc.), and is also very small (1 billion parameters). This means that potentially you could run it locally on a MacBook and avoid the normal internal refusal involved in the use of a Chinese-originated model. Sounds great - unfortunately there is a huge caveat on this model that has somewhat tempered my enthusiasm - if you look in the license document it explicitly excludes the EU, UK, and South Korea. So that’s annoying.
Olmo 3 and INTELLECT-3
Couple of interesting releases from the world of open source. Olmo 3 is probably the most interesting, as their technical paper goes through the full end to end training of the (dense rather than MoE) model (including datasets) from pre-training to RL. INTELLECT-3 is built on an existing MoE model (GLM-4.5-Air) and only covers how they approached RL. Whilst the models are not really directly comparable - INTELLECT-3 is a much larger model (MoE differences withstanding) - it’s interesting that they approach their RL differently and combinations of these probably give an indication how the frontier labs are performing RL (OpenAI senior researchers are on record saying that Deepseek R1’s technical paper gave everyone a leg up with GRPO). If you only read one, read the Olmo 3 one.
Long running agents
Not James Bond, but Anthropic giving useful information on how to use agents in the real world on long running tasks. In particular they tackle challenges such as the models tending to like to one-shot everything, how to pass on context to an agent when it’s starting with a blank context window, and how to get the models to test the stuff they do. All very coding centric but I recognise quite a few of the issues from my own experience and have bodged some of the same techniques for passing on context when messing around at home. No point me trying to summarise but certainly worth a read.
Random stuff
Google have a movie(?)/documentary out about Deepseek - not watched it yet. Cool demo of the new 3B model from Mistral running locally on your computer to provide real-time captions to your webcam (note 3GB download). OpenAI have an interesting looking alignment blog - the entry about reviewing code at scale looks interesting.