on
Working Notes - 29/01/26
A mix this week - some thoughts on OpenAI’s value-based pricing model pitch (spoiler: I’m a sceptic for most scenarios), testing Gemini Flash 3’s new agentic vision feature, some new chips and models, and another scary voice cloner.
Future business models
Given the scale of the investment that AI companies have seen and the expected impact on enterprises, changes in their business models are obviously quite interesting. The OpenAI CFO, Sarah Friar, has started making noises about how she sees OpenAI starting to claw back some of their outlay. Her basic premise is that AI firms will move beyond the standard SaaS model of bums on seats / usage charging and towards a more consulting model of value based outcomes.
This is a pretty interesting hypothesis and bears some scrutiny (I’ll take the statement at face value rather than as a funding round talking point). If you look at where this model currently works, you’ve got things like litigation (no win no fee), tender bidding support, M&A sourcing, payment providers, bug bounties, and recruitment. The common factor with all of these is either that the value is explicit (e.g. a 2% transaction fee by Stripe, or I decided to hire this guy) or there is neutral third party that will judge success (e.g. we won the litigation, I tested the bug, or we won the contract).
A corollary of this is that where there is no neutral measure of value, the approach is much less common. The main example is consulting companies doing value-based deals. From my own experience (management consultant for 11 years!), I’ve seen how difficult these deals can be at the end of a contract, even where everyone thought bulletproof definitions had been agreed a year earlier.
The other major factor that needs to be considered is competition. At the moment at least, it does not seem likely that OpenAI (or any AI company) will get a comparative advantage that is so large that any enterprise would be crazy to use an alternative provider. If OpenAI is asking for 10% of value and a competitor is offering a flat per token (or some other utility pricing measure) then a CFO is likely to be going with the competitor. After all, the vast majority of intelligence that companies purchase today (aka actual humans) is done on a flat rate basis with the only exceptions being very senior employees.
The only place I can see this working in the 2026/27 timescales is in joint ventures. If OpenAI can identify some high-value use cases with a clear neutral reward signal (e.g. a new drug makes it to a certain FDA milestone) and offer a deal along the lines of ‘you provide the data, I provide the compute and salary Opex, we split the value of the outcome’ then it’s going to be an interesting proposition for a CFO. The assessment won’t be around ‘can I get a cheaper deal’ it’ll be more of an opportunity cost discussion (what else could I do with this data, where else could I deploy these high-value employees).
This is a fairly bold play by Sarah Friar, but at the moment OpenAI doesn’t have a sufficient moat compared to other frontier labs to make it work for most enterprises. I expect things to go in the direction of utility pricing, which means that for any hope of recovering their capex the AI companies are going to need to see huge adoption in enterprise. The JV exception, if one emerges, will be industries with verifiable rewards like litigation and drug discovery. As with many things in AI, finding the correct reward signal is the key.
Agents with vision
Google has now enabled Gemini Flash 3 to perform agentically when given an image input. Normally when you give a model an image it’ll look once and then tell you your answer, however when code execution is enabled, the new version of Flash will look and use a tool based agentic loop before providing the result. What this means in practice is that it’ll use Python to slice up the image and effectively zoom into parts of the image that it identifies as interesting in its first look.
This is potentially quite useful for one of the most important enterprise AI use cases - ingesting existing word documents and PDFs. These often contain diagrams, Gantt charts, process flows, etc that you might either want to add immediately to a prompt or create markdown versions of for later retrieval as part of a RAG pipeline. So does it work? My experience is a big fat sort-of. When I asked it to extract detail from an example Gantt chart, it certainly used python to zoom in and manipulate the image, but it didn’t get the correct answers for everything. Whilst it got the order and task titles correct, it incorrectly recorded the start dates, end dates and durations. This may have been a prompting issue on my part, but I did try several approaches with similar results. Alternatively it might be due to the need to reference the date scale at the top of the chart with the bars themselves.
My second example was much more successful – I found a fairly detailed consulting framework diagram showing an ML model lifecycle and it provided me with a word-perfect transcription, complete with all the required information about how each part flowed to the next.
So - from a conceptual point of view, this does seem like ‘the way’ but at the moment at least it’s worth being circumspect with the types of images to which you apply it. I would expect this to improve over time and it’s already a useful tool on a model as cheap as Gemini Flash.
Kimi K2.5
After a brief Xmas pause we’re back to a new model release from Moonshot - this builds on the K2 thinking release in November. It has landed to a generally good reaction, particularly with respect to its coding abilities. All the benchmarks are suitably impressive, although as ever, treat these with a pinch of salt.
Other than the model now being natively multimodal (a first for Moonshot) the most interesting part of the release blog relates to what they are calling agent swarms. This is essentially what it sounds like - a orchestrator agent controls up to 100 sub-agents to complete a task.
The key thing is that they’ve trained the model to be able to decompose the tasks itself without any special prompting - they initially achieved with RL early on by rewarding parallel execution, then slowly removed this reward and only rewarded success. Additionally the reward function penalised the critical task duration (i.e the critical path in project management terminology) anything that happened whilst that longest task was completing was essentially not penalised - this has the effect of preventing sequential execution (as the critical path increases) but encouraging lots of shorter tasks. Clever, and their metrics show the swarm approach reducing time to result by up to 80%.
The chips are down
Microsoft have launched a new custom chip, the Maia 200. Unlike the chips predecessor (the Maia 100), the new chip is explicitly inference only rather than both inference and training. Whilst this sounds limiting, in reality, current RL training approaches actually involve loads of inference runs (e.g. generating synthetic data) so this will still be useful to MS (and therefore OpenAI) for its training runs.
One of the more interesting points in the blog / press release is that the chip will be used to run GPT5.2 - given the chip is best at FP4 (rather than FP8/16) this points to OpenAIs latest models being natively FP4. The chip’s main completion AWS Tranium and Google TPUv7 which it appears to best on everything except FP16 in the case of Googles offering. The question will be can they scale the production to start competing with them and reduce their dependency on Nvidia.
More text to speech models
Alibaba have provided yet another reason to not trust everything you hear with their new TTS model. This is an open source (or weights?) that is available in either 1.7B or 0.6B flavours - meaning it’ll essentially run on anything. I’d previously thought the Pocket TTS model was very impressive, but this one really does sound identical to my voice after providing it with 10 seconds of me talking and a transcription. It can even pronounce my name which is a first for the models I’ve played around with. Have a play with it on hugging face, but only after you’ve turned off voice verification on anything you remotely care about!