Written by Ewan Panter
on November 20, 2025

Working Notes - 20/11/25

Well one of those weeks where too much has happened to properly keep up with it all! Lots of interesting stuff - the new releases will keep me busy for a while!

It’s 0.1 louder

OpenAI have launched GPT5.1 - this is an incremental release and a relatively small improvement on GPT5 (which was already great). It lets you change the ‘style’ of the response (e.g. professional vs quirky) in the app, which sounds, and is, awful. They have also worked on ensuring the model spends the correct amount of tokens on thinking for more complex questions - which will obviously help reduce their inference costs! They’ve also (finally) made it not use em dashes if you ask it not to - in some ways great, in many more ways aghhhhh as it’ll now be harder to tell at a glance whether something was AI generated. I’ve used it a fair bit this week, and it seems… fine. Still one of the best models.

A new best model

Google has done its standard thing of releasing several things all in one go. The main ones are Gemini 3 Pro (a frontier LLM), Gemini Deepthink (a new mode for Gemini 3 that increases reasoning and multimodal capabilities), and Google Antigravity (an agentic IDE). No Flash variant of the model yet, but presumably that is on its way (for most things I do at work, Flash is what you want). Gemini 3 has only been out for a day, but it does pretty well at my normal ‘is my council tax correct’ test - probably the best of any of the models thus far, though that may be confirmation bias on my part. Benchmark wise it’s a beast and is top for pretty much all the well known benchmarks (including Tau2 bench that I looked at last week with 85% vs 55% for Gemini 2.5). I’ve not played with Deepthink yet or Antigravity, with the latter especially it’ll be interesting to see if it’s a me too product, or something different. One final thing on Gemini 3 is that there are some punchy improvements in hacking ability - sufficient that Google’s internal alert was triggered, which they dealt with by… making the test harder so the alert was no longer triggered (obviously being flippant - there was more to it than that - but still!). Which leads on nicely to…

AI Hacking

A few people have pointed me to the Anthropic paper on an AI-orchestrated cyber espionage campaign - it’s certainly an interesting read (if slightly weird that Anthropic are treating the fact that their model was used as a bit of a flex). The TLDR, is that a Chinese state-sponsored group used Claude and some MCP servers to salami slice a cyber attack into small tasks that didn’t look too sketchy individually but collectively meant that 90% of the attack was conducted by Claude with circa 10% human supervision. 30 companies were attacked (government agencies and tech companies). Somewhat amusingly (?), the biggest issue the hackers had was that Claude did its standard reward hacking, and reported false success (hey - the NSA master password is SpyingOnYou1!). Anthropic are cagey on how they detected the attack (I guess for obvious reasons) but it seems mainly the sheer volume of requests triggered an API alarm which meant someone looked into what they were doing and exclaimed ‘wat’. The shape of things to come I guess (except with rate limiters in the future!).

Shadow IT inbound

Microsoft have launched vibe coding for powerapps. However, it’s not what I initially expected - you do the obvious thing of describing the app in natural language in a chat interface, but what it spits out is a React app front end and some dataverse tables (which given 90% of the Powerapps I see are ‘replace this excel form with an app to populate a sharepoint list’ is probably fine?). What it doesn’t do is build any required flows etc. Limited to preview and only available in the US at the moment, but will certainly make companies IT departments stroke their beards and mutter about Shadow IT.

It’s about how you fill in the form

Gartner has published it’s magic quadrant thing for genAI model providers. It’s got OpenAI as the emerging leader, which I’ll buy, however several of the other companies in the top right quadrant are somewhat surprising to say the least. Apparently IBM is up there with WatsonX (who knew), and Mistral and Anthropic are roughly equivalent. Ok then! I can’t read the methodology as not a subscriber, but I’m guessing that the score is probably directly related to how experienced the company is at filling out Gartner’s questionnaires… best to take this one with a bucket of salt.

A novel LLM discovery? Maybe?

Interesting paper from Yale and Google on training LLMs (an old version of Gemma in this case) to understand the ‘language’ of individual cells. I’ve previously heard of people trying to train foundation models on genetic information, but what they did here was expand the pre-training and finetune the model. This then spurs the question of what is your question / answer pair for this training - seems they did a bunch of different types (e.g. give me a pancreatic cell, and the answer is the gene expression of said cell) for the pretraining but most importantly they did responses of cells to various drugs (e.g. cell + drug = cell response) in fine tuning. The end result is a cell model you can talk to like a normal LLM. The really interesting bit is they then used it to predict a cancer treatment that was novel and relatively subtle (drug alone = no effect, drug + specific immune signal = desired effect) and when tested in the lab… it worked. LLMs making novel discoveries is here? Maybe? If you squint?

Proof that job hunting is hard at the moment

My general understanding of getting a job at the moment is that it’s ‘not great’, not least because whilst you used to be able to set yourself apart by writing a customised coherent CV / cover letter, the advent of genAI has meant that everyone’s CV is now perfect. This Paper pours some fuel onto that fire by demonstrating this empirically. They used freelancer.com data so they can see both sides of things - the effort spent by the applicant on making their proposal (a proxy for a CV) vs who actually got hired. Essentially, the correlation between effort put into proposal/CV and being hired disappears - their reasoning is that employers are not stupid and are aware of chatgpt, so just ignore the effort signal and focus purely on price. Pretty grim - moves us further from being a meritocracy. The only signals the employers were using were price, work history and reputation score (which in some ways is just a proxy for work history as its jobs completed on the platform). Doing a tiny bit of extrapolation from this, I guess the takeaways are a) don’t get fired b) who the companies on your CV are matters and c) getting recommended by someone else is very important d) you can’t compensate for b and c by being good at cover letters and CVs.

Podcasts

That’s it for this week - couple of things I’ve been listening to: 1) The MAD podcast which had on OpenAI VP of Research, Jerry Tworek, discussing his career history and a bit of technical information on how the various GPT models have been developed - slightly annoying the host steers him back to the mainstream every time he goes off on a technical tangent! 2) The ever excellent Dwarkesh podcast with Satya Nadella - main takeaways being that he wasn’t kidding when he said “we are below them, above them, around them”: MS essentially owns OpenAI, and secondly agent infrastructure architecture needs to be a conversation topic in corporates.

← → Top