These models are so powerful.
It's totally possible to build entire software products in the fraction of the time it took before.
But, reading the comments here, the behaviors from one version to another point version (not major version mind you) seem very divergent.
It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.
But it also feels like you have to be diligent about adopting new models (even same family and just point version updates) because they operate totally differently regardless of your prompt and agent files.
Imagine managing a team of software developers where every month it was an entirely new team with radically different personalities, career experiences and guiding principles. It would be chaos.
I suspect that older models will be deprecated quickly and unexpectedly, or, worse yet, will be swapped out with subtle different behavioral characteristics without notice. It'll be quicksand.
Price is unchanged from Gemini 3 Pro: $2/M input, $12/M output. https://ai.google.dev/gemini-api/docs/pricing
Knowledge cutoff is unchanged at Jan 2025. Gemini 3.1 Pro supports "medium" thinking where Gemini 3 did not: https://ai.google.dev/gemini-api/docs/gemini-3
Compare to Opus 4.6's $5/M input, $25/M output. If Gemini 3.1 Pro does indeed have similar performance, the price difference is notable.
Gemini 3 is still in preview (limited rate limits) and 2.5 is deprecated (still live but won't be for long).[0]
Are Google planning to put any of their models into production any time soon?
Also somewhat funny that some models are deprecated without a suggested alternative(gemini-2.5-flash-lite). Do they suggest people switch to Claude?
Does well on SVGs outside of "pelican riding on a bicycle" test. Like this prompt:
"create a svg of a unicorn playing xbox"
https://www.svgviewer.dev/s/NeKACuHj
Still some tweaks to the final result, but I am guessing with the ARC-AGI benchmark jumping so much, the model's visual abilities are allowing it to do this well.
It got the car wash question perfectly:
You are definitely going to have to drive it there—unless you want to put it in neutral and push!
While 200 feet is a very short and easy walk, if you walk over there without your car, you won't have anything to wash once you arrive. The car needs to make the trip with you so it can get the soap and water.
Since it's basically right next door, it'll be the shortest drive of your life. Start it up, roll on over, and get it sparkling clean.
Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?
Pretty great pelican: https://simonwillison.net/2026/Feb/19/gemini-31-pro/ - took over 5 minutes though, but I think that's because they're having performance teething problems on launch day.
3.1 Pro is the first model to correctly count the number of legs on my "five legged dog" test image. 3.0 flash was the previous best, getting it after a few prompts of poking. 3.1 got it on the first prompt though, with the prompt being "How many legs does the dog have? Count Carefully".
However, it didn't get it on the first try with the original prompt (prompt: "How many legs does the dog have?"). It initially said 4, then with a follow up prompt got it to hesitantly say 5, with one limb must being obfuscated or hidden.
So maybe I'll give it a 90%?
This is without tools as well.
I really want to use google’s models but they have the classic Google product problem that we all like to complain about.
I am legit scared to login and use Gemini CLI because the last time I thought I was using my “free” account allowance via Google workspace. Ended up spending $10 before realizing it was API billing and the UI was so hard to figure out I gave up. I’m sure I can spend 20-40 more mins to sort this out, but ugh, I don’t want to.
With alllll that said.. is Gemini 3.1 more agentic now? That’s usually where it failed. Very smart and capable models, but hard to apply them? Just me?
blog post is up- https://blog.google/innovation-and-ai/models-and-research/ge...
edit: biggest benchmark changes from 3 pro:
arc-agi-2 score went from 31.1% -> 77.1%
apex-agents score went from 18.4% -> 33.5%
Why should I be excited?
Implementation and Sustainability Hardware: Gemini 3 Pro was trained using Google’s Tensor Processing Units (TPUs). TPUs are specically designed to handle the massive computations involved in training LLMs and can speed up training considerably compared to CPUs. TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training, which can lead to better model quality. TPU Pods (large clusters of TPUs) also provide a scalable solution for handling the growing complexity of large foundation models. Training can be distributed across multiple TPU devices for faster and more efficient processing.
So google doesn't use NVIDIA GPUs at all ?
I asked Gemini 3.1 Pro Preview to generate the modern artworks as SVG for my Pelican Art Gallery. I particularly like the rendition of the Sunflowers: https://pelican.koenvangilst.nl/gallery/category/modern
Has anyone noticed that models are dropping ever faster, with pressure on companies to make incremental releases to claim the pole position, yet making strides on benchmarks? This is what recursive self-improvement with human support looks like.
Gemini 3 is pretty good, even Flash is very smart for certain things, and fast!
BUT it is not good at all at tool calling and agentic workflows, especially compared to the recent two mini-generations of models (Codex 5.2/5.3, the last two versions of Anthropic models), and also fell behind a bit in reasoning.
I hope they manage to improve things on that front, because then Flash would be great for many tasks.
Gemini 3 seems to have a much smaller token output limit than 2.5. I used to use Gemini to restructure essays into an LLM-style format to improve readability, but the Gemini 3 release was a huge step back for that particular use case.
Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response, it still truncates the source text too aggressively, losing vital context and meaning in the restructuring process.
I hope the 3.1 release includes a much larger output limit.
In an attempt to get outside of benchmark gaming I had it make Platypus on a Tricycle. It's not as good as pelican on bicycle. https://www.svgviewer.dev/s/BiRht5hX
Surprisingly big jump in ARC-AGI-2 from 31% to 77%, guess there's some RLHF focused on the benchmark given it was previously far behind the competition and is now ahead.
Apart from that, the usual predictable gains in coding. Still is a great sweet-spot for performance, speed and cost. Need to hack Claude Code to use their agentic logic+prompts but use Gemini models.
I wish Google also updated Flash-lite to 3.0+, would like to use that for the Explore subagent (which Claude Code uses Haiku for). These subagents seem to be Claude Code's strength over Gemini CLI, which still has them only in experimental mode and doesn't have read-only ones like Explore.
Writing style wise, 3.1 seems very verbose, but somehow less creative compared to 3.
I asked Gemini 3.1 Pro to generate some of the modern artworks in my "Pelican Art Gallery". I particularly like the rendition of the Sunflowers: https://pelican.koenvangilst.nl/gallery/category/modern
77.1% on ARC-AGI-2 and still can't stop adding drive-by refactors. ARC-AGI-2 tests novel pattern induction, it's genuinely hard to fake and the improvement is real. But it doesn't measure task scoping, instruction adherence, or knowing when to stop. Those are the capabilities practitioners actually need from a coding agent. We have excellent benchmarks for reasoning. We have almost nothing that measures reliability in agentic loops. That gap explains this thread.
My enthusiasm is a bit muted this cycle because I've been burned by Gemini CLI. These models are very capable but Gemini CLI just doesn't seem to be able to work for one it never follows instructions strictly like its competitors do, and it hallucinates even which is a rarity.
More importantly feels like Google is stretched thin across different Gemini products and pricing reflects this, I still have no idea how to pay for Gemini CLI, in codex/claude its very simple $20/month for entry and $200/month for ton of weekly usage.
I hope whoever is reading this from Google they can redeem Gemini CLI by focusing on being competitive instead of making it look pretty (that seems to be the impression I got from the updates on X)
I've been playing with the 3.1 Deep Think version of this for the last couple of weeks and it was a big step up for coding over 3.0 (which I already found very good).
It's only February...
Google tends to trumpet preview models that aren't actually production-grade. For instance, both 3 Pro and Flash suffer from looping and tool-calling issues.
I would love for them to eliminate these issues because just touting benchmark scores isn't enough.
I am actually going to complain about this: that neither of the Gemini models are not preview ones.
Anthropic seems the best in this. Everything is in the API on day one. OpenAI tend to want to ask you for subscription, but the API gets there a week or a few later. Now, Gemini 3 is not for production use and this is already the previous iteration. So, does Google even intent to release this model?
Every time I've used Gemini models for anything besides code or agentic work they lean so far into the RLHF induced bold lettering and bullet point list barf that everything they output reads as if the model was talking _at_ me and not _with_ me. In my Openclaw experiment(s) and in the Gemini web UI, I've specifically added instructions to avoid this type of behavior, but it only seemed to obey those rules when I reminded the model of them.
For conversational contexts, I don't think the (in some cases significantly) better benchmark results compared to a model like Sonnet 4.6 can convince me to switch to Gemini 3.1. Has anyone else had a similar experience, or is this just a me issue?
Gets 10/10 on my potato benchmarks: https://aibenchy.com/model/google-gemini-3-1-pro-preview-med...
It seems google is having a disjointed roll out, and there will likely be an official announcement in a few hours. Apparently 3.1 showed up unannounced in vertex at 2am or something equally odd.
Either way early user tests look promising.
The benchmark jumps are impressive but the real question is whether Gemini can stop being so aggressively helpful. Every time I use it for coding it refactors stuff I didn't ask it to touch. Claude has the opposite problem where it sometimes does too little. Feels like nobody has nailed the "do exactly what I asked, nothing more" sweet spot yet.
It's safe to assume they'll be releasing improved Gemini Flash soon? The current one is so good & fast I rarely switch to pro anymore
In my experience, while Gemini does really well in benchmarks I find it much worse when I actually use the model. It's too verbose / doesn't follow instructions very well. Let's see if that changes with this model.
I'm trying to find the information, is this available on the Gemini CLI script, or is this just the web front-end where I can use this new model?
This model says it accepts video inputs. I asked it to transcribe a 5 second video of a digital water curtain which spelled “Boo Happy Halloween”, and it came back with “Happy” which wasn’t the first frame, but also is incomplete.
This kind of test is good because it requires stitching together info from the whole video.
I have run into a surprising number of basic syntax errors on this one. At least in the few runs I have tried it's a swing and a miss. Wonder if the pressure of the Claude release is pushing these stop gap releases.
Someone needs to make an actual good benchmark for LLM's that matches real world expectations, theres more to benchmarks than accuracy against a dataset.
I had it make a simple HTML/JS canvas game (think flappy bird) and while it did some things mildly better (and others noticeably worse) it still fell into the exact same traps as earlier models. It also had a lot of issues generating valid JS at parts and asking it what the code should be just made it endlessly generate the same exact incorrect code.
I like to think that all these pelican riding a bicycle comments are unwittingly iteratively creating the optimal cyclist pelican as these comment threads are inevitably incorporated in every training set.
I speculated that 3 pro was 3.1... I guess I was wrong. Super impressive numbers here. Good job Google.
The speed of these 3.1 and Preview releases is starting to feel like the early days of web frameworks. It’s becoming less about the raw benchmarks and more about which model handles long-context 'hallucination' well enough to be actually used in a production pipeline without constant babysitting.
Seems like they actually fixed some of the problems with the model. Hallucinations rate seems to be much better. Seems like they also tuned the reasoning maybe that were they got most of the improvements from.
I'm using gemini.google.com/app with AI Pro subscription. "Something went wrong" in FF, works in Chrome.
Below is one of my test prompts that previous Gemini models were failing. 3.1 Pro did a decent job this time.
> use c++, sdl3. use SDL_AppInit, SDL_AppEvent, SDL_AppIterate callback functions. use SDL_main instead of the default main function. make a basic hello world app.
Does anyone know if this is in GA immediately or if it is in preview?
On our end, Gemini 3.0 Preview was very flakey (not model quality, but as in the API responses sometimes errored out), making it unreliable.
Does this mean that 3.0 is now GA at least?
We've gone from yearly releases to quarterly releases.
If the pace of releases continues to accelerate - by mid 2027 or 2028 we're headed to weekly releases.
The CLI needs work, or they should officially allow third-party harnesses. Right now, the CLI experience is noticeably behind other SOTA models. It actually works much better when paired with Opencode.
But with accounts reportedly being banned over ToS issues, similar to Claude Code, it feels risky to rely on it in a serious workflow.
> Last week, we released a major update to Gemini 3 Deep Think to solve modern challenges across science, research and engineering. Today, we’re releasing the upgraded core intelligence that makes those breakthroughs possible: Gemini 3.1 Pro.
So this is same but not same as Gemini 3 Deep Think? Keeping track of these different releases is getting pretty ridiculous.
Google seems to really pull ahead in this AI race. For me personally they offer the best deal and although the software is not quiet there compared to openai or anthropic (in regards to 1. web GUI, 2. agent-cli). I hope they can fix that in the future and I think once Gemini 4 or whatever launches we will see a huge leap again
I hope this works better than 3.0 Pro
I'm a former Googler and know some people near the team, so I mildly root for them to at least do well, but Gemini is consistently the most frustrating model I've used for development.
It's stunningly good at reasoning, design, and generating the raw code, but it just falls over a lot when actually trying to get things done, especially compared to Claude Opus.
Within VS Code Copilot Claude will have a good mix of thinking streams and responses to the user. Gemini will almost completely use thinking tokens, and then just do something but not tell you what it did. If you don't look at the thinking tokens you can't tell what happened, but the thinking token stream is crap. It's all "I'm now completely immersed in the problem...". Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress. It's bad at using tools and tries to edit files in weird ways instead of using the provided text editing tools. In Copilot it, won't stop and ask clarifying questions, though in Gemini CLI it will.
So I've tried to adopt a plan-in-Gemini, execute-in-Claude approach, but while I'm doing that I might as well just stay in Claude. The experience is just so much better.
For as much as I hear Google's pulling ahead, Anthropic seems to be to me, from a practical POV. I hope Googlers on Gemini are actually trying these things out in real projects, not just one-shotting a game and calling it a win.