• About
  • FAQ
  • Landing Page
Newsletter
CryptoMarketNews.club is a website that reports daily blockchain news and offers practical crypto guides.
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
  • Bitcoin
  • Ethereum
  • Regulation
  • Market
  • Blockchain
  • Business
  • Guide
  • Contact Us
No Result
View All Result
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
  • Bitcoin
  • Ethereum
  • Regulation
  • Market
  • Blockchain
  • Business
  • Guide
  • Contact Us
No Result
View All Result
CryptoMarketNews.club is a website that reports daily blockchain news and offers practical crypto guides.
No Result
View All Result
Home Blockchain

Google’s DiffusionGemma AI Hits 1,000 Tokens Per Second—And It’s Free

admin by admin
11/06/2026
in Blockchain
0
Google’s DiffusionGemma AI Hits 1,000 Tokens Per Second—And It’s Free
191
SHARES
1.5k
VIEWS
Share on FacebookShare on Twitter


In brief

  • Google released DiffusionGemma, a free open-weight model that generates entire 256-token blocks simultaneously via text diffusion—hitting over 1,000 tokens per second on an NVIDIA H100, four times faster than standard autoregressive models.
  • The custom drafter module DiffusionGemma needs for local inference doesn’t exist in any public runtime yet—not in mlx-lm, not in LM Studio—making it effectively unrunnable on most consumer setups today.
  • On NVIDIA NIM, the model arrived preconfigured at 8,192 tokens of context—below the 64,000-token floor that agentic frameworks like Hermes Agent require—meaning autonomous workflows won’t run without manual reconfiguration.

Google dropped DiffusionGemma today, an open model AI that generates text the way image generators create pictures: start with noise, refine until it makes sense. It hits 1,000 tokens per second on an NVIDIA H100. (Tokens are the basic unit of information that an AI model handles.) That means it’s four times faster than regular Gemma. It’s also free, Apache 2.0, with weights on Hugging Face.

The catch, as always, is in the fine print. Per Google’s announcement, the model hits “700+ tokens per second on NVIDIA GeForce RTX 5090.” It also trails standard Gemma 4 on output quality.

Related articles

How Crypto Firms Will Own the Octagon at Trump’s White House UFC Event

How Crypto Firms Will Own the Octagon at Trump’s White House UFC Event

13/06/2026
OpenAI Wants a Price War With Anthropic—Is It Proving DeepSeek Right?

OpenAI Wants a Price War With Anthropic—Is It Proving DeepSeek Right?

12/06/2026

Google says so themselves. This is a speed model, not a quality upgrade.

What this actually does

Every LLM you’ve used is a typewriter. One token at a time with each word dependent on the last. That’s how autoregressive architectures work.

DiffusionGemma doesn’t do that. Instead of generating tokens sequentially, it starts with refined chunks of garbled text in parallel. Per Google’s developer guide, it “starts with a canvas of random placeholder tokens” and iteratively locks in confident tokens until the whole block snaps into focus. Two hundred fifty-six tokens per forward pass. The GPU stays busy.

The side effect is bidirectional attention—every token can see every other token while being generated, which is impossible in autoregressive models (they cannot see the future, what is going to be encoded). That makes it unusually good at tasks where the end of the answer constrains the beginning: code infilling, structured output, constraint-heavy problems, etc. Google fine-tuned a version to solve Sudoku as a demo. The base model got roughly 0% of puzzles right.

The fine-tuned version hit 80%.

Text diffusion has been a research project for years. MDLM, SEDD, LLaDA, Dream—academic models that proved the approach worked at small scales and mostly stayed as proof of concepts. Inception Labs shipped Mercury 2 in February 2026 as the first commercial diffusion reasoning model, claiming speeds five times faster than speed-optimized competitors.

But none of that was open-weight, and none of it came with day-zero support in vLLM, Hugging Face Transformers, and Unsloth. DiffusionGemma is the first major open release from a tier-one lab.

There’s also a historical irony worth noting. Image generators started as diffusion models (hence the name Stable Diffusion) and are now moving toward autoregressive architectures for better quality. Language models started as autoregressive and are now experimenting with diffusion for speed.

Why it’s a pain to run… for now

Running DiffusionGemma efficiently requires a drafter—a lightweight module that proposes token blocks in parallel, which the main model then verifies in one forward pass. This is called speculative decoding. DFlash is a framework published in early 2026 that uses a small diffusion model as the drafter, enabling over 6x speedup on some tasks. It’s the engine that makes this class of model practical.

The problem: DiffusionGemma needs a specific drafter to run locally via MLX—Apple’s machine learning framework for Apple Silicon. That module doesn’t exist in any public version of mlx-lm, in any open pull request, or in LM Studio’s bundled runtime.

We tried running DiffusionGemma with Hermes through NVIDIA NIM. The model loaded, but then: “agent init failed: Model google/diffusiongemma-26b-a4b-it has a context window of 8,192 tokens, which is below the minimum 64,000 required by Hermes Agent.”

To be precise: DiffusionGemma’s actual context window is 256K tokens. The 8,192 figure was Nvidia messing things up by default, not the model’s architectural limit.

In practice, getting it configured correctly for agentic use requires manual work that most everyday users haven’t figured out yet, and Hermes Agent simply won’t initialize without it. Parallel speed means nothing if the agent can’t boot.

Hopefully, in the next few days, the community will produce better resources to run these models.

Who this is actually for

Developers with NVIDIA RTX 4090 or 5090 hardware building real-time tools—inline editors, autocomplete, code infilling, structured generation. That’s the target. As Decrypt covered in May, Google has been on a steady push to make local inference faster without new hardware.

For researchers, bidirectional generation opens territory that autoregressive models simply can’t reach—protein sequences, mathematical graphs, anything where position N depends on position N+50. That’s not a small thing.

Google launched Gemma 4 under Apache 2.0 in April, and DiffusionGemma continues that strategy. There’s already a draft llama.cpp PR open as of today. When the toolchain catches up, this reaches a much wider audience.

On a machine with a capable discrete GPU, 1,000 tokens per second is real.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.



Source link

Share76Tweet48

Related Posts

How Crypto Firms Will Own the Octagon at Trump’s White House UFC Event

How Crypto Firms Will Own the Octagon at Trump’s White House UFC Event

by admin
13/06/2026
0

In brief President Trump’s upcoming UFC event will provide crypto firms with an unprecedented opportunity for corporate branding. Polymarket is...

OpenAI Wants a Price War With Anthropic—Is It Proving DeepSeek Right?

OpenAI Wants a Price War With Anthropic—Is It Proving DeepSeek Right?

by admin
12/06/2026
0

In brief OpenAI is considering significant token price cuts in anticipation of similar moves from Anthropic. The move emerges as...

Crypto Tax Bills Face Pushback in House Committee Hearing

Crypto Tax Bills Face Pushback in House Committee Hearing

by admin
10/06/2026
0

In brief A House hearing exposed divisions over six GOP crypto tax bills. Democrats questioned exempting staking and mining rewards...

OpenAI Wants to Kill the Chatbot It Invented and Turn It Into a Superapp

OpenAI Wants to Kill the Chatbot It Invented and Turn It Into a Superapp

by admin
09/06/2026
0

In brief OpenAI may be overhauling ChatGPT into a "superapp" bundling Codex, AI agents, and third-party integrations. The overhaul, internally...

Frontier AI Models Can Find Crypto’s Biggest Bugs. Experts Warn the Industry Isn’t Ready

Frontier AI Models Can Find Crypto’s Biggest Bugs. Experts Warn the Industry Isn’t Ready

by admin
08/06/2026
0

In brief Security researcher Taylor Hornby used Claude Opus 4.8 to discover a four-year-old flaw in Zcash's Orchard privacy pool...

Load More
  • Trending
  • Comments
  • Latest
Newly (Re)released Game Allows Players to Simulate Bitcoin Mining and Earn BTC

Newly (Re)released Game Allows Players to Simulate Bitcoin Mining and Earn BTC

04/03/2023
Ethereum retests $2,100, but could ETH crash amid technical breakdown?

Ethereum retests $2,100, but could ETH crash amid technical breakdown?

21/05/2026
Hyperliquid (HYPE) Integration As The Catalyst For Real Supply-Share Gain

Hyperliquid (HYPE) Integration As The Catalyst For Real Supply-Share Gain

21/05/2026
Margex Teams Up With ChangeNow – The No KYC Dynamic Duo of Crypto Exchanges

Bitcoin and Ethereum Stuck in Range, DOGE and XRP Gain

04/03/2023

US Commodities Regulator Beefs Up Bitcoin Futures Review

0

Bitcoin Hits 2018 Low as Concerns Mount on Regulation, Viability

0

India: Bitcoin Prices Drop As Media Misinterprets Gov’s Regulation Speech

0

Bitcoin’s Main Rival Ethereum Hits A Fresh Record High: $425.55

0
Kraken Adds USDCx Deposits And Withdrawals On Canton Network

Kraken Adds USDCx Deposits And Withdrawals On Canton Network

13/06/2026
Polish President Vetoes Crypto Bill for Third Time ahead of MiCA Deadline

Polish President Vetoes Crypto Bill for Third Time ahead of MiCA Deadline

13/06/2026
Ripple CEO Confirms White House Meeting between Crypto, Banking Reps

Ripple CEO Confirms White House Meeting between Crypto, Banking Reps

13/06/2026
Pro-Crypto Kevin Warsh Set for Trump Appointment Today: Big Weekend Rally?

Stargate Finance (STG) Rallies 166% as Cross-Chain Liquidity Solutions Take Center Stage

13/06/2026
CryptoMarketNews.club is a website that reports daily blockchain news and offers practical crypto guides.

© 2025-2026 Cryptomarketnews.Club

Navigate Site

  • About
  • FAQ
  • Support Forum
  • Landing Page
  • Contact Us

Follow Us

No Result
View All Result
  • Contact Us
  • Homepages
  • Business
  • Guide

© 2025-2026 Cryptomarketnews.Club