Nemotron on BLACKBOX: Open Weights, Encrypted Inference, and 420.2 tok/s

Blackbox is the orchestration layer for coding agents — a single, secure, cost-efficient platform that unifies the best open-source and closed-source models behind one interface. We were built for the teams that can't compromise on performance, price, or privacy: enterprises and governments deploying AI into the workflows that actually matter.

cheaper than closed-source

peak output (c=1)

E2E

encrypted inference

params, ~55B active

What's unique about Blackbox

What sets us apart comes down to three engineering decisions we made from day one.

A structural cost advantage, engineered at the inference layer

Many platforms resell tokens; we generate them. For open-source models, we leverage direct access to model weights to self-host on our own infrastructure — powered by a proprietary inference engine that is among the most advanced in the industry. Through a combination of custom CUDA kernels, optimized attention mechanisms, advanced quantization, continuous batching, and Multi-Token Prediction (MTP) speculative decoding, we extract significantly more tokens per GPU than competing systems.

The result is a durable cost moat: 20–30× cheaper than closed-source alternatives, with the gap compounding on every token served. For enterprises running agents at scale — where a single workflow can consume millions of tokens per task — this cost gap determines whether AI stays a pilot or reaches production.

A new standard for AI security

Security at Blackbox is foundational to the stack.

For closed-source models, our state-of-the-art PII layer strips and substitutes sensitive enterprise data before any request leaves our perimeter. Customers get the capabilities of frontier closed models without exposing customer data or internal IP to third-party labs.

How Nemotron Ultra adds value to Blackbox

Nemotron has become a cornerstone of the Blackbox platform — and the reason is strategic, not incidental.

Across our customer base, we're seeing a decisive shift: enterprises, governments, regulated industries, and research institutions are no longer asking *whether* to adopt open-source AI. They're asking which open-source models they can trust. And increasingly, the answer must satisfy two non-negotiable criteria: open weights and American provenance.

For sovereign workloads, defense applications, regulated sectors, and frontier research, closed-source APIs introduce unacceptable exposure, and models of Chinese origin are off the table entirely for reasons of compliance, supply-chain integrity, and national security posture. The market has spoken clearly: customers want models they can inspect, self-host, audit, and govern — built by an ecosystem aligned with Western regulatory and security standards.

“Nemotron sits precisely at that intersection.”

Frontier capability, where it counts

Nemotron delivers exceptional performance on the workloads our customers care about most: advanced software engineering, agentic reasoning, and complex data analysis. These metrics map directly to the production workflows driving real enterprise value: autonomous coding agents, large-scale code modernization, financial and scientific analysis, and the long-horizon, tool-using agents that define the next era of enterprise AI.

When we run Nemotron on Blackbox's proprietary inference engine, the combination is uniquely powerful: a frontier-class American open model, served with industry-leading tokens-per-GPU economics, behind end-to-end encrypted inference. Capability, cost, and security — without compromise on any axis.

NVIDIA's open-source investment is unlocking the agentic enterprise

NVIDIA's commitment to open-weight model development is one of the most consequential moves in the AI industry today. By releasing models of Nemotron's caliber into the open ecosystem, NVIDIA is doing something the market urgently needs: collapsing the gap between frontier capability and deployable, governable AI.

That gap has been the single biggest barrier to enterprise-wide agent adoption. Closed-source APIs are too expensive to run at agentic scale, too opaque for regulated environments, and too dependent on third-party trust boundaries. Open weights solve all three problems at once — and Nemotron proves that open models can match or exceed closed-source quality on the tasks that matter.

The implication is profound: enterprises can now deploy AI agents across every workflow in the organization — not just the handful that survive the cost, security, and compliance gauntlet. Engineering, research, operations, analytics, customer-facing systems, sovereign infrastructure — all of it becomes addressable.

The Blackbox + NVIDIA stack

Together, Blackbox and NVIDIA deliver a complete answer to the agentic enterprise:

NVIDIA provides

The frontier open-weight models
The hardware they run on (8× B300)

Blackbox provides

Orchestration across open + closed models
Inference optimization and tokens-per-GPU economics
End-to-end encryption and PII security guarantees
Developer surfaces — API, CLI, IDE, agents

The result is the most compelling open, American, secure, and economically viable AI stack on the market — purpose-built for the institutions that will define the next decade of intelligent systems. Nemotron is the model our most discerning customers are asking for by name. NVIDIA's continued investment in open source is what makes the agentic enterprise — at full scope, across every workload — finally possible.

Architecture at a glance

550B total parameters, ~55B active per token (sparse MoE).

Hybrid backbone: majority Mamba-2 (linear-time), interleaved sparse attention, MoE feed-forward experts.

Native Multi-Token Prediction head, trained jointly with the base model.

Inference speeds

Big models are supposed to be slow. A half-trillion-parameter model serving one user at a time, token by token, is the worst case for throughput — there is no batch to amortize across, no parallel requests to hide latency behind. It is just you, one prompt, and a very large network.

So here is the result that surprised us: Nemotron-3-Ultra-550B-A55B peaks at 420.2 tok/s and sustains an average of 414 tok/s of concurrency-1 (c=1) output (FP4), measured exactly the way Artificial Analysis measures every model on its public leaderboard. That is faster than most hosted models a fraction of its size.

peak output (c=1)

sustained average

<300 ms

time to first token

dedicated GPUs

This section explains what we measured, how we measured it, and *why* a model this large can decode this fast.

The headline (FP4)

Peak output speed: 420.2 tok/s (concurrency 1 (c=1), streaming)

Sustained average: 414 tok/s across many repeated runs (temperature 0 makes runs near-deterministic, so the spread is tiny)

Time to first token: under 300 ms

Decode speed stays flat as prompts get longer — long context is nearly free

Every tok/s figure here is counted with OpenAI's tiktoken o200k_base encoding — the same basis Artificial Analysis uses across its entire leaderboard — so these numbers are directly comparable to any model on the public chart.

What we measured

We mirrored the Artificial Analysis concurrency-1 (c=1) methodology end to end:

One request at a time (concurrency = 1) — what a real chat user actually experiences, and AA's headline metric.

temperature = 0, top_p = 1, streaming — identical to AA's published settings.

Output speed = output_tokens / (e2e_latency − TTFT) — the average decode rate after the first token streams.

Time to First Token (TTFT) — wall-clock from request send to the first content-bearing chunk.

Tokens counted in tiktoken o200k_base — the universal counting basis that normalizes away vendor-specific tokenizer differences.

Results

On the same 8× B300 deployment and the same benchmark, FP4 leads BF16 on every axis — and both clear the field for a model this size.

Concurrency-1 output speed — FP4 vs BF16higher is better

Same hardware, same benchmark. Higher is better.

Concurrency-1 output speed — FP4 vs BF16
Model	tok/s
FP4 — peak	420.2 tok/s
FP4 — average	414 tok/s
BF16 — peak	365.7 tok/s
BF16 — average	359.6 tok/s

Concurrency 1 (c=1), streaming, temperature 0. Tokens counted in tiktoken o200k_base.

FP4

Metric	Concurrency 1 (c=1)
Peak output speed ★	420.2 tok/s
Average output speed	414 tok/s
Time to first token	< 300 ms

BF16 — same bench, same hardware

Metric	Concurrency 1 (c=1)
Peak output speed	365.7 tok/s
Average output speed	359.6 tok/s
Time to first token	~500 ms

Serving setup: hosted and deployed on dedicated Blackbox GPUs (8× NVIDIA B300) and powered by the Blackbox inference engine. Because the deployment runs on dedicated hardware rather than shared, multi-tenant capacity, the concurrency-1 (c=1) numbers reflect what an end user actually gets.

Why a 550B model decodes this fast

Three architectural choices compound, and that compounding is the whole story.

1. Sparse Mixture-of-Experts. The model has 550B total parameters but activates only ~55B per token. You pay for a 55B-class forward pass while getting the quality of a model many times larger. Roughly 90% of the weights sit idle on any given token.

2. Mamba-hybrid layers. Most layers are Mamba-2, which is linear-time in context length rather than quadratic like attention. A handful of attention layers are interleaved where global mixing matters. The practical consequence: long prompts barely cost anything on the decode side. A longer prompt shifts a little cost into time-to-first-token but leaves per-token output speed almost untouched.

3. Built-in Multi-Token Prediction. The model ships with a native MTP head trained jointly with the base — not a separately bolted-on draft model. It proposes several tokens per step, the base verifies them in parallel, and accepted tokens come essentially for free. On real workloads this delivers a healthy multiplier on raw decode speed.

Two non-obvious findings

How we benchmark (and why you can trust the number)

We follow Artificial Analysis's public methodology so our headline is directly comparable to their leaderboard, and we re-implemented their protocol ourselves so we control every variable.

The protocol, step by step:

1. Read a set of prompts, each with a target output length.

2. Run a couple of warmup prompts and discard them.

3. For each measured prompt, stream a chat completion at temperature = 0, top_p = 1.

4. Mark TTFT at the first content-bearing chunk; concatenate all content deltas into the full output; record end-to-end latency.

5. Count output tokens with tiktoken o200k_base — not the model's native tokenizer.

6. Compute per-prompt tok/s = output_tokens / (e2e − TTFT).

7. Report p10 / p50 / p90 / average across all prompts.

Prompts are drawn from Artificial Analysis's own published performance-prompt set, covering their full task mix — summarization, Q&A, comparative analysis, translation, and structured generation — with input lengths verified in tiktoken o200k_base and an explicit task directive that forces long, decode-heavy output.

AA spec vs. what we ran

Topic	AA spec	What we did
Sampling	temperature = 0, top_p = 1	Same (model is non-reasoning)
Concurrency	Single (1) and Parallel (10)	Single (c=1) — the headline mode
Tasks	5 task types	All 5, inherited from AA's prompt set
Min output	≥1,000–1,500 tokens	Output caps set to AA's minimums
Tokenizer	tiktoken o200k_base	tiktoken o200k_base
Output-speed formula	output_tokens / (e2e − TTFT)	Same

Why our own harness rather than a vendor tool? Most LLM benchmarking tools report tok/s in the model's native tokenizer by default, which inflates this model's numbers by ~15% versus the AA standard, and many use random, meaningless synthetic prompts. Counting on the AA tiktoken basis with AA's real prompt mix is the strongest comparability lever available, so that is what every number here uses.

How it's deployed

These numbers come from a production deployment on dedicated GPUs, served by the Blackbox inference engine — not a lab benchmark on borrowed capacity. Three things follow from that.

Dedicated. The model runs on reserved 8× B300 hardware, so concurrency-1 (c=1) latency and throughput are stable and repeatable — there is no contention from other tenants sharing the same GPUs.

Engine-level tuning. The Blackbox inference engine handles the quantized execution, tensor + expert parallelism, FP8 KV cache, and built-in Multi-Token Prediction speculative decoding that together produce the 420.2 tok/s peak.

What you measure is what you get. Because the benchmark runs against the same dedicated deployment that serves traffic, the headline figures translate directly to real concurrency-1 (c=1) experience.

Run the same benchmark on your workload

Get a free API key and measure concurrency-1 output speed against your own prompts.

Get a Blackbox API key

Methodology references

Artificial Analysis performance methodology — artificialanalysis.ai/methodology/performance-benchmarking. Defines tiktoken o200k_base as the universal counting basis, temperature = 0, top_p = 1, single vs. parallel concurrency, and the output-speed formula.

AA performance prompts — the published prompt set we draw from, available from the AA methodology downloads.

Tiktoken o200k_base — github.com/openai/tiktoken via tiktoken.get_encoding("o200k_base"). The OpenAI GPT-4o encoding AA standardizes on and the basis for every figure here.

Bottom line

Nemotron-3-Ultra-550B-A55B (FP4) reaches 420.2 tok/s peak / 414 tok/s average concurrency-1 (c=1) output , counted on the same tiktoken basis as the public Artificial Analysis leaderboard. This performance comes from three architectural bets that compound: sparse MoE (only ~10% of parameters active per token), a Mamba-hybrid backbone (cheap long context), and a built-in MTP head (multiple tokens per decode step). For a model this large to serve a concurrency 1 (c=1) this fast is the headline; the architecture is the reason.

Capability, cost, and security — without compromise on any axis. That is why Nemotron runs on Blackbox, and why it is the model our most discerning customers ask for by name.