Token Generation Speed Simulator

Token Generation Speed Simulator

Token Generation Speed Simulator

This simulator visualises how different token-per-second speeds affect the real-time feel of LLM responses. Select a model speed preset or enter a custom rate, configure your output, then press Run Simulation to watch tokens stream in real time — just like a live API call.
Select Model Speed Preset
tokens / second
Range: 0.5 – 2,000 tok/s. Typical hosted APIs: 30–200 tok/s.
Scenario Preset
150
104,000 tokens
Tokens the model generates in this response. ~0.75 words per token.
300 ms
50 ms5,000 ms
Latency before the first token appears (ms). Typical: 200–1,500 ms.
Real models vary in speed due to batch size, network, and KV cache effects.
Controls how tokens are visually grouped in the stream window.
Leave blank to use a scenario-appropriate default prompt.
Simulation Output
Waiting for first token
Streaming
Complete
0
Current tok/s
0
Tokens Generated
0.0s
Elapsed Time
Time Remaining
Generating… 0%
Response stream
Simulation only. Speeds shown are representative approximations based on publicly reported benchmarks. Real-world throughput varies with model version, provider load, prompt length, network latency, and hardware. Use these figures for planning and UX design, not production SLA commitments.

What Is Token Generation Speed?

Token generation speed — measured in tokens per second (tok/s) — describes how quickly a large language model produces output text. A model generating 100 tok/s produces roughly 75 words per second, completing a 300-word paragraph in about 3 seconds. Speed varies significantly by model size, hardware, inference engine, and provider load.

When you send a request to an LLM API, the model first processes your entire prompt (the prefill phase), then generates tokens one at a time (the decode phase). The decode phase speed — measured in tokens per second — determines how quickly text appears in streaming responses. This is the metric the simulator above visualises.

Two separate latency metrics matter for user experience: Time to First Token (TTFT) — how long before any text appears — and generation throughput — how fast tokens arrive after that. A slow TTFT feels unresponsive even if generation is fast. Conversely, a fast TTFT with slow generation feels choppy for long responses.

Reference: Hugging Face, "LLM Inference Performance Engineering," huggingface.co; Anyscale, "Continuous Batching," anyscale.com. Accessed July 2025.

Speed Benchmarks by Model

The figures below are approximate throughput benchmarks based on publicly reported data and independent evaluations. Real performance depends on concurrent load, prompt length, and provider infrastructure. Treat these as planning estimates, not guaranteed SLAs.

Model Provider Avg tok/s Typical TTFT Speed Tier Visual

Sources: Artificial Analysis AI benchmarks (artificialanalysis.ai); community benchmarks from llm-bench and independent testing. Figures represent median throughput under typical load. Accessed July 2025.

Key Concepts: TTFT & Throughput

Understanding the two-phase nature of LLM inference helps you choose the right model and provider for your application's latency requirements.

TTFTTime to First Token — latency from sending your request to receiving the first character. Driven by prompt length and prefill compute.
tok/sGeneration throughput — how many tokens the model outputs per second during the decode phase. Driven by model size and hardware.
≈ 0.75Words per token — approximate conversion. 100 tok/s ≈ 75 words/second ≈ one 300-word response in ~4 seconds.
TBTTime Between Tokens — interval between consecutive token arrivals. At 100 tok/s, TBT = 10 ms. Affects streaming smoothness.
PrefillPrompt processing phase — model reads and encodes your entire input. Longer prompts increase TTFT proportionally.
DecodeOutput generation phase — model generates one token at a time, autoregressively. This is the throughput you see in tok/s benchmarks.

How Token Generation Speed Is Measured

Benchmarking LLM throughput requires carefully separating prefill time from decode time, and measuring under controlled conditions to get reproducible results.

── CORE METRICS ─────────────────────────────────────────────────
Time to First Token (TTFT) =
  Time from request sent → first token received
  [milliseconds]

Generation Throughput (tok/s) =
  Total output tokens
  ──────────────────────────────────────────
  Total generation time (excluding TTFT)
  [tokens per second]

Time Between Tokens (TBT) =
  1000 ÷ throughput (tok/s)
  [milliseconds per token]

Total Response Time =
  TTFT + (Output Tokens ÷ Throughput)
  [seconds]

── HUMAN-READABLE CONVERSIONS ───────────────────────────────────
Words per second    ≈  tok/s × 0.75
Words per minute    ≈  tok/s × 45
Time for N words    ≈  N ÷ (tok/s × 0.75)  seconds

── SIMULATOR FORMULA ────────────────────────────────────────────
Effective interval between display events =
  1000 ÷ (tok/s × variabilityFactor)  ms
  where variabilityFactor = random in [0.75, 1.25] for medium jitter
Formulas used by this simulator's calculation engine.

Token Speed & User Experience

Research on human perception of streaming text suggests that responses below 10 tok/s feel noticeably slow, 30–80 tok/s feels natural for reading along, and above 150 tok/s the text arrives faster than most users can comfortably read in real time. The sweet spot for streaming UX is typically 50–120 tok/s.

The perceived responsiveness of an LLM application depends more on TTFT than on raw throughput for short responses. A model that starts streaming in 200 ms at 40 tok/s often feels faster than one that starts in 2,000 ms at 100 tok/s — especially for conversational interfaces where users expect near-instant acknowledgement.

For longer outputs — code files, reports, essays — throughput dominates. A 2,000-token report takes 20 seconds at 100 tok/s but 67 seconds at 30 tok/s. At these lengths, users often prefer a "streaming completed, click to view" pattern rather than watching every character appear.

The simulator above lets you experience these differences directly. Try running the same 500-token scenario at 8 tok/s (local CPU) versus 120 tok/s (GPT-4o mini) to feel the real-world difference before choosing a model for your application.

Tips for Faster LLM Responses

  • Shorten your prompt to reduce TTFT TTFT scales roughly linearly with prompt token count because the model must process your entire input before generating the first output token. Removing unnecessary context, trimming conversation history, and using concise system prompts can cut TTFT by 30–60% on long-context requests without changing model throughput at all.
  • Set max_tokens to limit response length If your application only needs short answers, cap output tokens explicitly. A 100-token response at 60 tok/s completes in 1.7 seconds. A 500-token response at the same speed takes 8.3 seconds. Setting tight token limits also prevents runaway generations that inflate latency and cost simultaneously.
  • Choose smaller models for latency-critical paths GPT-4o mini, Llama 3.1 8B, and Mistral 7B typically generate 2–5× more tokens per second than their larger counterparts. For tasks where quality is acceptable at smaller scale — intent classification, short summaries, simple Q&A — routing to a fast small model can cut p95 response time from 8 seconds to under 2 seconds.
  • Pick providers with low TTFT, not just high throughput Different providers hosting the same model can have dramatically different TTFT characteristics depending on their infrastructure and load. For interactive applications, benchmark TTFT at your expected traffic times — a provider with 200 ms TTFT at 60 tok/s often beats one with 1,500 ms TTFT at 100 tok/s for perceived responsiveness.
  • Stream responses — never wait for the full completion Always use streaming mode (stream: true) for user-facing responses. Streaming lets users start reading after the TTFT delay rather than waiting for the full generation to complete. A 500-token response streamed at 60 tok/s starts displaying content in ~300 ms; the same response returned as a single JSON blob takes 8+ seconds before anything appears.

Data Sources & Accuracy

Speed benchmarks in this tool are derived from publicly available third-party evaluations including Artificial Analysis AI, community benchmarks on GitHub (llm-bench), and aggregated provider documentation. Figures represent approximate median throughput under typical load conditions as of July 2025.

Actual performance varies based on: concurrent requests on the provider's infrastructure, prompt length (which affects KV cache behaviour), output token count, network proximity to the provider's data center, and model version updates. The simulator uses these benchmarks as starting points for visual demonstration — not as guaranteed performance commitments.

For production planning, we recommend running your own benchmark suite against your specific prompt templates and expected load patterns using a tool such as OpenAI Evals or OpenLLM.

Sources: Artificial Analysis AI (artificialanalysis.ai); Anyscale Research Blog; Hugging Face Optimum Benchmark documentation. Accessed July 2025.