How does this LLM GPU simulator help me?

It lets you test-drive expensive GPUs (RTX 5090, RTX 4090, H100, A100) against models like Llama 3.1, Qwen 2.5, Mixtral and DeepSeek before buying. You avoid spending $2,000+ on a card that runs out of VRAM or is too slow for your context length.

How is VRAM calculated for an LLM?

We estimate weights (parameters × bytes-per-param) + KV-cache (scaled by context length and precision) + activation overhead (~25% of weights, minimum 2 GB). For example, a 70B model at INT4 needs roughly 40 GB VRAM at 4K context.

What is the difference between TTFT and decode speed?

TTFT (Time To First Token) measures how quickly the first token appears — responsiveness. Decode speed (tokens per second) measures how fast tokens stream afterwards — throughput. Chatbots need low TTFT; batch/agent workloads need high decode speed.

Why use INT4 / INT8 quantization?

INT4 reduces VRAM by roughly 50% versus FP16 with minimal accuracy loss and often doubles decode speed by making the workload more memory-bandwidth-friendly. It frequently lets 70B-class models run on a single RTX 4090 or dual RTX 3090 instead of enterprise H100/A100 GPUs.

Are the benchmarks accurate?

Estimates are based on published benchmarks from vLLM, NVIDIA TensorRT-LLM, Meta, Mistral and Qwen blogs, plus memory-bandwidth-limited theoretical ceilings. Real throughput varies with drivers, CPU, and framework, so treat values as best-case engineering estimates.

Does the simulator support multi-GPU setups?

Yes. Enable Throughput Mode in Advanced Settings to simulate larger batches and concurrency. For VRAM, make sure your combined pool (for example 2× 24 GB = 48 GB) exceeds the Required VRAM the simulator reports.

What is the best GPU for running Llama 3.1 70B locally?

At INT4 quantization, a single RTX 4090 (24 GB) or RTX 5090 (32 GB) can run Llama 3.1 70B at usable speeds, while dual RTX 3090s (48 GB combined) are the classic budget option. For full FP16 or high-concurrency serving, an H100 80 GB or A100 80 GB is recommended.

AI GPU Simulator — LLM Performance, VRAM & Tokens/sec Calculator

AI GPU Simulator is a free web tool that estimates how fast large language models (LLMs) run on a given GPU. Simulate VRAM usage, tokens per second, time to first token (TTFT), and decode speed for models like Llama 3.1, Qwen 2.5, Mixtral, and DeepSeek on GPUs including the RTX 5090, RTX 4090, RTX 3090, H100, and A100.

This page requires JavaScript to run the interactive simulator. Enable JavaScript or visit the open-source GitHub repository for documentation and benchmark data.

What you can do

Check if a model fits in a GPU's VRAM at FP16, INT8, or INT4 precision.
Estimate tokens-per-second decode throughput for single-user or batched workloads.
Compare hardware before spending $2,000+ on a GPU that can't run your model.