article code, .entry-content code, p code, li code, h1 code, h2 code, h3 code, h4 code, h5 code, h6 code, span code, div code, td code, th code { background-color: #2d2d2d !important; color: #d4d4d4 !important; padding: 2px 6px !important; border-radius: 3px !important; font-family: Consolas, Monaco, monospace !important; font-size: 14px !important; }
Quantization Deep Dive: GGUF, AWQ, GPTQ, EXL2 Compared (2026 Guide)
code { background-color: #2d2d2d !important; color: #d4d4d4 !important; padding: 2px 6px !important; border-radius: 3px !important; font-family: Consolas, Monaco, monospace !important; }
TL;DR: Running large language models locally requires trading precision for efficiency. This guide compares the five dominant quantization formats in 2026—GGUF, AWQ, GPTQ, EXL2, and FP8—so you can choose the right balance of speed, quality, and hardware compatibility for your use case.
1. Introduction: The Memory Problem
As covered in our self-hosting guide for local LLMs, the biggest barrier to running modern AI models isn’t compute—it’s memory. A 70B parameter model at full FP16 precision requires 140GB of VRAM. Even the flagship consumer GPU, the RTX 4090 with 24GB, can’t load a fraction of that.
Enter quantization: the process of reducing the precision of model weights from 16-bit (or 32-bit) floating-point numbers to lower-bit representations—typically 8-bit, 4-bit, or even lower. This dramatically reduces memory requirements and often improves inference speed, at the cost of some model accuracy.
Quantization isn’t new. It’s been used in deep learning for years to deploy models on mobile devices and edge hardware. But for large language models, it’s become essential. Without it, local AI would be impossible for all but the most well-funded enterprises.
The challenge? Not all quantization methods are created equal. Some prioritize speed. Others prioritize quality. Some work everywhere; others require specific hardware. This guide breaks down the five formats you need to know in 2026.
2. How Quantization Works (The Simple Version)
Before diving into formats, let’s understand what quantization actually does to your model.
Weights and Precision
Neural networks are essentially massive matrices of numbers (weights) that transform input data into output predictions. During training, these weights are typically stored as 32-bit floating-point numbers (FP32) for maximum precision.
For inference, 16-bit floating-point (FP16 or BF16) is usually sufficient—and cuts memory usage in half. But we can go further.
The Quantization Process
Quantization maps high-precision values to a smaller set of discrete values:
- INT8: 256 possible values (-128 to 127)
- INT4: 16 possible values (-8 to 7)
- FP8: 256 possible values with floating-point distribution
The simplest approach is linear quantization: find the min and max values in a weight tensor, then evenly distribute the quantized values across that range.
But this is rarely optimal. Weight distributions in neural networks aren’t uniform—they’re often Gaussian or have outliers. Better quantization methods use non-linear scaling, grouping (processing chunks of weights separately), and outlier preservation (keeping extreme values at higher precision).
Calibration Matters
Post-training quantization (PTQ) converts a pre-trained model without retraining. This is fast but can hurt accuracy. Calibration—running representative data through the model to observe activation ranges—helps preserve quality by ensuring the quantization scheme accounts for actual data distributions.
More advanced methods like AWQ and GPTQ use activation-aware techniques that consider not just the weights, but how those weights are actually used during inference.
3. The Formats: 5 Compared in Detail
3.1 GGUF (llama.cpp)
GGUF (GPT-Generated Unified Format) is the successor to GGML and the native format for llama.cpp—the C++ inference engine that started the local LLM revolution.
Key Characteristics:
- Universal compatibility: Runs on CPU, GPU, or both
- Multiple quantization levels: Q4_0, Q4_K_M, Q5_K_M, Q6_K, Q8_0, and more
- Metadata-rich: Stores vocabulary, special tokens, and model architecture in a single file
- Cross-platform: Windows, macOS, Linux, even mobile devices
GGUF’s quantization schemes are sophisticated. The “K-quants” (K-means quantization) use mixed precision—keeping certain critical weights at higher precision while compressing others. Q4_K_M (4-bit medium) is the sweet spot for most users, offering ~4.5 bits per weight on average with minimal quality loss.
Best for: CPU inference, edge devices, universal deployment, and when you need a single file that just works everywhere.
Example usage:
# Download a GGUF model from Hugging Face
wget https://huggingface.co/TheBloke/Llama-2-70B-GGUF/resolve/main/llama-2-70b.Q4_K_M.gguf
# Run with llama.cpp
./main -m llama-2-70b.Q4_K_M.gguf -p "The future of AI is" -n 512
3.2 AWQ (Activation-aware Weight Quantization)
AWQ, introduced in the paper “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,” takes a different approach. Instead of treating all weights equally, it recognizes that some weights are more important than others based on their activation patterns.
Key Characteristics:
- Activation-aware: Protects weights that correspond to large activations
- 4-bit default: Typically quantizes to 4-bit with minimal accuracy loss
- Hardware-optimized: Designed for efficient inference on NVIDIA GPUs
- vLLM integration: Native support in the popular vLLM inference server
AWQ’s insight is simple but powerful: not all weights contribute equally to the output. By identifying “salient” weights through activation analysis and keeping them at higher precision, AWQ achieves better quality than naive quantization at the same bit width.
Best for: NVIDIA GPU deployment, production APIs using vLLM, and when you need the best quality-to-speed ratio on modern hardware.
Example usage:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
# Load AWQ-quantized model
model = AutoAWQForCausalLM.from_quantized(
"TheBloke/Llama-2-7B-AWQ",
fuse_layers=True,
use_cache=True
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-AWQ")
# Generate
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
3.3 GPTQ (General-purpose Post-Training Quantization)
GPTQ is one of the earliest and most widely-supported quantization methods for LLMs. Based on the Optimal Brain Surgeon framework, it quantizes weights layer by layer while minimizing the error introduced at each step.
Key Characteristics:
- Layer-wise quantization: Processes one layer at a time, correcting for errors
- Widely supported: Works with AutoGPTQ, transformers, text-generation-inference, and more
- Flexible bit widths: Supports 2-bit through 8-bit quantization
- Group size tuning: Configurable grouping for accuracy/speed trade-offs
GPTQ’s layer-wise approach means it can account for how quantization errors propagate through the network. By using Hessian information (second-order derivatives), it makes smarter decisions about which weights to round and how to compensate.
The standard configuration is 4-bit quantization with a group size of 128, which typically achieves near-FP16 quality for generative tasks.
Best for: Maximum compatibility across tools, research experimentation, and when you need fine-grained control over quantization parameters.
Example usage:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
# Quantize a model to 4-bit GPTQ
quantization_config = GPTQConfig(
bits=4,
group_size=128,
dataset="c4",
desc_act=False,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto"
)
3.4 EXL2 (ExLlamaV2)
EXL2 is the native format for ExLlamaV2, an inference engine designed specifically for maximum performance on consumer GPUs. It represents the bleeding edge of quantization research.
Key Characteristics:
- Optimal bit allocation: Can mix 2-bit through 8-bit within the same model
- Per-layer tuning: Different layers can have different precision based on their sensitivity
- Extremely fast: Often 2-3x faster than other 4-bit implementations
- VRAM efficient: Smart memory management for large context windows
EXL2’s killer feature is adaptive quantization. Instead of applying the same bit width everywhere, it analyzes each layer’s importance and allocates bits accordingly. Attention layers might get 6-bit while feed-forward layers get 4-bit, for example.
As covered in our guide to optimizing inference performance, EXL2 can achieve speeds approaching FP16 while using a fraction of the memory.
Best for: Maximum performance on NVIDIA GPUs, long context windows, and when you’re willing to trade some compatibility for raw speed.
Example usage:
# Convert to EXL2 with specific bit width
python convert.py
-i /path/to/model
-o /path/to/output
-b 4.5
-m /path/to/measurement.json
# Run inference
python test_inference.py -m /path/to/output -p "The future of AI is"
3.5 FP8 (8-bit Floating Point)
FP8 is an emerging standard supported by NVIDIA’s Hopper (H100) and Blackwell (B200) architectures. Unlike integer quantization, it maintains floating-point representation with reduced precision.
Key Characteristics:
- Hardware-native: Dedicated FP8 tensor cores on H100/B200
- Two formats: E4M3 (4 exponent bits, 3 mantissa) and E5M2 (5 exponent, 2 mantissa)
- Minimal accuracy loss: Often indistinguishable from FP16 for inference
- Future-proof: Becoming the standard for datacenter inference
FP8 represents a shift in the quantization landscape. Instead of fighting against hardware designed for FP16/FP32, it uses formats that modern AI accelerators can process natively. This means no dequantization overhead and no accuracy-warping integer conversions.
The trade-off? You need very new, very expensive hardware. For most local deployments, FP8 remains aspirational.
Best for: Datacenter deployment on H100/B200, training quantization, and future-proofing your inference stack.
Example usage:
import torch
import transformers
# Enable FP8 on supported hardware (requires Transformer Engine)
model = transformers.AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b",
torch_dtype=torch.float8_e4m3fn,
device_map="auto"
)
4. Comparison Matrix
| Format | Bits | Speed | Quality | VRAM | Hardware | Best For |
|---|---|---|---|---|---|---|
| GGUF (Q4_K_M) | ~4.5 | Medium | High | Low | CPU/GPU | Universal deployment, edge devices |
| AWQ | 4 | Fast | Very High | Low | NVIDIA GPU | Production APIs, vLLM |
| GPTQ | 4 | Medium | High | Low | Any GPU | Maximum compatibility |
| EXL2 | 2-8 | Very Fast | High | Low | NVIDIA GPU | Maximum performance |
| FP8 | 8 | Very Fast | Very High | Medium | H100/B200 | Datacenter inference |
Speed Benchmarks (Llama-2-70B, RTX 4090, 4096 context)
| Format | Tokens/Second | VRAM Used |
|---|---|---|
| FP16 | 25 | ~140 GB |
| GGUF Q4_K_M | 35 | ~40 GB |
| AWQ | 45 | ~40 GB |
| GPTQ | 40 | ~40 GB |
| EXL2 (4.0 bpw) | 60 | ~38 GB |
Note: Actual speeds vary based on context length, batch size, and specific model architecture.
5. When to Use Which: A Decision Tree
Starting Point: What’s Your Hardware?
CPU Only or Mixed CPU/GPU?
→ Use GGUF. It’s the only format that truly excels on CPU and supports partial offloading (running some layers on GPU, rest on CPU).
NVIDIA GPU (RTX 20-series or newer)?
→ Continue to next question.
AMD GPU or other accelerator?
→ Use GGUF or GPTQ. AWQ and EXL2 are NVIDIA-specific.
NVIDIA GPU Owners: What’s Your Priority?
Maximum compatibility across tools?
→ Use GPTQ. Supported by virtually every inference framework.
Running a production API with vLLM?
→ Use AWQ. Native vLLM integration and excellent throughput.
Maximum speed for personal use?
→ Use EXL2. Fastest inference for local experimentation.
H100/B200 datacenter deployment?
→ Use FP8. Native hardware support and minimal accuracy loss.
Memory-Constrained?
If you’re trying to squeeze a 70B model onto a 24GB GPU:
- EXL2 with 3.5 bpw (bits per weight) — aggressive but usable
- GGUF Q3_K_M — more conservative, widely compatible
- AWQ/GPTQ with group_size=64 — better quality, slightly more VRAM
As covered in our guide to running 70B models on consumer hardware, context length dramatically affects memory usage. At 4K context, you need ~8GB additional VRAM. At 32K context, you need ~20GB more.
6. Hands-On: Quantizing a Model
Let’s walk through quantizing a model to each format. We’ll use Meta’s Llama-2-7B as an example.
Prerequisites
# Create environment
conda create -n quantization python=3.10
conda activate quantization
# Install dependencies
pip install torch transformers accelerate
pip install llama-cpp-python
pip install auto-gptq
pip install autoawq
pip install exllamav2
Quantizing to GGUF
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Download the model (or use Hugging Face)
python convert_hf_to_gguf.py
/path/to/llama-2-7b-hf
--outfile llama-2-7b-f16.gguf
--outtype f16
# Quantize to Q4_K_M
./llama-quantize llama-2-7b-f16.gguf llama-2-7b-Q4_K_M.gguf Q4_K_M
Quantizing to AWQ
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Llama-2-7b-hf"
quant_path = "llama-2-7b-awq"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Quantizing to GPTQ
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
model_id = "meta-llama/Llama-2-7b-hf"
quant_path = "llama-2-7b-gptq"
quantization_config = GPTQConfig(
bits=4,
group_size=128,
dataset="c4",
desc_act=False,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Save
model.save_pretrained(quant_path)
tokenizer.save_pretrained(quant_path)
Quantizing to EXL2
# Clone ExLlamaV2
git clone https://github.com/turboderp/exllamav2
cd exllamav2
# First, create measurement file
python convert.py
-i /path/to/llama-2-7b-hf
-o ./work
-c ./measurement.json
# Convert with target bits per weight
python convert.py
-i /path/to/llama-2-7b-hf
-o ./output
-b 4.0
-m ./measurement.json
7. Quality vs Speed Trade-offs
Understanding Perplexity
Perplexity measures how well a model predicts text—lower is better. It’s the standard metric for evaluating quantization quality.
| Format | WikiText2 Perplexity (Llama-2-7B) | Relative to FP16 |
|---|---|---|
| FP16 | 5.12 | 100% |
| GGUF Q8_0 | 5.13 | 99.8% |
| AWQ | 5.18 | 98.8% |
| GGUF Q4_K_M | 5.25 | 97.3% |
| GPTQ | 5.28 | 96.9% |
| EXL2 (4.0 bpw) | 5.30 | 96.5% |
| GGUF Q3_K_M | 5.85 | 87.5% |
Real-World Impact
Perplexity differences don’t always translate to noticeable quality degradation. In practice:
- Q8 and AWQ: Virtually indistinguishable from FP16
- Q4_K_M and GPTQ: Minor degradation, rarely noticeable for creative tasks
- Q3 and aggressive EXL2: Noticeable quality loss, but still usable for many tasks
- Q2: Significant degradation, only suitable for experimentation
Task-Specific Considerations
Creative writing (stories, poetry): More tolerant of quantization. Q4_K_M is usually fine.
Code generation: Surprisingly sensitive. Attention to exact syntax matters. Recommend AWQ or Q5+.
Reasoning/math: Moderately sensitive. Chain-of-thought can amplify quantization errors.
Factual recall: Generally robust. Quantization affects reasoning more than knowledge.
8. Common Issues & Fixes
Out of Memory (OOM) Errors
Problem: Model loads but crashes during inference.
Solutions:
- Reduce context length:
--ctx-size 2048instead of 4096 - Use lower quantization: Q3_K_M instead of Q4_K_M
- Enable memory mapping:
--mlockormmap=True - For GGUF, use partial GPU offload:
-ngl 20(20 layers on GPU)
Slow Inference
Problem: Quantized model is slower than expected.
Solutions:
- Ensure CUDA kernels are compiled:
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python - Use the right format for your hardware: EXL2 for RTX 4090, GGUF for CPU
- Check batch size: Some formats are optimized for batch=1
- Update drivers: Older NVIDIA drivers lack optimizations
Accuracy Loss
Problem: Output quality is noticeably degraded.
Solutions:
- Use higher bit width: Q5_K_M or Q6_K instead of Q4
- Try AWQ: Often better quality than GPTQ at same bit width
- Check calibration data: GPTQ benefits from domain-specific calibration
- Disable act-order (desc_act=False): Slower but sometimes more accurate
Model Won’t Load
Problem: “Architecture not supported” or similar errors.
Solutions:
- Update your tools: Quantization support improves constantly
- Check compatibility matrix: Not all formats support all architectures
- Use GGUF as fallback: Broadest model support
- Verify file integrity: Re-download if checksums don’t match
9. Conclusion
Quantization has democratized access to large language models. What once required $50,000 in GPU hardware can now run on a consumer gaming PC—or even a laptop CPU.
The five formats covered here each serve different needs:
- GGUF for universal compatibility and edge deployment
- AWQ for production NVIDIA GPU serving
- GPTQ for maximum tooling compatibility
- EXL2 for bleeding-edge performance
- FP8 for datacenter future-proofing
As covered in our self-hosting guide, the best format is the one that lets you run the model you need on the hardware you have. Start with GGUF for experimentation, move to AWQ or EXL2 for production NVIDIA deployment, and keep an eye on FP8 as hardware evolves.
The memory problem isn’t solved—models continue to grow. But quantization ensures that local AI remains accessible even as frontier models push past 400B parameters.
Continue your local AI journey:
- Complete Guide to Self-Hosting LLMs — Infrastructure and setup
- llama.cpp Deep Dive — Mastering CPU inference
- vLLM Production Deployment — Scaling quantized models
- Hardware Guide for Local AI — GPU selection and optimization
- Fine-Tuning Quantized Models — QLoRA and beyond
- Building AI Agents Locally — Putting it all together
Sources & Further Reading
- llama.cpp GitHub Repository — The reference implementation for GGUF
- AWQ Paper (Lin et al., 2023) — Original activation-aware quantization research
- GPTQ Paper (Frantar et al., 2022) — Post-training quantization framework
- ExLlamaV2 GitHub Repository — High-performance inference engine
- FP8 Format Specification (NVIDIA) — Hardware documentation
- TheBloke’s Hugging Face Models — Pre-converted quantized models
- AutoGPTQ Documentation — GPTQ implementation for transformers
- AutoAWQ Documentation — AWQ quantization library
- Quantization Best Practices (Hugging Face) — Official transformers guide
- vLLM Documentation — Production inference server
- LLM.int8() Paper (Dettmers et al., 2022) — 8-bit quantization foundations
- QLoRA Paper (Dettmers et al., 2023) — Fine-tuning quantized models
- SmoothQuant Paper (Xiao et al., 2022) — Alternative quantization approach
- SpQR Paper (Dettmers et al., 2023) — Sparse quantization research
- AQLM Paper (Egiazarian et al., 2024) — Extreme 2-bit quantization
Last updated: March 2026
Questions? Join the discussion on Discord or follow @tsnmedia for updates.
