Quantization Deep Dive: GGUF, AWQ, GPTQ, EXL2 Compared (2026 Guide)

code { background-color: #2d2d2d !important; color: #d4d4d4 !important; padding: 2px 6px !important; border-radius: 3px !important; font-family: Consolas, Monaco, monospace !important; }

TL;DR: Running large language models locally requires trading precision for efficiency. This guide compares the five dominant quantization formats in 2026—GGUF, AWQ, GPTQ, EXL2, and FP8—so you can choose the right balance of speed, quality, and hardware compatibility for your use case.

1. Introduction: The Memory Problem

As covered in our self-hosting guide for local LLMs, the biggest barrier to running modern AI models isn’t compute—it’s memory. A 70B parameter model at full FP16 precision requires 140GB of VRAM. Even the flagship consumer GPU, the RTX 4090 with 24GB, can’t load a fraction of that.

Enter quantization: the process of reducing the precision of model weights from 16-bit (or 32-bit) floating-point numbers to lower-bit representations—typically 8-bit, 4-bit, or even lower. This dramatically reduces memory requirements and often improves inference speed, at the cost of some model accuracy.

Quantization isn’t new. It’s been used in deep learning for years to deploy models on mobile devices and edge hardware. But for large language models, it’s become essential. Without it, local AI would be impossible for all but the most well-funded enterprises.

The challenge? Not all quantization methods are created equal. Some prioritize speed. Others prioritize quality. Some work everywhere; others require specific hardware. This guide breaks down the five formats you need to know in 2026.

2. How Quantization Works (The Simple Version)

Before diving into formats, let’s understand what quantization actually does to your model.

Weights and Precision

Neural networks are essentially massive matrices of numbers (weights) that transform input data into output predictions. During training, these weights are typically stored as 32-bit floating-point numbers (FP32) for maximum precision.

For inference, 16-bit floating-point (FP16 or BF16) is usually sufficient—and cuts memory usage in half. But we can go further.

The Quantization Process

Quantization maps high-precision values to a smaller set of discrete values:

INT8: 256 possible values (-128 to 127)
INT4: 16 possible values (-8 to 7)
FP8: 256 possible values with floating-point distribution

The simplest approach is linear quantization: find the min and max values in a weight tensor, then evenly distribute the quantized values across that range.

But this is rarely optimal. Weight distributions in neural networks aren’t uniform—they’re often Gaussian or have outliers. Better quantization methods use non-linear scaling, grouping (processing chunks of weights separately), and outlier preservation (keeping extreme values at higher precision).

Calibration Matters

Post-training quantization (PTQ) converts a pre-trained model without retraining. This is fast but can hurt accuracy. Calibration—running representative data through the model to observe activation ranges—helps preserve quality by ensuring the quantization scheme accounts for actual data distributions.

More advanced methods like AWQ and GPTQ use activation-aware techniques that consider not just the weights, but how those weights are actually used during inference.

3. The Formats: 5 Compared in Detail

3.1 GGUF (llama.cpp)

GGUF (GPT-Generated Unified Format) is the successor to GGML and the native format for llama.cpp—the C++ inference engine that started the local LLM revolution.

Key Characteristics:

Universal compatibility: Runs on CPU, GPU, or both
Multiple quantization levels: Q4_0, Q4_K_M, Q5_K_M, Q6_K, Q8_0, and more
Metadata-rich: Stores vocabulary, special tokens, and model architecture in a single file
Cross-platform: Windows, macOS, Linux, even mobile devices

GGUF’s quantization schemes are sophisticated. The “K-quants” (K-means quantization) use mixed precision—keeping certain critical weights at higher precision while compressing others. Q4_K_M (4-bit medium) is the sweet spot for most users, offering ~4.5 bits per weight on average with minimal quality loss.

Best for: CPU inference, edge devices, universal deployment, and when you need a single file that just works everywhere.

Example usage:

# Download a GGUF model from Hugging Face wget https://huggingface.co/TheBloke/Llama-2-70B-GGUF/resolve/main/llama-2-70b.Q4_K_M.gguf Run with llama.cpp

./main -m llama-2-70b.Q4_K_M.gguf -p "The future of AI is" -n 512

3.2 AWQ (Activation-aware Weight Quantization)

AWQ, introduced in the paper “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,” takes a different approach. Instead of treating all weights equally, it recognizes that some weights are more important than others based on their activation patterns.

Key Characteristics:

Activation-aware: Protects weights that correspond to large activations
4-bit default: Typically quantizes to 4-bit with minimal accuracy loss
Hardware-optimized: Designed for efficient inference on NVIDIA GPUs
vLLM integration: Native support in the popular vLLM inference server

AWQ’s insight is simple but powerful: not all weights contribute equally to the output. By identifying “salient” weights through activation analysis and keeping them at higher precision, AWQ achieves better quality than naive quantization at the same bit width.

Best for: NVIDIA GPU deployment, production APIs using vLLM, and when you need the best quality-to-speed ratio on modern hardware.

Example usage:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer 

Load AWQ-quantized model
model = AutoAWQForCausalLM.from_quantized(     "TheBloke/Llama-2-7B-AWQ",     fuse_layers=True,     use_cache=True ) tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-7B-AWQ") 

Generate
inputs = tokenizer("The future of AI is", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=512)

3.3 GPTQ (General-purpose Post-Training Quantization)

GPTQ is one of the earliest and most widely-supported quantization methods for LLMs. Based on the Optimal Brain Surgeon framework, it quantizes weights layer by layer while minimizing the error introduced at each step.

Key Characteristics:

Layer-wise quantization: Processes one layer at a time, correcting for errors
Widely supported: Works with AutoGPTQ, transformers, text-generation-inference, and more
Flexible bit widths: Supports 2-bit through 8-bit quantization
Group size tuning: Configurable grouping for accuracy/speed trade-offs

GPTQ’s layer-wise approach means it can account for how quantization errors propagate through the network. By using Hessian information (second-order derivatives), it makes smarter decisions about which weights to round and how to compensate.

The standard configuration is 4-bit quantization with a group size of 128, which typically achieves near-FP16 quality for generative tasks.

Best for: Maximum compatibility across tools, research experimentation, and when you need fine-grained control over quantization parameters.

Example usage:

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

Quantize a model to 4-bit GPTQ
quantization_config = GPTQConfig(     bits=4,     group_size=128,     dataset="c4",     desc_act=False, ) 

model = AutoModelForCausalLM.from_pretrained(     "meta-llama/Llama-2-7b-hf",     quantization_config=quantization_config,     device_map="auto" )

3.4 EXL2 (ExLlamaV2)

EXL2 is the native format for ExLlamaV2, an inference engine designed specifically for maximum performance on consumer GPUs. It represents the bleeding edge of quantization research.

Key Characteristics:

Optimal bit allocation: Can mix 2-bit through 8-bit within the same model
Per-layer tuning: Different layers can have different precision based on their sensitivity
Extremely fast: Often 2-3x faster than other 4-bit implementations
VRAM efficient: Smart memory management for large context windows

EXL2’s killer feature is adaptive quantization. Instead of applying the same bit width everywhere, it analyzes each layer’s importance and allocates bits accordingly. Attention layers might get 6-bit while feed-forward layers get 4-bit, for example.

As covered in our guide to optimizing inference performance, EXL2 can achieve speeds approaching FP16 while using a fraction of the memory.

Best for: Maximum performance on NVIDIA GPUs, long context windows, and when you’re willing to trade some compatibility for raw speed.

Example usage:

# Convert to EXL2 with specific bit width python convert.py -i /path/to/model -o /path/to/output -b 4.5 -m /path/to/measurement.json Run inference

python test_inference.py -m /path/to/output -p "The future of AI is"

3.5 FP8 (8-bit Floating Point)

FP8 is an emerging standard supported by NVIDIA’s Hopper (H100) and Blackwell (B200) architectures. Unlike integer quantization, it maintains floating-point representation with reduced precision.

Key Characteristics:

Hardware-native: Dedicated FP8 tensor cores on H100/B200
Two formats: E4M3 (4 exponent bits, 3 mantissa) and E5M2 (5 exponent, 2 mantissa)
Minimal accuracy loss: Often indistinguishable from FP16 for inference
Future-proof: Becoming the standard for datacenter inference

FP8 represents a shift in the quantization landscape. Instead of fighting against hardware designed for FP16/FP32, it uses formats that modern AI accelerators can process natively. This means no dequantization overhead and no accuracy-warping integer conversions.

The trade-off? You need very new, very expensive hardware. For most local deployments, FP8 remains aspirational.

Best for: Datacenter deployment on H100/B200, training quantization, and future-proofing your inference stack.

Example usage:

import torch
import transformers 

Enable FP8 on supported hardware (requires Transformer Engine)
model = transformers.AutoModelForCausalLM.from_pretrained(     "meta-llama/Llama-2-70b",     torch_dtype=torch.float8_e4m3fn,     device_map="auto" )

4. Comparison Matrix

Format	Bits	Speed	Quality	VRAM	Hardware	Best For
GGUF (Q4_K_M)	~4.5	Medium	High	Low	CPU/GPU	Universal deployment, edge devices
AWQ	4	Fast	Very High	Low	NVIDIA GPU	Production APIs, vLLM
GPTQ	4	Medium	High	Low	Any GPU	Maximum compatibility
EXL2	2-8	Very Fast	High	Low	NVIDIA GPU	Maximum performance
FP8	8	Very Fast	Very High	Medium	H100/B200	Datacenter inference

Speed Benchmarks (Llama-2-70B, RTX 4090, 4096 context)

Format	Tokens/Second	VRAM Used
FP16	25	~140 GB
GGUF Q4_K_M	35	~40 GB
AWQ	45	~40 GB
GPTQ	40	~40 GB
EXL2 (4.0 bpw)	60	~38 GB

Note: Actual speeds vary based on context length, batch size, and specific model architecture.

5. When to Use Which: A Decision Tree

Starting Point: What’s Your Hardware?

CPU Only or Mixed CPU/GPU? → Use GGUF. It’s the only format that truly excels on CPU and supports partial offloading (running some layers on GPU, rest on CPU).

NVIDIA GPU (RTX 20-series or newer)? → Continue to next question.

AMD GPU or other accelerator? → Use GGUF or GPTQ. AWQ and EXL2 are NVIDIA-specific.

NVIDIA GPU Owners: What’s Your Priority?

Maximum compatibility across tools? → Use GPTQ. Supported by virtually every inference framework.

Running a production API with vLLM? → Use AWQ. Native vLLM integration and excellent throughput.

Maximum speed for personal use? → Use EXL2. Fastest inference for local experimentation.

H100/B200 datacenter deployment? → Use FP8. Native hardware support and minimal accuracy loss.

Memory-Constrained?

If you’re trying to squeeze a 70B model onto a 24GB GPU:

EXL2 with 3.5 bpw (bits per weight) — aggressive but usable
GGUF Q3_K_M — more conservative, widely compatible
AWQ/GPTQ with group_size=64 — better quality, slightly more VRAM

As covered in our guide to running 70B models on consumer hardware, context length dramatically affects memory usage. At 4K context, you need ~8GB additional VRAM. At 32K context, you need ~20GB more.

6. Hands-On: Quantizing a Model

Let’s walk through quantizing a model to each format. We’ll use Meta’s Llama-2-7B as an example.

Prerequisites

# Create environment conda create -n quantization python=3.10 conda activate quantization Install dependencies

pip install torch transformers accelerate pip install llama-cpp-python pip install auto-gptq pip install autoawq pip install exllamav2

Quantizing to GGUF

# Clone llama.cpp git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make Download the model (or use Hugging Face) python convert_hf_to_gguf.py /path/to/llama-2-7b-hf --outfile llama-2-7b-f16.gguf --outtype f16 Quantize to Q4_K_M

./llama-quantize llama-2-7b-f16.gguf llama-2-7b-Q4_K_M.gguf Q4_K_M

Quantizing to AWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer 

model_path = "meta-llama/Llama-2-7b-hf" quant_path = "llama-2-7b-awq" quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"} 

Load model
model = AutoAWQForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) 

Quantize
model.quantize(tokenizer, quant_config=quant_config) 

Save
model.save_quantized(quant_path) tokenizer.save_pretrained(quant_path)

Quantizing to GPTQ

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "meta-llama/Llama-2-7b-hf" quant_path = "llama-2-7b-gptq" 

quantization_config = GPTQConfig(     bits=4,     group_size=128,     dataset="c4",     desc_act=False, ) 

model = AutoModelForCausalLM.from_pretrained(     model_id,     quantization_config=quantization_config,     device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_id) 

Save
model.save_pretrained(quant_path) tokenizer.save_pretrained(quant_path)

Quantizing to EXL2

# Clone ExLlamaV2 git clone https://github.com/turboderp/exllamav2 cd exllamav2 First, create measurement file python convert.py -i /path/to/llama-2-7b-hf -o ./work -c ./measurement.json Convert with target bits per weight

python convert.py -i /path/to/llama-2-7b-hf -o ./output -b 4.0 -m ./measurement.json

7. Quality vs Speed Trade-offs

Understanding Perplexity

Perplexity measures how well a model predicts text—lower is better. It’s the standard metric for evaluating quantization quality.

Format	WikiText2 Perplexity (Llama-2-7B)	Relative to FP16
FP16	5.12	100%
GGUF Q8_0	5.13	99.8%
AWQ	5.18	98.8%
GGUF Q4_K_M	5.25	97.3%
GPTQ	5.28	96.9%
EXL2 (4.0 bpw)	5.30	96.5%
GGUF Q3_K_M	5.85	87.5%

Real-World Impact

Perplexity differences don’t always translate to noticeable quality degradation. In practice:

Q8 and AWQ: Virtually indistinguishable from FP16
Q4_K_M and GPTQ: Minor degradation, rarely noticeable for creative tasks
Q3 and aggressive EXL2: Noticeable quality loss, but still usable for many tasks
Q2: Significant degradation, only suitable for experimentation

Task-Specific Considerations

Creative writing (stories, poetry): More tolerant of quantization. Q4_K_M is usually fine.

Code generation: Surprisingly sensitive. Attention to exact syntax matters. Recommend AWQ or Q5+.

Reasoning/math: Moderately sensitive. Chain-of-thought can amplify quantization errors.

Factual recall: Generally robust. Quantization affects reasoning more than knowledge.

8. Common Issues & Fixes

Out of Memory (OOM) Errors

Problem: Model loads but crashes during inference.

Solutions:

Reduce context length: --ctx-size 2048 instead of 4096
Use lower quantization: Q3_K_M instead of Q4_K_M
Enable memory mapping: --mlock or mmap=True
For GGUF, use partial GPU offload: -ngl 20 (20 layers on GPU)

Slow Inference

Problem: Quantized model is slower than expected.

Solutions:

Ensure CUDA kernels are compiled: CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
Use the right format for your hardware: EXL2 for RTX 4090, GGUF for CPU
Check batch size: Some formats are optimized for batch=1
Update drivers: Older NVIDIA drivers lack optimizations

Accuracy Loss

Problem: Output quality is noticeably degraded.

Solutions:

Use higher bit width: Q5_K_M or Q6_K instead of Q4
Try AWQ: Often better quality than GPTQ at same bit width
Check calibration data: GPTQ benefits from domain-specific calibration
Disable act-order (desc_act=False): Slower but sometimes more accurate

Model Won’t Load

Problem: “Architecture not supported” or similar errors.

Solutions:

Update your tools: Quantization support improves constantly
Check compatibility matrix: Not all formats support all architectures
Use GGUF as fallback: Broadest model support
Verify file integrity: Re-download if checksums don’t match

9. Conclusion

Quantization has democratized access to large language models. What once required $50,000 in GPU hardware can now run on a consumer gaming PC—or even a laptop CPU.

The five formats covered here each serve different needs:

GGUF for universal compatibility and edge deployment
AWQ for production NVIDIA GPU serving
GPTQ for maximum tooling compatibility
EXL2 for bleeding-edge performance
FP8 for datacenter future-proofing

As covered in our self-hosting guide, the best format is the one that lets you run the model you need on the hardware you have. Start with GGUF for experimentation, move to AWQ or EXL2 for production NVIDIA deployment, and keep an eye on FP8 as hardware evolves.

The memory problem isn’t solved—models continue to grow. But quantization ensures that local AI remains accessible even as frontier models push past 400B parameters.

Continue your local AI journey:

Complete Guide to Self-Hosting LLMs — Infrastructure and setup
llama.cpp Deep Dive — Mastering CPU inference
vLLM Production Deployment — Scaling quantized models
Hardware Guide for Local AI — GPU selection and optimization
Fine-Tuning Quantized Models — QLoRA and beyond
Building AI Agents Locally — Putting it all together
The Complete AI Glossary — Reference for quantization terms (GGUF, AWQ, GPTQ, FP8, QLoRA)

Sources & Further Reading

llama.cpp GitHub Repository — The reference implementation for GGUF
AWQ Paper (Lin et al., 2023) — Original activation-aware quantization research
GPTQ Paper (Frantar et al., 2022) — Post-training quantization framework
ExLlamaV2 GitHub Repository — High-performance inference engine
FP8 Format Specification (NVIDIA) — Hardware documentation
TheBloke’s Hugging Face Models — Pre-converted quantized models
AutoGPTQ Documentation — GPTQ implementation for transformers
AutoAWQ Documentation — AWQ quantization library
Quantization Best Practices (Hugging Face) — Official transformers guide
vLLM Documentation — Production inference server
LLM.int8() Paper (Dettmers et al., 2022) — 8-bit quantization foundations
QLoRA Paper (Dettmers et al., 2023) — Fine-tuning quantized models
SmoothQuant Paper (Xiao et al., 2022) — Alternative quantization approach
SpQR Paper (Dettmers et al., 2023) — Sparse quantization research
AQLM Paper (Egiazarian et al., 2024) — Extreme 2-bit quantization

Last updated: March 2026 Questions? Join the discussion on Discord or follow @tsnmedia for updates.

Quantization Deep Dive: GGUF, AWQ, GPTQ, EXL2 Compared (2026 Guide)

Quantization Deep Dive: GGUF, AWQ, GPTQ, EXL2 Compared (2026 Guide)

1. Introduction: The Memory Problem

2. How Quantization Works (The Simple Version)

Weights and Precision

The Quantization Process

Calibration Matters

3. The Formats: 5 Compared in Detail

3.1 GGUF (llama.cpp)

Run with llama.cpp

3.2 AWQ (Activation-aware Weight Quantization)

Load AWQ-quantized model

Generate

3.3 GPTQ (General-purpose Post-Training Quantization)

Quantize a model to 4-bit GPTQ

3.4 EXL2 (ExLlamaV2)

Run inference

3.5 FP8 (8-bit Floating Point)

Enable FP8 on supported hardware (requires Transformer Engine)

4. Comparison Matrix

Speed Benchmarks (Llama-2-70B, RTX 4090, 4096 context)

5. When to Use Which: A Decision Tree

Starting Point: What’s Your Hardware?

NVIDIA GPU Owners: What’s Your Priority?

Memory-Constrained?

6. Hands-On: Quantizing a Model

Prerequisites

Install dependencies

Quantizing to GGUF

Download the model (or use Hugging Face)

Quantize to Q4_K_M

Quantizing to AWQ

Load model

Quantize

Save

Quantizing to GPTQ

Save

Quantizing to EXL2

First, create measurement file

Convert with target bits per weight

7. Quality vs Speed Trade-offs

Understanding Perplexity

Real-World Impact

Task-Specific Considerations

8. Common Issues & Fixes

Out of Memory (OOM) Errors

Slow Inference

Accuracy Loss

Model Won’t Load

9. Conclusion

Sources & Further Reading

Related articles

Recent articles