Running Vision LLMs Locally: LLaVA, BakLLaVA & Beyond (2026 Guide)
Analyze images with AI—completely offline, completely private
Introduction: Why Vision LLMs Matter
In 2026, the ability to understand images with AI isn’t just a novelty—it’s becoming essential. From developers debugging code via screenshots to educators creating accessible materials, vision-capable language models are transforming how we interact with visual information.
But there’s a catch. Most people turn to cloud APIs like GPT-4 Vision or Claude 3 Opus, uploading their images to remote servers. Every screenshot, every document, every photo you analyze gets sent to a third party. For personal photos, sensitive documents, or proprietary designs, that’s a significant privacy concern.
Running vision LLMs locally changes everything.
Your images never leave your machine. No API keys to manage. No per-image costs that scale unpredictably. And perhaps most importantly: no internet required once the model is downloaded.
Cloud vision APIs can cost $0.005-$0.01 per image. Analyze 1,000 images and you’re looking at $50-100. Run a local model, and that cost drops to zero after the initial download. For researchers processing thousands of images, developers building vision-powered applications, or privacy-conscious users, local vision LLMs aren’t just an alternative—they’re often the better choice.
Building on our RAG guide, this guide will show you exactly how to run powerful vision language models on your own hardware. Whether you have a gaming GPU or just a laptop, there’s a vision LLM that will work for you.
What Are Vision LLMs? Understanding Multimodal AI
Vision LLMs (also called multimodal models or vision-language models) are AI systems that can process both images and text simultaneously. Unlike traditional language models that only understand words, these models can “see” and describe what’s in an image, answer questions about visual content, and even reason across both modalities.
How They Work (Simplified)
At a high level, vision LLMs combine two components:
- Vision Encoder: Converts images into a format the language model can understand (typically a series of “tokens” representing visual features)
- Language Model: Processes these visual tokens alongside text tokens to generate responses
Think of it like this: the vision encoder “describes” the image to the language model in a language both understand. The language model then reasons about this description along with any text prompts you provide.
For example, when you ask “What’s in this image?” and provide a photo of a cat:
- The vision encoder analyzes the image and extracts features (fur texture, ear shape, body posture)
- These features are converted to special tokens that represent “cat-ness”
- The language model receives both your text question and these visual tokens
- It generates: “I see a domestic cat with orange tabby markings sitting on a windowsill…”
Key Capabilities
Modern vision LLMs can:
- Describe images in detail (image captioning)
- Answer questions about visual content (visual Q&A)
- Read text in images (OCR – Optical Character Recognition)
- Identify objects and their relationships
- Understand charts, diagrams, and UI elements
- Compare multiple images (in some models)
The Players: Local Vision Models Compared
The open-source vision LLM landscape has exploded. Here are the top models you can run locally in 2026:
LLaVA 1.5 / 1.6 (Most Popular)
Base Architecture: Llama 2/3 + CLIP vision encoder
LLaVA (Large Language and Vision Assistant) is the most widely adopted open-source vision LLM. Version 1.5 brought significant improvements in reasoning and detail, while 1.6 added better multi-image support and higher resolution handling.
Pros:
- Excellent community support and documentation
- Works with Ollama (easiest setup)
- Strong general-purpose performance
- Multiple sizes available (7B to 34B parameters)
Cons:
- Larger models need significant VRAM
- Can be slower than specialized alternatives
Best for: General-purpose vision tasks, beginners getting started
BakLLaVA (Efficiency Champion)
Base Architecture: Mistral + CLIP
BakLLaVA swaps the Llama base for Mistral, resulting in faster inference and often better performance per parameter. It’s become the go-to for users who want good vision capabilities without massive hardware requirements.
Pros:
- Faster than LLaVA at similar sizes
- Mistral’s efficient architecture
- Excellent for edge deployment
- Strong OCR capabilities
Cons:
- Smaller ecosystem than LLaVA
- Slightly less mature tooling
Best for: Performance-conscious users, edge deployments, faster inference needs
Moondream 2 (Tiny but Mighty)
Base Architecture: Custom tiny architecture (~1.6B parameters)
Moondream 2 proves that bigger isn’t always better. This tiny model can run on CPUs, Raspberry Pis, and even some smartphones while still delivering impressive vision capabilities.
Pros:
- Runs on almost anything (including CPU-only systems)
- Extremely fast inference
- Perfect for embedded applications
- Surprisingly capable for its size
Cons:
- Less detailed than larger models
- Struggles with complex reasoning tasks
- Limited context window
Best for: Edge devices, CPU-only setups, resource-constrained environments
CogVLM (Quality Leader)
Base Architecture: Custom visual expert architecture
CogVLM takes a different approach, adding a “visual expert” module that processes visual features in parallel with text. This results in exceptional image understanding quality, particularly for detailed scenes and complex diagrams.
Pros:
- State-of-the-art open-source vision quality
- Excellent at reading text in images
- Strong performance on benchmarks
- Handles complex visual reasoning well
Cons:
- Requires more VRAM (13B+ parameters)
- Slower inference than smaller models
- More complex setup
Best for: Maximum quality when hardware permits, document analysis, detailed image understanding
InternVL (Multilingual Powerhouse)
Base Architecture: InternLM + custom vision encoder
InternVL brings strong multilingual capabilities alongside excellent vision performance. If you need vision AI that works across languages, this is your model.
Pros:
- Strong multilingual support (Chinese, Japanese, etc.)
- Excellent benchmark scores
- Good balance of size and performance
- Active development and updates
Cons:
- Smaller English-speaking community
- Documentation can be sparse in English
Best for: Multilingual applications, non-English vision tasks
Hardware Requirements: What You Actually Need
Let’s cut through the speculation. Here’s what you actually need to run these models:
| Model | Size | Minimum VRAM | Recommended VRAM | CPU-Only? |
|---|---|---|---|---|
| Moondream 2 | 1.6B | 2 GB | 4 GB | Yes (slow) |
| LLaVA 1.5 (7B) | 7B | 6 GB | 8 GB | No |
| BakLLaVA (7B) | 7B | 6 GB | 8 GB | No |
| LLaVA 1.6 (13B) | 13B | 10 GB | 16 GB | No |
| CogVLM | 17B | 12 GB | 24 GB | No |
| InternVL (8B) | 8B | 8 GB | 12 GB | No |
Understanding Quantization
Those numbers assume quantized models—compressed versions that trade some quality for massive size reductions. Here’s what the suffixes mean:
- Q4_K_M: 4-bit quantization, medium quality (most common)
- Q5_K_M: 5-bit quantization, better quality, larger size
- Q8_0: 8-bit quantization, near-original quality, largest size
For most users, Q4_K_M offers the best balance. If you have VRAM to spare, Q5_K_M provides noticeable quality improvements.
What If You Don’t Have a GPU?
Moondream 2 is your best bet for CPU-only operation. Other models will run on CPU but expect 10-30x slower inference. For occasional use, it’s workable. For regular analysis, a GPU becomes essential.
Step-by-Step: Running LLaVA with Ollama
Ollama is the easiest way to run vision LLMs locally. Here’s the complete setup:
Step 1: Install Ollama
macOS/Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows:
Download the installer from ollama.com and follow the prompts.
Step 2: Pull a Vision Model
# LLaVA 1.5 (7B) - Good balance of quality and speed
ollama pull llava:7b
# LLaVA 1.6 (13B) - Better quality, needs more VRAM
ollama pull llava:13b
# BakLLaVA - Faster alternative
ollama pull bakllava
# Moondream 2 - Tiny, runs almost anywhere
ollama pull moondream
Step 3: Test with an Image
# Basic image description
ollama run llava:7b "Describe this image" --image /path/to/your/image.jpg
# Ask specific questions
ollama run llava:7b "What text do you see in this image?" --image /path/to/document.png
Step 4: Python Integration
Create a simple Python script for programmatic access:
import ollama
import base64
def analyze_image(image_path, prompt="Describe this image in detail"):
# Read and encode the image
with open(image_path, 'rb') as f:
image_data = base64.b64encode(f.read()).decode('utf-8')
# Call the model
response = ollama.chat(
model='llava:7b',
messages=[{
'role': 'user',
'content': prompt,
'images': [image_data]
}]
)
return response['message']['content']
# Example usage
result = analyze_image('screenshot.png', 'What code is shown in this screenshot?')
print(result)
Step 5: Batch Processing
For processing multiple images:
import os
import json
from pathlib import Path
def batch_analyze(image_folder, output_file='results.json'):
results = []
image_paths = list(Path(image_folder).glob('*.jpg')) +
list(Path(image_folder).glob('*.png'))
for img_path in image_paths:
print(f"Processing: {img_path.name}")
description = analyze_image(str(img_path))
results.append({
'file': img_path.name,
'description': description
})
# Save results
with open(output_file, 'w') as f:
json.dump(results, f, indent=2)
return results
# Process a folder of images
batch_analyze('./my_images/', 'image_descriptions.json')
Advanced: Running with llama.cpp
For more control over quantization, context length, and inference parameters, llama.cpp is the power user’s choice.
Installation
# Clone the repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Build with CUDA support (for NVIDIA GPUs)
make LLAMA_CUDA=1
# Or build for CPU only
make
Downloading Vision Models
Vision models for llama.cpp typically come in GGUF format. Download from Hugging Face:
# LLaVA 1.5 7B (Q4_K_M quantization)
wget https://huggingface.co/cjpais/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-Q4_K_M.gguf
# Download the corresponding mmproj file (vision encoder)
wget https://huggingface.co/cjpais/llava-v1.5-7B-GGUF/resolve/main/mmproj-model-f16.gguf
Running Inference
./llava
-m llava-v1.5-7b-Q4_K_M.gguf
--mmproj mmproj-model-f16.gguf
--image input.jpg
-p "Describe this image in detail:"
Python Binding Example
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler
# Initialize the chat handler with vision support
chat_handler = Llava15ChatHandler(
clip_model_path="mmproj-model-f16.gguf"
)
# Load the model
llm = Llama(
model_path="llava-v1.5-7b-Q4_K_M.gguf",
chat_handler=chat_handler,
n_ctx=4096, # Context window
n_gpu_layers=-1 # Offload all layers to GPU
)
# Analyze an image
response = llm.create_chat_completion(
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "file://image.jpg"}},
{"type": "text", "text": "What's happening in this image?"}
]
}
]
)
print(response['choices'][0]['message']['content'])
Quantization Options
llama.cpp supports various quantization strategies:
| Quantization | Size Reduction | Quality Impact | Use Case |
|---|---|---|---|
| Q4_0 | ~75% | Noticeable | Maximum compression |
| Q4_K_M | ~70% | Minimal | Best balance |
| Q5_K_M | ~60% | Very slight | Quality priority |
| Q8_0 | ~50% | Negligible | Near-original quality |
| F16 | None | Original | Maximum quality |
Building on our guide to local LLMs, quantization is the key to fitting larger models into limited VRAM.
Use Cases & Practical Examples
1. Document OCR and Analysis
Extract text from scanned documents, receipts, or screenshots:
def extract_document_text(image_path):
prompt = """Please extract all text from this document.
Preserve the formatting as much as possible.
If there are tables, describe their structure."""
return analyze_image(image_path, prompt)
# Process a receipt
receipt_text = extract_document_text('receipt.jpg')
print(receipt_text)
Real-world application: Automate expense reporting by extracting data from receipt photos.
2. Image Captioning for Accessibility
Generate alt text for images to improve web accessibility:
def generate_alt_text(image_path):
prompt = """Write a concise alt text description for this image.
Keep it under 125 characters if possible, but include essential details.
Focus on the main subject and context."""
return analyze_image(image_path, prompt)
# Generate alt text for website images
alt_text = generate_alt_text('product-photo.jpg')
print(f'<img src="product-photo.jpg" alt="{alt_text}">')
Real-world application: Batch-process image libraries to add accessibility descriptions.
3. Visual Q&A for Education
Create interactive learning materials:
def educational_qa(image_path, question):
prompt = f"""You are a helpful tutor. Look at this educational image and answer the question.
Explain your reasoning clearly and simply.
Question: {question}"""
return analyze_image(image_path, prompt)
# Example: Analyze a historical photograph
answer = educational_qa('ww2-photo.jpg',
'What can you tell me about the people and setting in this photograph?')
print(answer)
Real-world application: Build study tools that let students ask questions about diagrams, historical photos, or scientific illustrations.
4. Code Screenshot to Text
Convert screenshots of code into copyable text:
def screenshot_to_code(image_path):
prompt = """Extract all code visible in this screenshot.
Format it properly with correct indentation.
Only return the code, no explanations."""
return analyze_image(image_path, prompt)
# Extract code from a tutorial screenshot
code = screenshot_to_code('tutorial-code.png')
with open('extracted_code.py', 'w') as f:
f.write(code)
Real-world application: Convert video tutorials or documentation screenshots into working code.
5. Security Camera Analysis (Privacy-Preserving)
Analyze security footage locally without sending sensitive video to the cloud:
import cv2
def analyze_security_frame(frame_path):
prompt = """Analyze this security camera frame.
Describe any people, vehicles, or unusual activity.
Note the approximate number of people and their general activity."""
return analyze_image(frame_path, prompt)
# Process video frames (requires OpenCV)
def process_video_frames(video_path, interval_seconds=5):
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
frame_interval = int(fps * interval_seconds)
frame_count = 0
results = []
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if frame_count % frame_interval == 0:
# Save frame temporarily
temp_path = f'frame_{frame_count}.jpg'
cv2.imwrite(temp_path, frame)
# Analyze
analysis = analyze_security_frame(temp_path)
results.append({
'timestamp': frame_count / fps,
'analysis': analysis
})
# Clean up
os.remove(temp_path)
frame_count += 1
cap.release()
return results
Real-world application: Generate activity summaries from security footage without privacy risks.
Limitations & When to Use Cloud
Local vision LLMs are powerful, but they’re not
perfect. Here’s when to consider cloud alternatives:
Local Model Limitations
| Limitation | Details |
|---|---|
| Resolution constraints | Most models downsample images to 336×336 or 448×448 pixels |
| Smaller knowledge base | Less general world knowledge than GPT-4V |
| Reasoning gaps | Complex multi-step visual reasoning can fail |
| Language limitations | English works best; other languages vary |
| Fine detail loss | Tiny text or distant objects may be missed |
When to Use Cloud APIs (GPT-4V, Claude 3 Opus)
Consider cloud vision APIs when:
- Maximum accuracy is critical (medical imaging, legal documents)
- Processing very high-resolution images (detailed technical diagrams)
- Need advanced reasoning (complex visual puzzles, multi-image comparison)
- Multilingual requirements (GPT-4V handles 100+ languages well)
- Infrequent usage (occasional analysis doesn’t justify hardware costs)
The Hybrid Approach
Many users find a hybrid workflow works best:
- Use local models for: Bulk processing, sensitive images, development/testing, offline work
- Use cloud APIs for: Critical decisions, complex analysis, final verification
Building on our AI cost analysis, this hybrid approach often delivers the best cost-quality balance.
Conclusion: Your Images, Your Control
Running vision LLMs locally puts you in control. Your images stay private. Your costs are predictable. And you’re not dependent on internet connectivity or API uptime.
We’ve covered:
- What vision LLMs are and how they work
- The top models available in 2026 (LLaVA, BakLLaVA, Moondream, CogVLM, InternVL)
- Hardware requirements for each model
- Step-by-step setup with Ollama
- Advanced usage with llama.cpp
- Practical applications from OCR to security analysis
Start small. If you have limited hardware, begin with Moondream 2 or LLaVA 7B via Ollama. As you get comfortable, experiment with larger models and different quantization levels.
The ability to understand images with AI is no longer locked behind corporate APIs. It’s on your machine, ready when you are.
Continue your local AI journey:
- Building on our RAG implementation guide for adding document retrieval to your vision workflows
- Explore running larger models efficiently with advanced quantization
- Learn about AI cost optimization strategies
- Discover multimodal AI applications beyond vision
- Read our complete local AI setup guide for the full stack
Sources & Further Reading
- LLaVA Project Website – Official LLaVA documentation and papers
- Ollama Vision Models – Curated vision model collection
- llama.cpp GitHub Repository – Official implementation
- Moondream 2 Hugging Face – Tiny vision model
- CogVLM GitHub – High-quality vision model
- InternVL Documentation – Multilingual vision LLM
- BakLLaVA on Ollama – Mistral-based vision model
- GGUF Format Specification – Model quantization format
- Hugging Face Multimodal Models – Model repository
- OpenAI GPT-4V Documentation – Cloud API reference
- Claude 3 Vision Capabilities – Anthropic’s vision features
- LocalAI Project – Alternative local AI platform
- Vision LLM Benchmarks – Performance comparisons
- Quantization Techniques Explained – Academic paper on LLM quantization
- Privacy-Preserving AI Guide – Related article on data privacy
