Running Vision LLMs Locally: LLaVA, BakLLaVA & Beyond (2026 Guide)

Analyze images with AI—completely offline, completely private

Introduction: Why Vision LLMs Matter

In 2026, the ability to understand images with AI isn’t just a novelty—it’s becoming essential. From developers debugging code via screenshots to educators creating accessible materials, vision-capable language models are transforming how we interact with visual information.

But there’s a catch. Most people turn to cloud APIs like GPT-4 Vision or Claude 3 Opus, uploading their images to remote servers. Every screenshot, every document, every photo you analyze gets sent to a third party. For personal photos, sensitive documents, or proprietary designs, that’s a significant privacy concern.

Running vision LLMs locally changes everything.

Your images never leave your machine. No API keys to manage. No per-image costs that scale unpredictably. And perhaps most importantly: no internet required once the model is downloaded.

Cloud vision APIs can cost $0.005-$0.01 per image. Analyze 1,000 images and you’re looking at $50-100. Run a local model, and that cost drops to zero after the initial download. For researchers processing thousands of images, developers building vision-powered applications, or privacy-conscious users, local vision LLMs aren’t just an alternative—they’re often the better choice.

Building on our RAG guide, this guide will show you exactly how to run powerful vision language models on your own hardware. Whether you have a gaming GPU or just a laptop, there’s a vision LLM that will work for you.

What Are Vision LLMs? Understanding Multimodal AI

Vision LLMs (also called multimodal models or vision-language models) are AI systems that can process both images and text simultaneously. Unlike traditional language models that only understand words, these models can “see” and describe what’s in an image, answer questions about visual content, and even reason across both modalities.

How They Work (Simplified)

At a high level, vision LLMs combine two components:

Vision Encoder: Converts images into a format the language model can understand (typically a series of “tokens” representing visual features)
Language Model: Processes these visual tokens alongside text tokens to generate responses

Think of it like this: the vision encoder “describes” the image to the language model in a language both understand. The language model then reasons about this description along with any text prompts you provide.

For example, when you ask “What’s in this image?” and provide a photo of a cat:

The vision encoder analyzes the image and extracts features (fur texture, ear shape, body posture)
These features are converted to special tokens that represent “cat-ness”
The language model receives both your text question and these visual tokens
It generates: “I see a domestic cat with orange tabby markings sitting on a windowsill…”

Key Capabilities

Modern vision LLMs can:

Describe images in detail (image captioning)
Answer questions about visual content (visual Q&A)
Read text in images (OCR – Optical Character Recognition)
Identify objects and their relationships
Understand charts, diagrams, and UI elements
Compare multiple images (in some models)

The Players: Local Vision Models Compared

The open-source vision LLM landscape has exploded. Here are the top models you can run locally in 2026:

LLaVA 1.5 / 1.6 (Most Popular)

Base Architecture: Llama 2/3 + CLIP vision encoder

LLaVA (Large Language and Vision Assistant) is the most widely adopted open-source vision LLM. Version 1.5 brought significant improvements in reasoning and detail, while 1.6 added better multi-image support and higher resolution handling.

Pros:

Excellent community support and documentation
Works with Ollama (easiest setup)
Strong general-purpose performance
Multiple sizes available (7B to 34B parameters)

Cons:

Larger models need significant VRAM
Can be slower than specialized alternatives

Best for: General-purpose vision tasks, beginners getting started

BakLLaVA (Efficiency Champion)

Base Architecture: Mistral + CLIP

BakLLaVA swaps the Llama base for Mistral, resulting in faster inference and often better performance per parameter. It’s become the go-to for users who want good vision capabilities without massive hardware requirements.

Pros:

Faster than LLaVA at similar sizes
Mistral’s efficient architecture
Excellent for edge deployment
Strong OCR capabilities

Cons:

Smaller ecosystem than LLaVA
Slightly less mature tooling

Best for: Performance-conscious users, edge deployments, faster inference needs

Moondream 2 (Tiny but Mighty)

Base Architecture: Custom tiny architecture (~1.6B parameters)

Moondream 2 proves that bigger isn’t always better. This tiny model can run on CPUs, Raspberry Pis, and even some smartphones while still delivering impressive vision capabilities.

Pros:

Runs on almost anything (including CPU-only systems)
Extremely fast inference
Perfect for embedded applications
Surprisingly capable for its size

Cons:

Less detailed than larger models
Struggles with complex reasoning tasks
Limited context window

Best for: Edge devices, CPU-only setups, resource-constrained environments

CogVLM (Quality Leader)

Base Architecture: Custom visual expert architecture

CogVLM takes a different approach, adding a “visual expert” module that processes visual features in parallel with text. This results in exceptional image understanding quality, particularly for detailed scenes and complex diagrams.

Pros:

State-of-the-art open-source vision quality
Excellent at reading text in images
Strong performance on benchmarks
Handles complex visual reasoning well

Cons:

Requires more VRAM (13B+ parameters)
Slower inference than smaller models
More complex setup

Best for: Maximum quality when hardware permits, document analysis, detailed image understanding

InternVL (Multilingual Powerhouse)

Base Architecture: InternLM + custom vision encoder

InternVL brings strong multilingual capabilities alongside excellent vision performance. If you need vision AI that works across languages, this is your model.

Pros:

Strong multilingual support (Chinese, Japanese, etc.)
Excellent benchmark scores
Good balance of size and performance
Active development and updates

Cons:

Smaller English-speaking community
Documentation can be sparse in English

Best for: Multilingual applications, non-English vision tasks

Hardware Requirements: What You Actually Need

Let’s cut through the speculation. Here’s what you actually need to run these models:

Model	Size	Minimum VRAM	Recommended VRAM	CPU-Only?
Moondream 2	1.6B	2 GB	4 GB	Yes (slow)
LLaVA 1.5 (7B)	7B	6 GB	8 GB	No
BakLLaVA (7B)	7B	6 GB	8 GB	No
LLaVA 1.6 (13B)	13B	10 GB	16 GB	No
CogVLM	17B	12 GB	24 GB	No
InternVL (8B)	8B	8 GB	12 GB	No

Understanding Quantization

Those numbers assume quantized models—compressed versions that trade some quality for massive size reductions. Here’s what the suffixes mean:

Q4_K_M: 4-bit quantization, medium quality (most common)
Q5_K_M: 5-bit quantization, better quality, larger size
Q8_0: 8-bit quantization, near-original quality, largest size

For most users, Q4_K_M offers the best balance. If you have VRAM to spare, Q5_K_M provides noticeable quality improvements.

What If You Don’t Have a GPU?

Moondream 2 is your best bet for CPU-only operation. Other models will run on CPU but expect 10-30x slower inference. For occasional use, it’s workable. For regular analysis, a GPU becomes essential.

Step-by-Step: Running LLaVA with Ollama

Ollama is the easiest way to run vision LLMs locally. Here’s the complete setup:

Step 1: Install Ollama

macOS/Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:
Download the installer from ollama.com and follow the prompts.

Step 2: Pull a Vision Model

# LLaVA 1.5 (7B) - Good balance of quality and speed
ollama pull llava:7b

# LLaVA 1.6 (13B) - Better quality, needs more VRAM
ollama pull llava:13b

# BakLLaVA - Faster alternative
ollama pull bakllava

# Moondream 2 - Tiny, runs almost anywhere
ollama pull moondream

Step 3: Test with an Image

# Basic image description
ollama run llava:7b "Describe this image" --image /path/to/your/image.jpg

# Ask specific questions
ollama run llava:7b "What text do you see in this image?" --image /path/to/document.png

Step 4: Python Integration

Create a simple Python script for programmatic access:

import ollama
import base64

def analyze_image(image_path, prompt="Describe this image in detail"):
    # Read and encode the image
    with open(image_path, 'rb') as f:
        image_data = base64.b64encode(f.read()).decode('utf-8')
    
    # Call the model
    response = ollama.chat(
        model='llava:7b',
        messages=[{
            'role': 'user',
            'content': prompt,
            'images': [image_data]
        }]
    )
    
    return response['message']['content']

# Example usage
result = analyze_image('screenshot.png', 'What code is shown in this screenshot?')
print(result)

Step 5: Batch Processing

For processing multiple images:

import os
import json
from pathlib import Path

def batch_analyze(image_folder, output_file='results.json'):
    results = []
    image_paths = list(Path(image_folder).glob('*.jpg')) + 
                  list(Path(image_folder).glob('*.png'))
    
    for img_path in image_paths:
        print(f"Processing: {img_path.name}")
        description = analyze_image(str(img_path))
        results.append({
            'file': img_path.name,
            'description': description
        })
    
    # Save results
    with open(output_file, 'w') as f:
        json.dump(results, f, indent=2)
    
    return results

# Process a folder of images
batch_analyze('./my_images/', 'image_descriptions.json')

Advanced: Running with llama.cpp

For more control over quantization, context length, and inference parameters, llama.cpp is the power user’s choice.

Installation

# Clone the repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build with CUDA support (for NVIDIA GPUs)
make LLAMA_CUDA=1

# Or build for CPU only
make

Downloading Vision Models

Vision models for llama.cpp typically come in GGUF format. Download from Hugging Face:

# LLaVA 1.5 7B (Q4_K_M quantization)
wget https://huggingface.co/cjpais/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-Q4_K_M.gguf

# Download the corresponding mmproj file (vision encoder)
wget https://huggingface.co/cjpais/llava-v1.5-7B-GGUF/resolve/main/mmproj-model-f16.gguf

Running Inference

./llava 
    -m llava-v1.5-7b-Q4_K_M.gguf 
    --mmproj mmproj-model-f16.gguf 
    --image input.jpg 
    -p "Describe this image in detail:"

Python Binding Example

from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler

# Initialize the chat handler with vision support
chat_handler = Llava15ChatHandler(
    clip_model_path="mmproj-model-f16.gguf"
)

# Load the model
llm = Llama(
    model_path="llava-v1.5-7b-Q4_K_M.gguf",
    chat_handler=chat_handler,
    n_ctx=4096,  # Context window
    n_gpu_layers=-1  # Offload all layers to GPU
)

# Analyze an image
response = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "file://image.jpg"}},
                {"type": "text", "text": "What's happening in this image?"}
            ]
        }
    ]
)

print(response['choices'][0]['message']['content'])

Quantization Options

llama.cpp supports various quantization strategies:

Quantization	Size Reduction	Quality Impact	Use Case
Q4_0	~75%	Noticeable	Maximum compression
Q4_K_M	~70%	Minimal	Best balance
Q5_K_M	~60%	Very slight	Quality priority
Q8_0	~50%	Negligible	Near-original quality
F16	None	Original	Maximum quality

Building on our guide to local LLMs, quantization is the key to fitting larger models into limited VRAM.

Use Cases & Practical Examples

1. Document OCR and Analysis

Extract text from scanned documents, receipts, or screenshots:

def extract_document_text(image_path):
    prompt = """Please extract all text from this document.
    Preserve the formatting as much as possible.
    If there are tables, describe their structure."""
    
    return analyze_image(image_path, prompt)

# Process a receipt
receipt_text = extract_document_text('receipt.jpg')
print(receipt_text)

Real-world application: Automate expense reporting by extracting data from receipt photos.

2. Image Captioning for Accessibility

Generate alt text for images to improve web accessibility:

def generate_alt_text(image_path):
    prompt = """Write a concise alt text description for this image.
    Keep it under 125 characters if possible, but include essential details.
    Focus on the main subject and context."""
    
    return analyze_image(image_path, prompt)

# Generate alt text for website images
alt_text = generate_alt_text('product-photo.jpg')
print(f'<img src="product-photo.jpg" alt="{alt_text}">')

Real-world application: Batch-process image libraries to add accessibility descriptions.

3. Visual Q&A for Education

Create interactive learning materials:

def educational_qa(image_path, question):
    prompt = f"""You are a helpful tutor. Look at this educational image and answer the question.
    Explain your reasoning clearly and simply.
    
    Question: {question}"""
    
    return analyze_image(image_path, prompt)

# Example: Analyze a historical photograph
answer = educational_qa('ww2-photo.jpg', 
    'What can you tell me about the people and setting in this photograph?')
print(answer)

Real-world application: Build study tools that let students ask questions about diagrams, historical photos, or scientific illustrations.

4. Code Screenshot to Text

Convert screenshots of code into copyable text:

def screenshot_to_code(image_path):
    prompt = """Extract all code visible in this screenshot.
    Format it properly with correct indentation.
    Only return the code, no explanations."""
    
    return analyze_image(image_path, prompt)

# Extract code from a tutorial screenshot
code = screenshot_to_code('tutorial-code.png')
with open('extracted_code.py', 'w') as f:
    f.write(code)

Real-world application: Convert video tutorials or documentation screenshots into working code.

5. Security Camera Analysis (Privacy-Preserving)

Analyze security footage locally without sending sensitive video to the cloud:

import cv2

def analyze_security_frame(frame_path):
    prompt = """Analyze this security camera frame.
    Describe any people, vehicles, or unusual activity.
    Note the approximate number of people and their general activity."""
    
    return analyze_image(frame_path, prompt)

# Process video frames (requires OpenCV)
def process_video_frames(video_path, interval_seconds=5):
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps * interval_seconds)
    
    frame_count = 0
    results = []
    
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        
        if frame_count % frame_interval == 0:
            # Save frame temporarily
            temp_path = f'frame_{frame_count}.jpg'
            cv2.imwrite(temp_path, frame)
            
            # Analyze
            analysis = analyze_security_frame(temp_path)
            results.append({
                'timestamp': frame_count / fps,
                'analysis': analysis
            })
            
            # Clean up
            os.remove(temp_path)
        
        frame_count += 1
    
    cap.release()
    return results

Real-world application: Generate activity summaries from security footage without privacy risks.

Limitations & When to Use Cloud

Local vision LLMs are powerful, but they’re not
perfect. Here’s when to consider cloud alternatives:

Local Model Limitations

Limitation	Details
Resolution constraints	Most models downsample images to 336×336 or 448×448 pixels
Smaller knowledge base	Less general world knowledge than GPT-4V
Reasoning gaps	Complex multi-step visual reasoning can fail
Language limitations	English works best; other languages vary
Fine detail loss	Tiny text or distant objects may be missed

When to Use Cloud APIs (GPT-4V, Claude 3 Opus)

Consider cloud vision APIs when:

Maximum accuracy is critical (medical imaging, legal documents)
Processing very high-resolution images (detailed technical diagrams)
Need advanced reasoning (complex visual puzzles, multi-image comparison)
Multilingual requirements (GPT-4V handles 100+ languages well)
Infrequent usage (occasional analysis doesn’t justify hardware costs)

The Hybrid Approach

Many users find a hybrid workflow works best:

Use local models for: Bulk processing, sensitive images, development/testing, offline work
Use cloud APIs for: Critical decisions, complex analysis, final verification

Building on our AI cost analysis, this hybrid approach often delivers the best cost-quality balance.

Conclusion: Your Images, Your Control

Running vision LLMs locally puts you in control. Your images stay private. Your costs are predictable. And you’re not dependent on internet connectivity or API uptime.

We’ve covered:

What vision LLMs are and how they work
The top models available in 2026 (LLaVA, BakLLaVA, Moondream, CogVLM, InternVL)
Hardware requirements for each model
Step-by-step setup with Ollama
Advanced usage with llama.cpp
Practical applications from OCR to security analysis

Start small. If you have limited hardware, begin with Moondream 2 or LLaVA 7B via Ollama. As you get comfortable, experiment with larger models and different quantization levels.

The ability to understand images with AI is no longer locked behind corporate APIs. It’s on your machine, ready when you are.

Continue your local AI journey:

Building on our RAG implementation guide for adding document retrieval to your vision workflows
Explore running larger models efficiently with advanced quantization
Learn about AI cost optimization strategies
Discover multimodal AI applications beyond vision
Read our complete local AI setup guide for the full stack

Sources & Further Reading

LLaVA Project Website – Official LLaVA documentation and papers
Ollama Vision Models – Curated vision model collection
llama.cpp GitHub Repository – Official implementation
Moondream 2 Hugging Face – Tiny vision model
CogVLM GitHub – High-quality vision model
InternVL Documentation – Multilingual vision LLM
BakLLaVA on Ollama – Mistral-based vision model
GGUF Format Specification – Model quantization format
Hugging Face Multimodal Models – Model repository
OpenAI GPT-4V Documentation – Cloud API reference
Claude 3 Vision Capabilities – Anthropic’s vision features
LocalAI Project – Alternative local AI platform
Vision LLM Benchmarks – Performance comparisons
Quantization Techniques Explained – Academic paper on LLM quantization
Privacy-Preserving AI Guide – Related article on data privacy

Running Vision LLMs Locally: LLaVA, BakLLaVA & Beyond (2026 Guide)

Running Vision LLMs Locally: LLaVA, BakLLaVA & Beyond (2026 Guide)

Introduction: Why Vision LLMs Matter

What Are Vision LLMs? Understanding Multimodal AI

How They Work (Simplified)

Key Capabilities

The Players: Local Vision Models Compared

LLaVA 1.5 / 1.6 (Most Popular)

BakLLaVA (Efficiency Champion)

Moondream 2 (Tiny but Mighty)

CogVLM (Quality Leader)

InternVL (Multilingual Powerhouse)

Hardware Requirements: What You Actually Need

Understanding Quantization

What If You Don’t Have a GPU?

Step-by-Step: Running LLaVA with Ollama

Step 1: Install Ollama

Step 2: Pull a Vision Model

Step 3: Test with an Image

Step 4: Python Integration

Step 5: Batch Processing

Advanced: Running with llama.cpp

Installation

Downloading Vision Models

Running Inference

Python Binding Example

Quantization Options

Use Cases & Practical Examples

1. Document OCR and Analysis

2. Image Captioning for Accessibility

3. Visual Q&A for Education

4. Code Screenshot to Text

5. Security Camera Analysis (Privacy-Preserving)

Limitations & When to Use Cloud

Local Model Limitations

When to Use Cloud APIs (GPT-4V, Claude 3 Opus)

The Hybrid Approach

Conclusion: Your Images, Your Control

Sources & Further Reading

Related articles

Recent articles