Running Vision LLMs Locally: LLaVA, BakLLaVA & Beyond (2026 Guide)

Published:

Running Vision LLMs Locally: LLaVA, BakLLaVA & Beyond (2026 Guide)

Analyze images with AI—completely offline, completely private


Introduction: Why Vision LLMs Matter

In 2026, the ability to understand images with AI isn’t just a novelty—it’s becoming essential. From developers debugging code via screenshots to educators creating accessible materials, vision-capable language models are transforming how we interact with visual information.

But there’s a catch. Most people turn to cloud APIs like GPT-4 Vision or Claude 3 Opus, uploading their images to remote servers. Every screenshot, every document, every photo you analyze gets sent to a third party. For personal photos, sensitive documents, or proprietary designs, that’s a significant privacy concern.

Running vision LLMs locally changes everything.

Your images never leave your machine. No API keys to manage. No per-image costs that scale unpredictably. And perhaps most importantly: no internet required once the model is downloaded.

Cloud vision APIs can cost $0.005-$0.01 per image. Analyze 1,000 images and you’re looking at $50-100. Run a local model, and that cost drops to zero after the initial download. For researchers processing thousands of images, developers building vision-powered applications, or privacy-conscious users, local vision LLMs aren’t just an alternative—they’re often the better choice.

Building on our RAG guide, this guide will show you exactly how to run powerful vision language models on your own hardware. Whether you have a gaming GPU or just a laptop, there’s a vision LLM that will work for you.


What Are Vision LLMs? Understanding Multimodal AI

Vision LLMs (also called multimodal models or vision-language models) are AI systems that can process both images and text simultaneously. Unlike traditional language models that only understand words, these models can “see” and describe what’s in an image, answer questions about visual content, and even reason across both modalities.

How They Work (Simplified)

At a high level, vision LLMs combine two components:

  1. Vision Encoder: Converts images into a format the language model can understand (typically a series of “tokens” representing visual features)
  2. Language Model: Processes these visual tokens alongside text tokens to generate responses

Think of it like this: the vision encoder “describes” the image to the language model in a language both understand. The language model then reasons about this description along with any text prompts you provide.

For example, when you ask “What’s in this image?” and provide a photo of a cat:

  1. The vision encoder analyzes the image and extracts features (fur texture, ear shape, body posture)
  2. These features are converted to special tokens that represent “cat-ness”
  3. The language model receives both your text question and these visual tokens
  4. It generates: “I see a domestic cat with orange tabby markings sitting on a windowsill…”

Key Capabilities

Modern vision LLMs can:

  • Describe images in detail (image captioning)
  • Answer questions about visual content (visual Q&A)
  • Read text in images (OCR – Optical Character Recognition)
  • Identify objects and their relationships
  • Understand charts, diagrams, and UI elements
  • Compare multiple images (in some models)

The Players: Local Vision Models Compared

The open-source vision LLM landscape has exploded. Here are the top models you can run locally in 2026:

LLaVA 1.5 / 1.6 (Most Popular)

Base Architecture: Llama 2/3 + CLIP vision encoder

LLaVA (Large Language and Vision Assistant) is the most widely adopted open-source vision LLM. Version 1.5 brought significant improvements in reasoning and detail, while 1.6 added better multi-image support and higher resolution handling.

Pros:

  • Excellent community support and documentation
  • Works with Ollama (easiest setup)
  • Strong general-purpose performance
  • Multiple sizes available (7B to 34B parameters)

Cons:

  • Larger models need significant VRAM
  • Can be slower than specialized alternatives

Best for: General-purpose vision tasks, beginners getting started

BakLLaVA (Efficiency Champion)

Base Architecture: Mistral + CLIP

BakLLaVA swaps the Llama base for Mistral, resulting in faster inference and often better performance per parameter. It’s become the go-to for users who want good vision capabilities without massive hardware requirements.

Pros:

  • Faster than LLaVA at similar sizes
  • Mistral’s efficient architecture
  • Excellent for edge deployment
  • Strong OCR capabilities

Cons:

  • Smaller ecosystem than LLaVA
  • Slightly less mature tooling

Best for: Performance-conscious users, edge deployments, faster inference needs

Moondream 2 (Tiny but Mighty)

Base Architecture: Custom tiny architecture (~1.6B parameters)

Moondream 2 proves that bigger isn’t always better. This tiny model can run on CPUs, Raspberry Pis, and even some smartphones while still delivering impressive vision capabilities.

Pros:

  • Runs on almost anything (including CPU-only systems)
  • Extremely fast inference
  • Perfect for embedded applications
  • Surprisingly capable for its size

Cons:

  • Less detailed than larger models
  • Struggles with complex reasoning tasks
  • Limited context window

Best for: Edge devices, CPU-only setups, resource-constrained environments

CogVLM (Quality Leader)

Base Architecture: Custom visual expert architecture

CogVLM takes a different approach, adding a “visual expert” module that processes visual features in parallel with text. This results in exceptional image understanding quality, particularly for detailed scenes and complex diagrams.

Pros:

  • State-of-the-art open-source vision quality
  • Excellent at reading text in images
  • Strong performance on benchmarks
  • Handles complex visual reasoning well

Cons:

  • Requires more VRAM (13B+ parameters)
  • Slower inference than smaller models
  • More complex setup

Best for: Maximum quality when hardware permits, document analysis, detailed image understanding

InternVL (Multilingual Powerhouse)

Base Architecture: InternLM + custom vision encoder

InternVL brings strong multilingual capabilities alongside excellent vision performance. If you need vision AI that works across languages, this is your model.

Pros:

  • Strong multilingual support (Chinese, Japanese, etc.)
  • Excellent benchmark scores
  • Good balance of size and performance
  • Active development and updates

Cons:

  • Smaller English-speaking community
  • Documentation can be sparse in English

Best for: Multilingual applications, non-English vision tasks


Hardware Requirements: What You Actually Need

Let’s cut through the speculation. Here’s what you actually need to run these models:

Model Size Minimum VRAM Recommended VRAM CPU-Only?
Moondream 2 1.6B 2 GB 4 GB Yes (slow)
LLaVA 1.5 (7B) 7B 6 GB 8 GB No
BakLLaVA (7B) 7B 6 GB 8 GB No
LLaVA 1.6 (13B) 13B 10 GB 16 GB No
CogVLM 17B 12 GB 24 GB No
InternVL (8B) 8B 8 GB 12 GB No

Understanding Quantization

Those numbers assume quantized models—compressed versions that trade some quality for massive size reductions. Here’s what the suffixes mean:

  • Q4_K_M: 4-bit quantization, medium quality (most common)
  • Q5_K_M: 5-bit quantization, better quality, larger size
  • Q8_0: 8-bit quantization, near-original quality, largest size

For most users, Q4_K_M offers the best balance. If you have VRAM to spare, Q5_K_M provides noticeable quality improvements.

What If You Don’t Have a GPU?

Moondream 2 is your best bet for CPU-only operation. Other models will run on CPU but expect 10-30x slower inference. For occasional use, it’s workable. For regular analysis, a GPU becomes essential.


Step-by-Step: Running LLaVA with Ollama

Ollama is the easiest way to run vision LLMs locally. Here’s the complete setup:

Step 1: Install Ollama

macOS/Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:
Download the installer from ollama.com and follow the prompts.

Step 2: Pull a Vision Model

# LLaVA 1.5 (7B) - Good balance of quality and speed
ollama pull llava:7b

# LLaVA 1.6 (13B) - Better quality, needs more VRAM
ollama pull llava:13b

# BakLLaVA - Faster alternative
ollama pull bakllava

# Moondream 2 - Tiny, runs almost anywhere
ollama pull moondream

Step 3: Test with an Image

# Basic image description
ollama run llava:7b "Describe this image" --image /path/to/your/image.jpg

# Ask specific questions
ollama run llava:7b "What text do you see in this image?" --image /path/to/document.png

Step 4: Python Integration

Create a simple Python script for programmatic access:

import ollama
import base64

def analyze_image(image_path, prompt="Describe this image in detail"):
    # Read and encode the image
    with open(image_path, 'rb') as f:
        image_data = base64.b64encode(f.read()).decode('utf-8')
    
    # Call the model
    response = ollama.chat(
        model='llava:7b',
        messages=[{
            'role': 'user',
            'content': prompt,
            'images': [image_data]
        }]
    )
    
    return response['message']['content']

# Example usage
result = analyze_image('screenshot.png', 'What code is shown in this screenshot?')
print(result)

Step 5: Batch Processing

For processing multiple images:

import os
import json
from pathlib import Path

def batch_analyze(image_folder, output_file='results.json'):
    results = []
    image_paths = list(Path(image_folder).glob('*.jpg')) + 
                  list(Path(image_folder).glob('*.png'))
    
    for img_path in image_paths:
        print(f"Processing: {img_path.name}")
        description = analyze_image(str(img_path))
        results.append({
            'file': img_path.name,
            'description': description
        })
    
    # Save results
    with open(output_file, 'w') as f:
        json.dump(results, f, indent=2)
    
    return results

# Process a folder of images
batch_analyze('./my_images/', 'image_descriptions.json')

Advanced: Running with llama.cpp

For more control over quantization, context length, and inference parameters, llama.cpp is the power user’s choice.

Installation

# Clone the repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build with CUDA support (for NVIDIA GPUs)
make LLAMA_CUDA=1

# Or build for CPU only
make

Downloading Vision Models

Vision models for llama.cpp typically come in GGUF format. Download from Hugging Face:

# LLaVA 1.5 7B (Q4_K_M quantization)
wget https://huggingface.co/cjpais/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-Q4_K_M.gguf

# Download the corresponding mmproj file (vision encoder)
wget https://huggingface.co/cjpais/llava-v1.5-7B-GGUF/resolve/main/mmproj-model-f16.gguf

Running Inference

./llava 
    -m llava-v1.5-7b-Q4_K_M.gguf 
    --mmproj mmproj-model-f16.gguf 
    --image input.jpg 
    -p "Describe this image in detail:"

Python Binding Example

from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler

# Initialize the chat handler with vision support
chat_handler = Llava15ChatHandler(
    clip_model_path="mmproj-model-f16.gguf"
)

# Load the model
llm = Llama(
    model_path="llava-v1.5-7b-Q4_K_M.gguf",
    chat_handler=chat_handler,
    n_ctx=4096,  # Context window
    n_gpu_layers=-1  # Offload all layers to GPU
)

# Analyze an image
response = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "file://image.jpg"}},
                {"type": "text", "text": "What's happening in this image?"}
            ]
        }
    ]
)

print(response['choices'][0]['message']['content'])

Quantization Options

llama.cpp supports various quantization strategies:

Quantization Size Reduction Quality Impact Use Case
Q4_0 ~75% Noticeable Maximum compression
Q4_K_M ~70% Minimal Best balance
Q5_K_M ~60% Very slight Quality priority
Q8_0 ~50% Negligible Near-original quality
F16 None Original Maximum quality

Building on our guide to local LLMs, quantization is the key to fitting larger models into limited VRAM.


Use Cases & Practical Examples

1. Document OCR and Analysis

Extract text from scanned documents, receipts, or screenshots:

def extract_document_text(image_path):
    prompt = """Please extract all text from this document.
    Preserve the formatting as much as possible.
    If there are tables, describe their structure."""
    
    return analyze_image(image_path, prompt)

# Process a receipt
receipt_text = extract_document_text('receipt.jpg')
print(receipt_text)

Real-world application: Automate expense reporting by extracting data from receipt photos.

2. Image Captioning for Accessibility

Generate alt text for images to improve web accessibility:

def generate_alt_text(image_path):
    prompt = """Write a concise alt text description for this image.
    Keep it under 125 characters if possible, but include essential details.
    Focus on the main subject and context."""
    
    return analyze_image(image_path, prompt)

# Generate alt text for website images
alt_text = generate_alt_text('product-photo.jpg')
print(f'<img src="product-photo.jpg" alt="{alt_text}">')

Real-world application: Batch-process image libraries to add accessibility descriptions.

3. Visual Q&A for Education

Create interactive learning materials:

def educational_qa(image_path, question):
    prompt = f"""You are a helpful tutor. Look at this educational image and answer the question.
    Explain your reasoning clearly and simply.
    
    Question: {question}"""
    
    return analyze_image(image_path, prompt)

# Example: Analyze a historical photograph
answer = educational_qa('ww2-photo.jpg', 
    'What can you tell me about the people and setting in this photograph?')
print(answer)

Real-world application: Build study tools that let students ask questions about diagrams, historical photos, or scientific illustrations.

4. Code Screenshot to Text

Convert screenshots of code into copyable text:

def screenshot_to_code(image_path):
    prompt = """Extract all code visible in this screenshot.
    Format it properly with correct indentation.
    Only return the code, no explanations."""
    
    return analyze_image(image_path, prompt)

# Extract code from a tutorial screenshot
code = screenshot_to_code('tutorial-code.png')
with open('extracted_code.py', 'w') as f:
    f.write(code)

Real-world application: Convert video tutorials or documentation screenshots into working code.

5. Security Camera Analysis (Privacy-Preserving)

Analyze security footage locally without sending sensitive video to the cloud:

import cv2

def analyze_security_frame(frame_path):
    prompt = """Analyze this security camera frame.
    Describe any people, vehicles, or unusual activity.
    Note the approximate number of people and their general activity."""
    
    return analyze_image(frame_path, prompt)

# Process video frames (requires OpenCV)
def process_video_frames(video_path, interval_seconds=5):
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(fps * interval_seconds)
    
    frame_count = 0
    results = []
    
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        
        if frame_count % frame_interval == 0:
            # Save frame temporarily
            temp_path = f'frame_{frame_count}.jpg'
            cv2.imwrite(temp_path, frame)
            
            # Analyze
            analysis = analyze_security_frame(temp_path)
            results.append({
                'timestamp': frame_count / fps,
                'analysis': analysis
            })
            
            # Clean up
            os.remove(temp_path)
        
        frame_count += 1
    
    cap.release()
    return results

Real-world application: Generate activity summaries from security footage without privacy risks.


Limitations & When to Use Cloud

Local vision LLMs are powerful, but they’re not
perfect. Here’s when to consider cloud alternatives:

Local Model Limitations

Limitation Details
Resolution constraints Most models downsample images to 336×336 or 448×448 pixels
Smaller knowledge base Less general world knowledge than GPT-4V
Reasoning gaps Complex multi-step visual reasoning can fail
Language limitations English works best; other languages vary
Fine detail loss Tiny text or distant objects may be missed

When to Use Cloud APIs (GPT-4V, Claude 3 Opus)

Consider cloud vision APIs when:

  • Maximum accuracy is critical (medical imaging, legal documents)
  • Processing very high-resolution images (detailed technical diagrams)
  • Need advanced reasoning (complex visual puzzles, multi-image comparison)
  • Multilingual requirements (GPT-4V handles 100+ languages well)
  • Infrequent usage (occasional analysis doesn’t justify hardware costs)

The Hybrid Approach

Many users find a hybrid workflow works best:

  1. Use local models for: Bulk processing, sensitive images, development/testing, offline work
  2. Use cloud APIs for: Critical decisions, complex analysis, final verification

Building on our AI cost analysis, this hybrid approach often delivers the best cost-quality balance.


Conclusion: Your Images, Your Control

Running vision LLMs locally puts you in control. Your images stay private. Your costs are predictable. And you’re not dependent on internet connectivity or API uptime.

We’ve covered:

  • What vision LLMs are and how they work
  • The top models available in 2026 (LLaVA, BakLLaVA, Moondream, CogVLM, InternVL)
  • Hardware requirements for each model
  • Step-by-step setup with Ollama
  • Advanced usage with llama.cpp
  • Practical applications from OCR to security analysis

Start small. If you have limited hardware, begin with Moondream 2 or LLaVA 7B via Ollama. As you get comfortable, experiment with larger models and different quantization levels.

The ability to understand images with AI is no longer locked behind corporate APIs. It’s on your machine, ready when you are.


Continue your local AI journey:


Sources & Further Reading

  1. LLaVA Project Website – Official LLaVA documentation and papers
  2. Ollama Vision Models – Curated vision model collection
  3. llama.cpp GitHub Repository – Official implementation
  4. Moondream 2 Hugging Face – Tiny vision model
  5. CogVLM GitHub – High-quality vision model
  6. InternVL Documentation – Multilingual vision LLM
  7. BakLLaVA on Ollama – Mistral-based vision model
  8. GGUF Format Specification – Model quantization format
  9. Hugging Face Multimodal Models – Model repository
  10. OpenAI GPT-4V Documentation – Cloud API reference
  11. Claude 3 Vision Capabilities – Anthropic’s vision features
  12. LocalAI Project – Alternative local AI platform
  13. Vision LLM Benchmarks – Performance comparisons
  14. Quantization Techniques Explained – Academic paper on LLM quantization
  15. Privacy-Preserving AI Guide – Related article on data privacy

Related articles

Recent articles