HomeBlogAI Tools
📖 15 min read
A massive, industrial iron server rack glowing with intense processing power sitting incongruously on a soft rug in the middle of a cozy, dimly lit bedroom
Advertisement

Last month, a client in healthcare asked us to build a document Q&A system. Standard RAG pipeline stuff. Except there was a catch: nothing could leave their network. No OpenAI. No Anthropic. No cloud inference of any kind. Patient data, regulatory compliance, the whole nine yards. My first thought was "this is going to be a nightmare of CUDA drivers and Docker configs." My second thought, about six hours later, was "that was shockingly painless." Because Ollama exists now, and it's genuinely good.

This guide walks you through getting a fully functional, private, local LLM setup running on your own hardware. No cloud dependencies. No API costs. No data leaving your machine. From zero to a ChatGPT-quality interface in under 20 minutes — assuming your hardware can handle it.

Hardware Reality Check

Before we start, let's talk hardware, because this is where people waste the most time. They install Ollama, pull Llama 3.3 70B, and wonder why it's generating two tokens per second on their MacBook Air. Local LLMs are computationally expensive. The model needs to fit in memory — either system RAM (CPU inference) or VRAM (GPU inference). GPU is dramatically faster.

Minimum for usable performance: 16GB RAM, Apple M1/M2 or NVIDIA GPU with 8GB+ VRAM. This runs Llama 3.1 8B at 15-25 tokens/second. Good enough for development and personal use.

Comfortable for production work: 32GB+ RAM, NVIDIA RTX 3090/4090 with 24GB VRAM. Runs Llama 3.3 70B (quantized to Q4_K_M) at 10-15 tokens/second. Genuinely competitive with cloud inference for most tasks.

The Mac situation: Apple Silicon with unified memory is surprisingly capable. An M2 Max with 64GB handles 70B models at reasonable speeds because the same memory pool serves both CPU and model weights. If you're on a Mac, you might be pleasantly surprised.

Advertisement

Step 1: Install Ollama

Ollama abstracts away everything painful about running LLMs locally — model downloading, quantization selection, GGUF format handling, memory management, CUDA configuration. It reduces the entire process to command-line one-liners.

macOS and Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com. Double-click. Done. It installs as a system service that runs in the background.

Verify the installation:

ollama --version

If you see a version number, you're good. The entire install takes under a minute on a decent connection.

Step 2: Pull Your First Model

This is the moment where local LLMs went from "cool experiment" to "viable production tool":

ollama run llama3.3

That downloads Llama 3.3 70B (the Q4_K_M quantized version, ~40GB), and drops you into an interactive chat session. First run takes 10-20 minutes depending on your connection. Subsequent launches are instant.

For lower-spec hardware, start with the 8B parameter model:

ollama run llama3.2

That's about 4.7GB. Downloads in a couple of minutes. Runs on basically anything modern. The quality difference between 8B and 70B is substantial — 70B produces outputs that rival GPT-4 for most tasks — but 8B is shockingly capable for its size and perfect for getting started.

Advertisement

Step 3: Install Open WebUI

Ollama's terminal interface is functional but spartan. Open WebUI gives you a full ChatGPT-style web interface that connects to your local Ollama instance. Multiple conversations. Model switching. System prompts. File uploads. Dark mode. The whole deal.

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. Create an admin account. Select your Ollama model from the dropdown. Start chatting. That's it. You now have a private, self-hosted AI assistant that looks and feels like ChatGPT but runs entirely on your hardware.

No Docker? Open WebUI also supports pip installation:

pip install open-webui
open-webui serve

Step 4: Model Selection Strategy

Ollama's model library has grown enormously. Here's what actually matters for different use cases:

You can have multiple models installed simultaneously and switch between them in Open WebUI's dropdown. Each model's performance depends on your hardware's ability to fit it in memory.

Advertisement

Step 5: The API (For Developers)

Here's where local LLMs become a serious development tool. Ollama exposes an OpenAI-compatible REST API on localhost:11434. This means any application, library, or framework that supports the OpenAI API can talk to your local model with a one-line configuration change:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required but ignored
)

response = client.chat.completions.create(
    model="llama3.3",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)

That code is identical to what you'd write for GPT-4, except the base URL points to localhost. Your existing OpenAI-based applications work with local models by changing two lines. Libraries like LangChain, LlamaIndex, and the Vercel AI SDK all support this endpoint natively.

Performance Tuning

Three quick wins that most guides skip:

1. NUMA settings (multi-CPU systems): If you're on a workstation with multiple CPU sockets, set OLLAMA_NUM_PARALLEL=4 as an environment variable. This enables parallel request handling.

2. GPU layers: By default, Ollama auto-selects how many model layers to offload to GPU. You can force full GPU offload with ollama run llama3.3 --gpu-layers 99 if you have enough VRAM.

3. Context window: Default context is 2048 tokens. For document-heavy tasks, increase it: ollama run llama3.3 --ctx-size 8192. This uses more memory but lets the model handle longer conversations.

When Local LLMs Beat Cloud APIs

Privacy-sensitive work. Offline environments. Unlimited usage without per-token costs. Testing and development iterations where latency to a cloud provider adds friction. Fine-tuning experiments. And — increasingly — just the satisfaction of owning the entire stack.

When they don't beat cloud: frontier model quality (GPT-4, Claude 3.5 Sonnet still outperform local models on complex reasoning). Ultra-low-latency production deployments where Groq's 85ms TTFT matters. And any scenario where maintaining hardware isn't worth the trade-off.

The sweet spot for most developers: local for development and testing, cloud for production. Write your prompts using our Prompt Builder, test them locally with Ollama, and deploy the proven prompts to your preferred cloud provider.

Advertisement