What hardware do I need to run Ollama?

Ollama runs on CPU alone, but it is slow for anything beyond the smallest models. A model needs roughly its file size in RAM (or VRAM on a GPU), so a 7B model quantised to 4-bit wants about 5 GB free. For usable speed you want a GPU, an Apple Silicon Mac, or a server with plenty of RAM. Start with a 1B–3B model if your machine is modest.

Where does Ollama store the models it downloads?

On Linux they live in /usr/share/ollama/.ollama/models, and on macOS in ~/.ollama/models. Models are large (several gigabytes each), so keep an eye on disk space and remove ones you are not using with ollama rm .

How do I expose the Ollama API to other machines?

By default Ollama only listens on 127.0.0.1:11434. Set the OLLAMA_HOST environment variable to 0.0.0.0 so it binds to all interfaces, then restart the service. Never expose that port to the public internet without putting authentication and a reverse proxy in front of it.

Is Ollama free to use?

Yes. Ollama itself is open source and the models it runs are open-weight, so there are no API keys and no usage charges. Your only cost is the hardware you run it on, which is exactly why self-hosting can be cheaper than a metered API at volume.

Run a local LLM with Ollama

Install Ollama
Pull and run your first model
Talk to it over HTTP
Run it on a server
Try other models

Ollama packages an open large language model, its weights, and a small HTTP server into a single command. You download a model once and run it entirely on your own hardware, which means no API keys, no rate limits, and no data leaving your machine. It's the fastest path from "I want to try an LLM" to a working prompt.

Install Ollama

On Linux the one-line installer sets up the binary and a systemd service:

curl -fsSL https://ollama.com/install.sh | sh

On macOS download the app from the website, or use Homebrew:

brew install ollama

The installer registers Ollama as a background service. Confirm it's running:

ollama --version
systemctl status ollama   # Linux

Pull and run your first model

Pick a model and run it. Ollama downloads the weights on first use, then drops you into an interactive prompt:

ollama run llama3.2

The first run pulls a few gigabytes, so give it a moment. After that the model is cached and starts instantly. Type a question, get an answer, and press Ctrl+D to exit.

If your machine is modest, start small. A 1B model is fast even on a laptop:

ollama run llama3.2:1b

To see what you've downloaded and reclaim space:

ollama list
ollama rm llama3.2:1b

Talk to it over HTTP

The reason Ollama is useful in real applications is the built-in REST API on port 11434. Anything that speaks HTTP can use it:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain a reverse proxy in one sentence.",
  "stream": false
}'

It also exposes an OpenAI-compatible endpoint at /v1/chat/completions, so most libraries written for OpenAI work against Ollama by just changing the base URL. That makes it a drop-in local stand-in while you develop, before you decide whether to use a hosted API like OpenAI in production.

Run it on a server

Running Ollama on your laptop is fine for experiments, but for a shared model you'll want it on a server with a GPU. The principle is the same, with two changes: bind the API to all interfaces and put it behind a reverse proxy.

sudo systemctl edit ollama

Add the host override:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Do not open port 11434 to the internet directly, it has no authentication. Front it with Nginx and require a token, the same way you'd protect any internal service. For a full walkthrough of sizing a box and locking it down, see self-host an LLM on your own server.

Try other models

Ollama's library covers most popular open-weight models. A few worth knowing:

llama3.2 — a solid general-purpose default from Meta.
deepseek-r1 — a reasoning model, see run DeepSeek locally.
qwen2.5-coder — tuned for code generation.
nomic-embed-text — an embedding model for building search and retrieval.

Swap the name in ollama run or ollama pull to try any of them. Because everything runs locally, the only limit is your hardware.

Knowledge

Run a local LLM with Ollama

#AI

Install Ollama

Pull and run your first model

Talk to it over HTTP

Run it on a server

Try other models

Subscribe to our newsletter

Frequently asked questions

More in #AI

Knowledge

Run a local LLM with Ollama

#AI

#Install Ollama

#Pull and run your first model

#Talk to it over HTTP

#Run it on a server

#Try other models

Subscribe to our newsletter

Frequently asked questions

More in #AI

Install Ollama

Pull and run your first model

Talk to it over HTTP

Run it on a server

Try other models