Do I need a GPU to self-host an LLM?

Not strictly, but you almost certainly want one. CPU inference works for small models and low traffic, but it is slow. A single consumer or data-centre GPU with enough VRAM to hold the model gives you an order of magnitude more throughput. Match the VRAM to the model: roughly the model's file size, plus headroom for the context window.

How much VRAM do I need for a given model?

A rough rule is that a quantised model needs about its file size in VRAM. A 7B model at 4-bit is around 5 GB, a 13B around 9 GB, and a 70B model needs 40 GB or more. Add a couple of gigabytes for the key-value cache that grows with your context length and number of concurrent requests.

Is self-hosting cheaper than using the OpenAI API?

It depends on volume. A hosted API has near-zero fixed cost and you pay per token, which is ideal for spiky or low usage. A GPU server has a fixed monthly cost regardless of traffic, so it wins once your token volume is high and steady. Run the numbers on your actual usage before committing.

How do I keep a self-hosted model secure?

Never expose the inference port directly. Bind it to localhost, put a reverse proxy in front, terminate TLS, and require an API token or mutual TLS. Treat it like any internal service: firewall the port, keep the host patched, and log who is calling it.

Self-host an LLM on your own server

Size the machine
Install a runtime
Serve it on all interfaces
Put it behind a reverse proxy
Add retrieval with a vector database
Decide if it's worth it

Sending every prompt to a hosted API is the easy path, but it has two costs: your data leaves your network, and the per-token bill grows with usage. Self-hosting flips both. The model runs on hardware you control, nothing leaves your perimeter, and your spend is a fixed monthly server cost instead of a meter that never stops. Here's how to do it properly.

Size the machine

The single most important number is VRAM (or system RAM if you run on CPU). A model has to fit in memory to run well, and a rough rule is that a quantised model needs about its file size:

1B–3B models: a few gigabytes, runs on almost anything.
7B–8B models: ~5–6 GB, a single modest GPU.
13B models: ~9 GB.
70B models: 40 GB+, which means a high-end data-centre GPU or several cards.

Add a couple of gigabytes of headroom for the context window and concurrent requests. If you're renting, a GPU instance from a provider like Hetzner or a cloud GPU host is the usual starting point. CPU-only works for small models and light traffic, but expect it to be slow.

Install a runtime

The quickest, most reliable runtime is Ollama, which handles model downloads, quantisation, and an HTTP server in one tool. Install it and pull a model:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2

That's the whole setup. For the full walkthrough of running and querying models, see run a local LLM with Ollama. If you need maximum throughput and are comfortable with more configuration, vLLM is the production-grade alternative that batches requests across a GPU.

Serve it on all interfaces

By default Ollama only listens on 127.0.0.1. To let other machines on your network reach it, override the host in the systemd unit:

sudo systemctl edit ollama

[Service]
Environment="OLLAMA_HOST=0.0.0.0"

Reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

The API now answers on port 11434 — but do not stop here, because that port has no authentication of its own.

Put it behind a reverse proxy

Never expose the raw inference port to the internet. Front it with Nginx, terminate TLS, and require a token. A minimal proxy that checks a shared secret looks like this:

server {
    listen 443 ssl;
    server_name ai.example.com;

    # ssl_certificate ... (see your certbot setup)

    location / {
        if ($http_authorization != "Bearer YOUR_SECRET_TOKEN") {
            return 401;
        }
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_read_timeout 300s;
    }
}

Grab a free certificate with Certbot, then lock the inference port to localhost only with your firewall so nothing can bypass the proxy:

sudo ufw deny 11434

The long proxy_read_timeout matters: model responses can take many seconds, and a short timeout will cut them off, which often shows up as a 504 Gateway Timeout.

Add retrieval with a vector database

A bare model only knows what it was trained on. To answer questions about your own documents, you store embeddings in a vector database and feed the relevant chunks into each prompt — the pattern known as RAG. PostgreSQL with the pgvector extension is a low-overhead way to do this on the same kind of server; see install PostgreSQL with pgvector on Ubuntu.

Decide if it's worth it

Self-hosting wins when your token volume is high and steady — a fixed server cost beats a meter that scales with traffic. It loses when usage is spiky or low, where a hosted API's pay-per-use model is cheaper and needs zero maintenance. Run the numbers on your real usage, and if you do go hosted, learn how to use the OpenAI API in Laravel and how to handle its rate limits.

Knowledge

Self-host an LLM on your own server

#AI

Size the machine

Install a runtime

Serve it on all interfaces

Put it behind a reverse proxy

Add retrieval with a vector database

Decide if it's worth it

Subscribe to our newsletter

Frequently asked questions

More in #AI

Knowledge

Self-host an LLM on your own server

#AI

#Size the machine

#Install a runtime

#Serve it on all interfaces

#Put it behind a reverse proxy

#Add retrieval with a vector database

#Decide if it's worth it

Subscribe to our newsletter

Frequently asked questions

More in #AI

Size the machine

Install a runtime

Serve it on all interfaces

Put it behind a reverse proxy

Add retrieval with a vector database

Decide if it's worth it