Roadmap: Becoming Good at Infrastructure

March 20, 2026, 5:51 p.m.

Philosophy: no certifications, no passive tutorials. You learn by breaking things on real machines. Each phase has a concrete project — the project validates your progress, not a multiple choice quiz.


Phase 0 — Get a Real Machine to Break

Before anything else, you need a playground.

Recommended option: Hetzner VPS - CX22: ~€4/month, 2 vCPU, 4 GB RAM, 40 GB SSD - Go with Ubuntu 24.04 LTS - You can destroy and recreate as many times as you want

Immediate goal: connect via SSH with a key pair, disable password authentication, and configure a basic UFW firewall. If you can't do that yet without docs, start here.


Phase 1 — Linux in Depth (2–4 weeks)

You already code, so you know the basics. Here we go deeper.

What You Need to Master

Processes & scheduling - ps, top, htop, pgrep, kill, nice, renice - Understand process states (R, S, D, Z) - strace to trace syscalls of a running process in real time - /proc/[pid]/ — read process state directly from the kernel

Memory - free, vmstat, smaps - Understand the difference between virtual and physical memory - What swap is and when the kernel triggers it

Files & I/O - lsof — see which files are open by which process - inotifywait — observe filesystem events in real time - Inodes, hard links vs symlinks - /dev/null, /dev/zero, /dev/urandom — why they exist

System-level networking - ss (replaces netstat), ip addr, ip route - tcpdump -i eth0 port 80 — capture live traffic - Understand what curl https://example.com does at each network layer

Serious bash scripting - set -euo pipefail at the top of every script — always - Signal handling with trap - Heredocs, process substitution <(...) - xargs, parallel, awk, sed for pipeline data processing

Phase 1 Project

Deploy your Django app manually on your VPS.

No Docker, no magic. By hand: 1. PostgreSQL installed and configured 2. Gunicorn as the WSGI server 3. Nginx as a reverse proxy 4. HTTPS with Certbot (Let's Encrypt) 5. Systemd service so Gunicorn restarts automatically 6. UFW exposing only ports 22, 80, 443

Don't move to Phase 2 until your app is running in production and you understand why each piece is there.


Phase 2 — Containers (3–4 weeks)

Docker

What you need to understand, not just use - Image layers: why the order of instructions in a Dockerfile affects cache and image size - Linux namespaces (pid, net, mnt, uts) — Docker is just syntactic sugar on top of these - Cgroups — how Docker limits memory and CPU - Multi-stage builds to produce lean images - .dockerignore and why it exists

Commands to have in your fingers

docker build --no-cache -t myapp:v1 .
docker run -d --name myapp -p 8000:8000 --env-file .env myapp:v1
docker exec -it myapp bash
docker logs -f myapp
docker stats
docker inspect myapp

Docker networking - Bridge, host, overlay — know which to use and why - How two containers communicate on the same custom network

Docker Compose

For orchestrating your local and staging stack.

# Minimal example for a Django app
services:
  web:
    build: .
    command: gunicorn config.wsgi:application --bind 0.0.0.0:8000
    env_file: .env
    depends_on:
      db:
        condition: service_healthy

  db:
    image: postgres:16
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      retries: 5

volumes:
  postgres_data:

Phase 2 Project

Containerize your Django app: one container for the app, one for Postgres, Nginx as a reverse proxy in front. Deploy the whole stack on your VPS with docker compose up -d. Understand why volumes exist and what happens if you delete the Postgres container without one.


Phase 3 — Infrastructure as Code (3–5 weeks)

The goal: your infrastructure must be versioned code in Git, not clicks in a web console.

Terraform

Provision cloud resources in a reproducible way.

Core concepts - terraform init, plan, apply, destroy - State — what it is, why it must not be in Git, how to store it in an S3/GCS bucket - Modules to avoid duplication - data sources to reference existing resources

What you will provision - Hetzner VPS (they have a Terraform provider) - DNS records (Cloudflare has an excellent provider) - Object storage bucket

Example workflow

# Create a Hetzner VPS with Terraform
terraform init
terraform plan -out=tfplan
terraform apply tfplan
# See the IP of the created server
terraform output server_ip
# Cleanly destroy everything
terraform destroy

Ansible

Configure machines once they exist.

Terraform creates the machine. Ansible configures it (install packages, deploy the app, manage config files).

Structure of a serious playbook

ansible/
├── inventory/
│   ├── production
│   └── staging
├── roles/
│   ├── common/          # basic security, users
│   ├── nginx/
│   ├── postgresql/
│   └── django/
└── deploy.yml

Concepts to master - Idempotence — a playbook must be able to run 10 times without side effects - ansible-vault to encrypt secrets - Handlers to restart a service only when its config has changed - Tags to run only part of a playbook

Phase 3 Project

Full reproducibility: starting from a terraform apply followed by ansible-playbook, you must be able to recreate your entire production environment in under 10 minutes, with zero manual intervention. If your VPS burns down tonight, you must be able to have a new working one by tomorrow morning.


Phase 4 — CI/CD (2–3 weeks)

Code that doesn't deploy automatically is code waiting to be forgotten.

GitHub Actions

Minimal pipeline for a Django app

name: Deploy

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run tests
        run: |
          pip install -r requirements.txt
          python manage.py test

  deploy:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to production
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.SERVER_IP }}
          username: deploy
          key: ${{ secrets.SSH_PRIVATE_KEY }}
          script: |
            cd /app
            git pull origin main
            docker compose up -d --build

What you need to understand - Secret management in GitHub (never put keys in code) - Artifacts and cache to speed up builds - Environments (staging vs production) with manual approval gates - When to use matrix builds

Phase 4 Project

Zero-touch deployment: a git push to main triggers the tests, and if everything passes, deploys to production automatically. You no longer SSH in to deploy.


Phase 5 — Kubernetes (1–2 months)

K8s is complex. Don't approach it before thoroughly digesting the previous phases — otherwise you'll memorize commands without understanding why.

Where to Start

k3s on two Hetzner VPS (~€8/month total) - One master node, one worker node - Much closer to real K8s than Minikube or Kind - You actually manage the networking, storage, and failures yourself

# On the master
curl -sfL https://get.k3s.io | sh -

# On the worker
curl -sfL https://get.k3s.io | K3S_URL=https://<master-ip>:6443 \
  K3S_TOKEN=<node-token> sh -

Core Concepts (in order)

  1. Pod — the basic unit. One or more containers sharing network and storage
  2. Deployment — manages replicas, rolling updates, rollbacks
  3. Service — exposes a set of pods with a stable IP
  4. Ingress — cluster-level reverse proxy (replaces Nginx in your current stack)
  5. ConfigMap / Secret — configuration and secrets injected into pods
  6. PersistentVolume / PVC — storage that survives a pod's death
  7. Namespace — logical isolation within the cluster

What You Must Be Able to Do

# Diagnose a pod that won't start
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous

# Get into a pod to debug
kubectl exec -it <pod-name> -- bash

# See what's consuming resources
kubectl top pods --all-namespaces

# Roll back a deployment
kubectl rollout undo deployment/myapp

# Apply a manifest and watch the rollout
kubectl apply -f deployment.yaml
kubectl rollout status deployment/myapp

Phase 5 Project

Migrate your Django app onto your k3s cluster: Deployment for the app, StatefulSet for Postgres, PVC for data, Ingress with cert-manager for automatic HTTPS. You must be able to kill a pod and watch K8s spin up a new one automatically.


Phase 6 — Observability (ongoing)

A system you can't observe is a system you don't control.

The Standard Stack

Prometheus + Grafana - Prometheus scrapes metrics from your apps and infrastructure - Grafana visualizes and lets you set up alerts

Loki (if you want to go further) - Centralized log aggregation, same philosophy as Prometheus but for logs

What You Need to Set Up

In Your Django Code

# django-prometheus
INSTALLED_APPS = ['django_prometheus', ...]
MIDDLEWARE = ['django_prometheus.middleware.PrometheusBeforeMiddleware', ...]

# Automatically exposes /metrics
# Prometheus scrapes this endpoint

What You End Up With

At the end of these phases, you have:

This isn't a checklist. It's what you're capable of doing.


Resources

No passive tutorials. But a few useful references when you're stuck:


Last updated: March 2026