Roadmap: Becoming Good at Infrastructure
March 20, 2026, 5:51 p.m.
Philosophy: no certifications, no passive tutorials. You learn by breaking things on real machines. Each phase has a concrete project — the project validates your progress, not a multiple choice quiz.
Phase 0 — Get a Real Machine to Break
Before anything else, you need a playground.
Recommended option: Hetzner VPS - CX22: ~€4/month, 2 vCPU, 4 GB RAM, 40 GB SSD - Go with Ubuntu 24.04 LTS - You can destroy and recreate as many times as you want
Immediate goal: connect via SSH with a key pair, disable password authentication, and configure a basic UFW firewall. If you can't do that yet without docs, start here.
Phase 1 — Linux in Depth (2–4 weeks)
You already code, so you know the basics. Here we go deeper.
What You Need to Master
Processes & scheduling
- ps, top, htop, pgrep, kill, nice, renice
- Understand process states (R, S, D, Z)
- strace to trace syscalls of a running process in real time
- /proc/[pid]/ — read process state directly from the kernel
Memory
- free, vmstat, smaps
- Understand the difference between virtual and physical memory
- What swap is and when the kernel triggers it
Files & I/O
- lsof — see which files are open by which process
- inotifywait — observe filesystem events in real time
- Inodes, hard links vs symlinks
- /dev/null, /dev/zero, /dev/urandom — why they exist
System-level networking
- ss (replaces netstat), ip addr, ip route
- tcpdump -i eth0 port 80 — capture live traffic
- Understand what curl https://example.com does at each network layer
Serious bash scripting
- set -euo pipefail at the top of every script — always
- Signal handling with trap
- Heredocs, process substitution <(...)
- xargs, parallel, awk, sed for pipeline data processing
Phase 1 Project
Deploy your Django app manually on your VPS.
No Docker, no magic. By hand: 1. PostgreSQL installed and configured 2. Gunicorn as the WSGI server 3. Nginx as a reverse proxy 4. HTTPS with Certbot (Let's Encrypt) 5. Systemd service so Gunicorn restarts automatically 6. UFW exposing only ports 22, 80, 443
Don't move to Phase 2 until your app is running in production and you understand why each piece is there.
Phase 2 — Containers (3–4 weeks)
Docker
What you need to understand, not just use
- Image layers: why the order of instructions in a Dockerfile affects cache and image size
- Linux namespaces (pid, net, mnt, uts) — Docker is just syntactic sugar on top of these
- Cgroups — how Docker limits memory and CPU
- Multi-stage builds to produce lean images
- .dockerignore and why it exists
Commands to have in your fingers
docker build --no-cache -t myapp:v1 .
docker run -d --name myapp -p 8000:8000 --env-file .env myapp:v1
docker exec -it myapp bash
docker logs -f myapp
docker stats
docker inspect myapp
Docker networking - Bridge, host, overlay — know which to use and why - How two containers communicate on the same custom network
Docker Compose
For orchestrating your local and staging stack.
# Minimal example for a Django app
services:
web:
build: .
command: gunicorn config.wsgi:application --bind 0.0.0.0:8000
env_file: .env
depends_on:
db:
condition: service_healthy
db:
image: postgres:16
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
retries: 5
volumes:
postgres_data:
Phase 2 Project
Containerize your Django app: one container for the app, one for Postgres, Nginx as a reverse proxy in front. Deploy the whole stack on your VPS with docker compose up -d. Understand why volumes exist and what happens if you delete the Postgres container without one.
Phase 3 — Infrastructure as Code (3–5 weeks)
The goal: your infrastructure must be versioned code in Git, not clicks in a web console.
Terraform
Provision cloud resources in a reproducible way.
Core concepts
- terraform init, plan, apply, destroy
- State — what it is, why it must not be in Git, how to store it in an S3/GCS bucket
- Modules to avoid duplication
- data sources to reference existing resources
What you will provision - Hetzner VPS (they have a Terraform provider) - DNS records (Cloudflare has an excellent provider) - Object storage bucket
Example workflow
# Create a Hetzner VPS with Terraform
terraform init
terraform plan -out=tfplan
terraform apply tfplan
# See the IP of the created server
terraform output server_ip
# Cleanly destroy everything
terraform destroy
Ansible
Configure machines once they exist.
Terraform creates the machine. Ansible configures it (install packages, deploy the app, manage config files).
Structure of a serious playbook
ansible/
├── inventory/
│ ├── production
│ └── staging
├── roles/
│ ├── common/ # basic security, users
│ ├── nginx/
│ ├── postgresql/
│ └── django/
└── deploy.yml
Concepts to master
- Idempotence — a playbook must be able to run 10 times without side effects
- ansible-vault to encrypt secrets
- Handlers to restart a service only when its config has changed
- Tags to run only part of a playbook
Phase 3 Project
Full reproducibility: starting from a terraform apply followed by ansible-playbook, you must be able to recreate your entire production environment in under 10 minutes, with zero manual intervention. If your VPS burns down tonight, you must be able to have a new working one by tomorrow morning.
Phase 4 — CI/CD (2–3 weeks)
Code that doesn't deploy automatically is code waiting to be forgotten.
GitHub Actions
Minimal pipeline for a Django app
name: Deploy
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run tests
run: |
pip install -r requirements.txt
python manage.py test
deploy:
needs: test
runs-on: ubuntu-latest
steps:
- name: Deploy to production
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.SERVER_IP }}
username: deploy
key: ${{ secrets.SSH_PRIVATE_KEY }}
script: |
cd /app
git pull origin main
docker compose up -d --build
What you need to understand - Secret management in GitHub (never put keys in code) - Artifacts and cache to speed up builds - Environments (staging vs production) with manual approval gates - When to use matrix builds
Phase 4 Project
Zero-touch deployment: a git push to main triggers the tests, and if everything passes, deploys to production automatically. You no longer SSH in to deploy.
Phase 5 — Kubernetes (1–2 months)
K8s is complex. Don't approach it before thoroughly digesting the previous phases — otherwise you'll memorize commands without understanding why.
Where to Start
k3s on two Hetzner VPS (~€8/month total) - One master node, one worker node - Much closer to real K8s than Minikube or Kind - You actually manage the networking, storage, and failures yourself
# On the master
curl -sfL https://get.k3s.io | sh -
# On the worker
curl -sfL https://get.k3s.io | K3S_URL=https://<master-ip>:6443 \
K3S_TOKEN=<node-token> sh -
Core Concepts (in order)
- Pod — the basic unit. One or more containers sharing network and storage
- Deployment — manages replicas, rolling updates, rollbacks
- Service — exposes a set of pods with a stable IP
- Ingress — cluster-level reverse proxy (replaces Nginx in your current stack)
- ConfigMap / Secret — configuration and secrets injected into pods
- PersistentVolume / PVC — storage that survives a pod's death
- Namespace — logical isolation within the cluster
What You Must Be Able to Do
# Diagnose a pod that won't start
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
# Get into a pod to debug
kubectl exec -it <pod-name> -- bash
# See what's consuming resources
kubectl top pods --all-namespaces
# Roll back a deployment
kubectl rollout undo deployment/myapp
# Apply a manifest and watch the rollout
kubectl apply -f deployment.yaml
kubectl rollout status deployment/myapp
Phase 5 Project
Migrate your Django app onto your k3s cluster: Deployment for the app, StatefulSet for Postgres, PVC for data, Ingress with cert-manager for automatic HTTPS. You must be able to kill a pod and watch K8s spin up a new one automatically.
Phase 6 — Observability (ongoing)
A system you can't observe is a system you don't control.
The Standard Stack
Prometheus + Grafana - Prometheus scrapes metrics from your apps and infrastructure - Grafana visualizes and lets you set up alerts
Loki (if you want to go further) - Centralized log aggregation, same philosophy as Prometheus but for logs
What You Need to Set Up
- System metrics: CPU, memory, disk I/O, network
- Application metrics: HTTP response times, error rate, request count
- Alerts: get notified if disk is at 90%, if the app stops responding, if Postgres lags
- A Grafana dashboard that shows your infra health at a glance
In Your Django Code
# django-prometheus
INSTALLED_APPS = ['django_prometheus', ...]
MIDDLEWARE = ['django_prometheus.middleware.PrometheusBeforeMiddleware', ...]
# Automatically exposes /metrics
# Prometheus scrapes this endpoint
What You End Up With
At the end of these phases, you have:
- A VPS or cluster managed entirely as code (Terraform + Ansible)
- Your Django app (or any app) deployed in real production
- A CI/CD pipeline that deploys automatically on every push
- Monitoring that alerts you when something goes wrong
- The ability to rebuild your entire infrastructure in < 10 minutes from scratch
This isn't a checklist. It's what you're capable of doing.
Resources
No passive tutorials. But a few useful references when you're stuck:
manpages — first, always- Official docs: Terraform, Ansible, Kubernetes — better than 95% of blog posts
- The Linux Command Line — William Shotts (free online)
- Site Reliability Engineering — Google (free online) — read after Phase 3
- DigitalOcean tutorials — often clearer than official docs for networking/Linux concepts
Last updated: March 2026