I self-host a variety of apps on my own infrastructure, and make extensive use of Ansible playbooks to keep it all running, document everything as code and make it easy to run up new environments to test changes without breaking everything! This post summarises how the playbooks work, how I do testing, and how I try to help myself pick up context and dive in quickly when I have a few spare minutes.

Why What and Where

Some things I self-host for practical reasons, like the much better transfer rates (and cheaper per GB storage cost) I get from my home WiFi to the server under my sofa compared to the cloud. Sometimes it's for principled reasons, like wanting to retain control over where my data lives and who has access to it. And sometimes it's just because I like doing it!

Amongst other things I'm running:

Nextcloud for files and notes
Home Assistant to monitor and control various sensors and smart devices (plus ESPHome to manage device firmware)
TimescaleDB and Grafana for storage and more powerful analysis of those sensor data
Convos IRC client for always-on IRC (I used to use The Lounge but Convos works better with SSO)
Firefly III for tracking my finances
Exim for outgoing email (incoming is handled by Cloudflare Email Routing so I don't have to worry about losing mail if my servers go down)

These are spread across a mixture of an old laptop running under my sofa with some storage attached (called sofaserver), and a Vultr VPS in London (called gate).

How it works

Most of the apps are deployed using Docker and orchestrated with docker-compose. Compose lets you write YAML files describing a set of one or more containers that make up an app, define some networks to connect those containers, and manage that app as a single unit. I've considered using a more "serious" orchestrator like Kubernetes in the past, but Compose is very easy to use, and Kubernetes is a lot of complexity for only two hosts in my spare time! Below is an example of a Compose file that will be rendered through Ansible templates.

---
version: "3"
services:
  grafana:
    image: grafana/grafana-oss:latest
    container_name: grafana
    volumes:
      - /opt/grafana/data/:/var/lib/grafana:rw
      - /opt/grafana/grafana.ini:/etc/grafana/grafana.ini:ro
    user: "{{ container_user.uid }}:{{ container_user.gid }}"
    labels:
      - traefik.enable=true
      - traefik.http.routers.grafana.rule=Host(`grafana.{{ services_domain }}`)
      - traefik.http.services.grafana.loadbalancer.server.port=3000
      - traefik.http.routers.grafana.entrypoints=websecure
      - traefik.http.routers.grafana.tls.certresolver=default
      - traefik.http.routers.grafana.middlewares=traefik-forward-auth
    restart: always
    networks:
      - traefik

networks:
  traefik:
    external: true

Notice the traefik... labels? They configure the Traefik proxy used to expose the container externally. Traefik reads the request (which is made to the standard HTTPS port 443 regardless of destination), presents the appropriate TLS certificate and routes the request to the backend. It also automatically issues and renews certificates using LetsEncrypt. I use the DNS-01 challenge via my Cloudflare DNS, which also means I can restrict some apps to only allow connections from my home network, while still using publicly trusted certificates.

A Traefik middleware, traefik-forward-auth handles authorisation to access my apps. It intercepts requests to perform a standard OIDC flow, only permitting those which log in successfully. The authentication service itself is not self-hosted; it would be possible to use something like Keycloak for this, but I've gone with a managed service instead to make securing the login process "somebody else's problem".

A previous iteration of this infrastructure used NGINX to configure the reverse proxying, and I built an Ansible role, nginx-gateway to set that up. However it didn't take care of issuing certificates, there was some Terraform that had to be rerun every few months, which is a strength of Traefik. That iteration also used client certificates (aka mTLS) for authentication, which was very secure but clunky: certs were issued with cobbled together openssl commands and several mobile apps and integrations wouldn't work because they couldn't present a client certificate, which is why I switched to OIDC.

Ansible Playbooks

The Ansible playbooks that manage all of this are stored in a private GitHub repository. The repo is laid out broadly along the lines of the layout in the Ansible documentation: there's a main playbook that imports various other playbooks to set everything up, along with host and group vars and an inventory file per environment.

site-automation/
  group_vars/
    production/
      cleartext.yml  # Non-sensitive vars for the production environmnet
      secret.yml  # Vault file containing encrypted secrets for production environment
    staging/
      cleartext.yml
      secret.yml
    vagrant_local/
      cleartext.yml
      secret.yml
  host_vars/
    gate/
      become.yml  # Password to become root on system
      gate.yml  # Vars that only apply to gate
    sofaserver/
      become.yml
      sofaserver.yml
  notes/
    setup-grafana.md  # Various Markdown notes about steps that have to be done manually
    setup-login.md
    ...
  templates/
    compose/
      esphome.yml  # Docker Compose file template for ESP Home
      firefly.yml
      grafana.yml
      ...
    gate/
      exim-cert-extract-cron.j2  # Exim certificate extraction cronjob template
      exim-cert-reload.sh  # Cert extraction script
      ...
    sofaserver/
      firefly-config.env.j2
      hdparm.conf.j2
      ...
  vagrant-testing/
    Vagrantfile  # Vagrantfile for describing local dev environment
    provision.sh  # Helper script to call `ansible-playbook` using Vagrant inventory
  .gitignore
  ansible.cfg
  inventory-production
  inventory-staging
  requirements.yml  # Roles used by the repo
  servers-00-all.yml  # Site playbook importing the others
  servers-15-networks.yml  # Configure networking and basic security hardening
  servers-20-gate.yml  # Server-specific plays for gate
  servers-20-sofaserver.yml  # Server-specific plays for sofaserver
  servers-30-containers.yml  # Set up containerised services
  servers-40-gate-post.yml  # Plays to run on gate after containers are up

The assumed starting point is a basic Ubuntu install, either using the installer (for the machines at home) or one of the cloud images on my VPS provider.

First off the servers-15-networks.yml playbook does some security hardening and network configuration by applying the firewall and security roles from Jeff Geerling. It also sets up Wireguard between the machines to have a secure internal network (using githubixx.ansible_role_wireguard) and creates records in Cloudflare DNS for the machine's IP addresses.

Next up is host-specific configuration. For sofaserver that means checking the health of the storage and RAID array and mounting it, although creating the array is done manually. It also includes setting up Samba file sharing and cloning a few HomeAssistant addons. For gate the Exim mail config is created, along with a script to extract TLS certificates from Traefik to reuse them for Exim.

Setting up and launching containers is the next step in servers-30-containers.yml, starting with installing Docker and deploying the samdbmg.traefik-auth-proxy role, then setting up each service.

services:  
  grafana:
    compose_file: templates/compose/grafana.yml
    data_backups:
      - path: /opt/grafana/data
    templates:
      - src: templates/sofaserver/grafana.ini
        dest: /opt/grafana/grafana.ini
    dns_entry:
      record: "{{ ('grafana' + services_zone) | trim('.') }}"

Each container service has a host var entry like the one above, which is used by the servers-30-containers.yml playbook to set it up. Compose files are templated to the host along with any config file templates, and data directories are created for containers to mount. Where those data directories contain useful application state, Duplicity is used for periodic backups of the data (using samdbmg.schedule-duplicity).

Finally, any extra post-container config is done, e.g. in servers-40-gate-post.yml which ensures the Traefik -> Exim certificate extraction process has run at least once.

Each deployed environment (production, staging and local testing) has its own set of group_vars, used to define the domain under which everything is hosted, versions of services, IP address ranges, etc.

Testing & Development

Since this is something I tinker with in my spare time (and that's spare time from both work and parenting) I have to pick it back up and be productive quite quickly. To help with that I try to keep detailed notes of what I'm working on: my current approach is a GitHub Issue for each bug/feature, into which I summarise progress, write up my research and add links to the commits as I work. I also maintain throwaway test environments to thoroughly test changes so I'm unlikely to break something and find I have to fix it immediately when I'm supposed to be on baby duty!

One of those test environments is a Vagrant configuration I maintain for each host, which automates building reproducible, disposable infrastructure.

Vagrant sets up a VM for each host, connects it to an internal network, and also applies any required hardware customisation like attaching extra disks. The Vagrant Ansible provisioner sorts out creating an inventory file, and a little script like the following lets me run Ansible commands using that inventory file (e.g. to apply part of the playbook for faster iterations than running vagrant provision).

#!/bin/bash

# Provision a single Vagrant VM using Ansible - mostly as a shortcut to save some typing. First arg is VM, rest are
# passed through
VM_NAME=$1
shift

ANSIBLE_CONFIG=../ansible.cfg ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook \
    -i .vagrant/provisioners/ansible/inventory/vagrant_ansible_inventory \
    ../servers-00-all.yml \
    --limit "${VM_NAME}" $@

Once a change to the playbooks has been written, I have a "staging" environment: VMs on a workstation that are very similar to the real production environment. I use this to practice complex upgrades that need to be tested in an environment that closely matches the real thing, like OS upgrades.

ansible-playbook -i inventory-production servers-00-all.yml -v --check --diff 2>&1 | tee check-prod-update.log

Finally, once a change is ready to go, I run Ansible in check and diff mode using the command above, and review all the changes it's going to make to the real thing, which has saved me a few times!

The Future

I'm going to keep evolving this setup and there are quite a few things I'd like to improve upon. I would like to have more CI/CD on some of the testing, perhaps being able to automatically power up the staging environment and apply changes to it. I've also got a couple of bugs to iron out, and a very long list of apps to test out and play with!

Sam's Blog