Infrastructure as Code (IaC)

May 01, 2026 • 16 min read
Knowledge

Summary

Over the last 8 months or so, I’ve been doing a significant restructuring of the way that I work in my homelab. It’s not that there was anything wrong with it, per se, but rather that I wanted to challenge myself to do it better.

I’d been putting off learning Ansible for years, and life just seemed to have a way of bringing it up every now and then. I’d see something in a blog post, on Reddit, on YouTube, at work, etc. talking about how great it is. I knew this instinctively based on what I already knew about it, but I hadn’t really gotten into the finer details yet.

Beyond knowing that I really needed to stop putting this off, I began to take stock of how much time I was putting into my homelab that I should’ve been putting into my family. I’d find myself playing with my kids, but my head would be thinking about different aspects of my homelab. All this did was just make me sad, and realize that something had to change.

Stack at a Glance

Here’s what the stack looks like:

Configuration with Ansible (roles, inventory, idempotent everything).
Cloud and DNS with OpenTofu against Cloudflare and Hetzner.
Automation via self-hosted GitHub Actions on a Raspberry Pi, with Tailscale and three concurrent runners.
Secrets in Bitwarden Secrets Manager, resolved at runtime by both Ansible and OpenTofu.
OpenTofu state in Cloudflare R2.
Caddy as the reverse proxy, with oauth2-proxy and Pocket ID for auth.
Homepage, Gatus, Beszel, Healthchecks.io, and Ntfy for monitoring and alerts.
Everything lives in a private homelab-iac repo.

Ansible

Something that I often have a hard time with is just starting. There often never really feels like a good time to do anything, and I know I’m not alone in this. I did a casual look around on YouTube for some sort of tutorials to get me going. I’d heard that Jeff Geerling’s Ansible 101 course was fantastic so I started there. Personally, I couldn’t get into it for some reason. It’s not a dig against the course itself, but lets just say it wasn’t for me.

I looked for a different course and found the Getting Started with Ansible course from Learn Linux TV and that was it for me. I finished the course in a few days and I already felt like I had a good foundation.

Finishing a course and doing the work are two very different things. The first weeks of homelab-iac were rough. I’d write a role, run it, watch it blow up, fix it, watch it blow up somewhere else, and repeat. The papercuts piled up fast:

Tasks that worked the first time and changed something the second, the opposite of idempotency.
Tripping over become and file ownership.
Getting the order of operations wrong and breaking a host mid-run.

None of it was hard in the way that hard problems are hard. It was just a thousand small papercuts before any of it started to feel natural.

The harder part wasn’t writing Ansible, it was unlearning years of muscle memory. I’d been SSHing into boxes and fixing things by hand for so long that the impulse to do it that way was reflexive. I’d catch myself five minutes into a manual fix, sigh, back out, and go write a role for it instead. That happened more times than I’d like to admit, and it kept happening for months. The tooling was the easy part. Trusting the tooling enough to stop reaching for the shell was the harder part.

I also had to be honest with myself about what was on those boxes. Importing existing state into Ansible meant looking at every config file on every host and asking what it was, why it was there, and whether I still needed it. The findings were humbling:

cron jobs that hadn’t run successfully in over a year.
Half-configured services from experiments I never finished.
Two different ways of doing the same thing on two different hosts because I’d forgotten how I’d solved it the first time.

None of that was a surprise, exactly, but seeing it all laid out next to each other was humbling.

A couple of weeks in, I started writing things down. I added an ADR directory to the repo and started capturing decisions as I made them. I’m not religious about ADRs in the corporate sense, but having a place to write down “I picked X over Y for these reasons” has saved me from re-litigating the same question with myself months later. There are 70+ ADRs in the repo at this point, and I refer back to them more than I expected to.

OpenTofu

With Ansible doing real work, the next thing to tackle was the stuff I’d been clicking through web UIs to manage for years:

DNS records across a handful of domains.
My Hetzner VPS.
Cloudflare Tunnels and R2 buckets.
Firewall rules.

None of it was in code, and if I lost any of it, I’d be reconstructing from memory and old screenshots.

I went with OpenTofu over Terraform. The fork happened for reasons I won’t relitigate here, but the practical effect for me was that I could keep using the ecosystem I already knew while supporting a project I actually wanted to support.

Importing existing infrastructure was, again, less about the tool and more about confronting how much cruft I’d accumulated. I had DNS records pointing at services I no longer ran. I had a domain I no longer owned that still had records configured. I had two zones that did almost the same thing for reasons I couldn’t reconstruct. The first OpenTofu plan against my real account was hundreds of lines long, and most of those lines were me figuring out what to delete.

The thing I’m most happy with on the OpenTofu side is how state is handled. State lives in Cloudflare R2, and credentials are pulled out of Bitwarden Secrets Manager at runtime via a small wrapper script:

# tofu-bw.sh: exports R2 creds for the backend, then exec's tofu
export AWS_ACCESS_KEY_ID=$(bws secret get "$R2_KEY_ID_UUID"     --output json | jq -r '.value')
export AWS_SECRET_ACCESS_KEY=$(bws secret get "$R2_SECRET_UUID" --output json | jq -r '.value')

exec tofu "$@"

The result:

Nothing on disk.
Nothing in shell history.
Nothing in environment files I’d have to remember to clean up.

It’s a small detail, but it removed an entire class of mistake I used to worry about.

GitOps

Once Ansible and OpenTofu were doing real work, the next obvious question was how to run them. The whole point of all this was to stop doing things by hand, and a workflow where I have to remember to run ansible-playbook from my laptop is just doing things by hand with extra steps.

I settled on self-hosted GitHub Actions. Rather than leaning on GitHub-hosted runners, I provisioned a Raspberry Pi as the runner host and stood up three runner instances on it so workflows can run concurrently instead of queueing up behind each other. The Pi sits inside my network, on Tailscale, and has line of sight to all of my hosts. That means a workflow can SSH directly into Xenlab or my Hetzner VPS without me exposing anything externally or punching holes through firewalls.

It’s a simple setup to describe, but it took a long time to actually get right. A push to main triggers the relevant workflow based on what changed:

Compose change → deploy-compose runs the matching Ansible role on the affected host.
Terraform change → OpenTofu plans and applies.
Weekly schedule → package-updates runs apt upgrades across every host and reports drift to a Healthchecks.io endpoint, which pushes a notification via Ntfy if anything looks off.

Each workflow has path filters so a doc-only PR doesn’t trigger a deploy:

on:
  pull_request:
    branches: [main]
    types: [opened, synchronize, reopened, ready_for_review]
    paths:
      - "docker/**/docker-compose.yml"

Getting it reliable took a while. A non-exhaustive list of what had to be sorted out before I could walk away from the system:

Playbooks intermittently failing because Bitwarden Secrets Manager was being hit too many times in parallel. Fixed by serializing host execution so BWS lookups didn’t pile up.
Compose files on hosts slowly drifting out of sync with the repo when a deploy was interrupted partway through, with reconciliation not always noticing.
A multi-PR effort to detect drift, self-heal it, pre-resolve sudo passwords once per run instead of per-task, and fix a callback plugin that was reporting “success” for tasks that had actually skipped.

None of it was glamorous, but all of it had to happen.

The first time I watched a Renovate PR open, auto-merge after CI passed, and roll out via deploy-compose without me touching anything, it felt surreal. I’d been doing all of that by hand for years. It also broke a few times before it stopped breaking, and I had to fight the urge to disable the automation every time it did.

Secrets

Secrets management was something I’d always handled poorly:

A .env file here.
A Bitwarden entry there.
A handful of values living in my shell history because I was in a hurry one time and never came back to clean up.

It was bad, and I knew it was bad.

Bitwarden Secrets Manager ended up being the answer. It has a real Ansible integration via a lookup plugin, which means I can reference a secret by its ID directly inside a playbook and have it resolved at runtime:

- name: Pull a secret at runtime
  ansible.builtin.set_fact:
    api_token: "{{ lookup('bitwarden.secrets.lookup', 'a1b2c3d4-...') }}"
  no_log: true

The secret never gets written to disk, never gets committed, and never lives anywhere it shouldn’t. The same approach works for OpenTofu via the wrapper I mentioned earlier.

The migration itself was painful. Every secret had to be moved into the vault, every reference updated, every old copy tracked down and removed. There was no clean cutover, just weeks of grepping for anything that looked like a secret and following the trail. Worth it, but I wouldn’t pretend it was a quick afternoon of work.

Monitoring

A homelab without monitoring is just a collection of things you’ll find out are broken when you try to use them. Before this rebuild, I was using Uptime Kuma for health checks and a patchwork of other things for everything else. Uptime Kuma was fine, but I wanted something configured in code rather than clicked into a UI.

I consolidated around four pieces:

Homepage: my dashboard, and the first thing I see when I open a browser tab.
Gatus: replaced Uptime Kuma for health checking. Configured in YAML, so the entire thing lives in the repo alongside everything else.
Beszel: host-level metrics.
Ntfy: the pipe through which everything reaches my phone.

Healthchecks.io sits alongside all of that as my dead-man’s-switch monitor for systemd timers and cron jobs. If a backup doesn’t check in on time, I know about it. If a package-updates run skips a host, I know about it. The mental load of “I hope that thing ran” is gone, and that alone has been worth it.

The reverse proxy story is similar. I’d been on Nginx Proxy Manager for years, mostly out of inertia. Two swaps later:

NPM → Caddy so the configuration could live in the repo.
Authentik → oauth2-proxy + Pocket ID because Authentik was more service than I needed.

Each migration took a weekend with its own set of gotchas. The cumulative effect is that nothing in the path between the public internet and my services is configured by hand anymore.

Where Things Stand

Once the foundation was in place, the homelab stopped feeling like a project and started feeling like infrastructure. The pace of changes has actually picked up rather than slowed down, which I think is the real test of whether GitOps was worth it. A few of the things that have happened since I started drafting this post:

B2 to R2 migration. I’d been on Backblaze B2 for years for backups and CDN content. Cloudflare R2 made more sense long term, especially with zero egress fees and how much of my traffic already flows through Cloudflare. The CDN domains moved to R2 custom domains, the backend objects moved over, and an ADR captured the rationale.
An Astro Starlight docs site. The homelab documentation had been spread across READMEs and a Bookstack instance, neither of which was great. I set up a Starlight site that lives next to the code at wired.io. I tried to migrate the site from Cloudflare Pages to Workers about a week after standing it up, hit a problem I didn’t want to debug at 11pm on a Friday, and reverted it back to Pages. That revert is sitting in the git log staring at me, and I’ll come back to it eventually.
Dotfiles via chezmoi. I added an Ansible role that bootstraps chezmoi on any new machine and pulls down my dotfiles. Going from “format a laptop and spend an evening getting it back to the way I like it” to a single command is genuinely satisfying.
A multi-week Postgres detour for Journalistic. I run a small open-source project called Journalistic, and at one point I decided to move its storage from SQLite plus Litestream over to PostgreSQL with point-in-time recovery. The path went something like this:
1. restic for pg_dump snapshots and WAL-G for WAL archiving to R2. WAL-G didn’t fit.
2. Swapped to pgBackRest, which wasn’t the right shape either.
3. Tried barman-cloud via the CNPG image with gzip-compressed WAL and base backups. Got it working, but had to admit the operational surface area was way more than the project needed.
4. Ripped it out and reverted to SQLite plus Litestream.
The detour cost weeks. I learned a lot about WAL archiving, but mostly it was a lesson in knowing when to stop.
Retiring MinIO in favor of VersityGW. MinIO had been my on-prem S3 layer for years, but the upstream repo is now flagged as no longer maintained, the community edition is source-only with no binary releases, and everyone is being funneled toward AIStor, their commercial offering. I’d already been on a community fork (firstfinger, then pgsty) to keep up, and that started feeling like a treadmill ending somewhere I didn’t want to go. I piloted VersityGW alongside MinIO on Xenlab, and once it was stable I phased the decommission into five PRs so each step was independently revertable:
1. Remove the service.
2. Swap the Caddy block and DNS.
3. Replace the Gatus check.
4. Swap the Homepage tile.
5. Drop the SMB mount.

The pilot itself was not smooth: CORS had to be set explicitly for the WebGUI, the healthcheck had to be TCP instead of HTTP, the backend had to come off SMB onto local disk because of file locking weirdness, the external gateway URL had to be advertised correctly to the WebUI, and the Sharrr IAM credentials had to be resolved directly from Bitwarden when the indirect path didn’t work. Each fix was small. Collectively they were the kind of grind that doesn’t show up in a feature list.

None of these are huge headline features. They’re the kind of small, steady improvements that I never used to make because the cost of touching anything was too high. With everything in code, the cost is now low enough that I just do them. That doesn’t mean any individual change is easy. It means the friction is in the actual problem, not in the act of making the change.

What I’d Do Differently

If I were starting over, two things would change:

One thing at a time. There were stretches with Ansible refactors, OpenTofu imports, secrets migration, and a Caddy cutover all in flight at once. When something broke, figuring out which moving part was responsible was much harder than it needed to be. The repo has plenty of evidence of me untangling something I’d accidentally tangled up myself.
Be more skeptical of my own enthusiasm. The Postgres detour is the obvious example, but not the only one. Several times during this rebuild I followed a thread longer than I should have because I’d already invested time and didn’t want to admit the thing I was building wasn’t the thing I needed. The willingness to delete a week of work and go back to the simpler answer is a skill, and I’m still building it.

Where to Go From Here

The next stretch is more about polish than new construction. I want to get to the point where every host can be reprovisioned from scratch with no human in the loop beyond kicking off the first playbook. I’m close, but not there yet. There are still a handful of bootstrap steps that need me to do something by hand, and each one is a small failure of the larger goal.

Beyond that, I’d like to extend the GitOps approach to the corners of my setup I haven’t touched yet:

pfSense: still managed through its web UI.
UniFi controller: config managed by the controller itself, which is fine until the day it isn’t.

Both are candidates for the same treatment, though I haven’t decided whether the juice is worth the squeeze.

The thing I keep coming back to is that the time investment up front buys back time on the other end in a way that compounds. I spend less time fighting my homelab now than I did a year ago, even though it does more, because the friction of any individual change is so much lower. That doesn’t mean I never spend a Saturday chasing a sudo password race condition or untangling a bind mount that should have been straightforward. It means that when I do, the work is captured in code at the end of it, and I don’t have to do it again.

It also gave me back the thing I was actually after, which was being able to be present with my family without my head being somewhere else. The homelab is still here, it still does what I need it to do, and now it mostly takes care of itself. That’s the version of this hobby I wanted, and even with all the grit it took to get here, I’m glad I finally put in the work.