Homelab Observability Platform

Building a centralized monitoring layer for homelab infrastructure using Prometheus, Grafana, exporters, and service health checks.

Overview

This project builds a centralized observability layer for my homelab so infrastructure and services are easier to monitor, troubleshoot, and improve over time.

The goal is to move beyond “is the service up?” checks and start building operational visibility across the systems that support the lab: Proxmox hosts, Ubuntu VMs, Docker services, NAS-backed workflows, VPN access paths, and public-facing endpoints.

The current platform centers on Prometheus for metrics collection and Grafana for visualization, with exporters used to expose host and service health data.

Problem This Project Solves

As the homelab grew, services became harder to reason about from memory alone. A failed container, unreachable endpoint, full disk, or unhealthy VM could create a lot of guesswork unless I manually checked each system.

The observability platform solves that by creating a central place to answer questions like:

Which systems are reachable?
Which hosts are under load?
Which services are failing health checks?
Are public endpoints responding correctly?
Are internal monitoring targets reporting metrics?
What changed before a service became unhealthy?

This project is about improving operational awareness before small issues become larger outages.

Architecture

The high-level monitoring path is:

Homelab infrastructure and services
  ↓
Exporters and health probes
  ↓
Prometheus scrape jobs
  ↓
Metrics storage
  ↓
Grafana dashboards
  ↓
Troubleshooting and operational response

Current focus areas include:

Linux host metrics through Node Exporter
Service availability checks through Blackbox Exporter
Prometheus scrape configuration
Grafana dashboards for system and service visibility
Documentation for adding future monitoring targets

Design Choices

This observability stack is intentionally built around common production-style monitoring patterns rather than a single all-in-one dashboard.

Prometheus is used as the core metrics collector because it is widely used, well-documented, and designed around scrape-based time-series monitoring. It gives the lab a practical way to collect host, service, and exporter metrics over time. If resource usage or long-term retention becomes an issue later, I may evaluate VictoriaMetrics as a Prometheus-compatible alternative because it is known for efficient storage and strong performance at scale.

Grafana is used for visualization because it separates dashboarding from metrics collection. This keeps the platform flexible: Prometheus can focus on collecting data, while Grafana turns that data into useful operational views. If the environment ever grows beyond a single-node metrics backend, I may evaluate Grafana Mimir or another scalable metrics backend for longer retention, clustering, or querying across multiple Prometheus-compatible data sources.

Datadog was considered because it is widely used in industry and provides metrics, logs, dashboards, and alerting in a polished SaaS platform. For this homelab, it is being deferred because SaaS observability can become costly, and self-hosting the stack provides better hands-on learning around the components that sit underneath managed observability platforms.

Node Exporter provides Linux host metrics such as CPU, memory, disk, filesystem, and network usage. This helps distinguish between a broken service and an unhealthy host. Alternatives such as Telegraf or Netdata could also provide host-level telemetry, but Node Exporter fits cleanly into the Prometheus ecosystem and keeps the first version of the stack simple and standard.

Blackbox Exporter is used for service probing because availability is not the same thing as resource usage. A server can be online while a web endpoint, DNS response, or TCP service is failing. Blackbox checks help validate the user-facing path. Tools like Uptime Kuma can provide a friendlier status dashboard, but Blackbox Exporter integrates directly into Prometheus and makes probe results available alongside the rest of the metrics stack.

Planned additions are being staged deliberately.

cAdvisor will add container-level visibility for Docker workloads. Host metrics can show that a VM is under pressure, but cAdvisor can help identify which container is consuming CPU, memory, network, or filesystem resources. Alternatives such as Docker’s built-in stats are useful for quick checks, but cAdvisor is better suited for Prometheus-based historical metrics.

Loki and Promtail will add log aggregation once the metrics layer is stable. Metrics can show that something is wrong, while logs often explain why. Larger logging platforms like Elastic Stack or OpenSearch are powerful, but they are heavier than this lab needs right now. Loki is a better fit because it integrates well with Grafana and can provide useful log search without introducing a much larger search cluster.

Alertmanager will add notification routing once dashboards and metrics are stable enough to avoid noisy alerts. The goal is not to alert on everything immediately. Alertmanager will be introduced when there are clear alert conditions tied to real operational actions. Simpler tools like direct Discord webhooks, email scripts, ntfy, or Gotify could work for basic notifications, but Alertmanager provides a more standard Prometheus-native path for grouping, routing, silencing, and managing alerts.

The goal is not to run every monitoring tool at once. The goal is to build a layered observability platform where each service has a clear role: metrics, dashboards, probes, container visibility, logs, and alerts. Tools are being added only when they solve a specific operational problem rather than because they are common in homelab stacks.

What I Built

The initial platform includes:

A Prometheus-based metrics collection layer
Grafana dashboards for visualizing infrastructure and service health
Node Exporter integration for Linux host metrics
Blackbox Exporter checks for HTTP/DNS-style service probing
Docker Compose-based deployment patterns
Documentation for expanding future monitoring coverage

One important troubleshooting issue came from endpoint probing behavior. During Blackbox Exporter testing, a probe for zachull.com initially attempted to use IPv6 when the working path required IPv4. The issue was corrected by adjusting the probe behavior and validating the result in Grafana.

Why This Matters

This project is one of the first steps toward treating the homelab less like a collection of individual machines and more like a small operational environment.

It helps build real service operations habits:

Monitoring before failure
Dashboards that answer useful questions
Documented troubleshooting paths
Repeatable exporter setup
Clearer understanding of dependencies
Better recovery and maintenance decisions

Current Status

The project is active and expanding.

Completed or in progress:

Prometheus deployment
Grafana deployment
Node Exporter integration
Blackbox Exporter service probing
Initial dashboard work
Monitoring documentation

Planned next steps:

Add container metrics with cAdvisor
Add alert routing through Alertmanager
Add centralized logs with Loki and Promtail
Add Uptime Kuma or equivalent service status views
Expand coverage to network devices, NAS services, and DNS
Create a standard runbook for onboarding new monitoring targets

Lessons Learned

The biggest lesson so far is that observability has to be designed around useful questions. A dashboard is only valuable if it helps explain what is healthy, what is degraded, and where to look next.

This project is helping me build the discipline to document not just whether a service works, but how I know it works and what evidence I have when it does not.