Owner / Builder
Homelab Observability Platform
Building a centralized monitoring layer for homelab infrastructure using Prometheus, Grafana, exporters, and service health checks.
Overview
This project builds a centralized observability layer for my homelab so infrastructure and services are easier to monitor, troubleshoot, and improve over time.
The goal is to move beyond “is the service up?” checks and start building operational visibility across the systems that support the lab: Proxmox hosts, Ubuntu VMs, Docker services, NAS-backed workflows, VPN access paths, and public-facing endpoints.
The current platform centers on Prometheus for metrics collection and Grafana for visualization, with exporters used to expose host and service health data.
Problem This Project Solves
As the homelab grew, services became harder to reason about from memory alone. A failed container, unreachable endpoint, full disk, or unhealthy VM could create a lot of guesswork unless I manually checked each system.
The observability platform solves that by creating a central place to answer questions like:
- Which systems are reachable?
- Which hosts are under load?
- Which services are failing health checks?
- Are public endpoints responding correctly?
- Are internal monitoring targets reporting metrics?
- What changed before a service became unhealthy?
This project is about improving operational awareness before small issues become larger outages.
Architecture
The high-level monitoring path is:
Homelab infrastructure and services
↓
Exporters and health probes
↓
Prometheus scrape jobs
↓
Metrics storage
↓
Grafana dashboards
↓
Troubleshooting and operational response
Current focus areas include:
- Linux host metrics through Node Exporter
- Service availability checks through Blackbox Exporter
- Prometheus scrape configuration
- Grafana dashboards for system and service visibility
- Documentation for adding future monitoring targets
Design Choices
This observability stack is intentionally built around common production-style monitoring patterns rather than a single all-in-one dashboard.
Prometheus is used as the core metrics collector because it is widely used, well-documented, and designed around scrape-based time-series monitoring. It gives the lab a practical way to collect host, service, and exporter metrics over time. If resource usage or long-term retention becomes an issue later, I may evaluate VictoriaMetrics as a Prometheus-compatible alternative because it is known for efficient storage and strong performance at scale.
Grafana is used for visualization because it separates dashboarding from metrics collection. This keeps the platform flexible: Prometheus can focus on collecting data, while Grafana turns that data into useful operational views. If the environment ever grows beyond a single-node metrics backend, I may evaluate Grafana Mimir or another scalable metrics backend for longer retention, clustering, or querying across multiple Prometheus-compatible data sources.
Datadog was considered because it is widely used in industry and provides metrics, logs, dashboards, and alerting in a polished SaaS platform. For this homelab, it is being deferred because SaaS observability can become costly, and self-hosting the stack provides better hands-on learning around the components that sit underneath managed observability platforms.
Node Exporter provides Linux host metrics such as CPU, memory, disk, filesystem, and network usage. This helps distinguish between a broken service and an unhealthy host. Alternatives such as Telegraf or Netdata could also provide host-level telemetry, but Node Exporter fits cleanly into the Prometheus ecosystem and keeps the first version of the stack simple and standard.
Blackbox Exporter is used for service probing because availability is not the same thing as resource usage. A server can be online while a web endpoint, DNS response, or TCP service is failing. Blackbox checks help validate the user-facing path. Tools like Uptime Kuma can provide a friendlier status dashboard, but Blackbox Exporter integrates directly into Prometheus and makes probe results available alongside the rest of the metrics stack.
Planned additions are being staged deliberately.
cAdvisor will add container-level visibility for Docker workloads. Host metrics can show that a VM is under pressure, but cAdvisor can help identify which container is consuming CPU, memory, network, or filesystem resources. Alternatives such as Docker’s built-in stats are useful for quick checks, but cAdvisor is better suited for Prometheus-based historical metrics.
Loki and Promtail will add log aggregation once the metrics layer is stable. Metrics can show that something is wrong, while logs often explain why. Larger logging platforms like Elastic Stack or OpenSearch are powerful, but they are heavier than this lab needs right now. Loki is a better fit because it integrates well with Grafana and can provide useful log search without introducing a much larger search cluster.
Alertmanager will add notification routing once dashboards and metrics are stable enough to avoid noisy alerts. The goal is not to alert on everything immediately. Alertmanager will be introduced when there are clear alert conditions tied to real operational actions. Simpler tools like direct Discord webhooks, email scripts, ntfy, or Gotify could work for basic notifications, but Alertmanager provides a more standard Prometheus-native path for grouping, routing, silencing, and managing alerts.
The goal is not to run every monitoring tool at once. The goal is to build a layered observability platform where each service has a clear role: metrics, dashboards, probes, container visibility, logs, and alerts. Tools are being added only when they solve a specific operational problem rather than because they are common in homelab stacks.
What I Built
The initial platform includes:
- A Prometheus-based metrics collection layer
- Grafana dashboards for visualizing infrastructure and service health
- Node Exporter integration for Linux host metrics
- Blackbox Exporter checks for HTTP/DNS-style service probing
- Docker Compose-based deployment patterns
- Documentation for expanding future monitoring coverage
One important troubleshooting issue came from endpoint probing behavior. During Blackbox Exporter testing, a probe for zachull.com initially attempted to use IPv6 when the working path required IPv4. The issue was corrected by adjusting the probe behavior and validating the result in Grafana.
Why This Matters
This project is one of the first steps toward treating the homelab less like a collection of individual machines and more like a small operational environment.
It helps build real service operations habits:
- Monitoring before failure
- Dashboards that answer useful questions
- Documented troubleshooting paths
- Repeatable exporter setup
- Clearer understanding of dependencies
- Better recovery and maintenance decisions
Current Status
The project is active and expanding.
Completed or in progress:
- Prometheus deployment
- Grafana deployment
- Node Exporter integration
- Blackbox Exporter service probing
- Initial dashboard work
- Monitoring documentation
Planned next steps:
- Add container metrics with cAdvisor
- Add alert routing through Alertmanager
- Add centralized logs with Loki and Promtail
- Add Uptime Kuma or equivalent service status views
- Expand coverage to network devices, NAS services, and DNS
- Create a standard runbook for onboarding new monitoring targets
Lessons Learned
The biggest lesson so far is that observability has to be designed around useful questions. A dashboard is only valuable if it helps explain what is healthy, what is degraded, and where to look next.
This project is helping me build the discipline to document not just whether a service works, but how I know it works and what evidence I have when it does not.