System Logs & Milestones

A chronological record of architectural pivots, technical wins, and lessons learned while scaling a single-node laboratory. This is the paper trail of a platform in motion.

Chapter 10: The MCP Era

Evolving the Hub into an agent-native platform by exposing the LGTM stack via the Model Context Protocol (MCP), enabling high-fidelity correlation of system telemetry.

  • MCP Level 4: Incident Investigation

    • Completed the LGTM observability loop by delivering 'investigate_incident' — a macro-tool that orchestrates metrics, logs, and traces in parallel to produce a structured incident report, reducing root cause analysis from minutes to seconds.
  • MCP Level 2 & 3: Semantic Logging & Distributed Tracing

    • Reduced Mean-Time-To-Discovery (MTTD) by implementing semantic log filtering via Loki and distributed trace search via Tempo (TraceQL), allowing agents to correlate unstructured events with system failures.
    • Implemented PII masking to prevent data leakage in logs for sensitive keys (e.g. passwords, tokens).
  • MCP Level 1: Metrics Intelligence

    • Achieved automated service health analysis by exposing Prometheus metrics to AI agents via the official MCP Go SDK, enabling streamlined performance baselining and anomaly detection against live Thanos data.
  • Proposal: The MCP Era

    • Proposed the transition to an agent-native platform via the Model Context Protocol (MCP) to provide a high-fidelity bridge for host-to-cluster investigation.

Chapter 9: The Terraform Era

Adopting OpenTofu (Terraform-compatible) to manage Kubernetes (K3s) workloads as infrastructure-as-code, replacing manual Helm workflows.

  • Platform Infrastructure: Immutable Orchestration via OpenTofu

    • Achieved 100% declarative infrastructure-as-code orchestration via OpenTofu, enabling auditable drift detection and consistent state management across the Kubernetes (K3s) observability stack.
  • OpenTofu Migration: Thanos, Loki, & Tempo

    • Migrated Thanos, Loki, and Tempo to OpenTofu orchestration.
    • Verified persistent data integrity and cross-service telemetry discovery.
  • OpenTofu Migration: MinIO & Prometheus

    • Migrated MinIO and Prometheus to OpenTofu orchestration.
    • Audited persistent volume claims to ensure stateful datasets correctly reconnect to new managed pods.
  • OpenTofu Migration: OpenTelemetry & Grafana

    • Migrated the OpenTelemetry Collector and Grafana to OpenTofu orchestration.
    • Validated zero-downtime migration and resolved specific resource security constraints for init containers.
  • Proposal: OpenTofu for K3s Service Management

    • Proposed adopting OpenTofu to declaratively manage the K3s observability stack, replacing manual Helm workflows with infrastructure-as-code.
    • Defined a phased migration strategy using tofu import for zero-downtime adoption of live services.

Chapter 8: Platform Maturity & Reusability

Evolving the Hub into a modular platform by organizing core logic into reusable 'building blocks' and executable services, culminating in a unified, OpenTelemetry-native architecture.

  • Unified Host Telemetry Collectors & Grafana Alloy Retirement

    • Deployed a resource-efficient Collectors service, centralizing host-level data collection and optimizing processing with batch processing.
    • Achieved significant cost savings and improved system performance by retiring Alloy, resulting in an 80% reduction in idle resource consumption and freeing up ~50% of reserved CPU and ~65% of reserved memory.
    • Established a modern, OpenTelemetry-native platform, delivering a robust and standardized solution for comprehensive system observability (logs, metrics, traces).
  • Pure Wrapper Architecture & Advanced OTel

    • Transitioned the Go fleet to a standardized 'Pure Wrapper' architecture, enabling native OpenTelemetry support and advanced operational visibility across all services.
  • Library-First Implementation

    • Executed the structural transition of the repository into the standard 'internal/' and 'cmd/' layout.
    • Decoupled core business logic into transport-agnostic modules to improve testability and reuse across platform interfaces.
  • Proposal: Modular Library Architecture

    • A strategic proposal to organize the platform's core features into reusable 'building blocks.' This ensures the system remains reliable and consistent as it grows, allowing new tools to be added easily without duplicating effort.

Chapter 7: The SRE Era

The transition from 'functional' to 'standardized' observability. A phase dedicated to learning and understanding the OpenTelemetry ecosystem and SRE principles.

  • Storage Scalability & Security Hardening

    • Scaled telemetry persistence by migrating from restricted local disks to professional S3-compatible object storage (MinIO), ensuring long-term data reliability.
    • Established a 'Safe-by-Default' infrastructure baseline, achieving 100% automated security compliance across the platform's core services.
  • Phase 4 Complete: Proxy Instrumentation & Synthetic Traces

    • Completed high-fidelity OpenTelemetry instrumentation for the 'proxy' service, transitioning to dynamic span naming and deep-dive diagnostics.
    • Engineered a synthetic validation suite that simulates global traffic (Region, Timezone, Device) to stress-test Grafana Tempo storage and Grafana visualization.
    • Resolved a critical 'silent' failure in the service graph pipeline by enabling Prometheus Remote Write and missing Tempo processors.
  • Phase 3 Complete: Prometheus

    • Deployed Prometheus for operational metrics storage with lean retention policies.
    • Integrated with the OpenTelemetry Collector to ingest internal metrics and established connectivity with Grafana as a provisioned data source.
  • Phase 2 Complete: Grafana Tempo

    • Deployed Grafana Tempo in single-binary mode for distributed trace storage.
    • Integrated the OpenTelemetry Collector with Grafana Tempo via OTLP/gRPC for end-to-end trace propagation.
  • Incident & Root Cause Analysis (RCA) Framework

    • Introduced a lightweight 'docs/incidents' template to capture what happened, how it was debugged, and the Root Cause Analysis (RCA) for future reference.
    • Resolved the first incident by migrating to a 'JSON-from-file' for Grafana dashboard provisioning.

Chapter 6: The Kubernetes Era

Evolving from container management to a Kubernetes (K3s) platform. Standardizing on cloud-native patterns for true operational resilience.

  • Phase 4 Complete: PostgreSQL on K3s

    • Migrated the core PostgreSQL database from standalone Docker to a K3s StatefulSet.
    • Implemented data persistence via PVC synchronization and verified extension compatibility for TimescaleDB and PostGIS, completing the four-stage migration strategy.
  • Phase 2 & 3 Complete: Grafana Loki & Grafana on K3s

    • Migrated Grafana Loki and Grafana from standalone Docker containers to K3s-native deployments.
    • Verified self-healing and data persistence for the visualization layer, completing Phase 3 of the platform orchestration strategy.

Chapter 5: Vault via OpenBao

Establishing a secure, centralized secret store to eliminate static environment variable management.

  • OpenBao Implementation

    • Replaced static environment variables with a secure OpenBao secret store and provider-agnostic secrets library.
    • Migrated 'system-metrics' and 'proxy' services to dynamic credential retrieval.
  • The Security Blueprint

    • Proposed transitioning to a centralized secret store to eliminate insecure static configuration files.
    • Evaluated HashiCorp Vault vs. OpenBao, selecting the latter for its truly open-source (MPL 2.0) and community-governed foundation.

Chapter 4: Orchestration & Event-Driven Pivot

Scaling beyond static host-management: pivoting from polling-based timers to event-driven webhooks and k3s orchestration.

  • Orchestration Spike: k3s

    • Validated k3s orchestration via a 'Shadow Deployment,' proving cross-platform networking.
    • Introduced 'Safe-Mode' feature toggles to decouple database dependencies during prototyping.

Chapter 3: GitOps & Host Observability

Scaling single-node infrastructure via templated GitOps automation and deep host-level visibility via systemd-journal integration.

  • GitOps Infrastructure: Phase 2

    • Scaled the GitOps engine to support multi-tenant synchronization across multiple repositories.
    • Optimized reconciliation intervals to balance data freshness with operational stability.
  • Host-Level Observability

    • established a pipeline to ingest Systemd journal logs into Loki via Promtail.
    • Implemented parsing and labeling for key infrastructure units.
  • GitOps Infrastructure: Phase 1

    • Implemented a secure, templated GitOps reconciliation engine for automated sync.
    • Adopted 'Logfmt' and 'Allowlist' patterns for better observability and security.

Chapter 2: The Platform Core

Standardization, Visualization, and Process

  • Shared Database Module

    • Standardized connection configuration and DSN generation in 'internal/db', enforcing safe defaults (UTC, SSL) across services.
    • Eliminated configuration drift by refactoring services to use the shared configuration module.
  • Standardization & Logging

    • Implemented a shared 'internal/logger' to enforce consistent structured observability across all services.
    • Formalized 'Documentation as Code' by treating architectural decisions (ADRs) as first-class artifacts.
  • Project Portal Launch

    • Released a self-hosted visualization portal to track system evolution and technical constraints.
    • Unified real-time system metrics and personal analytics via snapshots.

Chapter 1: The Genesis

Infrastructure Foundations & Telemetry Prototypes

  • Hybrid Cloud Architecture

    • Designed a secure "Store-and-Forward" bridge to ingest Azure Function telemetry without exposing local ports.
    • Decoupled stateless collectors from persistent storage to enable independent scaling.
  • Telemetry Engine Initialization

    • Built a lightweight Go proxy to emit structured telemetry for health and error tracking.
    • Unified logs and metrics into a single PostgreSQL JSONB schema, simplifying the data model.
  • Reliable Local Lab

    • Establish core Docker infrastructure with automated volume management and backups.
    • Selected PostgreSQL over specialized TSDBs for long-term flexibility and reduced maintenance.