System Logs & Milestones

Chapter 12: GitOps & Operational Maturity

The day-two operations chapter: hardening the platform with ArgoCD reconciliation, deterministic Kustomize rendering, Trivy-backed workload security, safer rollouts, and a clearer GitOps operating model.

2026-04-17
Trivy-Verified Workload Hardening
- Established a Trivy-verified workload hardening baseline for platform-managed containers and Kubernetes manifests.
- Strengthened the GitOps security posture by making container runtime controls part of the source of truth.
📄 ADR 024: Trivy-Verified Workload Hardening
2026-03-30
Kustomize & RBAC Render Hardening
- Resolved duplicate resource and namespace collision failures in the tiered Kustomize tree.
- Split cross-namespace RBAC from overlay-scoped resources so GitOps rendering stayed deterministic.
📄 RCA 005: Kustomize RBAC Resource Collision
2026-03-21
ArgoCD GitOps & Automated Fleet Promotion
- Transitioned the cluster to ArgoCD for continuous reconciliation instead of imperative scripts.
- Adopted the App-of-Apps pattern to manage the platform from a single source of truth.
- Validated the GitOps promotion loop by fixing image pull failures caused by SHA-length mismatches.
- Enabled ArgoCD to reconcile private GHCR images and roll updates across the simulation fleet.
📄 Screenshot: ArgoCD Fleet Update

Chapter 11: eBPF-Native Efficiency & Networking

The kernel-visibility chapter: using Cilium, Hubble, and Kepler to inspect network flows, protocol behavior, and infrastructure efficiency below the application layer.

2026-03-18
eBPF-Native Networking & MQTT Visibility
- Migrated the cluster to Cilium with kube-proxy replacement for an eBPF-native datapath.
- Added L7 MQTT visibility for EMQX without requiring application instrumentation.
- Resolved a critical SSH lockout during migration and formalized a recovery protocol.
📄 ADR 020: Cilium eBPF Foundation 📄 RCA 004: SSH Lockout Incident
2026-03-11
Kepler Integration & Energy Dashboards
- Deployed Kepler to establish an energy-efficiency baseline across the cluster.
- Added dashboards to correlate workload behavior with power consumption.
📄 Dashboard: Sustainability & Efficiency 📄 Dashboard: Infrastructure Health

Chapter 10: The MCP Era

The agent-native operations chapter: building an MCP gateway for live telemetry, Kubernetes, and host state; reducing noisy observability payloads with Rust; and proving the design with Go-vs-Rust benchmarks for faster, lower-cost investigations.

2026-04-17
Rust Telemetry Processing Benchmarks
- Benchmarked Go and Rust telemetry processors against the same 24-hour Loki and Prometheus payloads.
- Established Rust as the faster reduction path, especially for metric summarization where numeric parsing and statistical aggregation dominate.
📄 ADR 023: Benchmark-Validated Rust Obs Processor
2026-04-14
Structured Obs Processor Summaries
- Upgraded summaries from compressed strings to structured records with timestamps and investigation context.
- Improved log and metric investigation quality while benchmarks still showed major context reduction.
📄 ADR 022: Obs Processor Structured Summaries
2026-04-06
Rust Telemetry Summarization Processor
- Added `obs-processor` to turn large telemetry payloads into agent-friendly summaries before they enter the MCP context window.
- Reduced context waste while preserving enough signal for logs, metrics, and incident investigation.
📄 ADR 021: Rust Telemetry Summarization Processor 📄 RCA 007: Worker Ingestion Blocked from MongoDB Atlas
2026-03-19
mcp_obs_hub: Unified Agentic Gateway
- Consolidated telemetry, pods, and hub services into a single `mcp_obs_hub` binary to reduce management overhead.
2026-03-08
MCP Telemetry and Infrastructure Intelligence
- Exposed metrics, semantic logs, distributed traces, Kubernetes events, and host state through MCP.
- Split capabilities into domain-isolated services to reduce blast radius and enforce least privilege.
- Connected telemetry, pod state, and host-level signals for focused cross-layer debugging.
- Introduced `investigate_incident` for structured incident reports from live platform signals.
📄 ADR 017: Agentic Interface via MCP 📄 ADR 018: Domain-Isolated MCP Architecture 📄 ADR 019: Hybrid Host-MCP Intelligence

Chapter 9: The Terraform Era

The infrastructure-as-code chapter: moving Kubernetes platform services into OpenTofu so infrastructure state is declarative, reviewable, and easier to audit over time.

2026-03-01
Platform Infrastructure: Immutable Orchestration via OpenTofu
- Moved OpenTelemetry, Grafana, Prometheus, and the supporting K3s observability stack to OpenTofu.
- Brought MinIO, Thanos, Loki, and Tempo into the same declarative infrastructure workflow.
- Improved drift detection, persistent workload recovery, and cross-service telemetry discovery.
📄 ADR 016: OpenTofu K3s Migration 📄 RCA 006: Loki Gateway DNS Timeout

Chapter 8: Platform Maturity & Reusability

The software-architecture chapter: refactoring the Go codebase into reusable internal packages so services, workers, and interfaces share tested platform logic.

2026-02-22
Unified Host Telemetry Analytics & Grafana Alloy Retirement
- Deployed a resource-efficient Analytics service, centralizing host-level data collection and optimizing processing with batch processing.
- Retired Alloy and cut idle resource usage significantly while freeing reserved CPU and memory.
- Completed a more consistent OpenTelemetry-native observability stack for logs, metrics, and traces.
📄 ADR 015: Unified Host Telemetry Analytics 📄 RCA 003: Thanos Discovery and Retention Failure
2026-02-20
Pure Wrapper Architecture & Advanced OTel
- Transitioned the Go fleet to a standardized 'Pure Wrapper' architecture, enabling native OpenTelemetry support and advanced operational visibility across all services.
📄 Note: OpenTelemetry & Observability Guide
2026-02-18
Library-First Implementation
- Executed the structural transition of the repository into the standard 'internal/' and 'cmd/' layout.
- Decoupled core business logic into transport-agnostic modules to improve testability and reuse across platform interfaces.
📄 ADR 014: Library-First Service Architecture

Chapter 7: The SRE Era

The observability maturity chapter: standardizing on OpenTelemetry, connecting logs, metrics, and traces, and using RCAs to turn incidents into durable operational knowledge.

2026-02-13
Storage Scalability & Security Hardening
- Moved telemetry storage to S3-compatible object storage with MinIO for better durability.
- Established a safer default baseline across the core platform services.
2026-02-12
OpenTelemetry Logs, Metrics, and Traces
- Completed the OpenTelemetry path for traces, metrics, and proxy instrumentation across Tempo, Prometheus, and Grafana.
- Built synthetic traffic validation to confirm traces, dashboards, and service graph behavior under load.
- Fixed the service graph pipeline by enabling the missing Prometheus and Tempo pieces.
📄 ADR 013: Standardize on OpenTelemetry 📄 RCA 002: Service Graph Metrics Failure
2026-02-09
Incident & Root Cause Analysis (RCA) Framework
- Introduced a lightweight RCA workflow to document incidents and debugging steps.
- Resolved the first incident by switching Grafana dashboard provisioning to JSON-from-file.
📄 RCA 001: Grafana Provisioning Failure
2026-02-08
Phase 1 Complete: OpenTelemetry Collector
- Deployed the OpenTelemetry Collector as a standalone deployment in the K3s cluster.
📄 ADR 013: Standardize on OpenTelemetry

Chapter 6: The Kubernetes Era

The Kubernetes chapter: moving the observability stack, PostgreSQL, and persistent workloads from standalone containers into K3s for resilient, self-healing operations.

2026-02-05
Core Observability Stack on K3s
- Moved telemetry collection from Promtail to Grafana Alloy as the K3s-native agent.
- Migrated Grafana Loki and Grafana from standalone Docker containers to Kubernetes deployments.
- Moved PostgreSQL into a K3s StatefulSet with PVC-backed persistence.
- Established self-healing runtime behavior for the core observability stack.
📄 ADR 011: Phased K3s Migration Strategy 📄 ADR 012: Migrate Promtail to Grafana Alloy

Chapter 5: Vault via OpenBao

The security chapter: replacing scattered static environment variables with OpenBao-backed secrets, central policy, and safer credential access for services.

2026-01-26
OpenBao Implementation
- Replaced static environment variables with a secure OpenBao secret store and provider-agnostic secrets library.
- Migrated 'system-metrics' and 'proxy' services to dynamic credential retrieval.
2026-01-22
The Security Blueprint
- Proposed transitioning to a centralized secret store to eliminate insecure static configuration files.
- Evaluated HashiCorp Vault vs. OpenBao, selecting the latter for its truly open-source (MPL 2.0) and community-governed foundation.
📄 ADR 010: Integrate OpenBao

Chapter 4: Orchestration & Event-Driven Pivot

The orchestration chapter: moving from timer-driven background work to event-driven webhooks, reducing wasted wakeups while preparing the platform for Kubernetes.

2026-01-21
Standardization & Event-Driven Pivot
- Proposed pivoting from polling-based Systemd timers to event-driven webhooks to eliminate scaling bottlenecks.
- Proposed 'Paved Road' philosophy by centralizing database connection logic to streamline service development and enforce architectural consistency.
📄 ADR 008: GitOps via Proxy Webhook 📄 ADR 009: Standardized DB Connection
2026-01-19
Orchestration Spike: k3s
- Validated k3s orchestration via a 'Shadow Deployment,' proving cross-platform networking.
- Introduced 'Safe-Mode' feature toggles to decouple database dependencies during prototyping.
📄 ADR 007: k3s Shadow Deployment

Chapter 3: GitOps & Host Observability

The host-operations chapter: using GitOps-style synchronization and systemd journal ingestion to make host services repeatable, observable, and easier to recover.

2026-01-14
GitOps Infrastructure & Host Observability
- Implemented a secure, templated GitOps reconciliation engine for automated sync.
- Scaled the GitOps engine to support multi-tenant synchronization across multiple repositories.
- Established host-level Systemd journal ingestion into Loki via Promtail.
- Added log parsing, labels, and allowlist controls for safer operational visibility.
📄 ADR 005: GitOps Reconciliation

Chapter 2: The Platform Core

The platform-structure chapter: turning early services into a coherent system with shared libraries, structured logging, automated ingestion, and documented architecture decisions.

2026-01-09
Shared Database Module
- Standardized connection configuration and DSN generation in 'internal/db', enforcing safe defaults (UTC, SSL) across services.
- Eliminated configuration drift by refactoring services to use the shared configuration module.
📄 ADR 006: Shared Database Connection
2026-01-08
Automated Insights & GitOps
- Automated data ingestion with journal-backed Systemd timers.
- Validated GitOps strategy for automated repository synchronization.
📄 ADR 004: Spatial Telemetry Keyboard 📄 ADR 005: GitOps Reconciliation
2026-01-03
Standardization & Logging
- Implemented a shared 'internal/logger' to enforce consistent structured observability across all services.
- Formalized 'Documentation as Code' by treating architectural decisions (ADRs) as first-class artifacts.
📄 ADR 003: Shared Logging Library
2026-01-01
Project Portal Launch
- Released a self-hosted visualization portal to track system evolution and technical constraints.
- Unified real-time system metrics and personal analytics via snapshots.

Chapter 1: The Genesis

The foundation chapter: building a reliable local lab, choosing PostgreSQL for durable telemetry storage, and bridging cloud-generated events into a self-hosted observability path.

2025-12-31
Hybrid Cloud Architecture
- Designed a secure "Store-and-Forward" bridge to ingest Azure Function telemetry without exposing local ports.
- Decoupled stateless analytics from persistent storage to enable independent scaling.
📄 ADR 002: Cloud-to-Local Bridge
2025-12-25
Telemetry Engine Initialization
- Built a lightweight Go proxy to emit structured telemetry for health and error tracking.
- Unified logs and metrics into a single PostgreSQL JSONB schema, simplifying the data model.
📄 ADR 001: Postgres vs InfluxDB
2025-11-20
Reliable Local Lab
- Establish core Docker infrastructure with automated volume management and backups.
- Selected PostgreSQL over specialized TSDBs for long-term flexibility and reduced maintenance.

Chapter 12: GitOps & Operational Maturity

Trivy-Verified Workload Hardening

Kustomize & RBAC Render Hardening

ArgoCD GitOps & Automated Fleet Promotion

Chapter 11: eBPF-Native Efficiency & Networking

eBPF-Native Networking & MQTT Visibility

Kepler Integration & Energy Dashboards

Chapter 10: The MCP Era

Rust Telemetry Processing Benchmarks

Structured Obs Processor Summaries

Rust Telemetry Summarization Processor

mcp_obs_hub: Unified Agentic Gateway

MCP Telemetry and Infrastructure Intelligence

Chapter 9: The Terraform Era

Platform Infrastructure: Immutable Orchestration via OpenTofu

Chapter 8: Platform Maturity & Reusability

Unified Host Telemetry Analytics & Grafana Alloy Retirement

Pure Wrapper Architecture & Advanced OTel

Library-First Implementation

Chapter 7: The SRE Era

Storage Scalability & Security Hardening

OpenTelemetry Logs, Metrics, and Traces

Incident & Root Cause Analysis (RCA) Framework

Phase 1 Complete: OpenTelemetry Collector

Chapter 6: The Kubernetes Era

Core Observability Stack on K3s

Chapter 5: Vault via OpenBao

OpenBao Implementation

The Security Blueprint

Chapter 4: Orchestration & Event-Driven Pivot

Standardization & Event-Driven Pivot

Orchestration Spike: k3s

Chapter 3: GitOps & Host Observability

GitOps Infrastructure & Host Observability

Chapter 2: The Platform Core

Shared Database Module

Automated Insights & GitOps

Standardization & Logging

Project Portal Launch

Chapter 1: The Genesis

Hybrid Cloud Architecture

Telemetry Engine Initialization

Reliable Local Lab