Krateo ClickHouse Kubernetes Observability Stack

Replaces the eventrouter + eventsse + etcd stack with a ClickHouse-based observability pipeline that collects Kubernetes events, pod logs, traces, and metrics via OpenTelemetry.

Architecture

See the full architecture diagram: docs/architecture.md (Mermaid) or docs/architecture.html (interactive SVG).

Components Overview

Layer	Component	Role
Krateo Platform	Frontend + Snowplow, Composition Dynamic Ctrl, Core Provider, AuthN/AuthZ, Providers (Helm, GitHub, …)	Platform services producing logs, events, traces, metrics
Collection	OTel DaemonSet (per-node)	Pod logs, node metrics, kubelet stats via filelog, hostmetrics, kubeletstats
Collection	OTel Deployment (cluster-level)	K8s events via k8sobjects, cluster metrics via k8s_cluster, enriches with `krateo.io/composition-id` via compositionresolver
Collection	OTel Gateway (ClickStack)	OTLP/HTTP :4318 traces from instrumented apps
Storage	ClickHouse	`otel_logs`, `otel_traces`, `otel_metrics` tables; `/events` predefined query handler
Frontend	krateo-sse-proxy	Polls ClickHouse every 3s, serves SSE `/notifications/` and REST `/events`
Alerting	HyperDX	Monitors `otel_logs`, fires alert/resolution webhooks to Slack `#krateo-troubleshooting`
AI Agents	Krateo Autopilot	Orchestrates closed-loop: diagnose → remediate → verify → report/escalate
AI Agents	Observability Agent	Diagnosis & verification via ClickHouse MCP
AI Agents	k8s-agent	Kubernetes remediation (patch, restart, scale, delete)
AI Agents	helm-agent	Helm operations (inspect, rollback, upgrade)
AI Agents	Composition Agent	Krateo CRD operations (compositions, blueprints, RESTActions)
AI Agents	Proactive Monitor	Trend detection: memory pressure, error rate, restart frequency
AI Agents	KAgent Slack Bot	Receives @mentions from Slack alerts, routes to Krateo Autopilot
MCP	ClickHouse MCP Server	:8000, tools: `list_databases`, `list_tables`, `run_select_query`
MCP	Krateo MCP Tools	:8001, pre-built diagnostic tools: `get_pod_errors`, `get_pod_timeline`, `check_pod_health`, etc.
Alert Routing	Autopilot Alert Proxy	Deduplicates & correlates HyperDX webhooks before forwarding to Slack/KAgent
HA	PDBs, NetworkPolicies	PodDisruptionBudgets, ClickHouse ingress restriction, MCP access control
Self-Monitoring	Heartbeat Canary	CronJob writing canary logs every minute; absence alert if pipeline breaks

Directory Layout

krateo-observability-stack/
├── agents/                        # kagent Agent CRD definitions (v0.8.4+)
│   ├── krateo-autopilot.yaml      #   Orchestrator: diagnose → remediate → verify → report
│   ├── observability-agent.yaml   #   ClickHouse MCP diagnosis & verification
│   ├── k8s-agent.yaml             #   Kubernetes remediation
│   ├── helm-agent.yaml            #   Helm operations
│   ├── composition-agent.yaml     #   Krateo CRD operations
│   └── proactive-monitor-agent.yaml # Trend detection & anomaly alerts
├── clickstack/
│   └── values.yaml                # ClickStack Helm values
├── clickhouse-config/
│   ├── http-handlers.xml          # ClickHouse predefined_query_handler
│   ├── configmap.yaml             # ConfigMap wrapping the XML
│   ├── endpoint-secret.yaml       # Krateo endpointRef Secret
│   └── otel-credentials-secret.yaml # ClickHouse credentials for OTel + MCP
├── demo/
│   ├── scenario1-crashloop.yaml   # Pod crash demo
│   ├── scenario2-broken-blueprint/ # Broken blueprint Helm chart
│   └── tests/                     # E2E test framework (Playwright)
│       ├── framework/             #   Clients (clickhouse, k8s) + helpers (wait-for, test-id)
│       ├── scenarios/             #   5 test scenarios (full-loop, false-positive, etc.)
│       ├── playwright.config.ts
│       └── package.json
├── docs/
│   ├── architecture.md            # Architecture diagram (Mermaid)
│   ├── ALERT_RESOLUTION_DEEP_DIVE.md
│   └── IMPROVEMENT_PLAN.md        # 4-phase, 19-item improvement roadmap
├── ha/                            # High availability resources
│   ├── pod-disruption-budgets.yaml
│   ├── network-policies.yaml
│   └── canary-heartbeat.yaml      # Self-monitoring heartbeat CronJob
├── mcp-server/
│   ├── deployment.yaml            # ClickHouse MCP Server (raw SQL tools)
│   └── github-mcp-server.yaml     # GitHub MCP Server
├── otel-collectors/
│   ├── daemonset.yaml             # OTel DaemonSet (logs + metrics + composition-id enrichment)
│   └── deployment.yaml            # OTel Deployment (K8s events + cluster metrics)
├── otel-collector-custom/
│   └── compositionresolver/       # Custom OTel processor (Go)
├── pod-restart-alert/
│   ├── README.md                  # Alert setup guide
│   ├── bootstrap-alert.sh         # Single alert bootstrap
│   └── bootstrap-all-alerts.sh    # All 4 alerts bootstrap
├── runbooks/                      # Runbook-as-code YAML definitions
│   ├── oomkill-remediation.yaml
│   ├── helm-release-failure.yaml
│   ├── infra-self-healing.yaml
│   └── alert-storm-suppression.yaml
├── sse-proxy/                     # SSE proxy (Go, stdlib-only)
├── install.sh                     # End-to-end install (8 phases)
└── README.md

Closed-Loop Architecture

The Krateo Autopilot implements a closed-loop for automated incident response:

Alert fires (HyperDX → Autopilot Alert Proxy → Slack → KAgent)
  │
  ▼
DIAGNOSE: Observability Agent queries ClickHouse via MCP
  │        (get_pod_errors, get_pod_timeline, get_warning_summary)
  ▼
DECIDE: Autopilot routes to the appropriate agent
  │      ├── k8s-agent (pod crash, OOM, resource issues)
  │      ├── helm-agent (release failure, rollback needed)
  │      └── composition-agent (Krateo CRD issues)
  ▼
REMEDIATE: Agent takes action (patch, restart, rollback)
  │
  ▼
VERIFY: Observability Agent re-queries ClickHouse after 60s
  │      "Are Warning events still appearing?"
  ▼
REPORT: ✅ Resolved → Slack summary
        ❌ Persists → Retry once, then ESCALATE to human

Key features:

Post-remediation verification — agents confirm fixes worked before reporting success
Conditional routing — only invokes relevant agents based on diagnosis
Alert deduplication — HyperDX native alert grouping by namespace + pod
Self-observability — agent traces flow to ClickHouse via kagent v0.8.4 tracing
Proactive monitoring — trend detection agent catches issues before alerts fire

Prerequisites

kubectl pointing at the target cluster
helm v3+
kagent v0.8.4+ (for agent orchestration)
Docker (for building custom images)
Kubernetes ≥ 1.24

Quick Start

# The SSE proxy image is built and pushed automatically via GitHub Actions
# (.github/workflows/sse-proxy.yaml) on every push to main.
# Image: ghcr.io/braghettos/krateo-sse-proxy:<git-sha>

# Run the full install (uses the latest image tag by default)
chmod +x install.sh
./install.sh

Agent Quick Start

After installing the observability stack, deploy the agent chain:

# 1. Upgrade kagent to v0.8.4
helm upgrade kagent kagent/kagent --version 0.8.4 -n kagent-system

# 2. Deploy agent CRDs
kubectl apply -f agents/

# 3. Bootstrap all HyperDX alerts
cd pod-restart-alert && cp .env.example .env
# Edit .env with your HyperDX credentials
./bootstrap-all-alerts.sh

# 4. Verify agent traces in ClickHouse
kubectl exec -it -n clickhouse-system svc/krateo-clickstack-clickhouse -- \
  clickhouse-client -q "SELECT ServiceName, count() FROM otel_traces WHERE ServiceName LIKE 'krateo-%' GROUP BY ServiceName"

Running E2E Tests

cd demo/tests
npm install
npx playwright install --with-deps chromium

# Quick validation (pipeline + false positive)
npm run test:quick

# Full suite (all 5 scenarios)
npm test

# Individual scenarios
npm run test:full-loop       # Scenario A: full closed-loop
npm run test:false-positive  # Scenario B: rolling update noise
npm run test:helm-rollback   # Scenario C: multi-agent Helm rollback
npm run test:mcp-down        # Scenario D: agent failure resilience
npm run test:concurrent      # Scenario E: parallel alerts

Step-by-Step Install

Phase 1 – ClickStack

helm repo add clickstack https://clickhouse.github.io/ClickStack-helm-charts
helm repo update
helm install krateo-clickstack clickstack/clickstack \
  --namespace clickhouse-system --create-namespace \
  -f clickstack/values.yaml

Phase 2 – ClickHouse HTTP Handler Config

The ConfigMap mounts http-handlers.xml into /etc/clickhouse-server/config.d/ inside the ClickHouse pod. The extraVolumeMounts in clickstack/values.yaml wire this up. Apply the ConfigMap before the ClickStack install (or trigger a pod restart after):

kubectl apply -f clickhouse-config/configmap.yaml -n clickhouse-system
# restart ClickHouse to pick up the new config:
kubectl rollout restart statefulset -n clickhouse-system -l app.kubernetes.io/name=clickhouse

This exposes:

GET http://krateo-clickstack-clickhouse.clickhouse-system.svc:8123/events/{compositionId}

Phase 3 – OTel Collectors

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts

# Node-level logs + metrics
helm install otel-daemonset open-telemetry/opentelemetry-collector \
  -f otel-collectors/daemonset.yaml -n clickhouse-system

# K8s events + cluster metrics
helm install otel-deployment open-telemetry/opentelemetry-collector \
  -f otel-collectors/deployment.yaml -n clickhouse-system

Label your Krateo compositions. The OTel kubernetesEvents receiver propagates the krateo.composition.id label from the involved object to the log record's ResourceAttributes['krateo.composition.id']. Ensure compositions add this label to the resources they create.

Phase 4 – Krateo Endpoint Secret

kubectl apply -f clickhouse-config/endpoint-secret.yaml -n krateo-system

Phase 5 – SSE Proxy

kubectl apply -f sse-proxy/deploy/deployment.yaml

Update the Krateo frontend config.json:

{
  "api": {
    "EVENTS_API_BASE_URL":      "http://krateo-clickstack-clickhouse.clickhouse-system.svc:8123",
    "EVENTS_PUSH_API_BASE_URL": "http://krateo-sse-proxy.krateo-system.svc:8080"
  }
}

Phase 6 – ClickHouse MCP Server

kubectl apply -f mcp-server/deployment.yaml

Phase 7 – Pod Restart Alert (optional)

Create a pod restart alert in the HyperDX UI. Alerts fire when pod restart events (Killing, BackOff, Unhealthy, Failed) exceed a threshold and post to Slack. Target channel: #krateo-troubleshooting in workspace aiagents-gruppo.

See pod-restart-alert/README.md for full step-by-step instructions (create Slack webhook in HyperDX, saved search, alert).

To have the Krateo Observability Agent react to alerts, add the KAgent Slack bot to #krateo-troubleshooting. See the Krateo Autopilot repo: manifests/slack-integration/README.md.

For a deep study of what happens when an alert fires vs. resolves (ClickHouse vs. HyperDX roles), see docs/ALERT_RESOLUTION_DEEP_DIVE.md.

Access from Cursor (local):

kubectl port-forward svc/clickhouse-mcp-server 8000:8000 -n krateo-system

Add to .cursor/mcp.json:

{
  "mcpServers": {
    "clickhouse-k8s": {
      "url": "http://localhost:8000/mcp"
    }
  }
}

Blueprint Template Changes

Copy the updated templates into the portal-composition-page-generic chart:

File	Change
`restaction.composition-events.yaml`	`endpointRef.name` → `clickhouse-internal-endpoint`; `filter` updated to reshape ClickHouse JSON output into `SSEK8sEvent` list
`eventlist.composition-events-panel-eventlist.yaml`	No changes (update `EVENTS_PUSH_API_BASE_URL` in frontend config instead)

Validation

Verify events in ClickHouse

kubectl exec -it -n clickhouse-system \
  $(kubectl get pod -n clickhouse-system -l app.kubernetes.io/name=clickhouse -o name | head -1) \
  -- clickhouse-client -q \
  "SELECT count(), min(Timestamp), max(Timestamp)
   FROM otel_logs
   WHERE ResourceAttributes['k8s.event.reason'] != ''"

Test the REST endpoint

# Port-forward ClickHouse HTTP
kubectl port-forward svc/krateo-clickstack-clickhouse 8123:8123 -n clickhouse-system &

# Query events for a compositionId
curl -s "http://localhost:8123/events/my-composition-id" | jq .

Test the SSE proxy

kubectl port-forward svc/krateo-sse-proxy 8080:8080 -n krateo-system &
curl -N http://localhost:8080/notifications/
# Should see: ": connected" then periodic ": keepalive" comments,
# and "event: <compositionId>\ndata: {...}" when new events arrive.

Test the MCP Server

kubectl port-forward svc/clickhouse-mcp-server 8000:8000 -n krateo-system &
curl -s http://localhost:8000/mcp \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"tools/list","id":1}' | jq .

Troubleshooting Agent Queries

Once the MCP server is connected, an AI agent can run:

-- Pods with the most errors in the last hour
SELECT ResourceAttributes['k8s.pod.name'] AS pod,
       ResourceAttributes['k8s.namespace.name'] AS ns,
       count() AS errors
FROM otel_logs
WHERE SeverityText IN ('ERROR','FATAL')
  AND Timestamp > now() - INTERVAL 1 HOUR
GROUP BY pod, ns ORDER BY errors DESC LIMIT 10;

-- Correlate K8s events with pod logs
SELECT Timestamp, Body, ResourceAttributes['k8s.event.reason'] AS reason
FROM otel_logs
WHERE ResourceAttributes['k8s.pod.name'] = 'my-failing-pod'
ORDER BY Timestamp DESC LIMIT 50;

-- Slow traces
SELECT TraceId, SpanName, Duration/1e6 AS duration_ms
FROM otel_traces
WHERE ServiceName = 'my-service' AND Duration > 1000000000
ORDER BY Timestamp DESC LIMIT 20;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Krateo ClickHouse Kubernetes Observability Stack

Architecture

Components Overview

Directory Layout

Closed-Loop Architecture

Prerequisites

Quick Start

Agent Quick Start

Running E2E Tests

Step-by-Step Install

Phase 1 – ClickStack

Phase 2 – ClickHouse HTTP Handler Config

Phase 3 – OTel Collectors

Phase 4 – Krateo Endpoint Secret

Phase 5 – SSE Proxy

Phase 6 – ClickHouse MCP Server

Phase 7 – Pod Restart Alert (optional)

Blueprint Template Changes

Validation

Verify events in ClickHouse

Test the REST endpoint

Test the SSE proxy

Test the MCP Server

Troubleshooting Agent Queries

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.claude/worktrees		.claude/worktrees
.github/workflows		.github/workflows
agents		agents
blueprint-templates		blueprint-templates
clickhouse-config		clickhouse-config
clickstack		clickstack
demo		demo
docs		docs
ha		ha
kagent-overrides		kagent-overrides
mcp-server		mcp-server
otel-collector-custom		otel-collector-custom
otel-collectors		otel-collectors
pod-restart-alert		pod-restart-alert
runbooks		runbooks
sse-proxy		sse-proxy
README.md		README.md
install.sh		install.sh

Folders and files

Latest commit

History

Repository files navigation

Krateo ClickHouse Kubernetes Observability Stack

Architecture

Components Overview

Directory Layout

Closed-Loop Architecture

Prerequisites

Quick Start

Agent Quick Start

Running E2E Tests

Step-by-Step Install

Phase 1 – ClickStack

Phase 2 – ClickHouse HTTP Handler Config

Phase 3 – OTel Collectors

Phase 4 – Krateo Endpoint Secret

Phase 5 – SSE Proxy

Phase 6 – ClickHouse MCP Server

Phase 7 – Pod Restart Alert (optional)

Blueprint Template Changes

Validation

Verify events in ClickHouse

Test the REST endpoint

Test the SSE proxy

Test the MCP Server

Troubleshooting Agent Queries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages