DEV Community

Site Reliability Engineering

Site Reliability Engineering principles, practices, and culture.

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
A2A + MCP in Production: The SRE Reliability Framework Nobody Has Written Yet

A2A + MCP in Production: The SRE Reliability Framework Nobody Has Written Yet

Comments
8 min read
The Actual Cost of Self-Hosting Your LLM (Nobody Does This Math First)

The Actual Cost of Self-Hosting Your LLM (Nobody Does This Math First)

Comments
4 min read
Incident Communication: The Status Page That Builds Trust

Incident Communication: The Status Page That Builds Trust

Comments
3 min read
OCI Run Command Advanced Guide: Remote Execution, Object Storage Scripts, and Production Troubleshooting

OCI Run Command Advanced Guide: Remote Execution, Object Storage Scripts, and Production Troubleshooting

Comments
4 min read
Load Testing in Production: How We Do It Safely

Load Testing in Production: How We Do It Safely

Comments
3 min read
Claude Code for the Outer Loop: An AI SRE Playbook to Reduce On-Call Toil

Claude Code for the Outer Loop: An AI SRE Playbook to Reduce On-Call Toil

3
Comments
18 min read
DORA metrics are a CFO tool, not a dev tool

DORA metrics are a CFO tool, not a dev tool

Comments
2 min read
Delete 40% of your dashboards

Delete 40% of your dashboards

Comments
2 min read
Your Datadog bill is 60% DEBUG logs

Your Datadog bill is 60% DEBUG logs

Comments
2 min read
Effective On-Call Rotations: Lessons From Building Fair Schedules

Effective On-Call Rotations: Lessons From Building Fair Schedules

Comments
3 min read
Agents, context, and guardrails on a unified platform

Agents, context, and guardrails on a unified platform

2
Comments
3 min read
We built a system that investigates production incidents automatically

We built a system that investigates production incidents automatically

1
Comments
1 min read
Prometheus at Scale: Surviving the Cardinality Cliff

Prometheus at Scale: Surviving the Cardinality Cliff

Comments
2 min read
SLO Design for Agentic AI Systems — Why Traditional Reliability Metrics Break (and What to Use Instead)

SLO Design for Agentic AI Systems — Why Traditional Reliability Metrics Break (and What to Use Instead)

Comments
4 min read
DORA metrics for the CFO: making engineering velocity legible

DORA metrics for the CFO: making engineering velocity legible

Comments
5 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.