I’ve been in too many meetings where someone says, “Wait, why did we build it that way?” only to discover the answer was buried in a 47-page RFC from 2019 that nobody ever opened. Sound familiar? The irony is that Request for Comments documents are supposed to prevent this chaos. Instead, many teams produce RFCs that get skimmed, misunderstood, or worse—completely ignored. But here’s the thing: a well-crafted RFC is like a good movie. It doesn’t have to be long to be impactful. It just needs to know its audience and respect their time. After working with teams that struggle with documentation and others that nail it, I’ve noticed a clear pattern: the RFCs people actually read are the ones that understand what their readers care about. This article walks you through exactly how to write those RFCs.

Why RFCs Matter (And Why Most People Skip Them)

Before we talk about structure, let me be blunt: nobody reads RFCs for fun. Your team reads them because they have to—or because you’ve made it so darn compelling that they want to. Here’s what a great RFC actually does:

  • Convinces the author that their thinking is thorough and covers edge cases
  • Convinces the team that the solution is sound and their feedback matters
  • Convinces stakeholders that the work is meaningful and the team owns it
  • Helps future developers understand why decisions were made (not just what was decided) The problem is that most RFCs fail on count one. The author half-writes it, the team skims it, stakeholders get confused, and future developers find nothing. Everyone loses.

When Should You Actually Write an RFC?

Not every decision needs an RFC. If you’re changing variable names or fixing a UI button color, you probably don’t need a formal proposal. But here’s when an RFC becomes essential: You need an RFC when:

  • The decision affects multiple systems or teams
  • You’re proposing a significant architectural change
  • There are multiple viable solutions and you need to align on one
  • The implementation will take weeks or more
  • It touches on company infrastructure, security, or compliance
  • You’re establishing a new pattern or standard your team will follow repeatedly Think of it as the difference between a code review and a design review. Code reviews happen after you write code. RFCs happen before anyone touches the codebase. You want alignment when it’s still cheap to change.

The RFC Structure That Actually Works

Here’s where most teams go wrong: they start with solutions. They jump into technical details without establishing why they’re solving this problem. It’s like opening a movie at the climax—nobody cares yet.

The Foundation: Start with Metadata

Before you write a single paragraph of actual content, establish the basics:

Metadata:
  Date: YYYY-MM-DD
  Author: Your Name
  Status: Draft/Under Review/Accepted/Rejected
  RFC Number: RFC-001 (optional but useful)
  Tags: architecture, performance, storage
  Stakeholders: Team leads, affected service owners
  Timeline: Expected decision date

I love the numbering approach. When someone says “Remember that RFC about distributed tracing?” you can reply with “Oh, RFC-047?” instead of vague hand-waving. It makes RFCs feel like proper artifacts.

The Hook: Write Your Overview First

This should be 2-3 sentences. Maximum. This is your tl;dr. Think of it like movie trailer. Does it make someone want to watch? Here’s a bad version: “This RFC is about improving our caching strategy.” Here’s a better one: “We’re losing 30% of revenue to cache misses during peak traffic. This RFC proposes a two-tier caching strategy that reduces cache misses by 85% with minimal code changes.” See the difference? One is a statement. The other creates urgency.

The Context: Problem Statement (This Is Critical)

This is where most RFCs stumble, but it’s also where they matter most. I’ll say it again because it’s that important: a clear problem statement makes everything else easier. Here’s a structure that works:

  1. Start with a quick summary (1 sentence)
  2. Give background context (What’s the current situation? What metrics matter?)
  3. Clearly state the problem (Why is this broken? What’s the impact?) A solid problem statement looks like this:

“We currently process payment callbacks synchronously, which blocks our request handler for 2-5 seconds. During high-traffic events, this cascades into connection timeouts, causing 3-5% of transactions to fail. This costs approximately $50K per hour during peak periods and creates a poor customer experience.” Notice what we did here? We made the problem real. Not abstract. Money. Time. Customer impact. Measurable.

The Meat: Proposed Solution

Now you can talk about your solution. Include:

  • What are you actually proposing? (Be specific. “Better performance” is not a proposal.)
  • How does it solve the problem? (Reference your problem statement directly)
  • Why this solution over alternatives? (Acknowledge you considered other approaches) For the payment example:

“We propose implementing an asynchronous callback queue using Redis and a background worker pool. Callbacks are written to Redis (< 100ms), the request returns immediately, and workers process callbacks in batches. This decouples the critical path from callback processing.” Then explain: Why Redis? Why not use RabbitMQ? Why batching? Show your thinking.

The Reality Check: Pros and Cons

Here’s where honesty builds credibility. Every solution has tradeoffs. If you pretend yours doesn’t, you lose the room immediately.

Pros:
- Eliminates blocking on critical path
- Horizontally scalable (add workers as needed)
- Simple implementation using existing Redis infra
- Easy to monitor and debug
Cons:
- Adds operational complexity (managing workers)
- Callbacks now async (6-10s delay typical)
- Requires careful error handling for failed callbacks
- Redis becomes more critical to system health

Readers expect cons. They’re looking for evidence that you’ve actually thought this through. When you acknowledge tradeoffs, it shows maturity. When you pretend there aren’t any, you look naive.

The Action Plan: Implementation Steps

This is where you put teeth on your proposal. Break it down into concrete, sequenced steps:

Phase 1: Infrastructure (Week 1-2)
  - Set up Redis cluster with high availability
  - Create monitoring dashboards for queue depth and processing lag
  - Load test queue handling capacity
Phase 2: Implementation (Week 3-4)
  - Implement callback queueing in payment service
  - Build background worker with retry logic
  - Add metrics for success/failure rates
Phase 3: Rollout (Week 5-6)
  - Deploy to staging, run load tests
  - Gradual rollout to production (5% → 25% → 50% → 100%)
  - Monitor for regressions
Phase 4: Cleanup (Week 7)
  - Remove old synchronous callback code
  - Documentation and team runbook

Notice the specificity? This isn’t wishful thinking. This is what actually needs to happen, in order, with rough time estimates. Now your team can actually commit to something.

The Impact: What Happens When You Do This?

Be concrete about the benefits:

Expected Outcomes:
- Reduce payment failure rate from 3-5% to < 0.5%
- Improve API response time by 2-4 seconds average
- Recover ~$50K/hour during peak traffic periods
- Enable callback processing at 10x current volume
Risks if we don't do this:
- We lose revenue scalability (can't handle growth)
- Customer experience degrades during traffic spikes
- Technical debt compounds (more patches, more complexity)

The Gotchas: Considerations and Open Questions

This is where you address security, compliance, operational concerns:

Considerations:
- Security: Redis access restricted to service VPC only
- Compliance: Callbacks must be processed within SLA window
  (propose 24-hour max as acceptable)
- Operations: On-call team needs monitoring/alerting training
- Cost: Estimate $3K/month for Redis cluster + worker fleet
Open Questions:
- Should we implement dead-letter queue for failed callbacks?
  (Need PO input on long-term retry strategy)
- What's our acceptable callback delay? (Currently propose 6-10s)

Asking open questions shows intellectual honesty. It signals that you haven’t pretended to have all the answers. Your team fills in the gaps.

References: Back Your Thinking

Link to the research you did:

References:
- Similar async approach implemented by Stripe (2019 blog post)
- Redis Cluster operations manual
- Our existing Redis infrastructure documentation
- Payment SLA requirements (link to doc)

The RFC That Kills: A Complete Example

Here’s what a real, readable RFC looks like in practice:


Metadata:
  Date: 2026-01-15
  Author: Sarah Chen
  Status: Under Review
  RFC Number: RFC-042
  Tags: payments, async, infrastructure
  Stakeholders: Payment team lead, DevOps lead, PO
  Timeline: Decision by 2026-01-31

## Overview
Our synchronous payment callback processing blocks request handlers for 2-5 seconds, causing 3-5% transaction failure during high traffic. We propose an async queue system to decouple callback processing from the critical path, eliminating failures and enabling 10x throughput.

## Glossary
- Callback: HTTP request from payment processor confirming transaction status
- Queue: Persistent message store (Redis) for pending callbacks
- Worker: Background service processing callbacks from queue
- Critical path: Request handler timeline (must complete within seconds)

## Problem Statement
Currently, payment callbacks execute synchronously in the request handler:
1. Payment processor sends callback to our API
2. We fetch order details from DB
3. We update order status
4. We trigger downstream events (email, fulfillment, analytics)
5. We return 200 OK
This entire flow happens before the request completes. If any step is slow, the handler blocks. When we get traffic spikes (Black Friday, launches), callback processing slows down, handlers timeout, and the payment processor retries. We end up in a cascading failure.
Current metrics:
- Callback processing: 2-5 seconds per callback (mostly waiting on email service)
- During peak hours: 8-12% of callbacks timeout
- Customers experience failed payments, then duplicate charges when retries succeed
- Cost: ~$50K per hour in failed transactions + refund chargeback fees

## Proposed Solution
Implement async callback processing:
1. **Receive & Queue** (< 100ms): Payment callback arrives, we validate signature, write to Redis queue, return 200 OK immediately
2. **Process Asynchronously**: Background workers pull from queue, do all the work (DB updates, emails, events)
3. **Retry on Failure**: Failed callbacks go to dead-letter queue for manual inspection
Architecture:

Payment Processor → API Gateway → Redis Queue → Worker Pool → Services (100ms) (async) (2-10s per callback)

Why this approach?
- Decouples critical path (return 200) from slow operations (emails, analytics)
- Horizontally scalable: add workers when queue depth grows
- Redis is battle-tested, we already use it
- Easy to monitor and debug
Why not alternatives?
- Message broker (RabbitMQ): Adds operational complexity, we already have Redis
- Database polling: Increases DB load, harder to scale
- Direct async/await in handlers: Doesn't solve timeout issues, customer still sees failures

## Pros
- Eliminates transaction failures from callback timeouts ✓
- Enables 10x throughput with same resources
- Improves API response time by 2-4 seconds for affected endpoints
- Recovers $50K/hour during peak traffic
- Simple to implement with existing stack
- Easy to monitor queue depth, processing lag, error rates

## Cons
- Adds operational complexity: workers need monitoring, alerting, deployment
- Callbacks now async: ~6-10 second delay typical (customer won't see result immediately)
- Failure handling more complex: what if email service is down? Need retry + dead-letter queue
- Redis becomes more critical: cluster failure stops callback processing
- On-call team needs training on new system
- Initial development: ~4 weeks of engineering time

## Implementation Plan
**Phase 1: Foundation (Week 1-2)**
- Provision Redis cluster with HA and backup
- Set up monitoring: queue depth, processing lag, worker health
- Load test: 1K callbacks/second, verify Redis handles it
- Acceptance: Redis stable at 2x expected peak load
**Phase 2: Implementation (Week 3-4)**
- Worker pool service: consumes from queue, retries on failure
- Update payment API: write to queue instead of sync processing
- Error handling: dead-letter queue for investigation
- Metrics: success/failure rates, processing latency, queue depth
- Unit + integration tests: >90% code coverage
**Phase 3: Staging (Week 5)**
- Deploy entire system to staging
- Load test: 5K callbacks/second for 2 hours
- Failure injection: simulate Redis downtime, worker crashes
- Runbook for on-call team
**Phase 4: Production Rollout (Week 6)**
- Canary: 5% of traffic to new system for 24 hours
- Monitor: error rates, latency, queue health
- Gradual: 5% → 25% → 50% → 100% over 3 days
- Rollback plan: revert to sync if failure rate > 1%
**Phase 5: Cleanup (Week 7)**
- Remove old sync callback code
- Team training on monitoring + on-call process
- Documentation: system design, runbook, troubleshooting guide

## Impact & Benefits
Expected outcomes:
- Payment failure rate: 3-5% → < 0.5%
- API latency improvement: -2 to -4 seconds for users
- Revenue impact: recover ~$50K/hour during peak periods
- Scalability: can handle 10x current load without architecture changes
What happens if we don't do this:
- Can't scale during peak traffic (limits business growth)
- Customer experience continues degrading
- Technical debt accumulates (more one-off patches)
- Refund/chargeback costs continue

## Considerations
**Security**
- Redis access: restricted to service VPC, no public access
- Queue data: callbacks contain sensitive info, Redis persists to encrypted disk
- Authentication: service-to-worker uses internal mTLS
**Operations**
- On-call: team needs training on queue monitoring + worker debugging
- Alerting: alert if queue depth > 10K (indicates workers are falling behind)
- Dashboard: real-time queue depth, latency percentiles, error rates
**Compliance**
- SLA: callbacks must process within 24 hours (currently propose 6-10s typical)
- Audit: log all callbacks with timestamps, processing results
- Legal: confirm payment processor allows async processing (check API contract)
**Cost**
- Redis cluster: ~$2K/month (HA + backups)
- Worker instances: ~$1K/month (3 workers at normal load)
- Total: ~$3K/month additional infrastructure

## Open Questions
These need input before we proceed:
1. **Callback delay tolerance**: I proposed 6-10s typical delay. Is this acceptable to Product/Finance? Or do we need < 5s?
2. **Dead-letter queue retention**: How long should we keep failed callbacks? (Propose 30 days)
3. **Retry policy**: Should we retry failed callbacks indefinitely, or give up after N attempts? (Propose: 10 retries over 24 hours)
4. **Notification**: Should we alert customers when callbacks take > 30 minutes? (Propose: yes, for transparency)

## References
- Payment callback architecture patterns: https://stripe.com/blog/payments-without-pci
- Redis cluster operations: https://redis.io/topics/cluster-tutorial
- Our existing infrastructure docs: [link to internal wiki]
- Payment SLA requirements: [link to contracts]
- Similar implementation case study: [link]

See how this reads? It’s not fancy. It’s just complete. Someone can read this in 10 minutes, understand exactly what’s being proposed, why, and whether they agree.

The Process: How to Actually Write This

Here’s my workflow, and I recommend you steal it:

graph TD A["1. Define the Problem"] --> B["2. Research & Validate"] B --> C["3. Rough Draft"] C --> D["4. Get Early Feedback"] D --> E{"Problem\nStatement\nLocked?"} E -->|No| F["Refine Problem"] F --> D E -->|Yes| G["Write Full RFC"] G --> H["Self-Edit"] H --> I["Share for Review"] I --> J{"Major\nFeedback?"} J -->|Yes| K["Revise"] K --> I J -->|No| L["Archive & Reference"]

Step 1: Lock the Problem Statement First

This is non-negotiable. Before you write anything long, spend time on the problem. Literally:

  • Write it in 3-5 sentences
  • Share with one trusted colleague
  • Let them poke holes in it
  • Refine until it’s airtight This saves you from writing 5 pages of solution that solves the wrong problem. I’ve thrown away half-finished RFCs because we realized mid-write that we’d misdiagnosed the problem. Don’t be me.

Step 2: Use a Template (And Enforce It)

Create a template your team uses for every RFC. Put it in your wiki. No exceptions. Here’s why: consistency makes things easier to review. Your brain learns the structure after a few RFCs and can skim faster.

# RFC-XXX: [Title]
**Metadata**
- Date: 
- Author: 
- Status: Draft
- Stakeholders: 
- Timeline: 
## Overview
[2-3 sentences]
## Problem Statement
[1-2 paragraphs with metrics]
## Proposed Solution
[Explanation + why this over alternatives]
## Pros
[List]
## Cons
[List]
## Implementation Plan
[Phased breakdown with timelines]
## Impact
[Expected outcomes + risks]
## Considerations
[Security, ops, compliance, cost]
## Open Questions
[What do we need input on?]
## References
[Links + citations]

Use this for every RFC. Yes, every one. Consistency matters more than perfection.

Step 3: Keep It Ruthlessly Brief

Here’s my rule: never more than 2 pages. I don’t care if you think yours needs 5 pages. It doesn’t. If something can’t be explained in 2 pages, you don’t understand it well enough yet. This forces discipline. Every sentence has to earn its place. You’ll be amazed how much fluff disappears when you have a hard page limit.

Step 4: Collaborative Editing

Use Google Docs or similar. Comments and suggestions are essential. People are more willing to leave feedback when they can do it asynchronously and you see who said what. Pro tip: After you get feedback, read all the comments in one sitting. Don’t reply immediately. Let them percolate. Often there’s a pattern in the feedback that points to a real problem you missed.

Step 5: Edit, Edit, Edit

Treat RFC writing like code review: tight, concise, clear. Read your draft out loud. Yes, out loud. You’ll catch awkward phrasing you miss when reading silently. Replace:

  • “There may be some potential performance implications” → “Performance may degrade 15% during peak load”
  • “It could be argued that” → Delete it. If you’re not arguing it, why mention it?
  • “We should probably consider implementing” → “We will implement” Every sentence should do work. Remove anything that doesn’t.

Common Mistakes (And How Not to Make Them)

Mistake 1: No Problem Statement

The RFC jumps straight to solutions. “We should use Kafka.” Why? “Because Kafka is scalable.” Okay, but what problem are we solving? Fix: Start with problem. Always. Answer these first:

  • What’s broken right now?
  • How do you measure it?
  • What’s the business impact? Only then talk about solutions.

Mistake 2: False Neutrality

The author presents three options as equally viable when they actually prefer one. Readers spend time evaluating options fairly, then discover you already decided. Fix: Be direct. “We considered Kafka, RabbitMQ, and Redis. We chose Redis because [concrete reasons]. We rejected Kafka because [specific tradeoffs].”

Mistake 3: Pink Yak Problem

The RFC solves a problem nobody cares about. Great technical proposal, wrong priority. Fix: Before writing, get one stakeholder to say “Yes, this is worth doing.” Not a meeting. Not a Slack message. A conversation where they confirm the problem exists and matters.

Mistake 4: Too Much Detail, Too Little Context

The RFC goes deep into implementation before explaining why the approach matters. Readers get lost in trees, miss the forest. Fix: Structure: context → problem → solution → details. This order is not optional.

Mistake 5: Avoiding Cons

The RFC has a long “Pros” section and a short “Cons” section. Readers don’t trust it because every real solution has real tradeoffs. Fix: Cons should be substantive. “Adds complexity” is weak. “Adds two new services that need monitoring, on-call coverage, and SRE team support, increasing baseline costs by $10K/month” is honest.

The Surprising Outcome: RFCs as Team Alignment Tool

Here’s what I’ve noticed after watching teams do this right: RFCs aren’t just documents. They’re team alignment engines. When you write one, you have to think through your solution completely. Your team has to engage with the problem, not just react to code after it’s written. Stakeholders understand what’s happening before engineers are deep in implementation. This is why teams that use RFCs well tend to have fewer surprises, fewer “why did we build it this way?” conversations, and less rework. The catch: it only works if people actually read and engage with them. Which brings us back to the beginning. Write RFCs your team will read by respecting their time, being clear about what you want, and being honest about tradeoffs. Do that, and they’ll read every one.

Your Turn

Start here:

  1. Pick one decision coming up in your team
  2. Write a problem statement (3-5 sentences, share with one person)
  3. Draft the full RFC using the template (target: 2 pages)
  4. Get feedback before decisions are made
  5. Archive it so future you can reference it That’s it. The first RFC is always awkward. The fifth one gets smooth. By the tenth, you’ll wonder how you ever made decisions without them. Now go forth and document. Your future team will thank you.