Building HTTP clients might seem straightforward until 3 AM when your service starts hammering a failing external API, burns through your rate limits, and cascades into total meltdown. We’ve all been there. Or maybe you haven’t yet—consider this your friendly warning from someone who has. The difference between a casual HTTP client and a production-grade one often comes down to two deceptively simple concepts: retries and circuit breakers. They’re not glamorous, but they’ll save your bacon when things inevitably go sideways.
Why Your Naive HTTP Client Will Fail You
Let’s be honest. Writing an HTTP client in Go is embarrassingly easy. The standard library practically hands you everything on a silver platter:
resp, err := http.Get("https://api.example.com/data")
Beautiful. Elegant. Completely insufficient for the real world. The moment an external service hiccups—network timeout, temporary overload, transient database connection pool exhaustion—your code fails hard. No retry, no grace period, just immediate failure. And if you’re hitting that service repeatedly in a loop? You’ve just become a denial-of-service attack. This is where resilience patterns come in. They’re the difference between a service that works 95% of the time and one that works 99.99% of the time.
Understanding the Architecture
Before we dive into code, let me paint a picture of what we’re building:
Open?"} Retry{"Retry
Count"} Request["Make HTTP Request"] Success{"Success?"} Backoff["Exponential Backoff"] FailedReturn["Return Error"] SuccessReturn["Return Response"] Client --> CB CB -->|Open| FailedReturn CB -->|Closed| Retry Retry -->|Max Retries| FailedReturn Retry -->|Retries Left| Request Request --> Success Success -->|No| Backoff Backoff --> Retry Success -->|Yes| SuccessReturn
The flow is elegant: check the circuit breaker’s status, attempt the request with retry logic, back off exponentially on failure, and eventually either succeed or fail gracefully. No resource exhaustion, no thundering herd problem.
Building the Foundation
Let’s start with a solid foundation. We’ll create a struct that encapsulates our HTTP client with resilience capabilities:
package httpclient
import (
"context"
"fmt"
"io"
"net/http"
"time"
)
type Config struct {
MaxRetries int
InitialBackoff time.Duration
MaxBackoff time.Duration
Timeout time.Duration
CircuitThreshold int
CircuitTimeout time.Duration
}
type ResilientClient struct {
client *http.Client
config Config
circuitBreaker *CircuitBreaker
}
func NewResilientClient(config Config) *ResilientClient {
if config.MaxRetries == 0 {
config.MaxRetries = 3
}
if config.InitialBackoff == 0 {
config.InitialBackoff = 100 * time.Millisecond
}
if config.MaxBackoff == 0 {
config.MaxBackoff = 30 * time.Second
}
if config.Timeout == 0 {
config.Timeout = 10 * time.Second
}
if config.CircuitThreshold == 0 {
config.CircuitThreshold = 5
}
if config.CircuitTimeout == 0 {
config.CircuitTimeout = 30 * time.Second
}
httpClient := &http.Client{
Timeout: config.Timeout,
}
return &ResilientClient{
client: httpClient,
config: config,
circuitBreaker: NewCircuitBreaker(config.CircuitThreshold, config.CircuitTimeout),
}
}
Notice how we’re providing sensible defaults. Nothing’s worse than realizing halfway through debugging that you’ve misconfigured timeouts to be negative or something equally silly.
The Circuit Breaker Pattern
The circuit breaker is your safety valve. Think of it like the breaker box in your house—when current flows too heavily, it trips and prevents a fire. In our case, when an external service is melting down, the circuit breaker stops sending it requests.
package httpclient
import (
"sync"
"time"
)
type CircuitBreakerState int
const (
StateClosed CircuitBreakerState = iota
StateOpen
StateHalfOpen
)
type CircuitBreaker struct {
state CircuitBreakerState
failureCount int
lastFailureTime time.Time
threshold int
timeout time.Duration
mu sync.RWMutex
}
func NewCircuitBreaker(threshold int, timeout time.Duration) *CircuitBreaker {
return &CircuitBreaker{
state: StateClosed,
threshold: threshold,
timeout: timeout,
}
}
func (cb *CircuitBreaker) Call(fn func() error) error {
cb.mu.Lock()
defer cb.mu.Unlock()
// If circuit is open, check if we should transition to half-open
if cb.state == StateOpen {
if time.Since(cb.lastFailureTime) > cb.timeout {
cb.state = StateHalfOpen
cb.failureCount = 0
} else {
return ErrCircuitOpen
}
}
// Execute the function
err := fn()
if err != nil {
cb.failureCount++
cb.lastFailureTime = time.Now()
if cb.failureCount >= cb.threshold {
cb.state = StateOpen
}
return err
}
// Success - reset the circuit
if cb.state == StateHalfOpen {
cb.state = StateClosed
}
cb.failureCount = 0
return nil
}
func (cb *CircuitBreaker) State() CircuitBreakerState {
cb.mu.RLock()
defer cb.mu.RUnlock()
return cb.state
}
var ErrCircuitOpen = fmt.Errorf("circuit breaker is open")
Here’s where thread safety becomes important. We’re using a mutex because multiple goroutines might be accessing this circuit breaker simultaneously. The state machine has three states:
- Closed: Normal operation, requests pass through
- Open: Service is failing, requests are rejected immediately
- Half-Open: We’re testing if the service has recovered This prevents your service from continuously hammering a downed external API. When the circuit opens, you save bandwidth and give the other service time to recover.
Implementing Retry Logic with Exponential Backoff
Retry logic isn’t just “try again.” Hammering a service that’s recovering is like aggressively shaking a vending machine that’s stuck—it just makes things worse. Exponential backoff with jitter is the civilized approach:
package httpclient
import (
"context"
"io"
"math"
"math/rand"
"net/http"
"time"
)
func (rc *ResilientClient) Do(ctx context.Context, req *http.Request) (*http.Response, error) {
var lastErr error
backoff := rc.config.InitialBackoff
for attempt := 0; attempt <= rc.config.MaxRetries; attempt++ {
// Check context cancellation
select {
case <-ctx.Done():
return nil, ctx.Err()
default:
}
// Check circuit breaker
err := rc.circuitBreaker.Call(func() error {
resp, err := rc.client.Do(req)
if err != nil {
lastErr = err
return err
}
// Treat 5xx errors as retriable
if resp.StatusCode >= 500 {
io.Copy(io.Discard, resp.Body)
resp.Body.Close()
lastErr = fmt.Errorf("server error: %d", resp.StatusCode)
return lastErr
}
// Success
return nil
})
if err == nil {
// Successfully got a response
resp, _ := rc.client.Do(req)
return resp, nil
}
if err == ErrCircuitOpen {
return nil, err
}
// Don't retry on last attempt
if attempt == rc.config.MaxRetries {
break
}
// Calculate backoff with jitter
jitter := time.Duration(rand.Int63n(int64(backoff / 2)))
sleepDuration := backoff + jitter
select {
case <-time.After(sleepDuration):
case <-ctx.Done():
return nil, ctx.Err()
}
// Exponential backoff: double each time, capped at max
backoff = time.Duration(math.Min(
float64(backoff*2),
float64(rc.config.MaxBackoff),
))
}
return nil, fmt.Errorf("max retries exceeded: %w", lastErr)
}
Notice the jitter we’re adding? That’s crucial. If you retry with fixed intervals and multiple clients are affected simultaneously, they’ll all retry at the same time in lockstep—a thundering herd that hammers the recovering service. Random jitter spreads the load naturally. Also, we’re respecting the context’s deadline. If the caller has a timeout, we honor it and don’t keep retrying past it.
The Complete Production-Ready Client
Let’s assemble everything into a client you’d actually use in production:
package httpclient
import (
"context"
"encoding/json"
"fmt"
"io"
"net/http"
"time"
)
type Response struct {
Status int
Body []byte
Header http.Header
}
func (rc *ResilientClient) Get(ctx context.Context, url string) (*Response, error) {
req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
if err != nil {
return nil, fmt.Errorf("failed to create request: %w", err)
}
return rc.doAndReadBody(ctx, req)
}
func (rc *ResilientClient) Post(ctx context.Context, url string, body []byte, contentType string) (*Response, error) {
req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, nil)
if err != nil {
return nil, fmt.Errorf("failed to create request: %w", err)
}
req.Header.Set("Content-Type", contentType)
return rc.doAndReadBody(ctx, req)
}
func (rc *ResilientClient) PostJSON(ctx context.Context, url string, payload interface{}) (*Response, error) {
body, err := json.Marshal(payload)
if err != nil {
return nil, fmt.Errorf("failed to marshal JSON: %w", err)
}
return rc.Post(ctx, url, body, "application/json")
}
func (rc *ResilientClient) doAndReadBody(ctx context.Context, req *http.Request) (*Response, error) {
resp, err := rc.Do(ctx, req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return nil, fmt.Errorf("failed to read response body: %w", err)
}
return &Response{
Status: resp.StatusCode,
Body: body,
Header: resp.Header,
}, nil
}
This wraps everything neatly. You’ve got convenience methods for common operations, proper error handling, and a clean API.
Putting It All Together: A Real Example
Let’s create a practical example that calls a public API with all our resilience features:
package main
import (
"context"
"encoding/json"
"fmt"
"log"
"time"
"yourmodule/httpclient"
)
type GitHubUser struct {
Login string `json:"login"`
Name string `json:"name"`
Followers int `json:"followers"`
}
func main() {
// Configure our resilient client
config := httpclient.Config{
MaxRetries: 3,
InitialBackoff: 100 * time.Millisecond,
MaxBackoff: 5 * time.Second,
Timeout: 10 * time.Second,
CircuitThreshold: 5,
CircuitTimeout: 30 * time.Second,
}
client := httpclient.NewResilientClient(config)
// Create a context with a 15-second deadline
ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
defer cancel()
// Make the request
resp, err := client.Get(ctx, "https://api.github.com/users/golang")
if err != nil {
log.Fatalf("Failed to fetch user: %v", err)
}
if resp.Status != 200 {
log.Fatalf("Unexpected status code: %d", resp.Status)
}
// Parse the response
var user GitHubUser
if err := json.Unmarshal(resp.Body, &user); err != nil {
log.Fatalf("Failed to parse response: %v", err)
}
fmt.Printf("User: %s\n", user.Login)
fmt.Printf("Name: %s\n", user.Name)
fmt.Printf("Followers: %d\n", user.Followers)
}
Run this and you’ll see it handles network hiccups, temporary server errors, and rate limiting gracefully. The circuit breaker prevents cascading failures, and the exponential backoff keeps you from hammering the service.
Testing Your Resilient Client
Here’s where the rubber meets the road. Testing HTTP clients traditionally requires mocking, but with a resilient client, we can do something elegant:
package httpclient
import (
"context"
"net/http"
"net/http/httptest"
"testing"
"time"
)
func TestRetryLogic(t *testing.T) {
attemptCount := 0
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
attemptCount++
// Fail the first two attempts
if attemptCount < 3 {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
w.Write([]byte(`{"status": "ok"}`))
}))
defer server.Close()
config := Config{
MaxRetries: 3,
InitialBackoff: 10 * time.Millisecond,
MaxBackoff: 100 * time.Millisecond,
Timeout: 5 * time.Second,
}
client := NewResilientClient(config)
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
resp, err := client.Get(ctx, server.URL)
if err != nil {
t.Fatalf("Expected success, got error: %v", err)
}
if resp.Status != http.StatusOK {
t.Fatalf("Expected status 200, got %d", resp.Status)
}
if attemptCount != 3 {
t.Fatalf("Expected 3 attempts, got %d", attemptCount)
}
}
func TestCircuitBreakerTrips(t *testing.T) {
failCount := 0
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
failCount++
w.WriteHeader(http.StatusInternalServerError)
}))
defer server.Close()
config := Config{
MaxRetries: 1,
InitialBackoff: 5 * time.Millisecond,
CircuitThreshold: 2,
CircuitTimeout: 100 * time.Millisecond,
}
client := NewResilientClient(config)
ctx := context.Background()
// First call - triggers retries
client.Get(ctx, server.URL)
// Second call - circuit opens
client.Get(ctx, server.URL)
// Third call - should be rejected immediately
_, err := client.Get(ctx, server.URL)
if err != ErrCircuitOpen {
t.Fatalf("Expected circuit open error, got: %v", err)
}
}
These tests verify that retries actually happen and that the circuit breaker trips appropriately. Much better than wondering in production.
Configuration Best Practices
You’ve got knobs to turn, but turning them blindly is recipe for disaster. Here’s what I’ve learned from production experience: For typical API integrations:
Config{
MaxRetries: 3,
InitialBackoff: 100 * time.Millisecond,
MaxBackoff: 10 * time.Second,
Timeout: 5 * time.Second,
CircuitThreshold: 5,
CircuitTimeout: 30 * time.Second,
}
For critical paths you can’t afford to fail:
Config{
MaxRetries: 5,
InitialBackoff: 50 * time.Millisecond,
MaxBackoff: 30 * time.Second,
Timeout: 30 * time.Second,
CircuitThreshold: 10,
CircuitTimeout: 60 * time.Second,
}
For external services known to be unreliable:
Config{
MaxRetries: 2,
InitialBackoff: 500 * time.Millisecond,
MaxBackoff: 5 * time.Second,
Timeout: 3 * time.Second,
CircuitThreshold: 3,
CircuitTimeout: 15 * time.Second,
}
The key insight: a 5-second total timeout with a 500ms initial backoff means you’ve got room for about 3 retries before you hit the deadline. Don’t configure them independently—they’re interconnected.
Advanced Techniques
Once you’ve mastered the basics, consider these additions: Metrics and Observability:
type Metrics struct {
TotalRequests int64
SuccessfulRequests int64
FailedRequests int64
CircuitOpens int64
RetryCount int64
}
Track these to understand your service’s health. If retry counts spike, something upstream is degrading. If the circuit keeps opening, you might need to adjust thresholds. Status Code Strategies: Not all non-2xx responses warrant retries. A 400 Bad Request won’t magically become valid if you retry it three times. Only retry on 429 (rate limit), 503 (service unavailable), 504 (gateway timeout), and timeout errors. Handle 400-level errors differently from 500-level ones. Hedged Requests: For latency-sensitive operations, send two requests in parallel and return whichever completes first. This is advanced territory, but it can reduce p99 latencies dramatically.
Common Pitfalls to Avoid
Memory Leaks from Unclosed Bodies: Every failed request still reads and closes the response body to release the connection. Forgetting this causes connection pool exhaustion. Infinite Retries: Always respect context deadlines. A timeout is your emergency exit hatch. Too-Aggressive Backoff: If your MaxBackoff is 5 minutes but requests are failing every second, you’re bleeding traffic. Test realistic failure scenarios. Circuit Breaker Threshold Too High: Setting it to 100 failures means you’re hammering a downed service for way too long. Start conservative—usually 5-10 is right.
Wrapping Up
Building a resilient HTTP client isn’t glamorous work. There’s no exciting machine learning or cutting-edge infrastructure involved. But it’s one of those unsexy, fundamental things that separates reliable systems from systems that wake you up at 3 AM. The patterns we’ve built here—retries with exponential backoff, jitter for thundering herd prevention, and the circuit breaker for cascading failure protection—are industry standard for good reason. They work. Your 3 AM self will thank you.
