Building a Resilient HTTP Client in Go: Retries and Circuit Breakers

Building HTTP clients might seem straightforward until 3 AM when your service starts hammering a failing external API, burns through your rate limits, and cascades into total meltdown. We’ve all been there. Or maybe you haven’t yet—consider this your friendly warning from someone who has. The difference between a casual HTTP client and a production-grade one often comes down to two deceptively simple concepts: retries and circuit breakers. They’re not glamorous, but they’ll save your bacon when things inevitably go sideways.

Why Your Naive HTTP Client Will Fail You

Let’s be honest. Writing an HTTP client in Go is embarrassingly easy. The standard library practically hands you everything on a silver platter:

resp, err := http.Get("https://api.example.com/data")

Beautiful. Elegant. Completely insufficient for the real world. The moment an external service hiccups—network timeout, temporary overload, transient database connection pool exhaustion—your code fails hard. No retry, no grace period, just immediate failure. And if you’re hitting that service repeatedly in a loop? You’ve just become a denial-of-service attack. This is where resilience patterns come in. They’re the difference between a service that works 95% of the time and one that works 99.99% of the time.

Understanding the Architecture

Before we dive into code, let me paint a picture of what we’re building:

graph TD Client["Client Request"] CB{"Circuit Breaker
Open?"} Retry{"Retry
Count"} Request["Make HTTP Request"] Success{"Success?"} Backoff["Exponential Backoff"] FailedReturn["Return Error"] SuccessReturn["Return Response"] Client --> CB CB -->|Open| FailedReturn CB -->|Closed| Retry Retry -->|Max Retries| FailedReturn Retry -->|Retries Left| Request Request --> Success Success -->|No| Backoff Backoff --> Retry Success -->|Yes| SuccessReturn

The flow is elegant: check the circuit breaker’s status, attempt the request with retry logic, back off exponentially on failure, and eventually either succeed or fail gracefully. No resource exhaustion, no thundering herd problem.

Building the Foundation

Let’s start with a solid foundation. We’ll create a struct that encapsulates our HTTP client with resilience capabilities:

package httpclient
import (
    "context"
    "fmt"
    "io"
    "net/http"
    "time"
)
type Config struct {
    MaxRetries      int
    InitialBackoff  time.Duration
    MaxBackoff      time.Duration
    Timeout         time.Duration
    CircuitThreshold int
    CircuitTimeout  time.Duration
}
type ResilientClient struct {
    client          *http.Client
    config          Config
    circuitBreaker  *CircuitBreaker
}
func NewResilientClient(config Config) *ResilientClient {
    if config.MaxRetries == 0 {
        config.MaxRetries = 3
    }
    if config.InitialBackoff == 0 {
        config.InitialBackoff = 100 * time.Millisecond
    }
    if config.MaxBackoff == 0 {
        config.MaxBackoff = 30 * time.Second
    }
    if config.Timeout == 0 {
        config.Timeout = 10 * time.Second
    }
    if config.CircuitThreshold == 0 {
        config.CircuitThreshold = 5
    }
    if config.CircuitTimeout == 0 {
        config.CircuitTimeout = 30 * time.Second
    }
    httpClient := &http.Client{
        Timeout: config.Timeout,
    }
    return &ResilientClient{
        client:         httpClient,
        config:         config,
        circuitBreaker: NewCircuitBreaker(config.CircuitThreshold, config.CircuitTimeout),
    }
}

Notice how we’re providing sensible defaults. Nothing’s worse than realizing halfway through debugging that you’ve misconfigured timeouts to be negative or something equally silly.

The Circuit Breaker Pattern

The circuit breaker is your safety valve. Think of it like the breaker box in your house—when current flows too heavily, it trips and prevents a fire. In our case, when an external service is melting down, the circuit breaker stops sending it requests.

package httpclient
import (
    "sync"
    "time"
)
type CircuitBreakerState int
const (
    StateClosed CircuitBreakerState = iota
    StateOpen
    StateHalfOpen
)
type CircuitBreaker struct {
    state           CircuitBreakerState
    failureCount    int
    lastFailureTime time.Time
    threshold       int
    timeout         time.Duration
    mu              sync.RWMutex
}
func NewCircuitBreaker(threshold int, timeout time.Duration) *CircuitBreaker {
    return &CircuitBreaker{
        state:     StateClosed,
        threshold: threshold,
        timeout:   timeout,
    }
}
func (cb *CircuitBreaker) Call(fn func() error) error {
    cb.mu.Lock()
    defer cb.mu.Unlock()
    // If circuit is open, check if we should transition to half-open
    if cb.state == StateOpen {
        if time.Since(cb.lastFailureTime) > cb.timeout {
            cb.state = StateHalfOpen
            cb.failureCount = 0
        } else {
            return ErrCircuitOpen
        }
    }
    // Execute the function
    err := fn()
    if err != nil {
        cb.failureCount++
        cb.lastFailureTime = time.Now()
        if cb.failureCount >= cb.threshold {
            cb.state = StateOpen
        }
        return err
    }
    // Success - reset the circuit
    if cb.state == StateHalfOpen {
        cb.state = StateClosed
    }
    cb.failureCount = 0
    return nil
}
func (cb *CircuitBreaker) State() CircuitBreakerState {
    cb.mu.RLock()
    defer cb.mu.RUnlock()
    return cb.state
}
var ErrCircuitOpen = fmt.Errorf("circuit breaker is open")

Here’s where thread safety becomes important. We’re using a mutex because multiple goroutines might be accessing this circuit breaker simultaneously. The state machine has three states:

Closed: Normal operation, requests pass through
Open: Service is failing, requests are rejected immediately
Half-Open: We’re testing if the service has recovered This prevents your service from continuously hammering a downed external API. When the circuit opens, you save bandwidth and give the other service time to recover.

Implementing Retry Logic with Exponential Backoff

Retry logic isn’t just “try again.” Hammering a service that’s recovering is like aggressively shaking a vending machine that’s stuck—it just makes things worse. Exponential backoff with jitter is the civilized approach:

package httpclient
import (
    "context"
    "io"
    "math"
    "math/rand"
    "net/http"
    "time"
)
func (rc *ResilientClient) Do(ctx context.Context, req *http.Request) (*http.Response, error) {
    var lastErr error
    backoff := rc.config.InitialBackoff
    for attempt := 0; attempt <= rc.config.MaxRetries; attempt++ {
        // Check context cancellation
        select {
        case <-ctx.Done():
            return nil, ctx.Err()
        default:
        }
        // Check circuit breaker
        err := rc.circuitBreaker.Call(func() error {
            resp, err := rc.client.Do(req)
            if err != nil {
                lastErr = err
                return err
            }
            // Treat 5xx errors as retriable
            if resp.StatusCode >= 500 {
                io.Copy(io.Discard, resp.Body)
                resp.Body.Close()
                lastErr = fmt.Errorf("server error: %d", resp.StatusCode)
                return lastErr
            }
            // Success
            return nil
        })
        if err == nil {
            // Successfully got a response
            resp, _ := rc.client.Do(req)
            return resp, nil
        }
        if err == ErrCircuitOpen {
            return nil, err
        }
        // Don't retry on last attempt
        if attempt == rc.config.MaxRetries {
            break
        }
        // Calculate backoff with jitter
        jitter := time.Duration(rand.Int63n(int64(backoff / 2)))
        sleepDuration := backoff + jitter
        select {
        case <-time.After(sleepDuration):
        case <-ctx.Done():
            return nil, ctx.Err()
        }
        // Exponential backoff: double each time, capped at max
        backoff = time.Duration(math.Min(
            float64(backoff*2),
            float64(rc.config.MaxBackoff),
        ))
    }
    return nil, fmt.Errorf("max retries exceeded: %w", lastErr)
}

Notice the jitter we’re adding? That’s crucial. If you retry with fixed intervals and multiple clients are affected simultaneously, they’ll all retry at the same time in lockstep—a thundering herd that hammers the recovering service. Random jitter spreads the load naturally. Also, we’re respecting the context’s deadline. If the caller has a timeout, we honor it and don’t keep retrying past it.

The Complete Production-Ready Client

Let’s assemble everything into a client you’d actually use in production:

package httpclient
import (
    "context"
    "encoding/json"
    "fmt"
    "io"
    "net/http"
    "time"
)
type Response struct {
    Status int
    Body   []byte
    Header http.Header
}
func (rc *ResilientClient) Get(ctx context.Context, url string) (*Response, error) {
    req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
    if err != nil {
        return nil, fmt.Errorf("failed to create request: %w", err)
    }
    return rc.doAndReadBody(ctx, req)
}
func (rc *ResilientClient) Post(ctx context.Context, url string, body []byte, contentType string) (*Response, error) {
    req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, nil)
    if err != nil {
        return nil, fmt.Errorf("failed to create request: %w", err)
    }
    req.Header.Set("Content-Type", contentType)
    return rc.doAndReadBody(ctx, req)
}
func (rc *ResilientClient) PostJSON(ctx context.Context, url string, payload interface{}) (*Response, error) {
    body, err := json.Marshal(payload)
    if err != nil {
        return nil, fmt.Errorf("failed to marshal JSON: %w", err)
    }
    return rc.Post(ctx, url, body, "application/json")
}
func (rc *ResilientClient) doAndReadBody(ctx context.Context, req *http.Request) (*Response, error) {
    resp, err := rc.Do(ctx, req)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()
    body, err := io.ReadAll(resp.Body)
    if err != nil {
        return nil, fmt.Errorf("failed to read response body: %w", err)
    }
    return &Response{
        Status: resp.StatusCode,
        Body:   body,
        Header: resp.Header,
    }, nil
}

This wraps everything neatly. You’ve got convenience methods for common operations, proper error handling, and a clean API.

Putting It All Together: A Real Example

Let’s create a practical example that calls a public API with all our resilience features:

package main
import (
    "context"
    "encoding/json"
    "fmt"
    "log"
    "time"
    "yourmodule/httpclient"
)
type GitHubUser struct {
    Login string `json:"login"`
    Name  string `json:"name"`
    Followers int `json:"followers"`
}
func main() {
    // Configure our resilient client
    config := httpclient.Config{
        MaxRetries:      3,
        InitialBackoff:  100 * time.Millisecond,
        MaxBackoff:      5 * time.Second,
        Timeout:         10 * time.Second,
        CircuitThreshold: 5,
        CircuitTimeout:  30 * time.Second,
    }
    client := httpclient.NewResilientClient(config)
    // Create a context with a 15-second deadline
    ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
    defer cancel()
    // Make the request
    resp, err := client.Get(ctx, "https://api.github.com/users/golang")
    if err != nil {
        log.Fatalf("Failed to fetch user: %v", err)
    }
    if resp.Status != 200 {
        log.Fatalf("Unexpected status code: %d", resp.Status)
    }
    // Parse the response
    var user GitHubUser
    if err := json.Unmarshal(resp.Body, &user); err != nil {
        log.Fatalf("Failed to parse response: %v", err)
    }
    fmt.Printf("User: %s\n", user.Login)
    fmt.Printf("Name: %s\n", user.Name)
    fmt.Printf("Followers: %d\n", user.Followers)
}

Run this and you’ll see it handles network hiccups, temporary server errors, and rate limiting gracefully. The circuit breaker prevents cascading failures, and the exponential backoff keeps you from hammering the service.

Testing Your Resilient Client

Here’s where the rubber meets the road. Testing HTTP clients traditionally requires mocking, but with a resilient client, we can do something elegant:

package httpclient
import (
    "context"
    "net/http"
    "net/http/httptest"
    "testing"
    "time"
)
func TestRetryLogic(t *testing.T) {
    attemptCount := 0
    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        attemptCount++
        // Fail the first two attempts
        if attemptCount < 3 {
            w.WriteHeader(http.StatusServiceUnavailable)
            return
        }
        w.WriteHeader(http.StatusOK)
        w.Write([]byte(`{"status": "ok"}`))
    }))
    defer server.Close()
    config := Config{
        MaxRetries:     3,
        InitialBackoff: 10 * time.Millisecond,
        MaxBackoff:     100 * time.Millisecond,
        Timeout:        5 * time.Second,
    }
    client := NewResilientClient(config)
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    resp, err := client.Get(ctx, server.URL)
    if err != nil {
        t.Fatalf("Expected success, got error: %v", err)
    }
    if resp.Status != http.StatusOK {
        t.Fatalf("Expected status 200, got %d", resp.Status)
    }
    if attemptCount != 3 {
        t.Fatalf("Expected 3 attempts, got %d", attemptCount)
    }
}
func TestCircuitBreakerTrips(t *testing.T) {
    failCount := 0
    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        failCount++
        w.WriteHeader(http.StatusInternalServerError)
    }))
    defer server.Close()
    config := Config{
        MaxRetries:       1,
        InitialBackoff:   5 * time.Millisecond,
        CircuitThreshold: 2,
        CircuitTimeout:   100 * time.Millisecond,
    }
    client := NewResilientClient(config)
    ctx := context.Background()
    // First call - triggers retries
    client.Get(ctx, server.URL)
    // Second call - circuit opens
    client.Get(ctx, server.URL)
    // Third call - should be rejected immediately
    _, err := client.Get(ctx, server.URL)
    if err != ErrCircuitOpen {
        t.Fatalf("Expected circuit open error, got: %v", err)
    }
}

These tests verify that retries actually happen and that the circuit breaker trips appropriately. Much better than wondering in production.

Configuration Best Practices

You’ve got knobs to turn, but turning them blindly is recipe for disaster. Here’s what I’ve learned from production experience: For typical API integrations:

Config{
    MaxRetries:      3,
    InitialBackoff:  100 * time.Millisecond,
    MaxBackoff:      10 * time.Second,
    Timeout:         5 * time.Second,
    CircuitThreshold: 5,
    CircuitTimeout:  30 * time.Second,
}

For critical paths you can’t afford to fail:

Config{
    MaxRetries:      5,
    InitialBackoff:  50 * time.Millisecond,
    MaxBackoff:      30 * time.Second,
    Timeout:         30 * time.Second,
    CircuitThreshold: 10,
    CircuitTimeout:   60 * time.Second,
}

For external services known to be unreliable:

Config{
    MaxRetries:      2,
    InitialBackoff:  500 * time.Millisecond,
    MaxBackoff:      5 * time.Second,
    Timeout:         3 * time.Second,
    CircuitThreshold: 3,
    CircuitTimeout:   15 * time.Second,
}

The key insight: a 5-second total timeout with a 500ms initial backoff means you’ve got room for about 3 retries before you hit the deadline. Don’t configure them independently—they’re interconnected.

Advanced Techniques

Once you’ve mastered the basics, consider these additions: Metrics and Observability:

type Metrics struct {
    TotalRequests   int64
    SuccessfulRequests int64
    FailedRequests  int64
    CircuitOpens    int64
    RetryCount      int64
}

Track these to understand your service’s health. If retry counts spike, something upstream is degrading. If the circuit keeps opening, you might need to adjust thresholds. Status Code Strategies: Not all non-2xx responses warrant retries. A 400 Bad Request won’t magically become valid if you retry it three times. Only retry on 429 (rate limit), 503 (service unavailable), 504 (gateway timeout), and timeout errors. Handle 400-level errors differently from 500-level ones. Hedged Requests: For latency-sensitive operations, send two requests in parallel and return whichever completes first. This is advanced territory, but it can reduce p99 latencies dramatically.

Common Pitfalls to Avoid

Memory Leaks from Unclosed Bodies: Every failed request still reads and closes the response body to release the connection. Forgetting this causes connection pool exhaustion. Infinite Retries: Always respect context deadlines. A timeout is your emergency exit hatch. Too-Aggressive Backoff: If your MaxBackoff is 5 minutes but requests are failing every second, you’re bleeding traffic. Test realistic failure scenarios. Circuit Breaker Threshold Too High: Setting it to 100 failures means you’re hammering a downed service for way too long. Start conservative—usually 5-10 is right.

Wrapping Up

Building a resilient HTTP client isn’t glamorous work. There’s no exciting machine learning or cutting-edge infrastructure involved. But it’s one of those unsexy, fundamental things that separates reliable systems from systems that wake you up at 3 AM. The patterns we’ve built here—retries with exponential backoff, jitter for thundering herd prevention, and the circuit breaker for cascading failure protection—are industry standard for good reason. They work. Your 3 AM self will thank you.

Subscribe to Our Telegram Channel

Подпишитесь на наш телеграм

Thank you for subscribing!

Спасибо за подписку!

Why Your Naive HTTP Client Will Fail You#

Understanding the Architecture#

Building the Foundation#

The Circuit Breaker Pattern#

Implementing Retry Logic with Exponential Backoff#

The Complete Production-Ready Client#

Putting It All Together: A Real Example#

Testing Your Resilient Client#

Configuration Best Practices#

Advanced Techniques#

Common Pitfalls to Avoid#

Wrapping Up#