Why Your Systems Keep Breaking (And How Erlang Actually Fixes It)

Let me start with something most developers experience at 2 AM: a production system failing because one small component crashed. You’ve probably added try-catch blocks everywhere, added retry logic that somehow made things worse, and created defensive code so convoluted that nobody dares touch it. Then you hear about Erlang, and someone casually mentions “letting it crash” as if that’s a feature, not a nightmare. Spoiler: they’re not insane. There’s actually brilliant philosophy hiding behind what sounds like career suicide. Erlang was born in a telecommunications company where systems needed to run for years without stopping. Not “mostly without stopping.” Not “with scheduled maintenance windows.” Actually non-stop. That requirement forces you to think about failure differently—not as something to prevent, but as something to expect, isolate, and recover from.

The Mindset Shift: From Prevention to Resilience

Here’s the fundamental difference between traditional programming and Erlang: Most languages ask: “How do I prevent errors?” Erlang asks: “How do I handle errors when they inevitably occur?” This isn’t semantic hair-splitting. It changes architecture fundamentally. In traditional systems, you wrap everything in defensive shields. In Erlang, you design systems where failures in one part don’t cascade through everything else. The secret ingredient? Erlang was designed from the ground up for concurrency, distribution, and fault tolerance. It’s not something bolted on afterward. It’s in the DNA.

Meet BEAM: The Virtual Machine That Actually Gets Concurrency

Before diving into code, let’s talk about what makes Erlang tick: the Erlang Runtime System, specifically BEAM (Bjorn’s Erlang Abstract Machine). Unlike most virtual machines that treat concurrency like an afterthought, BEAM is obsessed with it:

  • Lightweight processes: You can spawn hundreds of thousands of them without breaking a sweat. These aren’t OS threads—they’re incredibly cheap abstractions that BEAM schedules across available CPU cores
  • Process isolation: When one process dies, it doesn’t corrupt others’ memory or state
  • Smart scheduling: BEAM tracks how much CPU time each process consumes (measured in “reductions”) and fairly distributes execution time
  • Built-in message passing: Processes communicate through messages, not shared memory, eliminating entire categories of concurrency bugs Think of it this way: BEAM is like having an incredibly smart conductor managing thousands of musicians. When one musician plays a wrong note, the conductor doesn’t shut down the entire orchestra—they just note what happened and move on.

The “Let it Crash” Philosophy: Not Negligence, It’s a Feature

This is the concept that sounds insane until you understand it. The recommended way of programming in Erlang is to let failing processes crash and have other processes detect and fix them. The philosophy is:

Instead of trying to save a situation that may not be salvageable, Erlang follows “Let it Crash”—terminate cleanly, restart, and log everything for debugging. Why is this brilliant?

  1. Simplicity: Your individual process code doesn’t need defensive logic everywhere. Write it cleanly, assume happy paths, and let failures propagate upward
  2. Debuggability: You get full crash logs instead of half-recovered systems in weird states
  3. Reliability: A clean crash and restart is way more reliable than code desperately trying to patch things up
  4. Recovery: Other processes can catch the crash signal and handle recovery automatically Here’s a real-world example: Imagine you’re building an IoT system with hundreds of thousands of sensors connected through gateways. Network failures happen constantly. Gateway firmware sometimes bugs out. Sensor data occasionally comes corrupted. Traditional approach: add error handling at every point. Erlang approach: let processes crash when they encounter bad data, have supervisors restart them, log the incident. The system keeps running. The problem is visible in logs. You fix it in the next deployment without the entire system entering a death spiral.

Processes: The Building Block of Erlang Systems

In Erlang, a process is a unit of failure tolerance, not just code organization. This distinction matters enormously. A process is a lightweight thread of execution. You spawn thousands of them. They communicate via message passing. When one dies, others live on. Here’s a trivial example:

-module(echo_server).
-export([start/0, loop/0]).
start() ->
    spawn(?MODULE, loop, []).
loop() ->
    receive
        {echo, From, Message} ->
            From ! {response, Message},
            loop();
        stop ->
            io:format("Echo server stopping~n", [])
    end.

This process waits for messages. When it gets an {echo, From, Message} tuple, it sends back the message. The process runs in its own isolated environment. If something goes wrong inside it, other processes don’t care. You can test this:

1> Server = echo_server:start().
<0.35.0>
2> Server ! {echo, self(), "Hello, Erlang!"}.
3> receive Response -> Response end.
{response,"Hello, Erlang!"}

Each process has its own message queue. They don’t share memory. No mutexes, no race conditions, no cache coherency nightmares. Just send a message and move on.

Process Linking: Making Failures Visible

Now here’s where it gets powerful. When an Erlang process terminates due to an error, it generates an ’exit signal’ broadcasted to all processes in its link set. You link processes like this:

link(Pid)

When the linked process crashes, the current process receives an exit signal. By default, this causes the receiver to exit as well—creating a cascade. Sounds bad? In a supervisory context, it’s actually perfect. A parent process can trap exit signals instead of dying:

-module(fault_tolerant_server).
-export([start/0, init/0]).
start() ->
    spawn(?MODULE, init, []).
init() ->
    process_flag(trap_exit, true),
    loop([]).
loop(Workers) ->
    receive
        {spawn_worker, Task} ->
            Pid = spawn_link(fun() -> do_task(Task) end),
            loop([Pid | Workers]);
        {'EXIT', Pid, normal} ->
            io:format("Worker ~w finished normally~n", [Pid]),
            loop(lists:delete(Pid, Workers));
        {'EXIT', Pid, Reason} ->
            io:format("Worker ~w crashed: ~w. Restarting...~n", [Pid, Reason]),
            % Could restart the worker here
            loop(lists:delete(Pid, Workers))
    end.
do_task(Task) ->
    % Some potentially failing operation
    io:format("Executing task: ~w~n", [Task]).

Setting trap_exit to true converts those deadly exit signals into regular messages. Now the parent process can observe failures, make decisions, and recover. This is the foundation of Erlang’s reliability story: failures are not silent. They’re surfaced to someone who can do something about it.

Supervision Trees: Building Reliable Hierarchies

Real systems need structure. Enter supervision trees—one of Erlang’s most elegant concepts. A supervisor is a special process that monitors other processes. If a child crashes, the supervisor catches the exit signal and decides what to do: restart it, let it fail permanently, or escalate the issue up the chain. Here’s the pattern:

-module(my_supervisor).
-behaviour(supervisor).
-export([start_link/0]).
-export([init/1]).
start_link() ->
    supervisor:start_link({local, my_sup}, ?MODULE, []).
init([]) ->
    ChildSpecs = [
        #{
            id => worker_1,
            start => {my_worker, start_link, []},
            restart => permanent,
            type => worker
        }
    ],
    {ok, {#{strategy => one_for_one}, ChildSpecs}}.

The strategy => one_for_one means: “If one child dies, restart only that child.” Other strategies include:

  • one_for_all: One child dies, restart all children
  • rest_for_one: One child dies, restart it and all started after it
  • simple_one_for_one: Dynamically spawn children with the same spec Supervisors can supervise other supervisors, creating hierarchies. At the top sits your application supervisor. Everything below it has a clear recovery policy.

Visual Architecture: How It All Fits Together

graph TB App["Application Supervisor
(strategy: one_for_one)"] WS["WebSocket Handler Supervisor
(strategy: simple_one_for_one)"] DB["Database Pool Supervisor
(strategy: one_for_all)"] W1["Connection 1"] W2["Connection 2"] W3["Connection 3"] D1["DB Worker 1"] D2["DB Worker 2"] App --> WS App --> DB WS --> W1 WS --> W2 WS --> W3 DB --> D1 DB --> D2 style App fill:#2E86AB style WS fill:#A23B72 style DB fill:#A23B72 style W1 fill:#F18F01 style W2 fill:#F18F01 style W3 fill:#F18F01 style D1 fill:#C73E1D style D2 fill:#C73E1D

When a WebSocket connection crashes (W2), the WebSocket supervisor restarts just that connection. The database pool and other connections keep running. If the entire database pool fails catastrophically, it restarts all DB workers, but doesn’t touch the WebSocket connections.

A Practical Example: Building a Fault-Tolerant Service

Let’s build something more realistic—a simple rate limiter that survives crashes:

-module(rate_limiter).
-behaviour(gen_server).
-export([start_link/1, check_rate/1]).
-export([init/1, handle_call/3, handle_cast/2]).
% Client API
start_link(MaxRequests) ->
    gen_server:start_link({local, ?MODULE}, ?MODULE, MaxRequests, []).
check_rate(Key) ->
    gen_server:call(?MODULE, {check, Key}).
% Server callbacks
init(MaxRequests) ->
    {ok, #{max => MaxRequests, requests => #{}}}.
handle_call({check, Key}, _From, State = #{max := Max, requests := Reqs}) ->
    Current = maps:get(Key, Reqs, 0),
    case Current < Max of
        true ->
            NewReqs = Reqs#{Key => Current + 1},
            {reply, {ok, Current + 1}, State#{requests := NewReqs}};
        false ->
            {reply, {error, rate_limit_exceeded}, State}
    end.
handle_cast(_, State) ->
    {noreply, State}.

Now the supervisor:

-module(rate_limiter_sup).
-behaviour(supervisor).
-export([start_link/0]).
-export([init/1]).
start_link() ->
    supervisor:start_link({local, ?MODULE}, ?MODULE, []).
init([]) ->
    ChildSpecs = [
        #{
            id => rate_limiter,
            start => {rate_limiter, start_link, },
            restart => permanent,
            type => worker
        }
    ],
    {ok, {#{strategy => one_for_one, intensity => 5, period => 60}, ChildSpecs}}.

The intensity => 5, period => 60 means: “If this restarts more than 5 times in 60 seconds, give up and escalate to the parent supervisor.” This prevents restart loops when something is fundamentally broken. What happens if the rate limiter process crashes?

  1. The supervisor detects the exit signal
  2. It checks the restart policy (permanent = restart it)
  3. It verifies restart hasn’t happened too many times recently
  4. It spawns a new rate_limiter process
  5. The system continues The crash is visible in logs, but your application keeps running.

Real-World Use Cases Where Erlang Shines

Chat Systems: Companies like ejabberd built massive chat infrastructure on Erlang. Thousands of simultaneous connections, any connection can crash without affecting others, and you can deploy fixes without disconnecting users. IoT Gateways: Hundreds of thousands of sensor connections through unreliable cellular networks. Gateways frequently fail due to network issues, buggy firmware, corrupt sensor data. Erlang’s approach of expecting failures and recovering from them is perfect. Telecom Infrastructure: Where this all started. Phone switches that need 99.9999999% uptime. Not coincidentally, where Erlang comes from. Distributed Systems: Multiple nodes communicating, coordinating work, handling node failures. Erlang has distribution built-in. The common thread: systems where one component failing shouldn’t bring everything down.

OTP: The Framework That Makes It Production-Ready

While Erlang the language handles concurrency, OTP (Open Telecom Platform) provides the design principles and reusable components. It includes:

  • gen_server: Generic server behavior (what we used above)
  • gen_statem: State machine behavior
  • supervisor: Process supervision
  • application: Application lifecycle management
  • Standard libraries for logging, configuration, distributed communication Using these behaviors means your code follows battle-tested patterns. Supervisors automatically restart workers. The framework handles message passing callbacks. You just write the business logic.

Getting Started: Your First Erlang Project

If you want to try this:

  1. Install Erlang: Most package managers have it. Or download from erlang.org
  2. Learn basic syntax: Functions, pattern matching, list comprehension
  3. Understand processes: Spawn, message passing, linking
  4. Study supervisors: How to structure reliable systems
  5. Build something small: A simple service with supervision The learning curve is real—Erlang is different from mainstream languages. But once it clicks, you see systems differently. Failure becomes manageable, not terrifying.

The Honest Truth

Erlang isn’t a silver bullet. It won’t fix bad algorithms. It won’t solve distributed system consensus problems. Some problems are legitimately hard. But here’s what it does do better than almost anything: it makes building systems that survive failures almost boring. You structure your code correctly, write your process logic cleanly, set up supervisors, and the framework handles the rest. Processes crash? Supervisors restart them. Network splits? Erlang’s distribution handles it. Need to deploy a fix? Hot-load code without stopping the system. The goal isn’t to prevent all failures. It’s to make failures survivable and visible. And that’s a goal worth building toward.

Want to build resilient systems? Start with the official Erlang documentation and work through a real project. Chat servers are the classic “hello world” of Erlang fault tolerance. By the time you’ve built one, you’ll understand why companies still rely on this 35-year-old technology for the systems that absolutely cannot fail.