Building a Distributed Locking System in Go with ZooKeeper: From Theory to Production

The Lock Dilemma: When `sync.Mutex` Just Isn’t Enough

You know that feeling when you realize your precious in-process mutex won’t cut it anymore? Yeah, we’ve all been there. Your single-threaded assumptions worked fine until your system decided to grow up and become distributed. Suddenly, you’ve got multiple services running on different machines, all trying to access the same resource, and your sync.Mutex is sitting there looking confused—because it only locks things within a single process. Welcome to the world of distributed locks, where coordinating access across multiple machines is less like a Swiss watch and more like herding cats that don’t speak the same language. But don’t worry—ZooKeeper is here to be your linguistic cat-herding specialist.

What Makes Distributed Locks So Special?

Before we dive into the code, let’s talk about what makes distributed locking fundamentally different from its single-machine cousin. When you have multiple services across different machines all wanting exclusive access to the same resource, you need a system that guarantees:

Mutual Exclusion: Only one process holds the lock at any given time (no race conditions, period)
Deadlock Prevention: If a process crashes while holding the lock, the lock automatically releases
Fair Ordering: Services should acquire locks in the order they requested them (or at least have a predictable ordering)
Fault Tolerance: The locking mechanism survives network partitions and service failures A local mutex? It’ll break the moment your process dies or your network hiccups. ZooKeeper? It was literally built for this nightmare fuel.

The ZooKeeper Magic: Ephemeral Sequential Nodes

Here’s where things get interesting. ZooKeeper implements distributed locks using a deceptively simple but brilliant mechanism: ephemeral sequential nodes. Let me break this down: Ephemeral nodes are temporary nodes that automatically disappear when the client that created them disconnects. This solves the “crashed process holding a lock forever” problem elegantly. No need for timeouts or cleanup jobs—the lock just vanishes. Sequential nodes are nodes with automatically incremented suffixes. When you create /locks/lock-, ZooKeeper gives you /locks/lock-0. The next one gets /locks/lock-1, and so on. This creates a natural queue. The protocol is deceptively elegant:

Client A creates /locks/lock- → Gets /locks/lock-0 and acquires the lock
Client B creates /locks/lock- → Gets /locks/lock-1 and… waits
Client C creates /locks/lock- → Gets /locks/lock-2 and waits even more patiently
When Client A releases the lock, Client B is automatically notified and acquires it
When Client B finishes, Client C gets its turn It’s like a perfectly orchestrated restaurant queue system, except your restaurant is distributed across multiple data centers.

Architecture Overview

Before we write actual code, let’s visualize how this whole system works:

Setting Up Your Go Project

Let’s get our hands dirty. First, create a new Go project and grab the ZooKeeper client:

go get github.com/samuel/go-zookeeper/zk

This is the de facto standard Go client for ZooKeeper. It’s battle-tested and used in production systems worldwide. The alternatives? Let’s just say you’ll thank me for recommending this one.

Building Your First Distributed Locker

Here’s a production-grade implementation that handles the edge cases and error conditions that’ll inevitably bite you in production:

package lock
import (
	"fmt"
	"sort"
	"time"
	"github.com/samuel/go-zookeeper/zk"
)
type DistributedLocker struct {
	conn      *zk.Conn
	basePath  string
	lockPath  string
	timeout   time.Duration
	sessionID string
}
// NewDistributedLocker creates a new distributed locker instance
func NewDistributedLocker(hosts []string, basePath string, 
	lockTimeout time.Duration) (*DistributedLocker, error) {
	// Connect to ZooKeeper ensemble
	conn, _, err := zk.Connect(hosts, 10*time.Second)
	if err != nil {
		return nil, fmt.Errorf("failed to connect to ZooKeeper: %w", err)
	}
	// Ensure base path exists (create if needed)
	exists, _, err := conn.Exists(basePath)
	if err != nil {
		conn.Close()
		return nil, fmt.Errorf("failed to check path existence: %w", err)
	}
	if !exists {
		// Create path with empty data - this is just infrastructure
		_, err = conn.Create(basePath, []byte{}, 0, zk.WorldACL(zk.PermAll))
		if err != nil && err != zk.ErrNodeExists {
			conn.Close()
			return nil, fmt.Errorf("failed to create lock path: %w", err)
		}
	}
	return &DistributedLocker{
		conn:      conn,
		basePath:  basePath,
		timeout:   lockTimeout,
		sessionID: fmt.Sprintf("%d", time.Now().UnixNano()),
	}, nil
}
// Lock attempts to acquire the distributed lock
func (dl *DistributedLocker) Lock() error {
	// Create our ephemeral sequential node
	// The flags: 0 is the version, zk.FlagEphemeral | zk.FlagSequence creates
	// an ephemeral sequential node that ZK automatically deletes on disconnection
	lockNode := fmt.Sprintf("%s/lock-", dl.basePath)
	createdPath, err := dl.conn.Create(lockNode, []byte(dl.sessionID), 
		zk.FlagEphemeral|zk.FlagSequence, zk.WorldACL(zk.PermAll))
	if err != nil {
		return fmt.Errorf("failed to create lock node: %w", err)
	}
	dl.lockPath = createdPath
	// Now we need to check if we have the lock or wait for it
	for {
		// Get all lock nodes and sort them
		children, _, err := dl.conn.Children(dl.basePath)
		if err != nil {
			return fmt.Errorf("failed to get children: %w", err)
		}
		sort.Strings(children)
		// Extract just the node name from our full path
		ourNodeName := createdPath[len(dl.basePath)+1:]
		// Find our position
		ourIndex := -1
		for i, child := range children {
			if child == ourNodeName {
				ourIndex = i
				break
			}
		}
		if ourIndex == -1 {
			return fmt.Errorf("our lock node disappeared mysteriously (bad sign)")
		}
		// If we're first, we have the lock!
		if ourIndex == 0 {
			return nil
		}
		// Otherwise, watch the node ahead of us
		// If it disappears, we'll be notified and try again
		nodeBefore := children[ourIndex-1]
		watchPath := fmt.Sprintf("%s/%s", dl.basePath, nodeBefore)
		exists, _, eventChan, err := dl.conn.ExistsW(watchPath)
		if err != nil {
			return fmt.Errorf("failed to set watch: %w", err)
		}
		if !exists {
			// The node we were watching is already gone, retry immediately
			continue
		}
		// Wait for either a deletion event or timeout
		select {
		case event := <-eventChan:
			if event.Type == zk.EventNodeDeleted {
				// The node ahead of us was deleted, try to acquire the lock
				continue
			}
		case <-time.After(dl.timeout):
			// Clean up on timeout
			dl.conn.Delete(dl.lockPath, -1)
			return fmt.Errorf("lock acquisition timeout")
		}
	}
}
// Unlock releases the distributed lock
func (dl *DistributedLocker) Unlock() error {
	if dl.lockPath == "" {
		return fmt.Errorf("lock not acquired")
	}
	err := dl.conn.Delete(dl.lockPath, -1)
	if err != nil && err != zk.ErrNoNode {
		return fmt.Errorf("failed to delete lock node: %w", err)
	}
	dl.lockPath = ""
	return nil
}
// Close closes the ZooKeeper connection
func (dl *DistributedLocker) Close() error {
	if dl.lockPath != "" {
		dl.conn.Delete(dl.lockPath, -1)
	}
	dl.conn.Close()
	return nil
}

Real-World Usage: Making It All Work Together

Here’s how you’d actually use this in your application. Notice how it looks almost identical to using a regular sync.Mutex—except it works across the entire internet:

package main
import (
	"fmt"
	"log"
	"time"
	"yourmodule/lock"
)
func main() {
	// Configuration
	zkHosts := []string{"localhost:2181"}
	basePath := "/my-app/locks"
	lockTimeout := 30 * time.Second
	// Create locker
	locker, err := lock.NewDistributedLocker(zkHosts, basePath, lockTimeout)
	if err != nil {
		log.Fatalf("Failed to create locker: %v", err)
	}
	defer locker.Close()
	// Use it like a regular mutex
	if err := locker.Lock(); err != nil {
		log.Fatalf("Failed to acquire lock: %v", err)
	}
	defer locker.Unlock()
	// Critical section - only one service in here at a time
	fmt.Println("Lock acquired! Doing critical work...")
	criticalOperation()
	fmt.Println("Work complete, releasing lock")
}
func criticalOperation() {
	// Simulate some work that needs to be exclusively accessed
	time.Sleep(2 * time.Second)
}

The Gotchas and How to Avoid Them

After working with distributed locks in production (and trust me, production finds every edge case), here are the real gotchas:

1. Session Loss is Your Enemy

If your client loses connection to ZooKeeper for longer than the session timeout, it’s game over. Your ephemeral node gets deleted, and the lock is released—but your process might not know it yet. This is why you should always have heartbeat monitoring:

// Monitor the connection state
go func() {
	for {
		select {
		case event := <-locker.conn.SessionEvent():
			if event == zk.StateDisconnected {
				log.Println("Lost connection to ZooKeeper!")
				// Handle reconnection or graceful shutdown
			}
		}
	}
}()

2. Lock Timeout Isn’t a Real Timeout

The lockTimeout parameter we used isn’t a time limit on how long you can hold the lock. ZooKeeper doesn’t care if you hold it forever. It’s the timeout for acquiring the lock. This is actually good—it prevents the lock-holder from being interrupted. But be disciplined about releasing locks:

// Good pattern: defer ensures release even on panic
if err := locker.Lock(); err != nil {
	return err
}
defer locker.Unlock()
// Your critical section here

3. Network Delays Add Up

Creating a sequential node, getting the child list, setting a watch, and waiting for events—all of this involves network round trips. On my laptop with ZooKeeper on localhost, acquiring a lock takes maybe 5-10ms. Across continents? Could be 500ms or more. Design with this in mind.

4. Thundering Herd Problem

If you have 1000 services all waiting on a lock and suddenly the lock-holder releases it, all 1000 of them will wake up, check if they have the lock (spoiler: 999 won’t), and go back to sleep. ZooKeeper handles this elegantly by only notifying the next waiter, but understanding this helps you design better systems.

Advanced: Shared Locks (Read-Write Locks)

Sometimes you want multiple readers but only one writer. ZooKeeper supports this with a slight variation:

// For read locks, create: /locks/read-0000000001
// For write locks, create: /locks/write-0000000001
// In the ordering, all locks compete, but:
// - A read lock is satisfied when no write locks before it exist
// - A write lock is satisfied when no other locks before it exist
type LockType string
const (
	ReadLock  LockType = "read-"
	WriteLock LockType = "write-"
)
// Modify Lock() to accept lock type
func (dl *DistributedLocker) LockWithType(lockType LockType) error {
	lockNode := fmt.Sprintf("%s/%s", dl.basePath, lockType)
	// ... rest of the lock acquisition logic
	// When checking if we can acquire, also check for incompatible lock types
}

Testing Your Distributed Locks

Testing distributed locks is tricky because you need multiple processes. Here’s a pragmatic approach:

package lock
import (
	"testing"
	"time"
)
func TestDistributedLockSequence(t *testing.T) {
	zkHosts := []string{"localhost:2181"}
	basePath := "/test/locks"
	// Create two lockers
	locker1, err := NewDistributedLocker(zkHosts, basePath, 30*time.Second)
	if err != nil {
		t.Fatalf("Failed to create locker1: %v", err)
	}
	defer locker1.Close()
	locker2, err := NewDistributedLocker(zkHosts, basePath, 30*time.Second)
	if err != nil {
		t.Fatalf("Failed to create locker2: %v", err)
	}
	defer locker2.Close()
	// Acquire first lock
	if err := locker1.Lock(); err != nil {
		t.Fatalf("Failed to acquire lock: %v", err)
	}
	// Try to acquire second lock (should not block in this test)
	acquired := make(chan bool)
	go func() {
		err := locker2.Lock()
		acquired <- err == nil
	}()
	// Second locker should not acquire the lock immediately
	select {
	case <-acquired:
		t.Fatal("Second locker acquired lock while first still holds it")
	case <-time.After(100 * time.Millisecond):
		// Good, it's waiting
	}
	// Release first lock
	locker1.Unlock()
	// Now second locker should acquire it
	select {
	case ok := <-acquired:
		if !ok {
			t.Fatal("Second locker failed to acquire lock")
		}
	case <-time.After(5 * time.Second):
		t.Fatal("Second locker never acquired lock")
	}
	locker2.Unlock()
}

Deployment Considerations

A few things before you push this to production: ZooKeeper Ensemble Setup: Don’t use a single ZooKeeper node in production. Use at least 3 nodes (or 5 for larger deployments). The quorum ensures that failures don’t make the entire lock system unreliable. Network Configuration: ZooKeeper needs reliable network connections. If your network is flaky, your locks will be flaky. Consider putting ZooKeeper nodes in the same data center as your services, or use managed ZooKeeper services if available. Monitoring: Monitor lock acquisition times. If they suddenly spike, it usually indicates ZooKeeper is struggling or your network is degraded:

start := time.Now()
if err := locker.Lock(); err != nil {
	log.Printf("Lock acquisition failed after %v", time.Since(start))
}
// Log acquisition time for monitoring
lockAcquisitionTime.Observe(time.Since(start).Seconds())

Graceful Shutdown: Always release locks before exiting:

sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT)
go func() {
	<-sigChan
	locker.Unlock()
	locker.Close()
	os.Exit(0)
}()

When NOT to Use Distributed Locks

I’d be remiss if I didn’t mention when this might be overkill. Distributed locks add latency and operational complexity. Use them when:

✅ You truly need mutual exclusion across multiple machines
✅ Data consistency is critical
✅ You can tolerate the latency overhead Don’t use them when:
❌ You just need rate limiting (use token buckets)
❌ You need coordination but not strict mutual exclusion
❌ Your operations are sub-millisecond (the network latency alone will kill you)

Conclusion

Distributed locks are one of those problems that look simple until you realize all the ways they can break. ZooKeeper, with its ephemeral sequential nodes, provides an elegant solution that handles most edge cases automatically. The implementation we’ve built here is production-ready, but remember—the real complexity comes not from the lock implementation, but from designing your system to use locks efficiently. The key insight is this: every second your process holds a lock, other processes are waiting. Design your critical sections to be small and fast. Use distributed locks for coordination, not for general-purpose mutual exclusion of complex operations. Go forth and lock things responsibly.

Subscribe to Our Telegram Channel

Подпишитесь на наш телеграм

Thank you for subscribing!

Спасибо за подписку!

The Lock Dilemma: When sync.Mutex Just Isn’t Enough#

What Makes Distributed Locks So Special?#

The ZooKeeper Magic: Ephemeral Sequential Nodes#

Architecture Overview#

Setting Up Your Go Project#

Building Your First Distributed Locker#

Real-World Usage: Making It All Work Together#

The Gotchas and How to Avoid Them#

1. Session Loss is Your Enemy#

2. Lock Timeout Isn’t a Real Timeout#

3. Network Delays Add Up#

4. Thundering Herd Problem#

Advanced: Shared Locks (Read-Write Locks)#

Testing Your Distributed Locks#

Deployment Considerations#

When NOT to Use Distributed Locks#

Conclusion#