Event Bus with Redis Stream

A risky bet that paid off

In the world of microservices architecture, communication between services is a critical aspect that can make or break your system. When we started building our platform at Ingedata, we needed a reliable way for our services to communicate with each other. We had to make a choice: use an existing message broker or build our own solution. We chose the latter, implementing our own Event Bus using Redis Stream. This article explains why we made this decision and how it turned out to be a successful bet.

What is an Event Bus and Why Use It?

An Event Bus is a messaging system that allows different parts of an application to communicate with each other without knowing about each other. It follows the publish-subscribe pattern, where publishers send messages to specific channels, and subscribers receive messages from the channels they're interested in.

The advantages of using an Event Bus include:

Decoupling: Services don't need to know about each other directly, reducing dependencies.
Scalability: You can add new subscribers without affecting publishers.
Resilience: If a service goes down, messages can be stored and processed when it comes back online. This simplifies deployment and temporary shutdown by a lot.
Asynchronous Processing: Services can process messages at their own pace.
Event-Driven Architecture: Enables reactive programming patterns where services respond to events.

Comparing Message Broker Technologies

Before building our own solution, we evaluated several popular message brokers:

Kafka

Kafka is a distributed event streaming platform capable of handling trillions of events a day. It's highly scalable and provides strong durability guarantees.

Pros:

Extremely high throughput
Excellent scalability
Strong durability guarantees
Mature ecosystem

Cons:

Complex to set up and operate
Requires significant DevOps expertise
Zookeeper dependency (though this is changing)
Overkill for many use cases

RabbitMQ

RabbitMQ is a traditional message broker implementing the Advanced Message Queuing Protocol (AMQP).

Pros:

Mature and battle-tested
Supports multiple messaging patterns
Good performance for most use cases
Relatively easy to set up

Cons:

Consumer groups are not as flexible as we needed
Message ordering can be challenging to maintain
Less scalable than Kafka for very high throughput

NATS

NATS is a lightweight, high-performance messaging system.

Pros:

Extremely lightweight and fast
Simple to deploy and operate
Good for request-reply patterns

Cons:

Limited persistence options
Consumer groups are not as flexible as we needed
Message ordering guarantees are not as strong

Why We Chose Redis Stream

After evaluating these options, we decided not to use any of them for the following reasons:

Kafka: While powerful, it was too complex to set up and manage at the DevOps level. We didn't need the extremely high throughput or extensive scalability features it provides.
RabbitMQ and NATS: Their consumer groups weren't as flexible as we required. One of our major concerns was message ordering - we needed to ensure consistency in message ordering for any given resource while still allowing multiple consumers in the same group. NATS stream would ensure message ordering if only one consumer was consuming messages from a stream, but it didn't provide ordering guarantees for multiple consumers.

Instead, we turned to Redis Stream, a feature introduced in Redis 5.0 that provides a log-like data structure. Out of the box, Redis Stream doesn't provide multiple consumers with resource message ordering, but we realized we could implement this ourselves using Redis LUA Scripts.

Redis also offers different backup strategies (AOF, RDB) that ensure we don't lose any messages, providing the durability we needed.

Understanding Consumer Groups

Consumer groups are a way to distribute message processing among multiple consumers while ensuring each message is processed only once within the group.

In this model:

Each message published to the stream can be consumed by one consumer from each consumer group
Within a consumer group, each message is delivered to only one consumer
Consumers acknowledge messages after processing them
If a consumer fails, unacknowledged messages can be reassigned to other consumers in the group

However, this standard model doesn't guarantee message ordering when multiple consumers are processing messages from the same stream. This was a critical requirement for us - we needed to ensure that messages related to the same resource were processed in order, even with multiple consumers.

Our Implementation: Sharded Locks with LUA Scripts

We implemented a sharding mechanism using Redis LUA scripts to solve the message ordering problem while still allowing parallel consumption. Here's how it works:

We divide our streams into multiple shards based on hash over resource IDs
When a consumer wants to read messages, it first tries to acquire locks on as many shards as possible. If a shard is already locked by a concurrent consumer, the shard won’t be read for message.
It then tries to read N messages in the shards it has locked
Once the messages are retrieved, it unlocks shards where no message was found and acknowledges the messages but keeps the messages’ origin shards locked during processing
After processing messages, it releases the locks to those shards and goes back to try to acquire locks

This approach ensures that messages for a specific resource are always processed in order, as they'll always be processed by the same consumer (the one that has acquired the lock for that resource's shard). At the same time, it allows multiple consumers to process messages concurrently for different resources (up to the number of shards).

The locking mechanism is implemented using two LUA scripts:

lock_shards.lua: Attempts to acquire locks on shards for a consumer
unlock_shards.lua: Releases locks when the consumer is done processing messages

The LUA scripts ensure that the lock/unlock operations are atomic, preventing race conditions. Those locks live on one specific server of our Redis cluster.

The Critical Importance of Message Ordering

In our microservices architecture, resources often receive multiple updates in sequence. Maintaining the correct order of these updates is absolutely critical for data consistency.

Let's look at a simple example: a customer updating their shipping address twice.

Imagine this sequence of events:

Customer mistakenly updates their address to "789 Wrong St"
Customer realizes the mistake and corrects it to "456 Oak Ave"
Customer places an order expecting it to be delivered to "456 Oak Ave"

If these events are processed out of order, we could end up with packages being sent to the wrong address. Here's what could happen:

Without proper message ordering, the address correction (event #2) might be processed before the initial wrong update (event #1). Since the events are processed out of order, the final state would be the wrong address ("789 Wrong St"), and the package would be sent to the incorrect location. This creates a poor customer experience and requires costly manual intervention to correct.

Other critical scenarios where ordering matters include:

Financial transactions: Processing a refund before the original payment (woopsie!)
Document editing: Applying edits in the wrong order, resulting in incorrect content
Workflow state transitions: Skipping required states in a process

By ensuring that all messages for a specific resource are processed by the same consumer (through our sharding mechanism), we guarantee that events are processed in the exact order they were published. This maintains data consistency across our distributed system, even as we scale horizontally with multiple consumers.

Challenges We Faced

Concurrency and Deadlocks

One of the biggest challenges was ensuring our system worked correctly under high concurrency without deadlocks. We had to create extensive stress tests to verify that our locking mechanism and multi-threaded Ruby implementation were robust.

We implemented a liveness check system where consumers periodically update a key in Redis to indicate they're still alive. If a consumer dies unexpectedly (OOM, sigkill, or server crash), its locks will eventually expire, allowing other consumers to acquire them.

def liveness_run
  key = "{VERSE:STREAM:SHARDLOCK}:SERVICE_LIVENESS:#{@consumer_id}"

  until @stopped
    redis do |r|
      r.set(key, 1, ex: 30)
    end

    sleep 15
  end
rescue StandardError => e
  # Redis error? Log it and continue
  Verse.logger.error(e)
  retry
end

Handling Consumer Failures

Another challenge was dealing with consumer failures during message processing. We had to make a design decision: should we acknowledge messages before or after processing them?

We chose to acknowledge messages immediately upon receipt, even before processing them. This means that if a consumer fails during processing, the message won't be redelivered. This was a conscious trade-off - we accepted the possibility of data inconsistency in the rare case of consumer failure to avoid the complexity of implementing a robust retry mechanism. In case of error, we rely on monitoring and alerting to detect and fix the issue, and maintenance tasks to fix any inconsistencies.

This decision simplified our implementation but required careful consideration of failure scenarios in our business logic.

Redis Key Sharding

Redis Cluster distributes keys across multiple nodes using a hashing algorithm. To ensure that related keys are stored on the same shard (which is necessary as LUA scripts run on one shard only), we used the {...} notation in our key names.

For example, in our locking mechanism, we use keys like:

{VERSE:STREAM:SHARDLOCK}:stream_name:shard_id:consumer_group

The part inside the curly braces ({VERSE:STREAM:SHARDLOCK}) is used to determine which Redis node will store the key. All keys with the same hash tag will be stored on the same node, allowing our LUA scripts to operate correctly. In this case, only one of our Redis nodes will manage the locking mechanism.

Memory Management

An important aspect of our Redis Stream implementation is how we manage memory. Since Redis keeps everything in memory, we needed to be careful about how many events we store. Our events are designed to be short-lived and consumed as quickly as possible, so we maintain a backlog of only 100,000 messages maximum per stream before dropping the older events.

This approach helps us manage Redis server memory pressure while still providing enough buffer for temporary service outages or processing delays. When publishing messages to a stream, we use Redis's XADD command with the MAXLEN option to automatically trim the stream:

redis.xadd(
  stream,
  { msg: content },
  approximate: true,
  maxlen: 100_000
)

The approximate: true parameter (which translates to the ~ flag in Redis) allows Redis to optimize the trimming operation, making it more efficient at the cost of potentially keeping slightly more messages than specified. This trade-off is acceptable for our use case and significantly improves performance under high load.

Finally, all messages are serialized using msgpack and compressed using zlib.

Conclusion

Building our own Event Bus using Redis Stream was indeed a risky bet, but it paid off. We now have a system that:

Ensures message ordering for resources
Allows multiple consumers to process messages concurrently
Use existing infrastructure and is relatively simple to operate compared to Kafka
Provides the durability we need through Redis persistence
Scales well for our use case (~0..50 messages per second)

The implementation wasn't without challenges, but by leveraging Redis LUA scripts and carefully designing our locking mechanism, we created a solution that perfectly fits our requirements.

If you're facing similar challenges in your microservices architecture, consider whether existing solutions truly meet your needs or if a custom implementation might be better. Sometimes, taking a calculated risk can lead to a more tailored and efficient solution.

Have you built custom messaging solutions for your architecture? I'd love to hear about your experiences in the comments. And if you're interested in learning more about our implementation, check out our other articles on microservices architecture and Redis usage.

Remember, the best solution isn't always the most popular one - it's the one that best fits your specific requirements.

This article is part of our series on API design and microservice architecture. Stay tuned for more insights into how we've built a scalable, maintainable system.

Designing an Event Bus

A risky bet that paid off

What is an Event Bus and Why Use It?

Comparing Message Broker Technologies

Kafka

RabbitMQ

NATS

Why We Chose Redis Stream

Understanding Consumer Groups

Our Implementation: Sharded Locks with LUA Scripts

The Critical Importance of Message Ordering

Challenges We Faced

Concurrency and Deadlocks

Handling Consumer Failures

Redis Key Sharding

Memory Management

Conclusion

Comments

Exposition–Service–Effects: The Event-Driven Microservices Series

Secure and Scalable Authentication in Microservices

More from this blog

Sovereign Data: Why Bare Metal Won

Protecting Sensitive Data

Testing Our Application

Dive into JSON:API specification

Secure and Scalable Authentication in Microservices

Command Palette

A risky bet that paid off

What is an Event Bus and Why Use It?

Comparing Message Broker Technologies

Kafka

RabbitMQ

NATS

Why We Chose Redis Stream

Understanding Consumer Groups

Our Implementation: Sharded Locks with LUA Scripts

The Critical Importance of Message Ordering

Challenges We Faced

Concurrency and Deadlocks

Handling Consumer Failures

Redis Key Sharding

Memory Management

Conclusion

Comments

Exposition–Service–Effects: The Event-Driven Microservices Series

Secure and Scalable Authentication in Microservices

More from this blog