How Discord Stores TRILLIONS of Messages
A Colossal Database Migration: Discord’s Journey from Cassandra to ScyllaDB
Introduction:
In this blog post, we will delve into the remarkable database migration journey undertaken by Discord engineers, involving the transfer of trillions of messages from Cassandra to ScyllaDB. If you’ve ever wondered about the challenges of migrating data at such an unimaginable scale, get ready for an exciting story of innovative engineering and clever solutions.
The Predicament with Cassandra:
As Discord’s platform continued to grow, they encountered serious performance issues with Cassandra, their chosen database. The maintenance of the main Cassandra cluster holding messages became increasingly cumbersome, leading to unpredictable latency and frequent on-call incidents that strained the team’s resources.
By 2022, the Cassandra cluster housed trillions of messages across 177 nodes, making it clear that a change was necessary. Discord’s very foundation depended on the efficiency of the message cluster. Therefore, finding a robust solution was critical.
The Solution: Embracing ScyllaDB:
The solution to Discord’s database dilemma was ScyllaDB, a powerful C++-based engine compatible with Cassandra. However, instead of jumping straight into tackling the colossal task, the engineers opted for a gradual approach. They began with smaller database migrations to test the waters and address any potential issues before diving into the massive migration of trillions of messages.
The Intermediate Layer: Introducing Data Services:
To optimize its database architecture, Discord created an intermediate layer called Data Services between the API monolith and the database clusters. This layer, built using the Rust programming language, introduced the concept of request coalescing. It allowed multiple users requesting the same data to be served with a single database query, reducing the chances of hot partitions and enhancing performance.
The Superdisk Innovation:
To tackle disk latency challenges, Discord had to innovate beyond the limitations of local NVMe SSDs on virtual machines. They chose to prioritize low-latency disk reads while maintaining high durability for critical data storage. The result was the creation of a unique “super disk,” a two-layered RAID solution. By combining RAID0 to merge multiple local SSDs into a low-latency virtual disk and RAID1 to mirror this array with a persistent disk, Discord achieved an optimized solution tailored to their specific needs.
The Migration Journey:
With all preparations in place, Discord embarked on the migration of their largest database, the ‘Cassandra-messages’ cluster, comprising trillions of messages and nearly 200 nodes. However, with a newly developed data migrator in Rust and clever strategies, the team pulled off the migration in just nine days! Astonishingly, they accomplished this without any downtime, completing the migration of trillions of messages in less than two weeks.
The Reward:
The migration to ScyllaDB resulted in a significantly quieter and more efficient system. Discord reduced the number of nodes from 177 Cassandra nodes to just 72 ScyllaDB nodes, leading to improved latencies and a better quality of life for the on-call staff. This extraordinary task was achieved through clever risk mitigation and the implementation of innovative solutions.
Conclusion:
Discord’s colossal database migration from Cassandra to ScyllaDB is a testament to the power of careful planning, gradual implementation, and out-of-the-box thinking. It showcases the prowess of a dynamic and innovative engineering culture that thrives on tackling challenges head-on. Migrating a production database at this scale is no small feat, and Discord’s success in accomplishing it is truly commendable.
Happy Coding ..!!! 💻