5 Tips for Running Cassandra on AWS
Cassandra is a high-performance, NoSQL relational database that is used by some of the biggest companies, including Netflix and Reddit. Although it was originally designed to run on-premises, this database can be used on cloud services as well. AWS EC2 instances, in particular, are one host option that can grant advantages of increased scalability, reliability, security, and decreased cost.
In this article, we’ll look at the differences between Cassandra and AWS’ similar offering, DynamoDB, and cover some tips for ensuring that you can maximize the possible benefits if you decide to make the transition to the cloud.
Cassandra vs DynamoDB
Cassandra and AWS DynamoDB essentially have the same functionality, since they are both based on the same NoSQL premise. You can see a further breakdown of the differences in the table below but the key differences are that DynamoDB is a fully managed but proprietary and less customizable database, while Cassandra is open-source, has full active-active multi-region support, and lower latency but requires manual management.
|Primary Key||Supports a large number of fields||Supports only 2 fields
Use of more attributes require manual concatenation into single field
|Schema||Data is structured according to a predefined schema||Only the primary key requires a schema
Table data can have mixed attributes
|Time-To-Live (TTL)||Supports by column which allows you to expire individual fields of an item||Supports by item only|
|Consistency||Offers both strong and eventual consistency
Added functionality of being able to specify the number of nodes queried with eventual consistency
|Offers both strong and eventual consistency|
|Language||Uses Cassandra Query Language (CQL)||Uses JSON|
|Protocol||Uses binary and allows simultaneous on-going requests||Uses JSON/HTTP|
|Batching||Supports batching with all pass / all fail guarantee||Supports batched operations with no all pass / all fail option|
|Scalability||Requires manual replication to scale but allows fine control||Autoscales but offers no control over replicas|
Tips for Operation
If you have decided to move your Cassandra database to AWS, the following tips can help you make sure you’re getting the greatest benefit.
One of the benefits of deploying Cassandra on AWS is the ability to automate deployment tasks, such as describing and provisioning infrastructure resources, through CloudFormation. When doing so, it is recommended to use one CloudFormation template for each Cassandra ring you want to orchestrate and if you are deploying to multiple regions, you should manage your stacks with CloudFormation StackSet.
On AWS, your deployment should follow one of these three architectures:
- Single-region with multiple Availability Zones (AZ)—consists of one ring in one region with nodes evenly distributed across at least three AZ. This configuration is useful when you’re required to use one region to comply with regulatory standards but it creates a risk of regional failure.
- Active-active, multi-region—consists of multiple rings in multiple regions. This configuration works best if the rings are identical; it eliminates the risk of data loss during failover and is highly available but comes with a higher cost due to duplication.
- Active-standby, multi-region—works the same as active-active except one ring serves only as disaster recovery. This is useful if you need a low Recovery Point Objective (RPO) or Recovery Time Objective (RTO) but has an additional downside of high latency for eventual consistency writes.
When running Cassandra on AWS you have the choice of two different storage options: ephemeral storage, which is good for general purpose deployments, or EBS volumes, which are more flexible and good for read-heavy clusters. The instance type you use will depend on this storage but regardless of which you choose, you should avoid using burstable (T2) instances as they cannot give you acceptable performance.
With ephemeral storage, you will get the best performance from storage-optimized (I3) instances with which you can get up to 3.3M I/O Operations per Second (IOPS). This storage option works best for larger clusters since as a cluster gets smaller, node failure will impact performance more significantly. If you choose this option, it’s important to remember that ephemeral storage only exists as long as an instance is active. If a node fails, the data stored is gone so it’s important to frequently back up your data, which will require a custom solution using rsync or third-party integration.
With EBS volume storage, you will get the best performance from compute-optimized (C5) instances, which can provide up to 80K IOPS per instance if a RAID configuration is used. This storage option works best for small clusters with large amounts of data due to its higher resilience in comparison to ephemeral storage. With EBS volumes, data is easily backed up through snapshots which can be used to quickly generate new instances in the event of failure as they eliminate the need to recopy all but the most recent changes.
For Cassandra deployments, your maintenance actions should be scripted with AWS SDK, which can be used in combination with Lambda or activated manually.
When the time comes to horizontally scale your database, make sure to maintain a consistent factor so your data remains evenly distributed, for instance when scaling up double your instances and halve them when scaling down.
To modify your instances, such as when upgrading volumes or applying patches, you should make use of rolling upgrades, with one instance being swapped out at a time. This will help eliminate downtime and make it easier to fix your ring if an update goes wrong. When doing this, you can benefit from the use of a secondary elastic network interface, which allows you to assign the IP address of your replaced instance to your new one and eliminate the need to rebalance your ring.
To manage the security of your database, make use of AWS’ built-in encryption features. Both at-rest and in-transit encryption are available but note that at rest encryption will depend on your storage type. For EBS, you can just encrypt the volume whereas, for ephemeral, you will have to use an encrypted file system or a third-party solution.
Cassandra is a powerful tool that continues to see use and support despite its age, and one to which you might be loyal. If you are already using both Cassandra and AWS services, it makes sense that you might consider joining the two. Hopefully, this article gave you a better idea of what such an integration looks like and taught you some tips for how to integrate smoothly and efficiently, ensuring that you get the benefits of both Cassandra and AWS services with maximum flexibility.
Subscribe to our Newsletter
Get The Free Collection of 60+ Big Data & Data Science Cheat Sheets. Stay up-to-date with the latest Big Data news.