Thursday, December 3, 2015

On Cassandra Striping Across Availability Zones

We've had mixed advice given to us in regards to how Cassandra stripes its data across AWS Availability Zones (mapped to "Racks" via the EC2MultiRegionSnitch). Some said that striping is guaranteed to be across all 3 AZs our nodes are deployed across per Region. While others said the striping is simply a "best effort" operation. This has meant that we've had to take a conservative approach and assume equal distribution of partitions across AZs can not be relied on. The practical implication of this was that when an entire AWS AZ, or even 2 nodes within an AZ went down, we'd have to fail-over to the Disaster Recovery half of our cluster (an AWS Region mapped to a logical "Datacenter" via the EC2MultiRegionSnitch) to achieve the consistency level of LOCAL_QUORUM required by our applications in order to achieve "strong consistency". This was not cool... So we've done a few some tests to see what the real story is.

Our approach was to develop a little test app that iteratively selected all rows in one of the biggest tables in our schema, "log", where all our raw messages are stored. The app used a consistency level of ALL, which meant that all replicas of the partition being requested must reply for the select to be considered successful. As is standard for our deployments, we used a replication factor of 3 (every partition is replicated 3 times per logical datacenter, of which we have 2). This means that if one of 3 replicas for a partition is unavailable, the select will fail. The app will ignore these failures (as opposed to retrying the query) and just output the count of successfully selected partitions.

This app was then used in 4 different scenarios:
  1. After a large, heavy backload had been performed. Done simply to get the count of partitions with all of the replicas available.
  2. After an entire AZ was taken down. In this test we'd expect no selects at a consistency level of ALL to succeed if striping was perfectly equally distributing replicas across AZs.
  3. After yet more load had been applied while an entire AZ is still down. This was to determine that with an AZ completely unavailable, we don't regress to striping all 3 replicas across just 2 AZs. Again we expect (desire) all selects with a consistency level of ALL to fail.
  4. After the AZ has been brought up again. This is mainly done for the sake of completeness and we of course expect all selects with a consistency level of ALL to succeed.

What happened?

Test 1
The result of Test 1 was that 296,904,900 partitions were successfully selected. This isn't interesting in itself, but this number will come in handy later.

Test 2
The result of Test 2 was that 0 selects were successful; a good result indicating that data has been striped consistently across 3 AZs. Importantly, as we later found out, this testing was done with a "fetch size" (number of rows to attempt to fetch for purposes of pagination, etc) of 1. Out of curiosity we'd run this test again with just 1, 2, and 3 of our 4 nodes in the AZ down. We'd of course expected to see roughly 74,226,225 (one quarter of 296,904,900) of selects succeed at a consistency level of ALL per node in the AZ still up. Interestingly if a single partition in a fetch does not have the required consistency level available, the entire fetch appears to fail because of this. So the only way to get the precise number of successful selects with a consistency level of ALL and multiple nodes down in an AZ is to use a fetch size of 1, which is practically infeasible in terms of time due to the performance hit of doing so. Thankfully with an entire AZ down it seems that Cassandra is smart enough to know that all queries at consistency level ALL will fail and it does so very quickly, even with a fetch size of 1.

All Nodes in AZ Up1 Node in AZ Down2 Nodes in AZ Down3 Nodes in AZ DownAll Nodes in AZ Down
Fetch SizeSuccessful SelectsSuccessful SelectsSuccessful SelectsSuccessful SelectsSuccessful Selects
1296,904,900*~ 222,678,67*~ 148,452,450*~ 74,226,225*0
* Not an actual measurement, but what we'd expect to see.

Test 3
The result of Test 3 was that we once again found that no selects were successful, even with a fetch size of 1. This is good because it indicates that striping has not regressed to putting all 3 replicas across Cassandra nodes in the only 2 still available AZs. Instead it will be writing hinted handoffs that will persist for 3 hours, so that if the downed nodes were only temporally available, then the missed writes could be replicated to them within those 3 hours. As our nodes in the downed AZ are completely gone (losing the data on their ephemeral disks), we'll instead need to stream all lost data to them when they're eventually replaced.

Test 4
As we'd sent in more messages since Test 1, the total number of entries in the log table was now 301,373,773. As an aside, a handy way to monitor bootstrap streaming progress is with the following command-line fu:
watch -n 10 'nodetool netstats | grep Receiving | gawk {'"'"' print $11/$4*100"% Complete, "$11/1024/1024/1024" GB remaining" '"'"'}'

How does this manifest in our applications which use a consistency level of LOCAL_QUORUM for "strong consistency"?

All of our applications use a hard-coded consistency level of LOCAL_QUORUM for all reads and writes to Cassandra in order to ensure "strong consistency". As an outcome of our testing, we now know that with our standard configuration of a Replication Factor of 3, Ec2MultiRegionSnitch, NetworkTopologyStrategy, and Cassandra nodes distributed across 3 AWS Availability Zones per Region; we can afford to lose an entire AZ of nodes (e.g. lose all of AZ C)  and still not need to fail over to the Disaster Recovery region to consistently achieve our LOCAL_QUORUM consistency. However, if we lose one node in one AZ and another node in a different AZ (e.g. one node in AZ A and another in AZ B), we will have to fail over for some requests.

The following is a diagram of how all this manifests in terms of data striping in Cassandra with our configuration. Only writes have been shown as read behaviour is the same.