Thursday, December 3, 2015

On Cassandra Striping Across Availability Zones

We've had mixed advice given to us in regards to how Cassandra stripes its data across AWS Availability Zones (mapped to "Racks" via the EC2MultiRegionSnitch). Some said that striping is guaranteed to be across all 3 AZs our nodes are deployed across per Region. While others said the striping is simply a "best effort" operation. This has meant that we've had to take a conservative approach and assume equal distribution of partitions across AZs can not be relied on. The practical implication of this was that when an entire AWS AZ, or even 2 nodes within an AZ went down, we'd have to fail-over to the Disaster Recovery half of our cluster (an AWS Region mapped to a logical "Datacenter" via the EC2MultiRegionSnitch) to achieve the consistency level of LOCAL_QUORUM required by our applications in order to achieve "strong consistency". This was not cool... So we've done a few some tests to see what the real story is.

Our approach was to develop a little test app that iteratively selected all rows in one of the biggest tables in our schema, "log", where all our raw messages are stored. The app used a consistency level of ALL, which meant that all replicas of the partition being requested must reply for the select to be considered successful. As is standard for our deployments, we used a replication factor of 3 (every partition is replicated 3 times per logical datacenter, of which we have 2). This means that if one of 3 replicas for a partition is unavailable, the select will fail. The app will ignore these failures (as opposed to retrying the query) and just output the count of successfully selected partitions.

This app was then used in 4 different scenarios:
  1. After a large, heavy backload had been performed. Done simply to get the count of partitions with all of the replicas available.
  2. After an entire AZ was taken down. In this test we'd expect no selects at a consistency level of ALL to succeed if striping was perfectly equally distributing replicas across AZs.
  3. After yet more load had been applied while an entire AZ is still down. This was to determine that with an AZ completely unavailable, we don't regress to striping all 3 replicas across just 2 AZs. Again we expect (desire) all selects with a consistency level of ALL to fail.
  4. After the AZ has been brought up again. This is mainly done for the sake of completeness and we of course expect all selects with a consistency level of ALL to succeed.

What happened?

Test 1
The result of Test 1 was that 296,904,900 partitions were successfully selected. This isn't interesting in itself, but this number will come in handy later.

Test 2
The result of Test 2 was that 0 selects were successful; a good result indicating that data has been striped consistently across 3 AZs. Importantly, as we later found out, this testing was done with a "fetch size" (number of rows to attempt to fetch for purposes of pagination, etc) of 1. Out of curiosity we'd run this test again with just 1, 2, and 3 of our 4 nodes in the AZ down. We'd of course expected to see roughly 74,226,225 (one quarter of 296,904,900) of selects succeed at a consistency level of ALL per node in the AZ still up. Interestingly if a single partition in a fetch does not have the required consistency level available, the entire fetch appears to fail because of this. So the only way to get the precise number of successful selects with a consistency level of ALL and multiple nodes down in an AZ is to use a fetch size of 1, which is practically infeasible in terms of time due to the performance hit of doing so. Thankfully with an entire AZ down it seems that Cassandra is smart enough to know that all queries at consistency level ALL will fail and it does so very quickly, even with a fetch size of 1.

All Nodes in AZ Up1 Node in AZ Down2 Nodes in AZ Down3 Nodes in AZ DownAll Nodes in AZ Down
Fetch SizeSuccessful SelectsSuccessful SelectsSuccessful SelectsSuccessful SelectsSuccessful Selects
1296,904,900*~ 222,678,67*~ 148,452,450*~ 74,226,225*0
* Not an actual measurement, but what we'd expect to see.

Test 3
The result of Test 3 was that we once again found that no selects were successful, even with a fetch size of 1. This is good because it indicates that striping has not regressed to putting all 3 replicas across Cassandra nodes in the only 2 still available AZs. Instead it will be writing hinted handoffs that will persist for 3 hours, so that if the downed nodes were only temporally available, then the missed writes could be replicated to them within those 3 hours. As our nodes in the downed AZ are completely gone (losing the data on their ephemeral disks), we'll instead need to stream all lost data to them when they're eventually replaced.

Test 4
As we'd sent in more messages since Test 1, the total number of entries in the log table was now 301,373,773. As an aside, a handy way to monitor bootstrap streaming progress is with the following command-line fu:
watch -n 10 'nodetool netstats | grep Receiving | gawk {'"'"' print $11/$4*100"% Complete, "$11/1024/1024/1024" GB remaining" '"'"'}'

How does this manifest in our applications which use a consistency level of LOCAL_QUORUM for "strong consistency"?

All of our applications use a hard-coded consistency level of LOCAL_QUORUM for all reads and writes to Cassandra in order to ensure "strong consistency". As an outcome of our testing, we now know that with our standard configuration of a Replication Factor of 3, Ec2MultiRegionSnitch, NetworkTopologyStrategy, and Cassandra nodes distributed across 3 AWS Availability Zones per Region; we can afford to lose an entire AZ of nodes (e.g. lose all of AZ C)  and still not need to fail over to the Disaster Recovery region to consistently achieve our LOCAL_QUORUM consistency. However, if we lose one node in one AZ and another node in a different AZ (e.g. one node in AZ A and another in AZ B), we will have to fail over for some requests.

The following is a diagram of how all this manifests in terms of data striping in Cassandra with our configuration. Only writes have been shown as read behaviour is the same.

Wednesday, June 10, 2015

Why not to multi-tenant Cassandra

Cassandra and the Cloud go together like <insert hilarious comparison here>. So does multi-tenancy and the Cloud.

As well as being able to save costs by elastically horizontally scaling in response to load during a day, it’s an attractive idea to make further strides by sharing hardware between customers (multi-tenanting). 

I’m not against multi-tenancy per se… I think it’s a great idea. But I am against doing it with Cassandra and here’s why…

You basically have 3 real options for multi-tenanting Cassandra, each with pros and cons:

By Keyspace
  • Tenants may have different replication factors.
  • Tenants may have different schemas (enabling them to potentially run different applications).
  • Tenants don’t share SSTables and data is effectively isolated by Cassandra.
  • Your scalability is limited because holding all the extra keyspaces and their tables in memory does not play well with Cassandra.
By Table
  • Tenants may have different schemas (enabling them to potentially run different applications).
  • Tenants don’t share SSTables and data is effectively isolated by Cassandra.
  • Your scalability is limited because holding all the extra tables in memory does not play well with Cassandra.
  • More complex application logic is needed to deal with different tables per tenant.
By Row
  • A scalable approach as your memory requirements are the same as for one tenant.
  • Tenants must have compatible schemas.
  • Tenants share both SSTables and Memtables, so isolation has to be done in the application, but active tenants can hinder the effectiveness of the shared caches for other tenants and cause compaction issues that will impact every tenant.
Having multiple Logical Datacenters has also been suggested to isolate the workloads of different customers. But this is designed for different workloads (e.g. BAU and Analytics) on the same data. If you want to isolate by tenant (data) then you’ll lose the benefits of multi-tenancy as you’ll no longer be sharing the hardware.

You can not win.

If you really want to save wastage costs from unused hardware, I'd consider increasing the granularity of your instances so that the wastage is reduced.

Thursday, May 14, 2015

Cassandra Compaction Strategies Under Heavy Delete Workflows

There are many factors at play when we look at what compaction strategies to use for our tables in Cassandra. Bulk deleting large quantities of data is something that is usually avoided in Cassandra. However, we have requirements that necessitate it (e.g. a requirement to hard-delete patient data when they've opted out of our service) and there are certain advantages to being able to do it (e.g. freeing up space when data has become "orphaned", never to be read again).

The big concern with deleting large quantities of data is tombstones. We have a log-structured file system in Cassandra in that we always append writes, avoiding disk seeks (how we achieve such awesome write performance). Even deletes are writes in the form of tombstones which mark data as having been deleted. Having lots of tombstones is a bad thing as we need to pull back the tombstones along with the upserts (inserts and updates are fairly much the same thing) for our data when we read it back. This spreads reads across SSTables and means we're holding more information in memory in order to resolve the current state of the data we're pulling back.

Because of this, it's important to understand the behaviour of different compaction strategies in relation to tombstones when we have a heavy delete workload. These were my findings:

Leveled Compaction Strategy (LCS)

Compaction load during heavy deletes was greatest with LCS. The upside being that this enabled more data to be compacted away during the delete workload, reducing the size consumed by the table as we went.

Delete throughput was also less under LCS than the other strategies due to the compaction load, but improved over time as the data size reduced.

Until we get Cassandra 3.0, nodetool compact (triggering a "major" compaction) is a no-op under LCS. This means that we can not opt to compact away all of the deleted data, leaving only the tombstones (assuming gc_grace_seconds hasn't expired). This is a big blow to the space saving use case as well as potentially causing issues around whether the data has been "hard" deleted.

A selling point for LCS over STCS is that you don't need the space 50% overhead in order to ensure compaction can safely happen.

Size Tiered Compaction Strategy (STCS)

The default compaction strategy, STCS, had a very light compaction load during the delete workflow. As a result, the data volumes actually grew with the tombstones being written and not much of the original data being compacted away.

Delete throughput was about the same as that of DTCS, around 34% higher than that of LCS in this test. Like with DTCS, delete throughput also remained fairly constant over time.

Triggering a "major" compaction with nodetool compact on STCS is generally not recommended as when you compact all of your data into one giant SSTable, it is going to be a very long time (if ever) that future SSTables will get up to a size with which they'll compact with your monolithic table. However, this did reduce the table size considerably as the tombstones (gc_grace_seconds hadn't expired) are a lot smaller than the actual data.

A downside to STCS is that it requires a 50% space overhead in order to guarantee it has enough space in order to perform a compaction. This can be a significant cost in hardware that could otherwise be utilised when you're at large scale.

Date Tiered Compaction Strategy (DTCS)

As with STCS, the compaction load was very light (we were only compacting the last hour of data together) and our data volume actually grew as we wrote more tombstones (the writes occurred a long time before the deletes).

Delete throughput was about the same as that of STCS, around 34% higher than that of LCS in this test, and remained fairly constant over time.

Triggering a major compaction had the same result as with STCS in that we wound up with one monolithic table and saved a tonne of space. However, unlike STCS the monolithic nature of this SSTable isn't so much of a concern under DTCS as we're not interested in compacting new data together with old data anyway as we're usually using it for more time-series type data.

How does this feed back into what Compaction Strategy to use and where?

In order of descending preference...

DTCS - Use anywhere we're going to do many immutable or very, very infrequently updated/deleted writes and we're only fetching back either a single logical row or a slice of rows with very close timestamps. Preferred over LCS as it scales better with data volume and enables major compactions.

LCS - Use for large tables (size overhead) where DTCS isn't suitable and we're unlikely to be concerned with freeing up space from bulk delete operations. Preferred over STCS as it means we can push our disk space utilisation further. Note: Requires some grunt to power (decent disks, decent compaction settings).

STCS - Use for small tables and tables unsuited to DTCS but where we really need to free up space or have data hard-deleted on bulk.