Thursday, May 14, 2015

Cassandra Compaction Strategies Under Heavy Delete Workflows

There are many factors at play when we look at what compaction strategies to use for our tables in Cassandra. Bulk deleting large quantities of data is something that is usually avoided in Cassandra. However, we have requirements that necessitate it (e.g. a requirement to hard-delete patient data when they've opted out of our service) and there are certain advantages to being able to do it (e.g. freeing up space when data has become "orphaned", never to be read again).

The big concern with deleting large quantities of data is tombstones. We have a log-structured file system in Cassandra in that we always append writes, avoiding disk seeks (how we achieve such awesome write performance). Even deletes are writes in the form of tombstones which mark data as having been deleted. Having lots of tombstones is a bad thing as we need to pull back the tombstones along with the upserts (inserts and updates are fairly much the same thing) for our data when we read it back. This spreads reads across SSTables and means we're holding more information in memory in order to resolve the current state of the data we're pulling back.

Because of this, it's important to understand the behaviour of different compaction strategies in relation to tombstones when we have a heavy delete workload. These were my findings:

Leveled Compaction Strategy (LCS)

Compaction load during heavy deletes was greatest with LCS. The upside being that this enabled more data to be compacted away during the delete workload, reducing the size consumed by the table as we went.

Delete throughput was also less under LCS than the other strategies due to the compaction load, but improved over time as the data size reduced.

Until we get Cassandra 3.0, nodetool compact (triggering a "major" compaction) is a no-op under LCS. This means that we can not opt to compact away all of the deleted data, leaving only the tombstones (assuming gc_grace_seconds hasn't expired). This is a big blow to the space saving use case as well as potentially causing issues around whether the data has been "hard" deleted.

A selling point for LCS over STCS is that you don't need the space 50% overhead in order to ensure compaction can safely happen.

Size Tiered Compaction Strategy (STCS)

The default compaction strategy, STCS, had a very light compaction load during the delete workflow. As a result, the data volumes actually grew with the tombstones being written and not much of the original data being compacted away.

Delete throughput was about the same as that of DTCS, around 34% higher than that of LCS in this test. Like with DTCS, delete throughput also remained fairly constant over time.

Triggering a "major" compaction with nodetool compact on STCS is generally not recommended as when you compact all of your data into one giant SSTable, it is going to be a very long time (if ever) that future SSTables will get up to a size with which they'll compact with your monolithic table. However, this did reduce the table size considerably as the tombstones (gc_grace_seconds hadn't expired) are a lot smaller than the actual data.

A downside to STCS is that it requires a 50% space overhead in order to guarantee it has enough space in order to perform a compaction. This can be a significant cost in hardware that could otherwise be utilised when you're at large scale.

Date Tiered Compaction Strategy (DTCS)

As with STCS, the compaction load was very light (we were only compacting the last hour of data together) and our data volume actually grew as we wrote more tombstones (the writes occurred a long time before the deletes).

Delete throughput was about the same as that of STCS, around 34% higher than that of LCS in this test, and remained fairly constant over time.

Triggering a major compaction had the same result as with STCS in that we wound up with one monolithic table and saved a tonne of space. However, unlike STCS the monolithic nature of this SSTable isn't so much of a concern under DTCS as we're not interested in compacting new data together with old data anyway as we're usually using it for more time-series type data.

How does this feed back into what Compaction Strategy to use and where?

In order of descending preference...

DTCS - Use anywhere we're going to do many immutable or very, very infrequently updated/deleted writes and we're only fetching back either a single logical row or a slice of rows with very close timestamps. Preferred over LCS as it scales better with data volume and enables major compactions.

LCS - Use for large tables (size overhead) where DTCS isn't suitable and we're unlikely to be concerned with freeing up space from bulk delete operations. Preferred over STCS as it means we can push our disk space utilisation further. Note: Requires some grunt to power (decent disks, decent compaction settings).

STCS - Use for small tables and tables unsuited to DTCS but where we really need to free up space or have data hard-deleted on bulk.