Monday, April 8, 2019

Oracle VS SQL Server Query Tuning

Note: I'm not a certified relational database administrator of any sort. Although I am begrudgingly a certified Cassandra administrator...

Introduction

The relational database technologies in use by our clients tends to depend on what region they're in. In APAC and Europe they tend to use Microsoft SQL Server. In the USA it's typically Oracle.

In the performance team, we work on performance issues across all our clients. So we wind up optimising queries in both SQL Server and Oracle.

You could be forgiven for assuming that the challenges would be much the same. After all, it's all SQL, and SQL is a standard. In reality, they're very different beasts.

While not the topic of this post, the most immediately obvious difference is in the tooling.

With Oracle we tend to utilise Oracle's tooling to tune and optimise (AWR Reports, Enterprise Management Studio, etc).

With SQL Server, we tend to lean on third party and our own bespoke tooling because the tooling provided by Microsoft is fairly horrendous.

In this post I'd like to focus on the differences when it comes to the tuning of queries themselves, regardless of the tools available.

Query Tuning Prioritisation

When we're looking to optimise a poorly performing query, we like prioritise our avenues of investigation such that we're targeting the simplest fixes that have the least ongoing maintenance burden associated with them first.

In some cases we can simply ensure statistics are representative of the data (and that a representative view is maintained going forward), and that'll enable the optimiser to select a more appropriate execution plan without us needing to touch the query or the schema.

Failing that, we may need to add an index or improve the query itself. If the query to be changed is from product (as opposed to solution configuration), this will need to be driven (sometimes with urgency) through development, which can make for a relatively heavy-weight fix.

Then there are hints and the locking down of execution plans to effectively force (or at least strongly suggest) the optimiser does things the way you know to be better. This is generally avoided as it is very data dependent and may become less appropriate as data changes, and isn't really suitable to being productised for all customers. Additionally, we don't like to do this as it breaks the declarative model of SQL (more on this later). We typically don't wind up getting to this point with Oracle, because the optimiser is really good, so long as you are feeding in the right information.

However, once again SQL Server is a different story...

SQL Server Challenges

With SQL Server we are continually running into issues that can't be resolved by having up-to-date statistics, an optimally written query, and the appropriate indexes. We often find that SQL Server opts for execution plans that defy all logic.

It seems to be that when a query reaches a certain level of complexity, the optimiser just gives up. You add one more join and and go from a sane execution plan that takes seconds, to one that does a tonne of index spooling and takes minutes.

One example of this is were you have a (complex) sub query and some WHERE conditions outside of it.

SELECT * FROM
( SELECT [columna] ,
  [columnb] ,
  [columnc] ,
  ...
) SUBQUERY
WHERE SUBQUERY.columna = 'foo'
AND SUBQUERY.columnb IN ('foo','bar')
AND SUBQUERY.facilityCode IN ('1','2','3','4')

Running just the sub query might return a completely unfiltered result set of a few hundred rows in seconds.

However, including the WHERE clauses results in the query taking minutes, despite returning less data.

What seems logical to humans in this case would be running the sub query first, THEN applying the predicates.

However, SQL Server refuses to see this.

To make matters worse, Microsoft has acknowledged there is not a way to hint SQL server to do this...

So what can be done?

A dirty hack of course...

SELECT * FROM
(SELECT TOP 1000000000 [columna]
  [columnb] ,
  [columnc] ,
  ...
ORDER BY columna
) SUBQUERY
WHERE SUBQUERY.columna = 'foo'
AND SUBQUERY.columnb IN ('foo','bar')
AND SUBQUERY.facilityCode IN ('1','2','3','4')

What this does is it forces the database to evaluate the entire subquery first before applying the predicates as it needs to ensure that it gets the top 1,000,000,000 (a "safe" number because we know we'll only ever have as many as thousands of rows returned) rows ordered by columna first.

It's not quite as optimal as having the optimiser choose a sane execution plan itself, or using a hint would be, but it ain't bad.

The Declarative Paradigm

Aside the the obvious hackery in the workaround demonstrated above, the major concern we have with some of the optimisation work we wind up doing with SQL Server that we don't wind up needing to do with Oracle, is we tend to have to break the declarative paradigm of SQL.

SQL is meant to be a declarative language, in that you tell it what to do, not how to do it.

You say, get me some milk. You shouldn't have to tell it to get the milk from the fridge, not from the shop.

Sure, you have to let it know that there is milk in the fridge for it to get... But in the case of SQL Server, we've told it that there's milk in the fridge and we find it off looking for a cow to milk...

Sunday, May 8, 2016

On Cassandra Stream Monitoring

In an ideal world we’d have a well-configured centralised monitoring solution in place for every deployment. Unfortunately, some times we don’t and we need to do the best we can with what we have at our disposal.

Without OpsCenter, an ELK stack, or some other centralised monitoring in place, the Cassandra swiss-army knife, nodetool, is what one would use to monitor streaming (bootstrapping, repairing, rebuilding) progress. I personally find the output a bit too verbose and difficult to grok at a glance.

I like to use the following command to get an ongoing, easy to grok picture of streaming progress.

watch -n 10 'nodetool netstats | grep "Receiving\|Sending" | gawk {'"'"' print $1" - "$11/$4*100"% Complete, "($4-$11)/1024/1024/1024" GB remaining" '"'"'}'

Here is an example of the output from a node being rebuilt from 4 other nodes.

Here's how it looks on the sending end(s).

Again, this is far from the ideal of a centralised monitoring system. But my hope as this will make life easier (or at least prettier) for somebody working in an environment under constraints that do not make centralised monitoring possible.

Thursday, December 3, 2015

On Cassandra Striping Across Availability Zones

We've had mixed advice given to us in regards to how Cassandra stripes its data across AWS Availability Zones (mapped to "Racks" via the EC2MultiRegionSnitch). Some said that striping is guaranteed to be across all 3 AZs our nodes are deployed across per Region. While others said the striping is simply a "best effort" operation. This has meant that we've had to take a conservative approach and assume equal distribution of partitions across AZs can not be relied on. The practical implication of this was that when an entire AWS AZ, or even 2 nodes within an AZ went down, we'd have to fail-over to the Disaster Recovery half of our cluster (an AWS Region mapped to a logical "Datacenter" via the EC2MultiRegionSnitch) to achieve the consistency level of LOCAL_QUORUM required by our applications in order to achieve "strong consistency". This was not cool... So we've done a few some tests to see what the real story is.

Our approach was to develop a little test app that iteratively selected all rows in one of the biggest tables in our schema, "log", where all our raw messages are stored. The app used a consistency level of ALL, which meant that all replicas of the partition being requested must reply for the select to be considered successful. As is standard for our deployments, we used a replication factor of 3 (every partition is replicated 3 times per logical datacenter, of which we have 2). This means that if one of 3 replicas for a partition is unavailable, the select will fail. The app will ignore these failures (as opposed to retrying the query) and just output the count of successfully selected partitions.

This app was then used in 4 different scenarios:

After a large, heavy backload had been performed. Done simply to get the count of partitions with all of the replicas available.
After an entire AZ was taken down. In this test we'd expect no selects at a consistency level of ALL to succeed if striping was perfectly equally distributing replicas across AZs.
After yet more load had been applied while an entire AZ is still down. This was to determine that with an AZ completely unavailable, we don't regress to striping all 3 replicas across just 2 AZs. Again we expect (desire) all selects with a consistency level of ALL to fail.
After the AZ has been brought up again. This is mainly done for the sake of completeness and we of course expect all selects with a consistency level of ALL to succeed.

What happened?

Test 1

The result of Test 1 was that 296,904,900 partitions were successfully selected. This isn't interesting in itself, but this number will come in handy later.

Test 2

The result of Test 2 was that 0 selects were successful; a good result indicating that data has been striped consistently across 3 AZs. Importantly, as we later found out, this testing was done with a "fetch size" (number of rows to attempt to fetch for purposes of pagination, etc) of 1. Out of curiosity we'd run this test again with just 1, 2, and 3 of our 4 nodes in the AZ down. We'd of course expected to see roughly 74,226,225 (one quarter of 296,904,900) of selects succeed at a consistency level of ALL per node in the AZ still up. Interestingly if a single partition in a fetch does not have the required consistency level available, the entire fetch appears to fail because of this. So the only way to get the precise number of successful selects with a consistency level of ALL and multiple nodes down in an AZ is to use a fetch size of 1, which is practically infeasible in terms of time due to the performance hit of doing so. Thankfully with an entire AZ down it seems that Cassandra is smart enough to know that all queries at consistency level ALL will fail and it does so very quickly, even with a fetch size of 1.

All Nodes in AZ Up	1 Node in AZ Down	2 Nodes in AZ Down	3 Nodes in AZ Down	All Nodes in AZ Down
Fetch Size	Successful Selects	Successful Selects	Successful Selects	Successful Selects	Successful Selects
1	296,904,900*	~ 222,678,67*	~ 148,452,450*	~ 74,226,225*	0
10	296,904,900*	48,093,960	17,281,840	8,161,280	0
100	296,904,900	41,047,204	5,071,200	1,059,500	0
1,000	296,904,900*	5,644,000	3,213,000	605,000	0
10,000	296,904,900*	2,660,000	2,360,000	470,000	0

* Not an actual measurement, but what we'd expect to see.

Test 3

The result of Test 3 was that we once again found that no selects were successful, even with a fetch size of 1. This is good because it indicates that striping has not regressed to putting all 3 replicas across Cassandra nodes in the only 2 still available AZs. Instead it will be writing hinted handoffs that will persist for 3 hours, so that if the downed nodes were only temporally available, then the missed writes could be replicated to them within those 3 hours. As our nodes in the downed AZ are completely gone (losing the data on their ephemeral disks), we'll instead need to stream all lost data to them when they're eventually replaced.

Test 4

As we'd sent in more messages since Test 1, the total number of entries in the log table was now 301,373,773. As an aside, a handy way to monitor bootstrap streaming progress is with the following command-line fu:

watch -n 10 'nodetool netstats | grep Receiving | gawk {'"'"' print $11/$4*100"% Complete, "$11/1024/1024/1024" GB remaining" '"'"'}'

How does this manifest in our applications which use a consistency level of LOCAL_QUORUM for "strong consistency"?

All of our applications use a hard-coded consistency level of LOCAL_QUORUM for all reads and writes to Cassandra in order to ensure "strong consistency". As an outcome of our testing, we now know that with our standard configuration of a Replication Factor of 3, Ec2MultiRegionSnitch, NetworkTopologyStrategy, and Cassandra nodes distributed across 3 AWS Availability Zones per Region; we can afford to lose an entire AZ of nodes (e.g. lose all of AZ C) and still not need to fail over to the Disaster Recovery region to consistently achieve our LOCAL_QUORUM consistency. However, if we lose one node in one AZ and another node in a different AZ (e.g. one node in AZ A and another in AZ B), we will have to fail over for some requests.

The following is a diagram of how all this manifests in terms of data striping in Cassandra with our configuration. Only writes have been shown as read behaviour is the same.

Wednesday, June 10, 2015

Why not to multi-tenant Cassandra

Cassandra and the Cloud go together like <insert hilarious comparison here>. So does multi-tenancy and the Cloud.

As well as being able to save costs by elastically horizontally scaling in response to load during a day, it’s an attractive idea to make further strides by sharing hardware between customers (multi-tenanting).

I’m not against multi-tenancy per se… I think it’s a great idea. But I am against doing it with Cassandra and here’s why…

You basically have 3 real options for multi-tenanting Cassandra, each with pros and cons:

By Keyspace

Pros:

Tenants may have different replication factors.
Tenants may have different schemas (enabling them to potentially run different applications).
Tenants don’t share SSTables and data is effectively isolated by Cassandra.

Cons:

Your scalability is limited because holding all the extra keyspaces and their tables in memory does not play well with Cassandra.

By Table

Pros:

Tenants may have different schemas (enabling them to potentially run different applications).
Tenants don’t share SSTables and data is effectively isolated by Cassandra.

Cons:

Your scalability is limited because holding all the extra tables in memory does not play well with Cassandra.
More complex application logic is needed to deal with different tables per tenant.

By Row

Pros:

A scalable approach as your memory requirements are the same as for one tenant.

Cons:

Tenants must have compatible schemas.
Tenants share both SSTables and Memtables, so isolation has to be done in the application, but active tenants can hinder the effectiveness of the shared caches for other tenants and cause compaction issues that will impact every tenant.

Having multiple Logical Datacenters has also been suggested to isolate the workloads of different customers. But this is designed for different workloads (e.g. BAU and Analytics) on the same data. If you want to isolate by tenant (data) then you’ll lose the benefits of multi-tenancy as you’ll no longer be sharing the hardware.

You can not win.

If you really want to save wastage costs from unused hardware, I'd consider increasing the granularity of your instances so that the wastage is reduced.

Thursday, May 14, 2015

Cassandra Compaction Strategies Under Heavy Delete Workflows

There are many factors at play when we look at what compaction strategies to use for our tables in Cassandra. Bulk deleting large quantities of data is something that is usually avoided in Cassandra. However, we have requirements that necessitate it (e.g. a requirement to hard-delete patient data when they've opted out of our service) and there are certain advantages to being able to do it (e.g. freeing up space when data has become "orphaned", never to be read again).

The big concern with deleting large quantities of data is tombstones. We have a log-structured file system in Cassandra in that we always append writes, avoiding disk seeks (how we achieve such awesome write performance). Even deletes are writes in the form of tombstones which mark data as having been deleted. Having lots of tombstones is a bad thing as we need to pull back the tombstones along with the upserts (inserts and updates are fairly much the same thing) for our data when we read it back. This spreads reads across SSTables and means we're holding more information in memory in order to resolve the current state of the data we're pulling back.

Because of this, it's important to understand the behaviour of different compaction strategies in relation to tombstones when we have a heavy delete workload. These were my findings:

Leveled Compaction Strategy (LCS)

Compaction load during heavy deletes was greatest with LCS. The upside being that this enabled more data to be compacted away during the delete workload, reducing the size consumed by the table as we went.

Delete throughput was also less under LCS than the other strategies due to the compaction load, but improved over time as the data size reduced.

Until we get Cassandra 3.0, nodetool compact (triggering a "major" compaction) is a no-op under LCS. This means that we can not opt to compact away all of the deleted data, leaving only the tombstones (assuming gc_grace_seconds hasn't expired). This is a big blow to the space saving use case as well as potentially causing issues around whether the data has been "hard" deleted.

A selling point for LCS over STCS is that you don't need the space 50% overhead in order to ensure compaction can safely happen.

Size Tiered Compaction Strategy (STCS)

The default compaction strategy, STCS, had a very light compaction load during the delete workflow. As a result, the data volumes actually grew with the tombstones being written and not much of the original data being compacted away.

Delete throughput was about the same as that of DTCS, around 34% higher than that of LCS in this test. Like with DTCS, delete throughput also remained fairly constant over time.

Triggering a "major" compaction with nodetool compact on STCS is generally not recommended as when you compact all of your data into one giant SSTable, it is going to be a very long time (if ever) that future SSTables will get up to a size with which they'll compact with your monolithic table. However, this did reduce the table size considerably as the tombstones (gc_grace_seconds hadn't expired) are a lot smaller than the actual data.

A downside to STCS is that it requires a 50% space overhead in order to guarantee it has enough space in order to perform a compaction. This can be a significant cost in hardware that could otherwise be utilised when you're at large scale.

Date Tiered Compaction Strategy (DTCS)

As with STCS, the compaction load was very light (we were only compacting the last hour of data together) and our data volume actually grew as we wrote more tombstones (the writes occurred a long time before the deletes).

Delete throughput was about the same as that of STCS, around 34% higher than that of LCS in this test, and remained fairly constant over time.

Triggering a major compaction had the same result as with STCS in that we wound up with one monolithic table and saved a tonne of space. However, unlike STCS the monolithic nature of this SSTable isn't so much of a concern under DTCS as we're not interested in compacting new data together with old data anyway as we're usually using it for more time-series type data.

How does this feed back into what Compaction Strategy to use and where?

In order of descending preference...

DTCS - Use anywhere we're going to do many immutable or very, very infrequently updated/deleted writes and we're only fetching back either a single logical row or a slice of rows with very close timestamps. Preferred over LCS as it scales better with data volume and enables major compactions.

LCS - Use for large tables (size overhead) where DTCS isn't suitable and we're unlikely to be concerned with freeing up space from bulk delete operations. Preferred over STCS as it means we can push our disk space utilisation further. Note: Requires some grunt to power (decent disks, decent compaction settings).

STCS - Use for small tables and tables unsuited to DTCS but where we really need to free up space or have data hard-deleted on bulk.

Tuesday, September 23, 2014

On Naming Things

"New York, New York. So good they named it twice." - Gerard Kenny

Naming things is hard.

That being said, I haven't found any area that has struggled with naming things as much as Java Garbage Collection...

For instance, even official Oracle documentation uses the term "infant mortality" to describe objects that "die" in the Young generation. While I get the metaphor (joke?), it's not really appropriate to talk about the infant mortality rate associated with your application when you're working in the Health domain. Saying things like "...the minor collection can take advantage of the high infant mortality rate." doesn't endear you to your coworkers.

There's also a lot of inconsistency and repurposing of terms.

The terms "Young Generation" and "New Generation" are often used interchangeably (even within the same documents)...

"The NewSize and MaxNewSize parameters control the new generation’s minimum and maximum size. Regulate the new generation size by setting these parameters equal. The bigger the younger generation, the less often minor collections occur. The size of the young generation relative to the old generation is controlled by NewRatio. For example, setting -XX:NewRatio=3 means that the ratio between the old and young generation is 1:3, the combined size of eden and the survivor spaces will be fourth of the heap." - Sun Java System Application Server Enterprise Edition 8.2 Performance Tuning Guide

The same goes for "Old Generation" and "Tenured Generation"...

Watch out though! The term "Tenured" is also used synonymously with "Promoted" when it comes to objects moving through the generations.

It gets worse when you read people's blogs and see them try to provide their own interpretations of the terminology. I read one today where the guy was trying to say some major collections aren't really major collections in his opinion... It just adds to the confusion.

Monday, July 7, 2014

On JMeter Reporting To Database

For this post I thought I'd talk about a plug-in I developed for Apache JMeter.

It's called Aggregate Report To Database. If you're familiar with Apache JMeter, you're no doubt familiar with the Aggregate Report plug-in. My Aggregate Report To Database plug-in builds upon the functionality of the Aggregate Report to enable both automatic and manually triggered saving of results to a Database (currently only Oracle, but would be very easy to enable others).

What does this achieve?

The main reason I developed this plug-in was in order to make it simple for development teams to take the performance testing scripts I'd made for their products and re-purpose them for nightly performance frameworks. Now teams can leverage existing scripts and have them run against their nightly builds to have traceability of performance improvements and regressions to builds, to quantify performance improvements and regressions, and to analyse performance trends over time.

How does it achieve this?

The build system triggers a JMeter test to run, the results of which are automatically saved into a relational database at the end of the test (so that saving the results has no impact on the test itself). We use Atlassian Confluence as an internal wiki. A page is configured by teams in the wiki to automatically graph the results (by querying the results database) of the past 2 weeks of performance testing against nightly builds for all the different features in a product. This makes it easy for a team to see at a glance how performance of different functionality has trended over time by simply loading a page. It's also an interesting set of graphs to show at a sprint demo.

Where can I get this plug-in?

I haven't currently published this plug-in publicly, but this is something I am definitely interested in doing if there is significant interest in getting it. I'd just have to clean a few things up and get permission from my employer to make it open source.

I'd like to thank Viktoriia Kuznetcova for her help in developing this plug-in.=)