Tuesday, September 23, 2014

On Naming Things

"New York, New York. So good they named it twice." - Gerard Kenny

Naming things is hard.

That being said, I haven't found any area that has struggled with naming things as much as Java Garbage Collection...

For instance, even official Oracle documentation uses the term "infant mortality" to describe objects that "die" in the Young generation. While I get the metaphor (joke?), it's not really appropriate to talk about the infant mortality rate associated with your application when you're working in the Health domain. Saying things like "...the minor collection can take advantage of the high infant mortality rate." doesn't endear you to your coworkers.

There's also a lot of inconsistency and repurposing of terms.

The terms "Young Generation" and "New Generation" are often used interchangeably (even within the same documents)...

"The NewSize and MaxNewSize parameters control the new generation’s minimum and maximum size. Regulate the new generation size by setting these parameters equal. The bigger the younger generation, the less often minor collections occur. The size of the young generation relative to the old generation is controlled by NewRatio. For example, setting -XX:NewRatio=3 means that the ratio between the old and young generation is 1:3, the combined size of eden and the survivor spaces will be fourth of the heap." - Sun Java System Application Server Enterprise Edition 8.2 Performance Tuning Guide

The same goes for "Old Generation" and "Tenured Generation"...

Watch out though! The term "Tenured" is also used synonymously with "Promoted" when it comes to objects moving through the generations.

It gets worse when you read people's blogs and see them try to provide their own interpretations of the terminology. I read one today where the guy was trying to say some major collections aren't really major collections in his opinion... It just adds to the confusion.

Monday, July 7, 2014

On JMeter Reporting To Database

For this post I thought I'd talk about a plug-in I developed for Apache JMeter.

It's called Aggregate Report To Database. If you're familiar with Apache JMeter, you're no doubt familiar with the Aggregate Report plug-in. My Aggregate Report To Database plug-in builds upon the functionality of the Aggregate Report to enable both automatic and manually triggered saving of results to a Database (currently only Oracle, but would be very easy to enable others).

What does this achieve?

The main reason I developed this plug-in was in order to make it simple for development teams to take the performance testing scripts I'd made for their products and re-purpose them for nightly performance frameworks. Now teams can leverage existing scripts and have them run against their nightly builds to have traceability of performance improvements and regressions to builds, to quantify performance improvements and regressions, and to analyse performance trends over time.

How does it achieve this?

The build system triggers a JMeter test to run, the results of which are automatically saved into a relational database at the end of the test (so that saving the results has no impact on the test itself). We use Atlassian Confluence as an internal wiki. A page is configured by teams in the wiki to automatically graph the results (by querying the results database) of the past 2 weeks of performance testing against nightly builds for all the different features in a product. This makes it easy for a team to see at a glance how performance of different functionality has trended over time by simply loading a page. It's also an interesting set of graphs to show at a sprint demo.

Where can I get this plug-in?

I haven't currently published this plug-in publicly, but this is something I am definitely interested in doing if there is significant interest in getting it. I'd just have to clean a few things up and get permission from my employer to make it open source.

I'd like to thank Viktoriia Kuznetcova for her help in developing this plug-in.=)

Sunday, June 8, 2014

On Java Garbage Collection Analysis & Tuning

1. Introduction

The purpose of this page is to serve as an introductory guide (or a cheat-sheet for the experienced but out of practice) on how to perform basic GC analysis and tuning.
This is not intended to be an in depth guide to the vast field of JVM tuning.

2. Collectors and their pros/cons

In most enterprise level applications, you will only ever want to use UseParallelOldGC or UseConcMarkSweepGC.
Here are the main different collectors and their pros and cons.

-XX:+UseSerialGC
- A single GC thread.
- Only suited to single processor machines and up to 100MB of data.
- This will likely never be the best option with modern web applications.
-XX:+UseParallelGC
- Multiple GC threads for minor collections, single threaded for major collections (old generation).
- I don't know of any reason to use this collector if you're on a JVM (Java 5u6+) that has UseParallelOldGC.
-XX:+UseParallelOldGC
- Multiple GC threads for both minor and major collections (old generation).
- This is what you want to use if you're interested in getting the highest throughput out of your application, and pause times are not as big of a concern.
- Colloquially referred to as the throughput collector.
-XX:+UseConcMarkSweepGC
- Performs most of it's garbage collection activity concurrently which results in less overall throughput than UseParallelOldGC, but also less time spent in "Stop The World" pauses.
- This is what you want to use if response times are more important than overall throughput. For example on HP Itanium hardware that has poor collection rates.
- Has 2 stop the world pause events with each major collection (mark and re-mark), although these are relatively short compared to the Full GCs of UseParallelOldGC.
- You won't have to perform any Full GCs unless you encounter a concurrent mode failure (objects are being tenured into the old generation faster than they can be reclaimed from it).
- Colloquially referred to as the concurrent collector.

3. Some background terminology and process

Terminology around Java Garbage Collection can be very confusing with multiple names for the same thing and reused terms (objects are tenured, plus there is a tenured generation).
Hopefully this will help...

3.1. Generations

New / Young Generation
- Spaces in this generation have minor collections.
- Consists of:
  - Eden Space
  - Survivor Spaces
Old / Tenured Generation
- There is one space in this generation and it has major collections (e.g. CMS or Full GC).
Permanent (Perm) Generation
- Used by the JVM for storing classes, methods etc.
Code Cache
- Used by the HotSpot compiler for compilation and storage of native code.

3.2. Tenuring/Promotion

Objects are initially created in the eden space (in the new/young generation).
Minor collections will tenure/promote objects that have not yet been de-referenced (memory pointers still exist that point to these objects) from eden and through the survivor spaces and into the old generation.
Major collections happen in the old generation. There is nowhere for objects to be promoted to from this generation, so objects will either be collected and their space freed up, or persist and potentially cause issues.

4. How to identify a typical memory leak

Note: The tool I prefer using and have used in these screenshots, is HPJMeter.

4.1. What does a healthy JVM look like?

We can see that this JVM is healthy from a number of different views.
The following is a graph of old generation size before GC from a healthy production server under peak load using the concurrent collector.

Here we can see that the size of the old generation isn't increasing after each consecutive time we have a CMS collection. The size of the old generation after a CMS collection goes up and down, but there is no overall trend of increasing.
We could also see this on a graph of reclaimed bytes.

Here we see that the number of bytes reclaimed by the CMS collections fluctuate with the load (more concurrent users means more objects persisted), but the trend is flat.

4.2. What does a memory leak look like?

The following is a graph of the old generation before GC from a very unhealthy development server using the throughput collector.

We can see that with every successive Full GC, the old generation gets bigger and bigger.
Also note that the interval between Full GCs decreases with time.
This continues until we see the purple line, at which time we are in what is commonly referred to as GC Hell. At this point the JVM is unable to clear any objects and is spending almost all of its time in garbage collection. The user will experience the system becoming completely unresponsive.
Once the JVM gets to a point where it is spending 98% of time in garbage collection and less than 2% of the heap is being recovered, an OutOfMemory (OOM) error will be thrown. This will cause the JVM to crash!
The following is a graph showing reclaimed bytes for the same period. We can see here that with each successive collection we are reclaiming less and less bytes. If this is allowed to continue, an OOM error will certainly be thrown.

5. What tuning is typically done

5.1. Sizing the heap

The maximum heap size should be set high enough so that you have enough space for your retained heap (where your heap size drops to after a major collection) under peak load and enough head-room so that the interval between major collections isn't too low.
Typically the concurrent collector requires more head-room than the throughput collector.
Finding the ideal heap size will require load testing and analysis.
Once you have determined what maximum heap size works best for you, you may want to set the minimum heap size to the same value in order to avoid poor performance while the JVM gradually increases the heap size from the minimum to your maximum.

5.2. Sizing the generations

Depending on the nature of your application (how transient the objects are), you'll want to size the generations differently.
Sizes of generations can be explicit or by ratio.
Ratio is somewhat preferable in my personal opinion as this should mean that once we understand the "nature" of our application, we can adjust for the scale of deployment by simply adjusting the heap size.
The main ratio you're likely to need to size is the New Ratio.
Setting the New Ratio can sometimes be confusing. If you have the New Ratio set to 2 (-XX:NewRatio=2), the ratio between the old and the young generation becomes 1:2. This means that 2/3 of the heap is occupied by the old generation and 1/3 is occupied by the new.

5.3. CMS Initiating Occupancy Fraction

When using the concurrent collector, you'll want to set the CMS Initiating Occupancy Fraction.
This can be done by setting -XX:CMSInitiatingOccupancyFraction=<N> where <N> is the percentage of the tenured (old) generation size that needs to be full before a CMS collection is triggered.
This will of course have to be a percentage of the old generation that is higher than where our retained heap size sits at peak load.
It will have to be high enough so that we aren't constantly having CMS collections and their associated pauses.
It will also have to be lower than the point at which triggering a CMS will happen too late and result in Concurrent Mode Failure and a Full GC.
Again, tuning this will require load testing and analysis.

6. What other JVM parameters are handy

-XX:+DisableExplicitGC
- Applications can programmatically trigger a Full GC by calling System.gc().
- This is not desirable and this flag will disable this feature.
```
-XX:+HeapDumpOnOutOfMemoryError 
```
- This will cause the JVM to generate a heap dump when you run out of memory (once the JVM gets to a point where it is spending 98% of time in garbage collection and less than 2% of the heap is being recovered).
- Having a heap dump will mean you can analyse it in Eclipse Memory Analyser in order to determine which objects are being persisted and causing the OOM error.

Wednesday, June 4, 2014

On Performance Engineering Yourself

The Force Is Strong With This One

Lately I've been paired performance testing a product in what I like to think of as a sort of Emperor Palpatine and Darth Maul/Vader arrangement. My apprentice has been running load tests in various configurations whilst I've been working behind the scenes to generate 250 million insurance claims with distributions representative of real data to underpin future testing.

Cloud City

Due to the scale of the testing that we're doing, this would have been impossible to do on internal hardware. As such we chose to use Amazon Web Services (AWS) after previously trying a few other cloud providers and being disappointed with their performance and reliability. So far AWS has been brilliant, no complaints whatsoever.

The possibilities offered by doing our testing in the cloud are quite exciting and have made me realise that the key place where optimisation is required in order to get the most value of it is actually ourselves. Previously, we were effectively bottlenecked on the available hardware. For example, if I was generating test data, the database would be hammered and no use for running load tests. This meant we had to manage our time so that load tests were run during the day and data generation and soak tests were run overnight and looked at the following morning.

A New Hope

Now we can quickly and easily spin up (using automated tooling), separate environments for each of us to work on tasks in parallel without worrying about interfering with each other. The coolest thing about this is it makes sense from a cost perspective too. If I spin up 32 machines for 1 hour, it costs the same as 16 of those machines for 2 hours. But, the most important thing is the overall cost is less spinning up more nodes because the person-hours spent was halved. And people are by far the biggest cost.

What this means though is to get the most value out of the cloud we need to be able to make our work highly parallelized. What if my apprentice could run every one of his planned load tests at the very same time? What might previously have been a weeks worth of testing could be finished in an hour, and importantly at lower cost. Of course there is overhead with analysis of results and reporting, but some of that can be automated too.

What if my data generation could take half the time if I threw twice the hardware at it? The possibilities are awesome. But they require us to optimise ourselves as engineers just like we would normally optimise our software.

Wednesday, May 21, 2014

On Generating Performance Test Data

There comes a time in every Performance Tester's life when he or she has to generate some test data. You can't always get real data. Working in the health domain this is often due to issues relating to data sovereignty and protected health information (PHI). Sometimes you can't even look at the data with your own eyes and thus need to tell someone else what information you want from it and likely how to find that information out. In this post I've decided to outline some approaches I've used for getting the information you need when you can't actually touch the data.

What to ask?

If I'm not able to look at the data myself and thus need to utilise someone else's valuable eyes and time, then these are the types of questions I'd typically ask. I find it's a balance between asking for a few key stats then getting pretty good data and getting nothing at all because you've asked for a daunting amount of info.

The Most Basic Questions:

How many of X are there?
How many distinct combinations of Y and Z are there?
How many distinct combinations of Y and Z are there per X? (median, min, max)
How many Z are there?
What is the size of each N?
How many N elements are there per X? (median, min, max bytes or characters)
Is there anything else that I'm neglecting that looks like it could be a concern?

Key:
X = whatever your payload is divided into, typically some form of "message" in the health domain
Y = identifiers
Z = namespaces
N = other elements that could be big, either because of their quantity, their individual size, or both

Note: I want to reiterate that this is not the ideal. For example, with this we're assuming everything is normally distributed, which is not always the case. If we had access to the data ourselves and plenty of time, I'd suggest generating histograms in order to see the actual distributions. Often data will be bi-modal which is a great opportunity to develop best-case and worst-case scenario data sets.

If you've done most of the work for people, they're more likely to help.

The data I'm using as a reference for the performance test data I'm generating is usually either stored in relational databases or in text files. If you know how the data is structured then you can make it more likely that you'll get the information you need by providing ready-made queries or commands (grep, gawk, sort, wc, etc) that somebody just needs to run in order to be the hero.

Actually generating the data.

I feel this is probably to specific to particular products to provide general tips so I'm going to leave it out. Maybe something for a future post.

Friday, May 16, 2014

On Geographically Distributed Load Testing

Welcome to my first public blog post!

Occasionally I get asked to perform load tests from geographically distributed locations as opposed to within the same local network that the application under test is running in. The idea being that we'll find out how well the application will perform for clients in different locations and do "something" about the results. I've decided to write this blog post so that I can refer people to with my opinions on the matter the next time it's brought up.

Pros:

If you're extremely short on hardware, companies will run your load testing scripts for you on their hardware.
You'll test the external interface of the hosting provider you're using. However this could also be a con if your current production environments are using the same interface. Also, it's not infeasible to test this locally. Plus, surely you'd have enforceable SLAs with your hosting provider around this.

Cons:

If you find a problem outside the DMZ, you probably won't know where the problem was as each packet can take its own route. Even if you did find an issue on a particular node, you couldn't do anything to fix any issue. (Are you going to ring them up and tell them to get their act together? Although I guess there are some companies that have the kind of power to make change here.)
Unless you're somehow crowd sourcing the load testing, you'll be using a very limited subset of ISPs and potential routes to the destination. Different upstream providers can mean very different routes.
The only thing you're going to be able to do in order to fix any problems you find with traffic from a given location is to look at optimising your application's network profile which you could do without testing from another geographic location.
Your application has to be exposed to the internet. Security will be a major concern.
You'll likely have to work with another company to provide remote hardware or to run the scripts as well which means:

You shouldn't use real data in your tests as it will all be visible to the third party.
They'll have your scripts and with that knowledge of your application and how it works. Plus knowledge of any testability facilitating methods you use. (Which should never wind up in production, but I believe we'd be lying if we said we hadn't heard of this happening before.)

So in conclusion, I'm not in favour of going to the effort of running geographically distributed load tests given the low value it offers in relation to running performance tests locally. I find hard to believe an organisation could be so strapped for hardware resources as to make it their only option given hardware's relative inexpensiveness in relation to other resources (e.g. Engineers) and the option of running everything in a cloud environment (I've been using AWS a lot lately and am very impressed BTW).

Thanks for reading what was hopefully my first post of many!

Regards,

Anthony Fisk