Sunday, June 8, 2014

On Java Garbage Collection Analysis & Tuning

1. Introduction

The purpose of this page is to serve as an introductory guide (or a cheat-sheet for the experienced but out of practice) on how to perform basic GC analysis and tuning.
This is not intended to be an in depth guide to the vast field of JVM tuning.

2. Collectors and their pros/cons

In most enterprise level applications, you will only ever want to use UseParallelOldGC or UseConcMarkSweepGC.
Here are the main different collectors and their pros and cons.
  • -XX:+UseSerialGC
    • A single GC thread.
    • Only suited to single processor machines and up to 100MB of data.
    • This will likely never be the best option with modern web applications.
  • -XX:+UseParallelGC
    • Multiple GC threads for minor collections, single threaded for major collections (old generation).
    • I don't know of any reason to use this collector if you're on a JVM (Java 5u6+) that has UseParallelOldGC.
  • -XX:+UseParallelOldGC
    • Multiple GC threads for both minor and major collections (old generation).
    • This is what you want to use if you're interested in getting the highest throughput out of your application, and pause times are not as big of a concern.
    • Colloquially referred to as the throughput collector.
  • -XX:+UseConcMarkSweepGC
    • Performs most of it's garbage collection activity concurrently which results in less overall throughput than UseParallelOldGC, but also less time spent in "Stop The World" pauses.
    • This is what you want to use if response times are more important than overall throughput. For example on HP Itanium hardware that has poor collection rates.
    • Has 2 stop the world pause events with each major collection (mark and re-mark), although these are relatively short compared to the Full GCs of UseParallelOldGC.
    • You won't have to perform any Full GCs unless you encounter a concurrent mode failure (objects are being tenured into the old generation faster than they can be reclaimed from it).
    • Colloquially referred to as the concurrent collector.

3. Some background terminology and process

Terminology around Java Garbage Collection can be very confusing with multiple names for the same thing and reused terms (objects are tenured, plus there is a tenured generation).
Hopefully this will help...

3.1. Generations

  • New / Young Generation
    • Spaces in this generation have minor collections.
    • Consists of:
      • Eden Space
      • Survivor Spaces
  • Old / Tenured Generation
    • There is one space in this generation and it has major collections (e.g. CMS or Full GC).
  • Permanent (Perm) Generation
    • Used by the JVM for storing classes, methods etc.
  • Code Cache
    • Used by the HotSpot compiler for compilation and storage of native code.

3.2. Tenuring/Promotion

Objects are initially created in the eden space (in the new/young generation).
Minor collections will tenure/promote objects that have not yet been de-referenced (memory pointers still exist that point to these objects) from eden and through the survivor spaces and into the old generation.
Major collections happen in the old generation. There is nowhere for objects to be promoted to from this generation, so objects will either be collected and their space freed up, or persist and potentially cause issues.

4. How to identify a typical memory leak

Note: The tool I prefer using and have used in these screenshots, is HPJMeter.

4.1. What does a healthy JVM look like?

We can see that this JVM is healthy from a number of different views.
The following is a graph of old generation size before GC from a healthy production server under peak load using the concurrent collector.

Here we can see that the size of the old generation isn't increasing after each consecutive time we have a CMS collection. The size of the old generation after a CMS collection goes up and down, but there is no overall trend of increasing.
We could also see this on a graph of reclaimed bytes.

Here we see that the number of bytes reclaimed by the CMS collections fluctuate with the load (more concurrent users means more objects persisted), but the trend is flat.

4.2. What does a memory leak look like?

The following is a graph of the old generation before GC from a very unhealthy development server using the throughput collector.

We can see that with every successive Full GC, the old generation gets bigger and bigger.
Also note that the interval between Full GCs decreases with time.
This continues until we see the purple line, at which time we are in what is commonly referred to as GC Hell. At this point the JVM is unable to clear any objects and is spending almost all of its time in garbage collection. The user will experience the system becoming completely unresponsive.
Once the JVM gets to a point where it is spending 98% of time in garbage collection and less than 2% of the heap is being recovered, an OutOfMemory (OOM) error will be thrown. This will cause the JVM to crash!
The following is a graph showing reclaimed bytes for the same period. We can see here that with each successive collection we are reclaiming less and less bytes. If this is allowed to continue, an OOM error will certainly be thrown.

5. What tuning is typically done

5.1. Sizing the heap

The maximum heap size should be set high enough so that you have enough space for your retained heap (where your heap size drops to after a major collection) under peak load and enough head-room so that the interval between major collections isn't too low.
Typically the concurrent collector requires more head-room than the throughput collector.
Finding the ideal heap size will require load testing and analysis.
Once you have determined what maximum heap size works best for you, you may want to set the minimum heap size to the same value in order to avoid poor performance while the JVM gradually increases the heap size from the minimum to your maximum.

5.2. Sizing the generations

Depending on the nature of your application (how transient the objects are), you'll want to size the generations differently.
Sizes of generations can be explicit or by ratio.
Ratio is somewhat preferable in my personal opinion as this should mean that once we understand the "nature" of our application, we can adjust for the scale of deployment by simply adjusting the heap size.
The main ratio you're likely to need to size is the New Ratio.
Setting the New Ratio can sometimes be confusing. If you have the New Ratio set to 2 (-XX:NewRatio=2), the ratio between the old and the young generation becomes 1:2. This means that 2/3 of the heap is occupied by the old generation and 1/3 is occupied by the new.

5.3. CMS Initiating Occupancy Fraction

When using the concurrent collector, you'll want to set the CMS Initiating Occupancy Fraction.
This can be done by setting -XX:CMSInitiatingOccupancyFraction=<N> where <N> is the percentage of the tenured (old) generation size that needs to be full before a CMS collection is triggered.
This will of course have to be a percentage of the old generation that is higher than where our retained heap size sits at peak load.
It will have to be high enough so that we aren't constantly having CMS collections and their associated pauses.
It will also have to be lower than the point at which triggering a CMS will happen too late and result in Concurrent Mode Failure and a Full GC.
Again, tuning this will require load testing and analysis.

6. What other JVM parameters are handy

  • -XX:+DisableExplicitGC
    • Applications can programmatically trigger a Full GC by calling System.gc().
    • This is not desirable and this flag will disable this feature.
  • -XX:+HeapDumpOnOutOfMemoryError 
    • This will cause the JVM to generate a heap dump when you run out of memory (once the JVM gets to a point where it is spending 98% of time in garbage collection and less than 2% of the heap is being recovered).
    • Having a heap dump will mean you can analyse it in Eclipse Memory Analyser in order to determine which objects are being persisted and causing the OOM error.

Wednesday, June 4, 2014

On Performance Engineering Yourself

The Force Is Strong With This One

Lately I've been paired performance testing a product in what I like to think of as a sort of Emperor Palpatine and Darth Maul/Vader arrangement. My apprentice has been running load tests in various configurations whilst I've been working behind the scenes to generate 250 million insurance claims with distributions representative of real data to underpin future testing.

Cloud City

Due to the scale of the testing that we're doing, this would have been impossible to do on internal hardware. As such we chose to use Amazon Web Services (AWS) after previously trying a few other cloud providers and being disappointed with their performance and reliability. So far AWS has been brilliant, no complaints whatsoever.

The possibilities offered by doing our testing in the cloud are quite exciting and have made me realise that the key place where optimisation is required in order to get the most value of it is actually ourselves. Previously, we were effectively bottlenecked on the available hardware. For example, if I was generating test data, the database would be hammered and no use for running load tests. This meant we had to manage our time so that load tests were run during the day and data generation and soak tests were run overnight and looked at the following morning.

A New Hope

Now we can quickly and easily spin up (using automated tooling), separate environments for each of us to work on tasks in parallel without worrying about interfering with each other. The coolest thing about this is it makes sense from a cost perspective too. If I spin up 32 machines for 1 hour, it costs the same as 16 of those machines for 2 hours. But, the most important thing is the overall cost is less spinning up more nodes because the person-hours spent was halved. And people are by far the biggest cost.

What this means though is to get the most value out of the cloud we need to be able to make our work highly parallelized. What if my apprentice could run every one of his planned load tests at the very same time? What might previously have been a weeks worth of testing could be finished in an hour, and importantly at lower cost. Of course there is overhead with analysis of results and reporting, but some of that can be automated too.

What if my data generation could take half the time if I threw twice the hardware at it? The possibilities are awesome. But they require us to optimise ourselves as engineers just like we would normally optimise our software.