Wednesday, May 21, 2014

On Generating Performance Test Data

There comes a time in every Performance Tester's life when he or she has to generate some test data. You can't always get real data. Working in the health domain this is often due to issues relating to data sovereignty and protected health information (PHI). Sometimes you can't even look at the data with your own eyes and thus need to tell someone else what information you want from it and likely how to find that information out. In this post I've decided to outline some approaches I've used for getting the information you need when you can't actually touch the data.

What to ask?

If I'm not able to look at the data myself and thus need to utilise someone else's valuable eyes and time, then these are the types of questions I'd typically ask. I find it's a balance between asking for a few key stats then getting pretty good data and getting nothing at all because you've asked for a daunting amount of info.

The Most Basic Questions:
  • How many of X are there?
  • How many distinct combinations of Y and Z are there?
  • How many distinct combinations of Y and Z are there per X? (median, min, max)
  • How many Z are there?
  • What is the size of each N?
  • How many N elements are there per X (median, min, max bytes or characters)
  • Is there anything else that I'm neglecting that looks like it could be a concern?
Key:
X = whatever your payload is divided into, typically some form of "message" in the health domain
Y = identifiers
Z = namespaces
N = other elements that could be big, either because of their quantity, their individual size, or both

Note: I want to reiterate that this is not the ideal. For example, with this we're assuming everything is normally distributed, which is not always the case. If we had access to the data ourselves and plenty of time, I'd suggest generating histograms in order to see the actual distributions. Often data will be bi-modal which is a great opportunity to develop best-case and worst-case scenario data sets.

If you've done most of the work for people, they're more likely to help.

The data I'm using as a reference for the performance test data I'm generating is usually either stored in relational databases or in text files. If you know how the data is structured then you can make it more likely that you'll get the information you need by providing ready-made queries or commands (grep, gawk, sort, wc, etc) that somebody just needs to run in order to be the hero.

Actually generating the data.

I feel this is probably to specific to particular products to provide general tips so I'm going to leave it out. Maybe something for a future post.

No comments:

Post a Comment