Wednesday, May 21, 2014

On Generating Performance Test Data

There comes a time in every Performance Tester's life when he or she has to generate some test data. You can't always get real data. Working in the health domain this is often due to issues relating to data sovereignty and protected health information (PHI). Sometimes you can't even look at the data with your own eyes and thus need to tell someone else what information you want from it and likely how to find that information out. In this post I've decided to outline some approaches I've used for getting the information you need when you can't actually touch the data.

What to ask?

If I'm not able to look at the data myself and thus need to utilise someone else's valuable eyes and time, then these are the types of questions I'd typically ask. I find it's a balance between asking for a few key stats then getting pretty good data and getting nothing at all because you've asked for a daunting amount of info.

The Most Basic Questions:
  • How many of X are there?
  • How many distinct combinations of Y and Z are there?
  • How many distinct combinations of Y and Z are there per X? (median, min, max)
  • How many Z are there?
  • What is the size of each N?
  • How many N elements are there per X (median, min, max bytes or characters)
  • Is there anything else that I'm neglecting that looks like it could be a concern?
Key:
X = whatever your payload is divided into, typically some form of "message" in the health domain
Y = identifiers
Z = namespaces
N = other elements that could be big, either because of their quantity, their individual size, or both

Note: I want to reiterate that this is not the ideal. For example, with this we're assuming everything is normally distributed, which is not always the case. If we had access to the data ourselves and plenty of time, I'd suggest generating histograms in order to see the actual distributions. Often data will be bi-modal which is a great opportunity to develop best-case and worst-case scenario data sets.

If you've done most of the work for people, they're more likely to help.

The data I'm using as a reference for the performance test data I'm generating is usually either stored in relational databases or in text files. If you know how the data is structured then you can make it more likely that you'll get the information you need by providing ready-made queries or commands (grep, gawk, sort, wc, etc) that somebody just needs to run in order to be the hero.

Actually generating the data.

I feel this is probably to specific to particular products to provide general tips so I'm going to leave it out. Maybe something for a future post.

Friday, May 16, 2014

On Geographically Distributed Load Testing

Welcome to my first public blog post!

Occasionally I get asked to perform load tests from geographically distributed locations as opposed to within the same local network that the application under test is running in. The idea being that we'll find out how well the application will perform for clients in different locations and do "something" about the results. I've decided to write this blog post so that I can refer people to with my opinions on the matter the next time it's brought up.

Pros:
  • If you're extremely short on hardware, companies will run your load testing scripts for you on their hardware.
  • You'll test the external interface of the hosting provider you're using. However this could also be a con if your current production environments are using the same interface. Also, it's not infeasible to test this locally. Plus, surely you'd have enforceable SLAs with your hosting provider around this.
Cons:
  • If you find a problem outside the DMZ, you probably won't know where the problem was as each packet can take its own route. Even if you did find an issue on a particular node, you couldn't do anything to fix any issue. (Are you going to ring them up and tell them to get their act together? Although I guess there are some companies that have the kind of power to make change here.)
  • Unless you're somehow crowd sourcing the load testing, you'll be using a very limited subset of ISPs and potential routes to the destination. Different upstream providers can mean very different routes.
  • The only thing you're going to be able to do in order to fix any problems you find with traffic from a given location is to look at optimising your application's network profile which you could do without testing from another geographic location.
  • Your application has to be exposed to the internet. Security will be a major concern.
  • You'll likely have to work with another company to provide remote hardware or to run the scripts as well which means:
    • You shouldn't use real data in your tests as it will all be visible to the third party.
    • They'll have your scripts and with that knowledge of your application and how it works. Plus knowledge of any testability facilitating methods you use. (Which should never wind up in production, but I believe we'd be lying if we said we hadn't heard of this happening before.)
So in conclusion, I'm not in favour of going to the effort of running geographically distributed load tests given the low value it offers in relation to running performance tests locally. I find hard to believe an organisation could be so strapped for hardware resources as to make it their only option given hardware's relative inexpensiveness in relation to other resources (e.g. Engineers) and the option of running everything in a cloud environment (I've been using AWS a lot lately and am very impressed BTW).

 Thanks for reading what was hopefully my first post of many!

Regards,

Anthony Fisk