What to ask?
If I'm not able to look at the data myself and thus need to utilise someone else's valuable eyes and time, then these are the types of questions I'd typically ask. I find it's a balance between asking for a few key stats then getting pretty good data and getting nothing at all because you've asked for a daunting amount of info.
The Most Basic Questions:
- How many of X are there?
- How many distinct combinations of Y and Z are there?
- How many distinct combinations of Y and Z are there per X? (median, min, max)
- How many Z are there?
- What is the size of each N?
- How many N elements are there per X? (median, min, max bytes or characters)
- Is there anything else that I'm neglecting that looks like it could be a concern?
X = whatever your payload is divided into, typically some form of "message" in the health domain
Y = identifiers
Z = namespaces
N = other elements that could be big, either because of their quantity, their individual size, or both
Note: I want to reiterate that this is not the ideal. For example, with this we're assuming everything is normally distributed, which is not always the case. If we had access to the data ourselves and plenty of time, I'd suggest generating histograms in order to see the actual distributions. Often data will be bi-modal which is a great opportunity to develop best-case and worst-case scenario data sets.
If you've done most of the work for people, they're more likely to help.
The data I'm using as a reference for the performance test data I'm generating is usually either stored in relational databases or in text files. If you know how the data is structured then you can make it more likely that you'll get the information you need by providing ready-made queries or commands (grep, gawk, sort, wc, etc) that somebody just needs to run in order to be the hero.
Actually generating the data.
I feel this is probably to specific to particular products to provide general tips so I'm going to leave it out. Maybe something for a future post.