Load Testing is Easy, The Hard Part is Making The Tests Useful

With modern tooling many engineering teams find it easy to automate performance testing. Understanding their results is a different story altogether.

Over the past decade, load testing, stress testing, soak testing and other performance testing variants, have evolved from niche testing practices to essential development phases for making code ‘production ready’. Developers across verticals, company types and organization sizes implement load testing as a common practice to ensure their applications can handle the expected workload when deployed in a production environment.

To support this paradigm, there are a plethora of well-established open source load testing tools (JMeter, Locust, Gatling, and more) as well as many more commercial ones. Load testing communities are thriving and online content and multiple plugins for integrating load testing into CI/CD processes are constantly being developed.

And yet, despite the overflow of information regarding load testing tools, best practices, methodologies, and conversations — these types of tests still present a significant time consuming challenge to any automation team. Often, due to unpredictable results.

It Works… On Someone’s Machine

Load testing popularity continues to surge despite the fact that we all know it is not a reliable practice. We often joke that testing in dev environments is practically useless. We rely on our development environments mimicking the production environment, but we know they are not actually identical.

There are differences in hardware and software components, the datasets that are tested are not the same and the traffic levels differ. This means that code that is tested in dev might work on your machine, but it will not always perform the same as in production. In fact, it probably won’t.

It’s the Data, Stupid

The dev environment results gap is the symptom of the problem, not the cause. Running load tests on different environments and machines will render unreliable results by design. Differences across machines, locations and even testing tools will necessarily deliver differing results, since they are, well, different.

In a recent article by Itamar Turner-Trauring, he attempts to benchmark his CPU performance measurement and ensure consistency. As it turns out, “consistent benchmarking is hard, and it’s even harder in the cloud” because he gets a) “inconsistent results on a single machine”, and b) “inconsistent results across machines”.

Why then, can’t we just take those results and adapt them to where and how our code is tested? This is mainly due to the fact that load testing works perfectly only on paper. It’s getting increasingly easier to run really thorough and sophisticated load testing scripts, but few of us have invested in properly reading the test results. The results, after all, is what makes load testing useful. Without clear results we have ourselves a nice technical lab project, at best. At worst we have noise.

Image: Reddit

To complicate things further, load testing itself might have an almost Quantum like nature, interfering with the system being tested, and potentially skewing the results through unrealistic testing scenarios.

APMs: The Right Tools for the Wrong Need

To overcome these challenges and skip the need for analyzing load testing results, developers are encouraged to use APMs (Application Performance Monitoring) to clearly understand the results of their load tests. APMs will gather the data from load testing tools and provide information about application availability, including alerts about when it is down.

But while Application Performance Management (APM) tools are designed to monitor the performance of an application over time, they may not be well-suited for monitoring the specific results of load testing.

Load testing introduces artificial traffic and conditions that are not representative of normal operating conditions, which can cause APMs to misinterpret or misrepresent the results; since APMs were designed for production. In addition, APMs are often unable to understand the granular impact on application performance. They may provide useful data on overall performance trends, but they may not be able to pinpoint the specific causes of performance issues or identify the impact of specific code changes.

Finally, APMs often provide a wealth of metrics and dashboards that can be overwhelming and difficult to interpret. It can be challenging to know which metrics to focus on and what constitutes good or bad performance. Ultimately, APM are more of a monitoring tools for production environment, and their application for CI and load tests can introduce a lot of confusion in the GitOps process.

Source: Merktoonist, license acquired by author

5 Principles for Rethinking Your Load Testing Results

There is no easy solution for analyzing performance testing results and introducing a stable load testing build, and indeed if there was one I would be very wary of its effectiveness. However here are 5 principles that I have seen as effective. At the very least, these should be on your radar during your journey to a stable pipeline:

  1. Set clear objectives and requirements: Before conducting load testing, it is important to set clear objectives and define what success looks like. This step sounds deceptively simple, but often becomes a blocker for the entire initiative. What constitutes ‘good’ performance? Your results should correlate to your business requirements, but what is it? Gil Tene does an excellent job of defining performance in terms of ‘user misery’ but what constitutes misery? Should waiting 200ms be considered a critical bug? At what scale and what frequency? Different stakeholders may have different opinions on what constitutes acceptable performance. From a technical perspective, the measurement metrics and dashboards may also introduce more questions than answers: Is 100% CPU good or bad? What does ‘slow’ really mean?
  2. Understand the flows within your application: It’s easy to get lost in performance details and to lose sight of the big picture. Useful load testing results enable us to start from a flow and understand its behavior or to look at a specific component and tie it to the flows it’s affected by. Measurements and metrics by themselves are completely insignificant without the context: a) What are they affecting? Is it an asynchronous background operation or a user waiting for response. b) What are they affected by? Is the reason for the slowness waiting for yet another process? Distilling a discrete list of flows in the application is a prerequisite for trying to come up with a system of requirements.
  3. Understand performance baselines but identify areas of instability: Some areas of the code have a very clear performance scale factor and range. Establishing this baseline is a must to understand improvement and degradation over time, assess impact of code changes etc. Trend analysis will rely on these benchmarks and our ability to use them to remove outliers. On the other hand, Other areas of the code may behave erratically or unpredictably. This could typically mean one of several important factors are in play: First, there is another variable in play affecting the performance that we also need to factor in. For example, think of a function that archives a given file passed as a parameter. In such a case, the size of the file would a critical performance predictor, we need to measure and observe to be able to assess changes in how the function performs. The second factor is the external ‘system wide’ affecting the performance. For example, there could be a shortage of CPU resources. In such cases we would expect to see many other areas of the code suffering from the same inexplicit slowness, or detect strange gaps in the metrics that are accounted by the CPU context switching. Third factor to note is when the function is dependent on another external API that behaves erratically. We need to highlight that API as the source of the issue. In any case, detecting and classifying these anomalies is a required step that can help eliminate noise from our load testing.
  4. Load testing in CI should focus on the code, as well as on usage: Validating code changes, guarding against bad indexes in queries, locking transitions and misuse of asynchronous code are only some of the benefits of introducing load testing into the CI. To reap those benefits, results should focus on identifying changes, and correlating these with specific builds, PRs or change-sets. At the same time, some performance degradation may be happening over time as usage accumulates (for example, a query becomes slower the bigger the production db becomes). Get to the root cause of issues: Load testing results only provide value if they can actually help improve performance. Therefore, it’s important to leverage them to identify the source of bottlenecks and to eliminate them. Traces and proportional traces are a great way to understand the reason issues are happening.

As with any unstable test holding back the CI, developers and QA teams may give up or ignore results that require a significant amount of time to process, in favor of accelerating releases of meeting the timelines

Over the years, I’ve seen many Engineering organizations take on the challenge of implementing effective performance tests as a part of their release cycle. In many instances, I’ve observed that eager developers were able to quickly generate substantial performance data, only to have the entire initiative fail because of the inherent flakiness these tests are afflicted with, and the inability to get to stable baselines and measurements.

In order, then, to reap the benefits of performance testing to release stability, organizations need to internalize that collecting performance data is secondary to knowing how to process and analyze in a way that will not lead engineers on wild goose chases after phantom issues every time a production server suffers a hiccup. In following posts, we’ll dive deeper into the five principles detailed above to explore a few approaches for implementation, using different data analysis tools, as a part of the journey towards a stable performance testing pipeline.