What’s New – November 2023
This last month was very busy in implementing community ideas and feedback! Here are some of the features we rolled out 🚀
With modern tooling many engineering teams find it easy to automate performance testing. Understanding their results is a different story altogether.
Over the past decade, load testing, stress testing, soak testing and other performance testing variants, have evolved from niche testing practices to essential development phases for making code ‘production ready’. Developers across verticals, company types and organization sizes implement load testing as a common practice to ensure their applications can handle the expected workload when deployed in a production environment.
To support this paradigm, there are a plethora of well-established open source load testing tools (JMeter, Locust, Gatling, and more) as well as many more commercial ones. Load testing communities are thriving and online content and multiple plugins for integrating load testing into CI/CD processes are constantly being developed.
And yet, despite the overflow of information regarding load testing tools, best practices, methodologies, and conversations — these types of tests still present a significant time consuming challenge to any automation team. Often, due to unpredictable results.
Load testing popularity continues to surge despite the fact that we all know it is not a reliable practice. We often joke that testing in dev environments is practically useless. We rely on our development environments mimicking the production environment, but we know they are not actually identical.
There are differences in hardware and software components, the datasets that are tested are not the same and the traffic levels differ. This means that code that is tested in dev might work on your machine, but it will not always perform the same as in production. In fact, it probably won’t.
This never gets old ! https://t.co/WN1ZIDa5pt pic.twitter.com/pT3ftLpah0
— Programmer Humor (@PR0GRAMMERHUM0R) February 11, 2023
The dev environment results gap is the symptom of the problem, not the cause. Running load tests on different environments and machines will render unreliable results by design. Differences across machines, locations and even testing tools will necessarily deliver differing results, since they are, well, different.
In a recent article by Itamar Turner-Trauring, he attempts to benchmark his CPU performance measurement and ensure consistency. As it turns out, “consistent benchmarking is hard, and it’s even harder in the cloud” because he gets a) “inconsistent results on a single machine”, and b) “inconsistent results across machines”.
Why then, can’t we just take those results and adapt them to where and how our code is tested? This is mainly due to the fact that load testing works perfectly only on paper. It’s getting increasingly easier to run really thorough and sophisticated load testing scripts, but few of us have invested in properly reading the test results. The results, after all, are what makes load testing useful. Without clear results, we have ourselves a nice technical lab project, at best. At worst we have noise.
To complicate things further, load testing itself might have an almost Quantum-like nature, interfering with the system being tested, and potentially skewing the results through unrealistic testing scenarios.
To overcome these challenges and skip the need for analyzing load testing results, developers are encouraged to use APMs (Application Performance Monitoring) to clearly understand the results of their load tests. APMs will gather the data from load testing tools and provide information about application availability, including alerts about when it is down.
But while Application Performance Management (APM) tools are designed to monitor the performance of an application over time, they may not be well-suited for monitoring the specific results of load testing.
Load testing introduces artificial traffic and conditions that are not representative of normal operating conditions, which can cause APMs to misinterpret or misrepresent the results; since APMs were designed for production. In addition, APMs are often unable to understand the granular impact on application performance. They may provide useful data on overall performance trends, but they may not be able to pinpoint the specific causes of performance issues or identify the impact of specific code changes.
Finally, APMs often provide a wealth of metrics and dashboards that can be overwhelming and difficult to interpret. It can be challenging to know which metrics to focus on and what constitutes good or bad performance. Ultimately, APM are more of a monitoring tool for production environments, and their application for CI and load tests can introduce a lot of confusion in the GitOps process.
There is no easy solution for analyzing performance testing results and introducing a stable load testing build, and indeed if there was one I would be very wary of its effectiveness. However, here are 5 principles that I have seen as effective. At the very least, these should be on your radar during your journey to a stable pipeline:
Set clear objectives and requirements: Before conducting load testing, it is important to set clear objectives and define what success looks like. This step sounds deceptively simple but often becomes a blocker for the entire initiative. What constitutes ‘good’ performance? Your results should correlate to your business requirements, but what is it? Gil Tene does an excellent job of defining performance in terms of ‘user misery’ but what constitutes misery? Should waiting 200ms be considered a critical bug? At what scale and what frequency? Different stakeholders may have different opinions on what constitutes acceptable performance. From a technical perspective, the measurement metrics and dashboards may also introduce more questions than answers: Is 100% CPU good or bad? What does ‘slow’ really mean?
As with any unstable test holding back the CI, developers and QA teams may give up or ignore results that require a significant amount of time to process, in favor of accelerating releases of meeting the timelines
Over the years, I’ve seen many Engineering organizations take on the challenge of implementing effective performance tests as a part of their release cycle. In many instances, I’ve observed that eager developers were able to quickly generate substantial performance data, only to have the entire initiative fail because of the inherent flakiness these tests are afflicted with, and the inability to get to stable baselines and measurements.
In order, then, to reap the benefits of performance testing to release stability, organizations need to internalize that collecting performance data is secondary to knowing how to process and analyze in a way that will not lead engineers on wild goose chases after phantom issues every time a production server suffers a hiccup. In the following posts, we’ll dive deeper into the five principles detailed above to explore a few approaches for implementation, using different data analysis tools, as a part of the journey toward a stable performance testing pipeline.
Find out how Digma can help you analyze and understand your automated load tests: Here