Recently, we observed unexpected behavior and encountered performance issues in our backend services. Here’s a short tale of how our team used the Continuous Feedback (CF) tool to quickly identify a concurrency issue.
Hello folks! My name is Asaf Chen and I’m a senior Software Engineer here at Digma. I wanted to share a short tale on how our team used our own Continuous Feedback tool to identify a code concurrency issue. Even as a developer of Digma, I was seriously overwhelmed by the performance gains. Beyond Digma and its specific capabilities, I think this story can be meaningful to any backend developer who is having similar issues in trying to optimize a complex system.
Let’s start with the obvious. As you probably know, concurrency issues emerge when multiple threads access shared resources concurrently, often leading to unexpected behavior. If you’ve been doing this for a while, then you probably also know first-hand how it becomes super challenging to identify such issues as they emerge throughout the dev cycle. Such challenges will then continue to haunt the application if it lacks visibility into real-world performance. The absence of such visibility that most developers face today when building software in complex distributed systems makes it difficult to make informed design decisions or evaluate the consequences of code modifications.
This is where Continuous Feedback comes into play, our vision of Digma is to create a pipeline automation solution capable of detecting diverse issues throughout the development cycle in a continuous manner by providing contextual insights. But enough background! Let’s get into the specifics of what actually occurred.
The tale: Identifying Code Concurrency Issues
Recently, we observed unexpected behavior and encountered performance issues in our backend services. In response, we decided to dogfood our own Continuous Feedback system in an attempt to identify the issue and pinpoint the root cause of this behavior.
Simply looking at the statistics was no help. There were multiple areas experiencing slowness, regardless of concurrency, and other areas that were blazing fast unless they were blocked due to another concurrent execution. We knew the data was there – in the OpenTelemetry observability were were collecting, but we also knew a lot of processing was needed to be able to RCA this to a specific piece of code responsible for the issue.
The benefit of Continuous Feedback is that it is, just as the name suggests – continuous. In practical terms that means developers don’t need to go fishing looking for such issues, if they exist they will be detected automatically, whether we know to look for them or not. All that remained was to actually see what Digma’s analysis of the code was telling us.
Indeed when we looked at the code, the issue was right there in the open. This insight saved us hours that would have otherwise been spent, identifying whether an issue actually exists, investigating it, searching, and attempting to unravel the intricacies of the problem. This entire investigation would have required a backlog item to be created prioritized, assigned and hopefully not pushed aside because of some urgent matters. Luckily using existing observability data, Digma was able to show a very specific analysis of the concurrency issue we were facing.
The specific problem identified: Code Concurrency issue
The graph above consists of two axes: the y-axis represents the average duration of the span in seconds, while the x-axis represents concurrency. The duration remains relatively constant until we hit a concurrency level of about 35 executions in parallel – then the average duration begins to increase (shown by the blue dots).
The Root Cause: DB query with a scaling issue
The analysis of the observability data went beyond identifying the issue, it was also able to pinpoint the root cause of this scaling problem.
In the graph, the root cause is represented by the orange line which we track in parallel to the overall execution time. It represents a query call triggered as a part of the request handing in this endpoint. As you can observe, both lines follow a similar pattern, it is specifically that query that is suffering the scaling issue that is then propagating onto the endpoint.
After identifying the root cause of the issue, we analyzed the problematic span and observed an inefficient query execution plan. We proceeded to refactor the query and added missing indexes, addressing the underlying problem. The scaling issue was resolved quickly as we were able to focus on the right specific issue.
Why Developers Need CF
I think this story highlights just how important data can be to developers. Observability data that is often thought to be relevant only for the DevOps, SREs, and IT folks can make a huge impact on developers developing scalable and maintainable systems.
If developers can’t see how their code performs in the real world, they can’t make informed design decisions and assess the impact of their changes. By closing the loop between observability and code, Digma opens the way for a new method of development.
Install Digma Free: Here