In this blog series, we invite prominent developers to share their terrifying coding horror stories and experiences when dealing with challenges related to codebase complexity.
Hey folks, welcome back to the fourth part of our ‘Coding Horrors’ blog series! This adventure all started with our own wild coding experiences as developers. In this blog series, we ask prominent developers to share their horrifying stories and experiences in dealing with software complexity. We noticed that all developers, wizards, or those just starting out can relate to what we call hair-raising ‘oops’ moments. The first three blogs got tons of engagement, so we decided to keep the adventure going.
Coding mistakes, constantly changing requirements, and wrong decisions impose a real challenge when building software in distributed environments. Also when multiple developers contribute to the same codebase doesn’t help the situation.
This time, Kristian Rekstad, a Tech Lead and software engineer, shared his wild coding stories with us. He works in a medium-sized, event-driven microservice architecture, where Kotlin is the language of choice. This tech environment has given him both headaches and valuable learning experiences in troubleshooting errors within a distributed system.
Kristian’s journey reflects the challenges faced in complex microservice architectures. Let’s check out the lessons learned:
Lessons learned: Codebase complexity
- XML is a horrible data language for simple tasks if you lack a schema because of strict parsers and namespaces, the ambiguity of arrays, overly complex, and bad libraries for working with it.
- JAXB is not good; if you like it you might be suffering from Stockholm Syndrome.
- Get schemas for data when integrating with external systems/teams.
- HTTP is not really ideal for machine-to-machine APIs. The methods and statuses poorly map to anything useful outside of transferring HTML. (Moved an endpoint in one service to a new path, then caused bad outcomes because of a 404 ambiguity).
- Clear expectations and good communication with clients/customers may clarify when delays and rework are the result of their planning weaknesses and not your fault.
- Remain focused also on easy tasks with admin credentials – ticking a checkbox on the wrong page can end up as a disaster. Prefer read-only credentials when possible.
- Sometimes you can’t salvage the code – take the learnings and start from scratch instead. (Spolsky might disagree).
- Don’t make the embarrassing mistake, where your users tell you that things are broken. Catch mistakes in tests and post-deploy with metrics and alarms.
- A symptom of bad code and no tests: if you feel that making any change is scary and requires manual testing afterward, for example when dealing with a beast of nondeterministic, untestable, and flaky nature, spanning over 1000 lines of Java code.
- Don’t blame others for all the bad code. You’ve made code bad yourself too.
What’s the worst, most horrifying experience you have with codebase complexity?
An intern at the startup was hired for a few weeks to develop a frontend in Vue for an internal tool. A sane stack at the time: Typescript, Webpack, Vue, Vuex (state management), and npm. This was an inexperienced dev, only used to React, a student, and things had to go fast to finish before the internship ended. (A perfect recipe for disaster?)
I had to take over the code afterward and try to understand it. It was a mess. Lots of indirection, where some of the code happened here and there. Side effects, where the state would just change in random places. Callback functions are being set in the Vuex Store, to later be invoked at completely different parts of the code. (That one felt like a violation of some unwritten law; shouldn’t everything in the state store be serializable to string/json? Some tools let you see the state, and travel in time. Must be horrible (and error-prone) with functions and captured scopes.) It was so bad, we started nicknaming all horrible code “XYZ-code” (after their name XYZ) after they left. That thing left a scar on me, I’m sure.
However, I’m not going to blame others for all the bad code. I’ve done it myself too. It was on the Android app: the dreaded CameraFragment—a beast of nondeterministic, untestable, and flaky nature. At over 1000 lines of Java code, this was responsible for all of the UI (lifecycle, components, and their rendering), the UI state management, the Camera lifecycle, navigation to other screens after capturing an image or video, saving the capture locally to disk or uploading to the API, directly sending to specific users, replying with picture-in-picture to specific users, and probably more.
The CameraFragment should have been split into layers and classes in the tens.
And because this was early in Android, testing the UI was not really possible at that point: you needed a phone or (slow) emulator for device tests, and our Jenkins could not run that. Because all the code resided in the CameraFragment – and Fragment is a UI component in the Android SDK – we couldn’t find a way to test any of it. Doing any kind of change was scary, and required manual testing afterwards. It was best to just not touch it.
What was the scariest bug you encountered in a codebase?
We moved an endpoint in one service to a new path. Something like `/api/product/{id}` to `/api/v1/products/{id}`. Another service – the “adapter” – was using that API to feed an ecommerce site for a large retailer. No alarms went off anywhere, but we suddenly started deleting everything in their webshop. The adapter interpreted the 404 from `/api/product/123` like “product 123 doesn’t exist anymore, delete it”. So this was a “logic bug”, a fatal oversight of how HTTP status codes actually map to both web server logic and our domain logic. (I think HTTP is a pretty horrible protocol to use for machine-to-machine APIs. The methods and statuses poorly map to anything useful outside of transferring HTML.)
Did you ever work on a project with constantly changing requirements?
That’s what I spent my time on from December 2022 to May 2023. We integrated two external systems, transferring XML from one XML structure to the other. This would have taken me a week at most with JSON and a clear specification. However, we had no schemas for either side. For JSON, that’s not too bad. But this is Microsoft AX 2012 wanting a SOAP message with very specific message contents mixing XML namespaces on its tags; it’s simply not going to validate and get past the XML parser if we get anything wrong. Also, the input format was undocumented, so we had no clue what the fields were named, or what they contained. And there were typos and inconsistencies in the input format. We also didn’t know if tags were single elements or part of an array because there is no distinction in XML (unlike the `[ ]` in JSON).
I couldn’t adapt much. I asked for schemas or full examples or documentation weekly, but it either didn’t exist or would come “soon” (it never came). So my only choice was to communicate clearly with the customer, that this will take a lot of time, and the solution will be buggy and not work properly until we have seen all examples and permutations of the input data.
describe a horrifying architectural decision you encountered in a project, how did you deal with it?
A processor of events tried to merge incoming data to the state of an entity in the domain. The choice of architecture was to do massive switch-case statements of the event type, followed by lambdas and functions nested into each other, spread around, to the point where you had to open 7 files to see the changes being done for a single event. Also, the lambdas were not typed with interfaces, but signatures ( `String -> String`), which makes it really hard to reason about or navigate the code. On top of this, it used `suspend` and Kotlin Coroutines, so any stack trace you got would be severely mangled to the point it was useless to understand anything.
Normally, I would write tests and then refactor. But in this case, it was better to just “fork” the code path and write a new event processor from scratch. Then slowly migrate the existing cruft over to the new flow, using feature flags to switch back and forth. Sometimes you can’t salvage the code – take the learnings and start from scratch instead (Spolsky might disagree).
Have you ever had a terrifying “oops” moment or made a coding mistake that sent shivers down your spine?
I was once doing a database snapshot/dump from PHPMyAdmin with admin credentials. Easy task: just tick a checkbox for the tables you want and copy it as SQL to your machine. Except: I was on the wrong page and started a “copy database to another database”, essentially overwriting the database with itself. And during this rewrite, it did some table truncating. I clicked cancel before it came very far, but the database was broken already. We got to test our database restore procedure that day 🙂 (And lost about 6 hours of user activity from ~90k users).
Have you ever had a spine-chilling encounter with a third-party library or framework that caused unexpected issues or complications?
Maybe not spine-chilling, but I was patching an application in maintenance mode and made an incident. It didn’t have good test coverage, nor any proper end-to-end or smoke test. But it was just a patch version of Jersey, what could go wrong? Well, apparently, authentication on the API could go wrong. So, a customer failed to fetch inventory statuses into SAP for their 300 pharmacies. It’s the embarrassing kind of mistake, where we don’t notice anything and they have to tell us that it is broken.
I never patched that application again… (But I did start making end-to-end tests on that project!.)
Final Words: Java Code complexity
We can’t wait to share more of these spooky coding stories with you! If you missed the first part of the “Coding Horrors” series, you can catch up. If you want to share a story with us that will potentially be published shoot us an email.