Hello there, welcome (back) to my blog, today I want to go through logging across distributed systems. However, before we begin, I want to tell you a story. A story all about what happens if you have no logging in place. Much chaos could be averted and you will find how later.
Picture this, a young developer starts working on a project, building features with his team. So far so good, they add features, nothing breaks, everyone is happy. One day though, an issue occurs, all hells break loose as a customer cannot complete his purchase. It obviously is a all-hands on deck situation. One of the developers believes he found the issue, shares his finding with the team. The team agrees this must be what went wrong so they go ahead and put a fix in place. A few months later, the same issue occurs again, slightly different circumstances but it is the same issue. The fix didn’t catch it. Don’t get me wrong, the fix in place fully worked, however, it was in the wrong place. That team wouldn’t know until later when after another similar fix, the issue resurfaces again.
Eventually, the team would figure the issue happens upstream from where they applied the fixes. Some data parsing goes wrong and returns null values where it shouldn’t which made the downstream services crash which as a result prevents purchases. However, if logging was in place, they could have spotted this issue from the first occurrence. From there, they would have spent time putting one fix in the right place and not spend developer time on not fixing the root cause. Fortunately, one of the team members lived something similar before and his then-lead decided to put in place an ELK stack to avoid this happening again.
What is an ELK stack? Actually now people call it the Elastic stack, it is a suite of products from Elastic which amount to a log management system. It is compound of Elasticsearch, Logstash and Kibana. You can use Logstash to push logs towards Elasticsearch and query these logs using a user interface that is Kibana. The Elastic stack might be the most popular solution to enable logging across distributed systems. Sometimes there is an addition here and there like Filebeat but the ELK core remains.
Distributed logging allows to track that your system behaves as you expect. It groups information logged by each of your services so that you can find that information in one place. Generally, the logs are timestamped so that you can track events that led to a given issue in the right order. Per example, in that story that I told you if distributed logging was available, the developers could have spotted the error which occurred in the upstream service and investigated that first. Even better, if some level of request logging, they could have replayed that scenario and its variations that led to further crisis to ensure the issue was fixed once and for all. Instead, with no visibility, they went from what they knew, being a “Something went wrong” error message in their app.
Logging across distributed systems, while a given to some people, still is an important piece to have in place. Through the provided logs, you can track what went wrong, where and put together a detailed bug ticket. The more details you have, the easier the developers’ job gets to track and fix issues. You will save time, you will save yourself headaches by living through experiences others lived through for you. You don’t need to grow from learning the pain of working without any kind of logging. People did that for you in the seventies. Learn from the elders and get some logs dammit. I will probably write a simple implementation to show you how to have some distributed logging in my Go Cloud series.
Thank you for reading my post and thank you to the new subscribers. I am always humbled to see people subscribing to my blog or following me on Twitter. This helps drive me to create quality content for you guys. I hope you have a great day! And remember, don’t navigate blind, put some logs in your system.