Some words on Observability

What is observability? If you look it up through the Merriam-Webster dictionary you do not get a definition but instead the definition of observable.

observability, not easy to fine dictionary entries for that

On this page, you can see a few things. Observability is defined relative to the observable, the visible or apparent. In software, this relates to a system running in an arbitrary environment. When I speak of observability I refer to the ability to observe a system running in a production environment.

While it is good to have an observable system, you will first need to define why we need to observe that system and then what needs to be observed about that system. What is noteworthy enough is that you should go and observe the system.

Why would we want observability?

Nobody is interested in knowing that all is well when it comes to production systems. At least that is not what I think about when it comes to observability. What I tend to understand is what it means when a system is not working as well as expected. Defining what bad looks like is relative, here it would be relative to what good enough looks like.

How do we define “good enough”? Can we define “good enough” in such a way that it makes sense to both developers and non-developers? Even better, can we measure “good enough”? If we can measure “good enough” then we can measure what not good enough means, and that would feel noteworthy would it not?

Now for a more practical example. Let’s imagine that I am making changes to my DummyCounter app. Instead of updating and holding the count of taps locally, I decided to build an API defined as such. It has three endpoints. One to register taps and another two to retrieve and reset the taps count.

Now in this context, I do want the API to be very fast as the taps would block the UI. Here, “good enough” would be that I can increment or reset my taps count and then retrieve the latest count within half a second. This “good enough” will be my service-level objective (SLO).

This “good enough” would make sense to both developers and non-developers. But now can we measure this? The conservative measure would be that each API call would return successfully in under 0.25 seconds.

How can we measure our system’s health?

Depending on what information is available we can build a set of observable metrics which will serve as service-level indicators or SLIs. To keep things simple I can create metrics that would tell me the health of my API on a per-endpoint basis.

My first set of indicators would need to give me information that will tell me whether each request returns “successfully”. In API language that would mean that each request call will return a response with a 2XX HTTP code. 201 (Created) when registering a tap, 200 (OK) when retrieving taps count and 204 (No Content) when resetting taps. I would end up with a count of successful requests and another count of failed requests.

Now to complete my SLIs I need indicators that will show that my requests return in under “0.25 seconds”. In API-speak this would be determined by the latency. For each request, I can measure and publish a metric giving me that latency.

Some metrics frameworks would allow me to capture these metrics as single bundles of data. That bundle would contain a request’s HTTP method, endpoint, response code and latency. Now I have my SLI defined I need to understand how I can evaluate my SLO.

How will monitor that our system is healthy?

An option would be to periodically look up my logs and metrics. While this will provide me with valuable information this does not scale.

What if I need to I need to ensure that my API is available 24/7? What if I have better things to do than staying on my screen looking at a shiny dashboard? Maybe I can pay someone to do this for me? Can I afford to pay a person to do this 24/7 for my Dummy Counter API? Can a single person do this? Maybe I need 3 people? Should I pay 3 people full-time salaries when instead I can build something that would provide me the same value?

I don’t know for you guys but I don’t have Elon Musk’s bank account so I would rather spend that sort of money travelling or investing. What if it’s my company’s money? Surely it doesn’t hurt to spend it? Maybe, however, if you work somewhere you tend to want that work to be as efficient and cost-effective as possible so that the company can grow whether you own it or not. If you took that AWS Solutions Architect exam, as I did a few years back, you’ve probably read something similar during your course.

Ah yes, building something to reduce costs. Depending on what you’re working with you will need to plug an alerting tool to wherever your metrics go. If you’re using AWS, Cloudwatch allows you to record metrics and trigger alerts when your SLOs are breached. If you’re more of an open-source type, then you could use tools like Prometheus for recording metrics and triggering alerts.

Once your alerting is in place, you probably want something to notify you that your SLOs were breached. Tools like Pagerduty, Healthchecks.io or Prometheus’ Alertmanager can provide you with at least e-mail and Slack notifications.

Conclusion

To recap, when you want to set up some level of observability you first need to define why you want it by setting your service-level objectives (SLOs). Then you define how you can measure your SLOs through service-level indicators (SLIs). Eventually, you determine how you get alerted when your SLIs indicate that your SLOs are breached.

Some words on Observability

Why would we want observability?

How can we measure our system’s health?

How will monitor that our system is healthy?

Conclusion

Jean-Dominique Nguele