Observability Best Practices

observability-best-practices

Here are some principles and best practices to follow in observability.

Best practices for creating dashboards

A dashboard should tell a story or answer a question

What narrative are you attempting to convey through your dashboard? Try to arrange the data in a logical order, such as from large to small or from general to specific. What is this dashboard’s objective? Hint: Consider whether you really require the dashboard if it does not have a goal.)

Keep your graphs brief and focused on providing an answer to your question. For instance, if you ask, “Which servers are having problems?” then you might not need to display all of the server data at all. Just provide the troubled with the data.

Dashboards should reduce cognitive load, not add to it

Cognitive load is essentially the way that hard you want to contemplate something to sort it out. Make it simple to read your dashboard. It will be appreciated by other users and by you in the future when you are trying to figure out what broke at 2 a.m.

Think about it:

  1. Can I determine the precise meaning of each graph? Is it clear to me, or do I need to consider it?
  2. How long will it take them to figure it out if I show this to someone else? Will they wander off?

Have a monitoring strategy

It’s easy to make new dashboards. It’s harder to optimize dashboard creation and adhere to a plan, but it’s worth it. This strategy should govern both your overall dashboard scheme and enforce consistency in individual dashboard design.

Best practices to follow

  • When creating a new dashboard, make sure it has a meaningful name.
  • If you create many related dashboards, think about how to cross-reference them for easy navigation.
  • Avoid unnecessary dashboard refreshing to reduce the load on the network or backend. For example, if your data changes every hour, then you don’t need to set the dashboard refresh rate to 30 seconds.
  • Use the left and right Y-axes when displaying time series with different units or ranges.
  • Add documentation to dashboards and panels.
  • Reuse your dashboards and enforce consistency by using templates and variables.

Best practices for managing dashboards

This page outlines some best practices to follow when managing Grafana dashboards. Here are some principles to consider before you start managing dashboards.

Strategic observability

There are several common observability strategies. You should research them and decide whether one of them works for you or if you want to come up with your own. Either way, have a plan, write it down, and stick to it.

Adapt your strategy to changing needs as necessary.

Best practices to follow

  • Avoid dashboard sprawl, meaning the uncontrolled growth of dashboards. Dashboard sprawl negatively affects time to find the right dashboard.
  • Periodically review the dashboards and remove unnecessary ones.
  • If you create a temporary dashboard, perhaps to test something, prefix the name with TEST: . Delete the dashboard when you are finished.
  • Copying dashboards with no significant changes is not a good idea.
    • You miss out on updates to the original dashboard, such as documentation changes, bug fixes, or additions to metrics.
  • When you must copy a dashboard, clearly rename it and do not copy the dashboard tags. Tags are important metadata for dashboards that are used during search. Copying tags can result in false matches.

Common observability strategies

When you have a lot to monitor, like a server farm, you need a strategy to decide what is important enough to monitor. This page describes several common methods for choosing what to monitor.

A logical strategy allows you to make uniform dashboards and scale your observability platform more easily.

Guidelines for usage

  • The USE method tells you how happy your machines are, the RED method tells you how happy your users are.
  • USE reports on causes of issues.
  • RED reports on user experience and is more likely to report symptoms of problems.
  • The best practice of alerting is to alert on symptoms rather than causes, so alerting should be done on RED dashboards.

USE method

USE stands for:

  • Utilization – Percent time the resource is busy, such as node CPU usage
  • Saturation – Amount of work a resource has to do, often queue length or node load
  • Errors – Count of error events

This method is best for hardware resources in infrastructure, such as CPU, memory, and network devices.

RED method

RED stands for:

  • Rate – Requests per second
  • Errors – Number of requests that are failing
  • Duration – Amount of time these requests take, distribution of latency measurements

This method is most applicable to services, especially a microservices environment. For each of your services, instrument the code to expose these metrics for each component. RED dashboards are good for alerting and SLAs. A well-designed RED dashboard is a proxy for user experience.

The Four Golden Signals

According to the Google SRE handbook, if you can only measure four metrics of your user-facing system, focus on these four.

This method is similar to the RED method, but it includes saturation.

  • Latency – Time taken to serve a request
  • Traffic – How much demand is placed on your system
  • Errors – Rate of requests that are failing
  • Saturation – How “full” your system is

Dashboard management maturity model

Dashboard management maturity refers to how well-designed and efficient your dashboard ecosystem is. We recommend periodically reviewing your dashboard setup to gauge where you are and how you can improve.

Broadly speaking, dashboard maturity can be defined as low, medium, or high.

Low – default state

At this stage, you have no coherent dashboard management strategy. Almost everyone starts here.

How can you tell you are here?

  • Everyone can modify your dashboards.
  • Lots of copied dashboards, little to no dashboard reuse.
  • One-off dashboards that hang around forever.
  • No version control (dashboard JSON in version control).
  • Lots of browsing for dashboards, searching for the right dashboard. This means lots of wasted time trying to find the dashboard you need.
  • Not having any alerts to direct you to the right dashboard.

Medium – methodical dashboards

At this stage, you are starting to manage your dashboard use with methodical dashboards. You might have laid out a strategy, but there are some things you could improve.

How can you tell you are here?

  • Prevent sprawl by using template variables. For example, you don’t need a separate dashboard for each node, you can use query variables. Even better, you can make the data source a template variable too, so you can reuse the same dashboard across different clusters and monitoring backends..
  • Methodical dashboards according to an observability strategy.
  • Hierarchical dashboards with drill-downs to the next level.
  • Compare like to like: split service dashboards when the magnitude differs. Make sure aggregated metrics don’t drown out important information.
  • Expressive charts with meaningful use of color and normalizing axes where you can.
  • Directed browsing cuts down on “guessing.”

High – optimized use

At this stage, you have optimized your dashboard management use with a consistent and thoughtful strategy. It requires maintenance, but the results are worth it.

  • Actively reducing sprawl.
    • Regularly review existing dashboards to make sure they are still relevant.
    • Only approved dashboards added to master dashboard list.
    • Tracking dashboard use.
  • Consistency by design.
  • No editing in the browser. Dashboard viewers change views with variables.
  • Browsing for dashboards is the exception, not the rule.

Conclusion

So, we saw best practices to follow in Observability world. Hope this helps in some way.

Tags:

Leave a Reply