As part of the DevOps and DevSecOps track during Sonatype's 9th All Day DevOps (ADDO) event, AWS Senior Developer Advocate Guillermo Ruiz presented his session titled "Building Observability to Increase Resiliency." Well-applied observability helps you find early signs of problems before they impact customers and makes it possible to react quickly to disruptions.
Observability and resiliency topics typically focus on logging and tracing system performance. However, Ruiz focused on things that might go wrong within a system as a way to discuss how to uncover and diagnose issues, as well as prevent future challenges.
There are four common types of failure: a bad dependency, a bad component, a bad deployment, or a traffic spike. But how would you know there was a problem to begin with? Ruiz used a hypothetical e-commerce website as an example, where each of the pages and elements, including navigation and search, operate independently with its own code. When one of these elements experiences an issue, it interferes with the user experience. Identifying these issues before they can frustrate users is why observability is essential.
By observing data for real-time insight into the system and by setting alerts to anything that happens that might be out of the ordinary, it's possible to proactively address potential problems before they become user complaints. Dimensionality allows developers to break down errors into multiple factors or dimensions, such as time, location (in code or system), user input, or environmental conditions. By analyzing these dimensions, developers can understand the context in which the error occurs, pinpointing the source more effectively.
Having more dimensions means seeing the problem from more perspectives, but having too many can be overwhelming, so it's important to find the right balance for dimensionality without getting lost in the data.
You can also use composite alarms, which combine several alarms into a single notification. In our website example, we can set a threshold so that when an issue is detected across multiple pages, just a single alarm is triggered. This is a way to minimize the number of alarms and focus on key issues.
During his session, Ruiz explained in detail how to identify and respond to each of these common failure types and how to identify issues that may originate externally.
All Day DevOps is the largest DevOps conference in the world, with more than 180,000 attendees each year.
You can catch Ruiz's session on demand here, as well as hundreds of sessions across a wide range of topics.