Using Process Oriented Design (POD) to Increase the Dependability of DevOps

July 23, 2019 By Derek Weeks

3 minute read time

For many users, software often isn't appreciated until something breaks. Constant availability is an expectation, but 100% availability isn't a reality. When high-profile systems, like Netflix or AWS, have outages, it makes national news.

Reducing System Outages

Where do problems arise that cause system outages? What can be done to improve processes to reduce system outages?

Researchers see that system outages often stem from problems during operations processes, such as upgrading software. Dr. Ingo Weber (@ingomweber) is one of those researchers. He is a principal research scientist and team leader at Data 61, part of CSIRO, Australia's government-funded research body. He and his fellow researchers developed an approach and tool framework, Process-Oriented Dependability (POD), to address this challenge in DevOps practices. POD enables fast error detection, root cause analysis, and recovery.

Ingo and I discussed his insights on POD during his talk, Increasing the Dependability of DevOps Processes, at last year's All Day DevOps conference. In it, he describes the approaches and tools. His key findings are still relevant. (See below.)

By addressing process issues, you can significantly reduce system outages.

Ingo quotes a Gartner study showing that "80% of outages impacting mission-critical services will be caused by people and process issues."

He also notes that significantly shorter release cycles (moving from months between releases to continuous delivery and releases delivered in hours or days), magnifies the potential issues. As an example, he notes that Etsy has an average of 25 full deployments/day and 10 commits per deployment. Because of this, baseline-based anomaly detection no longer works due to cloud uncertainty and continuous changes. These conditions include: multiple sporadic operations at all times, scaling in/out, snapshots, migrations, reconfigurations, rolling upgrades, and cron-jobs.

POD: Process Oriented Design

The POD approach at a high-level is:

Increase dependability during operation time through:

More accurate performance monitoring
Faster error detection
Fast or autonomous healing (quick fix)
Root cause diagnosis to figure out what the actual problem is
Guided or autonomous recovery

Incorporating change-related knowledge into system management

Build knowledge about sporadic operations in Process-Oriented Dependability model

Digging a little deeper into POD, Ingo talks about two approaches they use: Conformance Checking and Assertion Evaluation.

Conformance Checking

There are three levels of Conformance Checking:

Basic
Detecting numerical invariants
Detecting timing anomalies

When errors/anomalies are detected, an alert is raised, and all results are visualized through POD-Viz, the dashboard.

Errors visualized through at POD-Viz dashboard

Conformance Checking detects the following types of errors:

Unknown/error log line: a log line that corresponds to a known error, or is simply unknown.
Unfit: a log line corresponds to a known activity, but said activity should not happen in the current execution state of the process instance.

All other log lines are deemed fit. The goal is 100% fit. Otherwise, raise an alert and learn from false alerts to improve classification and/or the model.

Assertion Evaluation

Assertion Evaluation creates and checks against assertions. Assertions check if the actual state is the expected state at a given point. They are coded against cloud APIs, so they can find out the true state of resources directly. You also identify the main factors affecting a resource and identify the log events that have the most important influence on changing the state of a system resource.

Look at the metrics and choose whose are most relevant, then derive a formula that can be used to estimate the value of a variable associated with a system's resource, so that you can test it against a range of acceptable values. Then, you drive an assertion based on this.

Does Process-Oriented Dependability sound like something you might want to implement or consider further? Ingo's full talk is especially geared toward practitioners, diving into more detail and examples. It is available to view in its entirety here.

Ingo also suggested two papers for those interested in reading further: Process-Oriented Dependability and Software Performance Engineering in the DevOps World.

Written by Derek Weeks

Derek serves as vice president and DevOps advocate at Sonatype and is the co-founder of All Day DevOps -- an online community of 65,000 IT professionals.

Explore All Posts by Derek Weeks

Using Process Oriented Design (POD) to Increase the Dependability of DevOps

Reducing System Outages

POD: Process Oriented Design

Conformance Checking

Assertion Evaluation

Code 3x Faster with Less False Positives

Related Resources

Accelerate DevOps with Sonatype's Multi-Product AWS Offering

The Open Letter on Sustainable Stewardship: Why Responsible Open Source Matters

Policy Engine