Software Engineering

Reliability in Software Engineering

Building Software and Processes for Unreliable Scenarios

Published in

CodeX

7 min readJun 4, 2022

Even though the software industry has seen many innovations, achievements, complex systems, and experienced veterans— it is yet so young when compared to other fields and practices.

Building software itself might be trivial in some circumstances, but in order to deliver true state-of-the-art quality, we first have to engineer it, and it’s not all about the services, but also the teams and human factors behind it.

Without that, it would be questionable whether we can build quality systems.

Software Reliability

Reliability is one of the characteristics describing quality software and outlines the difference between the good and the bad.

Image Credit — afgprogrammer@unsplash.com

Reliable software is a system that is tolerant to failures or even failure-free.

It’s the probability of issue-free operations of a component for a given period in a given environment.

There’s one saying that still sticks with me when thinking about reliable systems:

Reliability is flipping the switch and knowing the lights will come on.

Reliability is not yet another metric, but it is itself the way how customers experience the system.

It’s not a static characteristic. It’s dynamic in the sense that a system can withstand failures during various circumstances, both expected and unexpected.

Resilience is the way and reliability is the outcome.

A system can not be reliable without being resilient, and the other way around.

Depending on what part of the world you reside in, you might often hear these two words used interchangeably. Still, there is some difference to it.

The resilience of a software system is actually measured by how good and how much it can withstand, and mitigate threats, and how fast it can recover.

Labeling a software system as reliable requires us to ensure both operation and design of the components in place.

Regardless of the industry, company size, priorities, and customers, everyone wants their software components and platforms to work at all times.

Design for Reliability

Designing and building reliable software it’s the responsibility of everyone involved, whether that may be UI/UX designers, product managers, engineers, architects, or the underlying infrastructure and hardware provider.

In order for a software to be reliable, it has to be first designed to:

Minimize failures
Minimize the impact of failures
Ensure uptime

Components have to stay functional during both predicted and unpredicted scenarios.

Reliable systems are built to be trusted.

During the design phase of a service, component, or product — we already have to start striving for reliability.

Design imperfections can lead often to confusion, and unnecessary complexity and can cause a good ground for brittle solutions.

In the concept phase, there are two Reliability models that you can make use of:

Software Reliability Goal Setting
Software Reliability Program Plan

Set reliability to be a goal, review and analyze decisions during the design, conception, and development too, and results will greatly improve before you even started shipping a product.

Make use of other techniques that are used in designing reliability such as:

Team Design Reviews
Software Failure Modes and Effects Analysis
Software Fault Tree Analysis
Software Fault Tolerance Analysis

“Design for Reliability” itself is a model, often part of the Design for Excellence.

Common Threats

We wave to recognize the common threats, in order to improve a system and design it to be adaptable and ready to mitigate issues.

A common set of threats for every system can be the following:

Security threats
Latency
Heavy dependency on other services
Inter-service communication
Discovery
Unresponsiveness
Unhandled issues and edge cases
Environmental changes
Hardware failures
Inconsistency in both performance, and behavior

The list is a very rough one and can go to a pretty extent.

Reliability is often about everything right and wrong in a software system. All factors that can cause it to fail are threats in essence, from security, bad design, and limited hardware resources, to incorrectness and unpredicted issues.

From experience, failures are caused by:

Neglected threat analysis
Technical incorrectness
Overengineering and overdesign
Infrastructure errors
Lack of testing
Low resilience to development changes

There is a reason why great software is easily maintained, simple, effective, and robust. But it has to be designed and built like so.

Lack of ability to recover from failures is a major factor as well and influences the system’s reliability very often.

Overcoming Obstacles

Thousand of times all of us encounter software that just doesn’t work.

It doesn’t break, it doesn’t throw a flag, notifying you that there’s an error — but for some reason, it just doesn’t do what it clearly is intended to. This implies unfairness, bad design, and zero anticipation for its behavior and screams “I’m bad at what I’m supposed to do”.

Reliability is designed into the product and processes as well.

Perspective is everything. First off, look at your software from the user's perspective. Use it, try it out — fill the consumer's shoes.

Improve your product in a technical sense, but then work on your processes and look into what can be improved there as well.

Always analyze your software, features, threats, bugs, incidents, and issues.

Technically, all the following practices and concepts need to be incorporated in order for us to improve our software:

Logging
Auditing
Infrastructure Metrics
Hardware Resources Metrics
Custom User Metrics
Real-Time Alerting
Analytics
Postmortem Analysis
Incident Reporting
Incident Management

Reliable software isn’t built nor proven in a day. Reliability has to be achieved, and the way you design, build and improve your systems all are part of it.

Failure Modes and Effects Analysis

There are many frameworks and models that help you with overcoming issues, analyzing, and improving your solutions.

One of these is Failure Modes and Effects Analysis. FMEA is a systematic and proactive framework to identify possible failures and their impact.

It’s a process for reviewing the components and parts that make up your system, and lets you identify failure causes and effects.

FMEA is a core task for reliability engineering.

Many companies and communities do not have such models in place, yet strive for high quality, stability, and reliability.

It simply lets you recognize:

What could go wrong
Why would the failure occur
What are the impact and consequences of a failure

Incident Metrics

Glitches, outages, and errors — all-cause downtime and affect how reliable a system is.

Tracking metrics related to incidents is a must. Incident management, detection, diagnosis, resolution, and prevention are all KPIs for your engineering department.

As far as KPIs go, just a few of them are the following:

Number of alerts created in a given time period
Number of incidents that occur in a given time period
Mean Time Between Failures
Mean Time To Acknowledge
Mean Time To Detect
Mean Time To Resolve

Only once you have these in place, you actually can look into what segments can be improved, either per team, product, service, or on your entire engineering department layer.

On-Call Rotation

Whether you have a dedicated support team, or your developers are supporting internal components, you need on-call rotation.

It’s proven to benefit your software quality and reliability.

Having a metric for the time spent handling incidents during on-call also is a very useful way to yield more insight.

Still, all the KPIs, measurements, and metrics in the world can’t replace context. Incidents are unique, and statistics are sometimes traps. Context-aware analysis of an incident is a must and often requires only a pinch of common sense.

Alerting

Having metrics in order to analyze your components is great, having real-time alerting even before a customer notices an issue is amazing.

Reliability isn’t baked into the software alone, but into the teams behind also.

Alerting and responding to issues is required in order to provide reliability to consumers. Mitigation is key — detecting an issue before it blows up, and your customers even notice it, lets you paint a whole different picture, and enables you to prevent incidents.

Alerting is a must-have. From experience, you’d be surprised how many benefits alerting brings. Try out and see what channels work best for you:

Real-Time Status Pages
Email
SMS
Slack, Teams, or whatever you use

Many successful companies have all of these in place. Alerting is worthy to be a company-wide initiative together with monitoring.

Six Sigma

Design for Six Sigma, or DFSS — is a systematic approach and model in Software Engineering, which goal is to achieve Reliability. It’s a process of generating high and improved quality.

Six Sigma will enable you to have a:

Statistical Quality Control
Methodical Approach
Fact and Data-Based Approach
Project and Objective-Based Focus
Customer Focus

All are a prerequisite for reliable software. This allows you in turn to incorporate a DMAIC loop, short for Design-Measure- Analyze-Improve-Control.

There are many concepts, factors, and aspects to Software Reliability, but hopefully the above summarizes a good part of it. The goal of this story is to shed some light on already well-established concepts, but also on some less-known ones. Software reliability is only one segment of engineering, but in essence, it’s a science on its own.

I hope you enjoyed and found the story interesting, whether you are an engineer, architect, manager, or designer. Follow and subscribe to my newsletter to stay tuned in for more stories like this one.

Thank you for reading! 🎉