Software Engineering

Incident Management

How to Overcome Obstacles and Improve Your Software

Eldar Jahijagic

Published in

CodeX

4 min readJun 5, 2022

Thousand of times all of us encounter software that just doesn’t work.

It doesn’t break, it doesn’t throw a flag, notifying you that there’s an error — but for some reason, it just doesn’t do what it clearly is intended to. This implies unfairness, bad design, and zero anticipation for its behavior and screams “I’m bad at what I’m supposed to do”.

Incident Perspective

The following outlines what your customers see, and what you see.

Reliability is designed into the product as well as the processes behind it.

Perspective is everything. First off, look at your software from the user’s perspective. Use it, try it out — fill the consumer’s shoes.

Improve your product in a technical sense, but then work on your processes and look into what can be improved there as well.

Always analyze your software, features, threats, bugs, incidents, and issues.

Technically, all the following practices and concepts need to be incorporated in order for us to improve our software:

Logging
Auditing
Infrastructure Metrics
Hardware Resources Metrics
Custom User Metrics
Real-Time Alerting
Analytics
Postmortem Analysis
Incident Reporting
Incident Management

Reliable software isn’t built nor proven in a day. Reliability has to be achieved, and the way you design, build and improve your systems all are part of it.

Failure Modes and Effects Analysis

There are many frameworks and models that help you with overcoming issues, analyzing, and improving your solutions.

One of these is Failure Modes and Effects Analysis. FMEA is a systematic and proactive framework to identify possible failures and their impact.

It’s a process for reviewing the components and parts that make up your system, and lets you identify failure causes and effects.

FMEA is a core task for reliability engineering.

Many companies and communities do not have such models in place, yet strive for high quality, stability, and reliability.

It simply lets you recognize:

What could go wrong
Why would the failure occur
What are the impact and consequences of a failure

Incident Metrics

Glitches, outages, and errors — all-cause downtime and affect how reliable a system is.

Tracking metrics related to incidents is a must. Incident management, detection, diagnosis, resolution, and prevention are all KPIs for your engineering department.

As far as KPIs go, just a few of them are the following:

Number of alerts created in a given time period
Number of incidents that occur in a given time period
Mean Time Between Failures
Mean Time To Acknowledge
Mean Time To Detect
Mean Time To Resolve

Only once you have these in place, you actually can look into what segments can be improved, either per team, product, service, or on your entire engineering department layer.

On-Call Rotation

Whether you have a dedicated support team, or your developers are supporting internal components, you need on-call rotation.

It’s proven to benefit your software quality and reliability.

Having a metric for the time spent handling incidents during on-call also is a very useful way to yield more insight.

Still, all the KPIs, measurements, and metrics in the world can’t replace context. Incidents are unique, and statistics are sometimes traps. Context-aware analysis of an incident is a must and often requires only a pinch of common sense.

Alerting

Having metrics in order to analyze your components is great, having real-time alerting even before a customer notices an issue is amazing.

Reliability isn’t baked into the software alone, but into the teams behind also.

Alerting and responding to issues are required in order to provide reliability to consumers. Mitigation is key — detecting an issue before it blows up, and your customers even notice it, lets you paint a whole different picture, and enables you to prevent incidents.

Alerting is a must-have. From experience, you’d be surprised how many benefits alerting brings. Try out and see what channels work best for you:

Real-Time Status Pages
Email
SMS
Slack, Teams, or whatever you use

Many successful companies have all of these in place. Alerting is worthy to be a company-wide initiative together with monitoring.

Six Sigma

Design for Six Sigma, or DFSS — is a systematic approach and model in Software Engineering, which goal is to achieve Reliability. It’s a process of generating high and improved quality.

Six Sigma will enable you to have a:

Statistical Quality Control
Methodical Approach
Fact and Data-Based Approach
Project and Objective-Based Focus
Customer Focus

All are a prerequisite for reliable software. This allows you in turn to incorporate a DMAIC loop, short for Design-Measure- Analyze-Improve-Control.

There are many concepts, factors, and aspects to Incident Management in Software Engineering, but hopefully the above summarizes a good part of it.

I hope you enjoyed and found the story interesting, whether you are an engineer, architect, manager, or designer.

Thank you for reading! 🎉