Back

DevOps Incident Management Portal

DevTool

2026

Overview

Designing for High-Stakes Operations

When systems go down, teams don’t fail because they’re inactive. They fail because too many things are happening at once. This project explores how engineers can stay aligned, make clear decisions, and avoid making incidents worse while trying to fix them.

View Figma File

The Problem

Teams struggle to quickly understand and respond when incidents occur.

Modern engineering teams face high-pressure situations when critical APIs or services fail. Outages often escalate not because engineers aren’t working, but because responsibility is unclear, data is incomplete, or multiple actors intervene in conflicting ways. Teams need tools to quickly detect anomalies, coordinate responses, and resolve incidents without chaos.

The Goal

Enable teams to detect, understand, and resolve incidents quickly.

The goal of this design is to support teams through that chaos.
From the moment an issue is detected to the point where it’s resolved, the system helps users understand:

what’s going wrong
who is handling it
what actions are being taken
and what should happen next

The Research

The User Flow

The experience is structured across five stages: 
Detection → Triage & Ownership → Intervention → Conflict Resolution → Outcome
Each stage is designed to reduce confusion, surface only what matters, and help teams act quickly without stepping on each other.

View Userflow

Design Solution

Core Dashboard Screens

The Core Dashboard serves as the operational command center, designed to give responders immediate awareness of system health while enabling rapid detection of issues without requiring deep navigation. The system is built to transition seamlessly from passive monitoring to active intervention. When an issue is detected, the dashboard becomes the first point of escalation—visually signaling changes in system state and guiding the user toward deeper investigation flows.  Overall, the dashboard balances situational awareness, responsiveness, and clarity, ensuring that users can quickly move from observation to action in high-pressure scenarios without being overwhelmed by unnecessary detail.

Alert Entry Point, Detection

The Detection layer is designed to immediately capture user attention the moment an issue is identified, ensuring no critical signal goes unnoticed. Instead of relying on users to actively monitor the dashboard, the system proactively surfaces incidents through real-time visual and behavioral cues. A system banner: appears at the top for critical incidents, making the issue impossible to miss while you’re on the dashboard. It gives immediate visibility into system state and directs you straight to the incident without needing to search. A popup toast: interrupts the flow when immediate attention is required, showing a quick summary of the issue with clear actions so you can jump into investigation right away without navigating away or losing context. The notifications panel: captures both new incidents and ongoing updates, giving you a running feed of what’s happening across the system so you can track changes, see who is taking action, and stay aligned without constantly switching screens.

Triage & Ownership

The Triage & Ownership layer is designed to bring immediate clarity to who is responsible for resolving an incident, eliminating ambiguity at the most critical moment in the flow. When users enter an incident, the interface prioritizes a concise situation summary, highlighting impact, affected systems, and system-generated hypotheses so responders can quickly understand the problem without scanning excessive detail. This ensures fast orientation under pressure.

Conflict / Concurrency

The Conflict / Concurrency layer is designed to manage situations where multiple responders attempt overlapping or conflicting actions during an active incident. In high-pressure environments, issues often worsen not because of inaction, but because actions are taken simultaneously without coordination. When the system detects conflicting actions, it surfaces this immediately through clear visual states, preventing silent failures. Early warnings allow users to reassess, while high-risk conflicts require resolution before any action can proceed.

Intervention

The Intervention layer is designed to help responders make fast, confident decisions under pressure. Once ownership is established, the interface shifts focus from understanding the problem to taking the right action. The screen prioritizes a small set of relevant, ranked actions instead of overwhelming users with too many options. Each action is paired with clear impact and risk indicators, allowing users to quickly evaluate trade-offs without deep analysis.

Successful Resolution

The Successful Resolution screen provides clear closure after an incident is resolved, shifting the user from active response to confirmation and review. The interface centers on a strong visual state change, replacing urgency with stability through a prominent “Incident Resolved” message and green success indicators, signaling that the system is back to normal. Key outcome metrics such as resolution time, recovery rate, and responders involved are surfaced immediately to give a quick understanding of impact and performance. This allows users to assess how effectively the incident was handled without digging into details.

Impacts and results

Conclusion

This platform combines API monitoring, observability, and incident management into a cohesive system. By designing around roles, scenarios, and uncertainty, it empowers engineering teams to act decisively under pressure while maintaining visibility, coordination, and operational control.

Katola Kehinde

Open to work

Home

About

contact Me

8:23 AM

Katola Kehinde