Introduction/Purpose

A solution is required that shows the status of critical system components, so that System Administrators can see quickly whether or not there are any problems. The critical components to be monitored via the Health Monitoring Dashboard are...

Collector
Kubernetes
NATS
Event Store
MicroServices

Potential Actors/Roles

System Administrator: A user with the relevant permissions to be able to view and manage the health monitoring system

System: The overall Red Box system that handles the recording, playback and management of audio and its associated meta-data

Use Cases

View the current state of the Collector

View the current state of Kubernetes

View the current state of NATS

View the current state of the Event Store

View the current state of the MicroServices

Expected Behaviours (inc. flow diagrams etc)

The Status can be one of...

Running as it should be
Running when it should not be (eg. Service continues running even when it shouldn’t)
Not Running as it should be
Recently restarted (if restarted within the last day)

The dashboard must also identify the specific component so that if there are multiple instances of a component running, the System Administrator can see which specific instance has failed.

The information should be based on a "traffic light" system, with working components in green, components that are recently restarted in amber, and components that have failed in some way in red. This will allow the System Administrator to see at a glance if there are any failed or failing components.

The status will be captured by regularly interrogating each component.

The details shown on the dashboard for each component are...

Collector

Collector ID
Status
Last Failure Time
Last Restart Time

Kubernetes

Pod - All (if one part fails we show as the whole failed)
Status
Last Failure Time
Last Restart Time

NATS

NATS Cluster ID (if one part of the cluster fails or is failing we show the entire cluster as failed or failing)
Status
Last Failure Time
Last Restart Time

Event Store

Event Store ID
Status
Last Failure Time
Last Restart Time

MicroServices

Microservice Cluster Type (Individual Component) (list of cluster type) - https://redboxrecorders.atlassian.net/wiki/spaces/FPD/pages/193429514/Component+Registry
Status
Last Failure Time
Last Restart Time

If a component is shown as failed or failing the System Administrator must be able to select the component and see which part of the component has failed. For example, if a NATS cluster has failed, the System Administrator must be able to "drill down" to find out which element of the NATS cluster has failed.

Simon Jolly (Unlicensed) to review and sign-off on requirements.
Devon Cockram (Unlicensed) (PO) to review and sign-off on requirements.
QA to review and sign-off on requirements.
BA to to review and sign-off on requirements.
(Team Lead) to review and sign-off on requirements.

Future Platform Development

Health Monitoring Dashboard Requirements

Introduction/Purpose

Potential Actors/Roles

Use Cases

Expected Behaviours (inc. flow diagrams etc)