Health Monitoring Dashboard Requirements
Introduction/Purpose
A solution is required that shows the status of critical system components, so that System Administrators can see quickly whether or not there are any problems. The critical components to be monitored via the Health Monitoring Dashboard are...
- Collector
- Kubernetes
- NATS
- Event Store
- MicroServices
Potential Actors/Roles
System Administrator: A user with the relevant permissions to be able to view and manage the health monitoring system
System: The overall Red Box system that handles the recording, playback and management of audio and its associated meta-data
Use Cases
View the current state of the Collector
View the current state of Kubernetes
View the current state of NATS
View the current state of the Event Store
View the current state of the MicroServices
Expected Behaviours (inc. flow diagrams etc)
The Status can be one of...
- Running as it should be
- Running when it should not be (eg. Service continues running even when it shouldn’t)
- Not Running as it should be
- Recently restarted (if restarted within the last day)
The dashboard must also identify the specific component so that if there are multiple instances of a component running, the System Administrator can see which specific instance has failed.
The information should be based on a "traffic light" system, with working components in green, components that are recently restarted in amber, and components that have failed in some way in red. This will allow the System Administrator to see at a glance if there are any failed or failing components.
The status will be captured by regularly interrogating each component.
The details shown on the dashboard for each component are...
Collector
- Collector ID
- Status
- Last Failure Time
- Last Restart Time
Kubernetes
- Pod - All (if one part fails we show as the whole failed)
- Status
- Last Failure Time
- Last Restart Time
NATS
- NATS Cluster ID (if one part of the cluster fails or is failing we show the entire cluster as failed or failing)
- Status
- Last Failure Time
- Last Restart Time
Event Store
- Event Store ID
- Status
- Last Failure Time
- Last Restart Time
MicroServices
- Microservice Cluster Type (Individual Component) (list of cluster type) - https://redboxrecorders.atlassian.net/wiki/spaces/FPD/pages/193429514/Component+Registry
- Status
- Last Failure Time
- Last Restart Time
If a component is shown as failed or failing the System Administrator must be able to select the component and see which part of the component has failed. For example, if a NATS cluster has failed, the System Administrator must be able to "drill down" to find out which element of the NATS cluster has failed.
- Simon Jolly (Unlicensed) to review and sign-off on requirements.
- Devon Cockram (Unlicensed) (PO) to review and sign-off on requirements.
- QA to review and sign-off on requirements.
- BA to to review and sign-off on requirements.
- (Team Lead) to review and sign-off on requirements.