Health Monitoring / Insight / ELK++

Health Monitoring / Insight / ELK++

The session with Yusuf Limalia (Unlicensed) and Simon Jolly (Unlicensed)

Current World "As-Is"

Insight is a health monitoring application used by:

  1. Internal RBR, via the Managed Service system (which pulls and aggregates information from N customer Insight installations)
  2. ReSellers, either directly via the Insight UI, or via an instance of their own Managed Service (not rolled out yet)
  3. Customers 

The key functions of Insight are

  • Health Monitoring
  • Configuration
  • Alerting

There are two types of elements providing logging and health information:

  1. Components; recorder, IVDS, plus others which won't exist in eXtended Platform.
  2. Pro-Active Tools; QoR (Quality of Recording) and DSC (Daily System Check)

Both components and pro-active tools send logging information to Insight, which is then surfaced via a UI.

QoR

Examines each call once it is recorded using algorithms to determine the quality of the call.

Most likely we will re-write this as a microservice in .NET Core, lifting and shifting the algorithms from the existing solution.

DSC

Calls each device, on a schedule (which is set by configuration via Insight application), to determine it's working status.

This will be more challenging to rewrite and may end up being an additional windows hosted service.


eXtended Platform (New World - To Be)

  • Insight is to be totally replaced by a new solution.
  • QoR to be rebuilt in the new system
  • DSC to be added to the system 'as-is'
    • Potential work to drive the devices from our configuration store
  • Logs to ingest
    • All micro-services
    • Collectors
      • Alerts/Alarms
      • Logs
    • Media API logs
    • Daily System Check (DSC) logs
  • Changes to 'Managed Service' to harvest information from ELK stack(s).
  • Creating and configuring ELK alerts/alarms
    • Email
    • SNMP
    • SMS

Quick Session 

Above session was held with Kirill Zotkin (Unlicensed) Dean Lawrence (Unlicensed) and Yusuf Limalia (Unlicensed)

There are three types of logging information that needs to be viewed:

  1. Application Logs - These are logs that are produced and are specific to our applications such as debug logs. This has been covered in our Logging Strategy.
  2. System Logs - These are metrics regarding CPU / RAM usage etc. and module specific ones such as Kubernetes / Elastic Stack. Metricbeat has been discussed previously as a possible candidate to handle this. For systems that are not supported by Metricbeat such as Event-Store or Minio, there will require some additional work to parse some logs using Logstash.
  3. Collector Health Logs - These are based on the requirements and views that Insight (Red Box Application) provides, such as status (online/offline) recorders and devices.

The business requires that all logs be viewed in a single location for which Kibana seems a possible Candidate. The method for determining the logs we need to produce at the three levels mentioned above will be driven by a UI requirement. Example screens of what information wants to be viewed will need to be produced and worked backwards from.


Feedback from James Denning (Unlicensed): The key is that a system component doesn't decide when to issue an alert, we just make sure its logs of a situation get to the required destination and THEN something else knows when to generate an alert - this means that if you get the logging right you can tune the alerting mechanism in one place

Whiteboard Session  

Attendees: Dean Lawrence (Unlicensed)Adam Holmes (Unlicensed)Simon Jolly (Unlicensed)Josh Hepworth (Unlicensed)Yusuf Limalia (Unlicensed)

UI Requirements

Balsamiq UI's have been created...

Managed Client Dashboard.bmpr

Non-Managed Client Dashboard.bmpr

These can be found on sharepoint in Red Box Recorders\Nubis - Documents\General\839-Reporting\Analysis\Health Check Mockups


The following (historic) VSTS Feature, Insight, needs to be reviewed and any useful data gleaned and added to the requirements/stories for the replacement feature health monitoring.

Add label