Event Store Destructive Testing Phase 2

Event Store Destructive Testing Phase 2

Business Case


We need to perform destructive and stability testing for the event store against various scenarios where the functionality of the event store could be compromised and provide a report with proposed solutions.

Objectives

  1. Test snapshot mechanism during the Event Store failure (time needed to restart service after the failure/connection issue)

  2. Perform Event Store destructive testing with the clustered core services (services responsible for the main flows should be scaled up to 3 nodes)

  3. Capture the call processing flow metrics for stable and partially failed Event Store

  4. Test each flow independently

    1. Call Capturing

    2. Call Processing

    3. Call Management

    4. Export Engine + Transcription

    5. Policy Engine (Configuration + Policy execution)

    6. Health Monitoring (Collector Health Monitoring, Collector Alerting)
      We need to perform destructive testing for the event store against various scenarios where the functionality of the event store could be compromised and provide a report with proposed solutions.

Destructive Testing for event store

Destructive Testing for event store

Name of requirement 

Event store destructive testing phase 2

Type 

Business requirement

Description 

We need perform destructive testing for the event store against various scenarios below:

  • Problem Statement: Event store fails to commit all data to disk during restart of containers and fails to reload to container.
    Test Case: Surviving nodes repair the damaged nodes after a node fails to initialise.
    Test that Event Store is able to rebuild its state on instance failure (ES instances failure)

  • Problem statement: When connections from a microservice to eventstore is broken the microservice does not re-establish a valid connection to eventstore. We need to identify time needed to restart service after a failure/connection issue.
    Test Case: Test snapshot mechanism during the Event Store failure (time needed to restart service after the failure/connection issue).
    Test that Event Store is able to rebuild its state on instance failure (ES instances failure)

  • Problem statement: Event store is currently scaled to 3 instances and produces a lot of internal communication.
    Test case: Scale the event store service up until the service no longer responds to identify the current uplimit of instance within the K8 cluster. (Perform Event Store destructive testing with the clustered core services (services responsible for the main flows should be scaled up to 3 nodes).
    Test that Event Store is able to rebuild its state on instance failure (ES instances failure).
    Test Core Services stability on Event Store failure:

  1. Partial Failure (some instances are active, some are failed) - services should work correctly

  2. Complete Failure (all instances are failed) - services are not able to process commands, but able to re-connect after the Event Store recover

Pre-conditions

 

  • Event Store is running in the clustered mode (3 pods).

  • Core services responsible for the call capturing/processing/managing flow are scaled up to 3 pods per service.

  • Snapshot mechanism is implemented for Configuration and Identity Storing services.

  • Event Store upgraded to 5.xx version

Environments

  • Azure Kubernetes Environment (regular Dev/QA environment with SSD attached as a persistent storage).

  • On-prem regression like environment with HDD with NFS attached as persistent storage. (on-prem environment should not be restarted overnight and should have large number of events).

Priority 

Must 

Tags

NFR; Resiliency

 

Stability Testing

Stability Testing

Name of requirement 

Stability testing

Type 

Business requirement

Description 

We need to also perform stability testing for the event store

  • Problem statement: Flooding of event store causes mass disconnection of microservices when using slow hard drives.
    Test Case: Use standard hard drive for event store storage and flood the event store and observe it disconnecting from the microserves. This should not happen when using SSD (Test Event Store with HDD and SSD disks under the load).
    Test system with the low heartbeat timeouts (emulate network latency)
    Capture Performance Metrics for Call Processing flow during the partial Event Store failure

  • Problem statement: Failure of the Event Store instances shouldn’t affect overall system stability until there is at least one instance available.
    Test Case: Test Core Services stability on Event Store failure:
    Partial Failure (some instances are active, some are failed) - services should work correctly.
    Complete Failure (all instances are failed) - services are not able to process commands, but able to re-connect after the Event Store recover.
    Capture Performance Metrics for Call Processing flow during the partial Event Store failure.

  • Problem statement: Services should be able to re-connect to the Event Store and be ready to accept commands in less then 10 -15 seconds (max time specified for the snapshot restore).
    Test case: To be determined by QA

 

Pre-conditions

  • Event Store is running in the clustered mode (3 pods).

  • Core services responsible for the call capturing/processing/managing flow are scaled up to 3 pods per service.

  • Snapshot mechanism is implemented for Configuration and Identity Storing services.

  • Event Store upgraded to 5.xx version

Environments

  • Azure Kubernetes Environment (regular Dev/QA environment with SSD attached as a persistent storage).

  • On-prem regression like environment with HDD with NFS attached as persistent storage. (on-prem environment should not be restarted overnight and should have large number of events).

Priority 

Must 

Tags

NFR; Resiliency

Phase 1 Testings

https://uniphore.atlassian.net/wiki/spaces/RedboxHome/pages/2562759119

https://uniphore.atlassian.net/wiki/spaces/RedboxHome/pages/2562753364

https://redboxdev.visualstudio.com/Nubis/_workitems/edit/19145

Reviewers 

@Simon Jolly (Unlicensed)
@Jayasri (Unlicensed)
@Sergey Shafiev (Unlicensed)
Add label