Resilience Fixes for REDIS
Description:
Extended Architecture’s (EA) Process Manager relies on Redis for processing workflows. As a result of this, when Redis fails or becomes temporarily unavailable, there is a disruption in workflow processing.
The purpose of this feature is to improve EA’s resilience by introducing a failover mechanism to ensure that workflows continue to be processed in event of a Redis failure.
Business Case
This feature will improve the reliability of EA
Personas effected
- Hannah James - Senior Operations Analyst
- John Williams - Head of Digital Transformation/ CTO/ Strategy
User Stories
User Story 1: As a voice recording system administrator I would like call workflows to continue to be processed if Redis fails, so that the reliability of the system is unaffected by a failure of Redis.
Acceptance Criteria:
- Given that EA is running, when there is a REDIS failure, then the system should write error to logs
- Given that EA is running, when there is a REDIS failure, then the system should switch to Event Store as the source for processing call workflows
- Given that EA is running, when there is a REDIS failure and the system switches to Event Store as the source of processing call workflows, then all call workflows should be processed without interruption.
User Story 2: As a voice recording system administrator, if call workflows are being processed by Event Store due to a Redis failure, once Redis is available, i would like the system to revert to Redis, so that the system is restored to its default status for call workflow processing
Acceptance Criteria:
- Given that EA is currently processing call workflows from Event Store, when Redis becomes available, then the system should revert to using Redis to process all call workflows
Functionality
Use Case Title: | Failover to Event Store |
---|---|
Description (GOAL) | If there is a failure of Redis, the system should failover to Event Store |
Trigger | Which ever happens sooner:
Or
|
Primary Actors (Personas) | System |
Secondary Actors | NA |
Stakeholders |
|
Preconditions |
|
Flow (Main success Scenario) |
|
Alternative flows | None |
Post-conditions | Success End condition:
Failure End condition:
|
Frequency | NA |
Priority | Must |
Use Case Title: | Revert to Redis |
---|---|
Description (GOAL) | Revert to Redis as the primary data source for workflow processing |
Trigger | System notification that Redis is available |
Primary Actors (Personas) | System |
Secondary Actors | NA |
Stakeholders |
|
Preconditions |
|
Flow (Main success Scenario) | 1. System receives notification that Redis is available 2. System switches to Redis |
Alternative flows | None |
Post-conditions | Success End condition:
Failure End condition:
|
Frequency | NA |
Priority | Must |
Non Functional Requirements
Ref | Area | MoSCoW | Requirement | Comments |
---|---|---|---|---|
1 | Error-handling | M | Ease with which the system can degrade gracefully if errors occur - eg does the entire system go down and lose data if the internet goes down | Capture error in logs |
2 | Legal and Regulatory |
| specific legal and regulatory requirements associated with the feature | NA |
3 | Licensing | new/amended licensing requirements associated with the feature or with introduced 3rd party components) | NA | |
4 | Localizability |
| need to include localised features eg currency; date formats | NA |
5 | Performance | M | ability to meet specific performance standards/requirements | Failover timeout for Redis should be 10 seconds or 3 query retries (which ever happens first) |
6 | Concurrency |
| Specific concurrency requirements | NA |
7 | Resilience | M | ability to handle failure of an individual component within the system | Failure of Redis should not affect the operation the system |
8 | Scalability |
| requirements to support increasing numbers of users/concurrency without incurring significant cost | NA |
9 | Security | M | adherence to defined/specified customer/industry security standards | All connections to Redis and Event store must remain protected |
10 | Storage |
| Specific storage requirements/considerations | NA |
11 | Supportability | S | ease with which Support could/need to access logs etc to diagnose a problem | Service configuration should support setting number of retries attempts and time interval for status checks |
12 | Test requirements | ease with which the functionality could/should be supported by automated testing | NA | |
13 | Training |
| specific training/installation/configuration documentation that is associated with this feature that need to be created/updated | NA |
14 | User Experience | specific user experience requirements that would ensure the functionality is acceptable to customers eg can complete action within x clicks | NA |
- Simon Jolly (Unlicensed) (Technical Architect) review and signed-off
- Sergey Shafiev (Unlicensed) (Team Lead) to review and sign-off (Signed of by Kirill Zotkin (Unlicensed) on behalf of Sergey)
- Vikash Mahabir (Unlicensed) (QA Manager) to review and sign-off