Resilience Fixes for REDIS

Resilience Fixes for REDIS

Description:

Extended Architecture’s (EA) Process Manager relies on Redis for processing workflows. As a result of this, when Redis fails or becomes temporarily unavailable, there is a disruption in workflow processing.

The purpose of this feature is to improve EA’s resilience by introducing a failover mechanism to ensure that workflows continue to be processed in event of a Redis failure.

Business Case

  • This feature will improve the reliability of EA

Personas effected

  • Hannah James - Senior Operations Analyst
  • John Williams - Head of Digital Transformation/ CTO/ Strategy

User Stories

User Story 1As a voice recording system administrator I would like call workflows to continue to be processed if Redis fails, so that the reliability of the system is unaffected by a failure of Redis.

Acceptance Criteria: 

  1. Given that EA is running, when there is a REDIS failure, then the system should write error to logs
  2. Given that EA is running, when there is a REDIS failure, then the system should switch to Event Store as the source for processing call workflows
  3. Given that EA is running, when there is a REDIS failure and the system switches to Event Store as the source of processing call workflows, then all call workflows should be processed without interruption.

User Story 2: As a voice recording system administrator, if call workflows are being processed by Event Store due to a Redis failure, once Redis is available, i would like the system to revert to Redis, so that the system is restored to its default status for call workflow processing

Acceptance Criteria:

  1. Given that EA is currently processing call workflows from Event Store, when Redis becomes available, then the system should revert to using Redis to process all call workflows

Functionality

Use Case Title:

Failover to Event Store

Description (GOAL)

If there is a failure of Redis, the system should failover to Event Store

Trigger

Which ever happens sooner:

  • Redis is unavailable for 10 seconds

Or

  • 3 successive queries to Redis are unsuccessful

Primary Actors (Personas)

System

Secondary Actors 

NA

Stakeholders 

  • Hannah James
  • John Williams

Preconditions 

  • EA system is running
  • Event Store is available

Flow (Main success Scenario)

  1. System receives notification of Redis failure
  2. System writes error in logs
  3. System failsover to Event Store
  4. System process workflows from Event Store

Alternative flows

None

Post-conditions 

Success End condition: 

  • Workflows are processed from Event Store

Failure End condition:

  • Failure condition logged

Frequency 

NA

Priority 

Must

Use Case Title:

Revert to Redis

Description (GOAL)

Revert to Redis as the primary data source for workflow processing

Trigger

System notification that Redis is available

Primary Actors (Personas)

System

Secondary Actors 

NA

Stakeholders 

  • Hannah James
  • John Williams

Preconditions 

  • EA system is running
  • Redis failure has occurred

Flow (Main success Scenario)

1. System receives notification that Redis is available

2. System switches to Redis

Alternative flows

None

Post-conditions 

Success End condition: 

  • Workflows are processed from Redis

Failure End condition:

  • Error condition logged
  • System continues processing workflows from Event Store

Frequency 

NA

Priority 

Must

Non Functional Requirements

Ref
Area
MoSCoW
Requirement
Comments

1

Error-handling

M

Ease with which the system can degrade gracefully if errors occur - eg does the entire system go down and lose data if the internet goes down

Capture error in logs

2

Legal and Regulatory

 

specific legal and regulatory requirements associated with the feature

NA

3

Licensing


new/amended licensing requirements associated with the feature or with introduced 3rd party components)

NA

4

Localizability

 

need to include localised features eg currency; date formats

NA

5

Performance

M

ability to meet specific performance standards/requirements

Failover timeout for Redis should be 10 seconds or 3 query retries (which ever happens first)

6

Concurrency

 

Specific concurrency requirements

NA

7

Resilience

M

ability to handle failure of an individual component within the system

Failure of Redis should not affect the operation the system

8

Scalability

 

requirements to support increasing numbers of users/concurrency without incurring significant cost

NA

9

Security

M

adherence to defined/specified customer/industry security standards

All connections to Redis and Event store must remain protected

10

Storage

 

Specific storage requirements/considerations

NA

11

Supportability

S

ease with which Support could/need to access logs etc to diagnose a problem

Service configuration should support setting number of retries attempts and time interval for status checks

12

Test requirements


ease with which the functionality could/should be supported by automated testing

NA

13

Training

 

specific training/installation/configuration documentation that is associated with this feature that need to be created/updated

NA

14

User Experience


specific user experience requirements that would ensure the functionality is acceptable to customers eg can complete action within x clicks

NA




Add label