Disaster Recovery Testing Guidelines
Application
The University Disaster Recovery (DR) standard and its supporting guidelines apply to all ITS managed computer systems.
Purpose
These guidelines are designed to support the operation of the DR standard. The DR standard is designed to ensure the DR requirements have been formally considered for all new and existing ITS managed computers systems.
Disaster recovery test frequency
Recovery Priority | Recovery Time | Testing Frequency |
---|---|---|
1 | 1.0 hour | 6 monthly |
2 | 1.0 – 2.0 hours | 6 monthly |
3 | 2.0 – 4.0hours | 6 monthly |
4 | 4.0 hours | 12 monthly |
Good practice stipules the frequency of disaster recovery tests is determined by the system’s recovery time objective (RTO). The table above recommends DR testing frequency based on RTO.
Waiving disaster recovery tests
DR tests are performed to confirm a system or application can be resumed within the specified timeframe and also serves as training for staff.
The dates for DR Tests are recorded in the DR testing calendar. To allow flexibility, minimise Production system downtime and ensure efficient use of Staff time, DR tests may be waived, or combined with maintenance (e.g. software patching) if:
- Maintenance on an active-active or active-stand-by system has been completed within the past 2 months
- Maintenance on an active-active or active-stand-by system is scheduled to be completed in the following 2 months
- If the system’s DR procedures have been activated during a live DR incident within the past 4 months
To determine whether a scheduled DR test can be waived the following points must be considered:
- If maintenance was performed, were servers disabled (at separate times) from one site and then the other meaning the system was operating solely from one site, and then the other
- For scheduled maintenance, the same as the previous bullet point must apply
- DR procedures for that system/service have been activated within the past 4 months
- The failover includes 75% of the intended scope of the DR test
If it is decided the scheduled DR test is not required, the DR system owner must agree and a DR test report (using the DR test report template) must be written retrospectively and include a note to specifying the DR test was waived and the reasons why.
Disaster recovery test set-up
The types of DR test performed include:
- Technical/activation
- Specific component
- Full
The DR testing programme for each system will largely determine the type of DR test to be performed and the test scope.
Process:
- Set test date
- Develop and circulate DR test scope
- Assemble test team, confirm test scope and identify preparation tasks
- Review the DR test report from the previous DR test
- Progress the DR test through change management
- Complete preparation tasks
- Perform DR test
- Complete post DR test activities
Test team
Secure ITS and business personnel to assist in preparing for, and performing the DR test.
The test team is the group who will prepare for and perform the DR test. Typically the test team includes the following:
- The DR Coordinator
- Representatives of the system owners (Technical Services or Infrastructure Services)
- Testers from business users
Agreement with Business owner
Contact the business owner to confirm planning for the DR test has started and discuss:
- Suitable dates for the DR test (based on the DR testing and maintenance calendar, processing cycles and staff availability/preference)
- The type of DR test to be performed (if applicable) and the DR test scope
- Assistance from the business – e.g. testers
- Appointing a business representative to assist in the planning of the DR test
Test scope
Review the previous DR test report to confirm all action items have been completed and identify any specific tests to be included in this DR test. Some considerations may be:
- To test the connectivity to other systems (DR or Production)
- User testing
- Recover to a set point or to the current point
Risks
DR tests come with some risks, common risks are:
- There is no DR during the DR test
- There is no DR until the DR environment has been restored. This is particularly relevant if the stand-by database is taken out of stand-by and has to be restored.
- If connectivity tests are included, care must be given to ensure live/Production data does not get passed to the DR systems and test data does not get passed to Production systems
During the planning stage all risks associated with this DR test must be identified and included in the DR test scope. Procedures must also be developed to manage these risks.
DR test approach
Set the start time for the DR test. Where possible, schedule preparation tasks to be performed slightly ahead of time. The duration of the DR test will depend on the type and complexity of test being performed. Ensure sufficient time to restore the Production and DR environments after the DR test is complete.
DR test progress updates to owners
Offer to text business and system owners at strategic stages during the DR Test, but they usually they only want to know when the DR Test is completed, or if issues arise.
Change management
Once the DR test scope, date and time and personnel are confirmed progress the DR test though the change management process.
User testing
User testing verifies the functionality and connections of the DR system and also the integrity of the DR data. A simple user test may be for a single User to log on the DR system and navigate screens, checking data. This would confirm the security, screen functionality and data of the DR system. A more complex user test may involve:
- Multiple users from multiple sites logging on
- Comparison of data between the production and DR systems via screen shots
- Entering data into the DR system
- Running reports
- Checking links/data flow to other systems
For more complex user tests, it is worthy to note care must be given to ensuring
The test scope will define how much user testing is required
Testers are arranged by the business owner. Allow sufficient time for the testing and any regression/tidy up activities
As part of the preparation for the DR test, testers need to develop a user test plan. User test plans from previous DR tests, or test plans from any system development projects may be used
Arrange a session to review the test routine to ensure all objectives are tested and data integrity is maintained
DR Test Preparation
Review the DR plan and technical/supporting documentation to ensure they are current and complete and the DR system to ensure it is aligned with production. If significant changes to the DR system are required, note this in the test scope and the DR Test report.
Communications
Once the DR test has been approved by the CAB, ensure the business owner has advised all Stakeholders and Users of the impending DR test, especially the time and date. Care must be given to ensuring live data is not entered into the DR system by mistake and test data does not find its way into Production environments.
Perform DR test
Perform the DR test as documented in the DR test scope, DR plan and technical documentation. All issues and anomalies must be documented for review and action.
Post-Mortem review
All participants meet to review the DR test. The first item discussed is to compare test results (what went right and what went wrong) to the DR test scope to determine and agree the success (or otherwise) or the DR test.
The next, the issues encountered are reviewed. Actions are agreed and assigned to staff for resolution. If possible a resolution date is also agreed to ensure timely completion of action items. If the DR test was not successful, or only partially successful it is likely a follow up (full or partial) DR test will be required once action items are completed.
DR test report
The DR test report is written and circulated to stakeholders. If successful, the DR test is signed off by the business owner. The DR test report must then be posted on the SharePoint site.
DR test setup checklist
Complete | Tasks | Comment |
---|---|---|
Agree test date | ||
Confirm and document DR test scope |
||
Change Management | ||
Review and update DR plan and technical documentation | ||
Scope user tests and arrange testers | ||
Communications – advise stakeholders | ||
Perform DR test | ||
Hold test post-mortem review meeting | ||
Confirm and assign action items | ||
Write test report and distribute | ||
Obtain DR test sign off from business owner | ||
Co-ordinate completion of action items | ||
Test tidy up and follow up tasks |
Definitions
The following definitions apply to this document:
Computer System describes a group of applications which combined, provides an IT service. E.g. a typical computer system may include a business application, database and web access application.
Failover refers to transferring processing from the Production (or primary) computer system to the disaster recovery computer system.
Incident is a disruptive event that causes an unacceptable interruption to business operations and/or IT services.
Recovery time objective (RTO) is the length of time following the decision to invoke the DR Plan that an IT service must be restored.
University means the University of Auckland and includes all subsidiaries.
Key relevant documents
Include the following:
Document management and control
Prepared by: DR Programme Manager
Owned by: Chief Digital Officer (CDO)
Date approved: November 2013
Review date: November 2016