Disaster Recovery Planning Standard
Disaster recovery (DR) plans form part of the University’s overall business continuity management (BCM) process. DR plans protect the University’s computer systems in the event of an unacceptable disruption to IT services. Other components of the BCM process are incident management and business continuity.
This DR standard provides the framework which governs how the University performs all disaster recovery planning activities DR Plan, development, maintenance and testing) and ensures the requirement for a disaster recovery solution is considered for all ITS projects. The standard sets out essential rules and defines the roles and responsibilities around the development, maintenance and testing of DR plans for the University which apply in different circumstances.
Standards – summary
- Implementing a disaster recovery (DR) solution must be fully considered for every new and existing University computer system and a decision must be formally recorded
- All DR documents must use the DR templates available on the Disaster Recovery SharePoint Site
- DR plans must be tested and maintained according to the schedule set out in the DR testing and maintenance calendar, or whenever major system changes are made
- The roles and responsibilities regarding ownership of the DR requirement, DR solution and DR plan and responsibility for ensuring the testing and maintenance of the DR solution and DR plan must be defined
Implementing a disaster recovery (DR) solution
The decision whether or not to implement a disaster recovery (DR) solution must be taken for every new and existing University computer system. If the decision is taken not to implement a DR solution, the risks and impact to the Unit/ University and the system’s recovery requirements must be formally identified via a business impact analysis (BIA), performed by a suitably experienced person,in conjuction with ITS. The DR recovery requirements include the following:
- System recovery point objective (RPO)
- System recovery time objective (RTO)
- The relative capacity of the DR system compared to the production system
- All systems must have tape back up as a minimum
The system’s business owner owns the DR Solution including the DR requirements, DR system and DR plan. They are responsible for stating whether a DR solution is to be implemented and for specifying the system’s recovery requirements.
If the business owner decides no DR solution is to be implemented, they accept the risk that in a DR situation (following a significant outage), their system could be unavailable indefinitely or never recovered.
For existing computer systems the BIA may be performed retrospectively and must be driven by the business owner.
If it is determined that a system requires a DR solution to be developed, it is the responsibility of the business owner to facilitate the design and costing of that DR solution. ITS are responsible for the design of the DR system. The costing must include the cost of implementing the DR solution as well as any modifications to the Production system to accommodate the DR system and the on-going cost of maintaining and testing the DR solution. Maintaining the DR solution involves software licenses, operating software upgrades, server health checks and future hardware upgrades.
Funding of the DR solution is also provided by the business owner. However, funding may be negotiated with other departments.
If the development of the DR solution is deferred, by default the business owner has accepted the risk that the system could potentially be unavailable indefinitely or never recovered following a significant outage.
For new computer systems the BIA must be performed during the system design phase of the project and the DR requirements included in the business case for the development of the system.
It is the responsibility of the project project steering committee to determine whether a DR solution is to be developed or not and whether to implement the DR solution as part of the project, or at a later date. The project manager must raise the requirement for DR based on the business case.
If the project steering committee determines no DR solution is to be implemented by default they have accepted the risk that the system could potentially be unavailable indefinitely or never recovered following a significant outage. The recovery process will be to restore from tape backup. If the DR solution is to be implemented at a later date the same risk applies until the DR solution is implemented. This decision must be clearly documented in the system functional specifications and steering committee minutes. The following decisions must also be made and included in the system functional specifications:
- A date to review the decision whether to implement the DR solution or a date to begin implementing the DR solution
- Who will fund the DR solution when/if it is implemented.
- Whether ITS or the business owner will project manage the implementation of the DR solution
The business owner may choose to update their business continuity plan/s to cover the impact of that system being unavailable.
Development project DR
Critical stages of the development project may exist where the development and/or testing environments require DR systems to be set up. This could be due to a set go live date that cannot be moved due to legislative (e.g. new GST rate or changes to benefit or Ministry of Education funding) or the University (e.g. a new course starting) requirements. These critical stages must be identified and documented in the project scope or business case.
DR systems in the development and test environments must have a finite life span and the date the DR system can be removed must be specified in the server set up request.
Before a new computer system is accepted into production the following points must be considered, confirmed and included in the project completion report:
- If a DR solution is implemented, a formally tested and documented (DR Plan) DR solution, to the University standards, must be included in the system hand-over. The DR test approach and results must be signed off by the manager of the team who will be supporting the system in Production. The ITS Incident Management plan, Recovery Priority spreadsheet and other associated documentation must be updated.
- If a DR solution was not delivered as part of the project, the project steering committee must formally acknowledge and accept the risks of not implementing a DR solution and the Director of ITS must sign off this decision. The top four risks and the impact of those risks to the business and the University must also be documented.
All DR plans must use the DR plan template
All DR plans must use the DR plan template provided by ITS. The DR plan template ensures all DR plans are consistent and the DR plan includes all vital information such as:
- DR system overview
- Escalation and failover procedures and considerations
- Recovery personnel and roles and responsibilities
- Failover timeframes for different stages in the processing cycle
- System verification procedures
- Business contacts
Testing and maintenace of DR plans
DR tests are performed according to the schedule set out in the DR testing and maintenance calendar. The frequency and type/complexity of DR tests is dictated by the system recovery requirements and complexity of the DR solution. Where it is required, a DR testing programme is defined with specific DR tests entered into the DR testing and maintenance calendar.
DR tests must be managed as a project and be set up using the ITS provided DR test planning guide. A range of templates are also provided to ensure DR tests are complete and performed to a consistent process.
DR test planning guide
The DR test planning guide documents the procedure to set up and perform the DR test.
DR test scope
The DR test scope details the scope, approach, objectives and criteria for success of the DR test.
DR test report
Reports on the outcomes of the DR test, including the success of otherwise of the DR test, actions items arising from the DR test and any follow-up testing to be performed.
If mandatory criteria for success are not met, the DR test will be deemed to have failed. The issues must then be resolved and a re-test scheduled and performed.
All DR plans are stored in the Sharepoint DR site. The Sharepoint site is backed up to the offsite dropbox on a daily basis. Read only access is provided to the business owner and ITS departments. In a DR situation, the DR plan version in the dropbox is the official version.
DR testing and maintenance calendar
The DR testing and maintenance calendar is maintained by ITS and is a two year rolling schedule of all DR tests to be performed and the dates for DR plan reviews. It is updated on a monthly basis.
Roles and responsibilities
The Associate Director Infrastructure and Technical Services is the custodian of the DR system and DR plan. This includes maintaining the operational aspects of the DR system such as:
- Ensuring (O/S and applications) software levels continue to be aligned with the Production system
- Relative processing and storage capacities of production and DR remain consistent
- DR processes (e.g. data synchronisation) continue to operate
- The technical sections of the DR plan are regularly reviewed and updated
- DR tests are performed according to agreed schedules and to the University standards
- DR test results are circulated to all stakeholders
- DR test follow-up actions are completed in a timely manner
- The DR plan meets the University standards
The infrastructure and technical services teams who maintain and support specific components of the production system are responsible for that same component on the DR system – e.g. the technical services team is responsible for the production and DR databases, applications and operating systems, while the infrastructure team is responsible for the production and DR servers and storage.
The DR solution owner (business owner) is responsible for the system’s DR recovery requirements and ensuring they continue to be aligned with business operating requirements. This means reviews of the system’s recovery requirements (business impact analyses) should be performed on a regular basis, or whenever significant changes to business operations occur.
It is recommended BIA’s be performed every 1-2 years, depending on the system’s recovery time objective (RTO).
All change requests must include the impact on DR and risks must be escalated to the Director ITS if DR is affected.
The DR maintenance check list is provided to assist in reviewing and updating DR Plans. DR plan maintenance is covered in two parts:
- Business contacts
- ITS contacts and technical procedures
The Associate Director Infrastructure and Technical Services is responsible for ensuring the ITS contacts and technical procedures remain current. The DR plan must be reviewed prior to each DR test and updated whenever changes to the production system or fail over procedures occur.
The business owner is responsible for advising the Associate Director Infrastructure and Technical Services of any changes to business contact personnel or contact details within the DR Plan. Business contact details are reviewed at the beginning of each semester and updates emailed to the Associate Director Infrastructure and Technical Services by the second Friday of the new semester.
The following definitions apply to this standard:
Business continuity plan (BCP) refers to the procedures detailing the recovery or resumption of business functions following any incident
Business continuity management (BCM) refers to procedures enabling the continuation or resumption of business operations following a disruptive event. The disciplines are Incident Management, Business Continuity Planning and Disaster Recovery Planning.
Business impact analysis (BIA) is a questionnaire used to determine recovery requirements (recovery time objective and recovery point objective) for a computer system.
CAB is the Change Administration Board. This is the group who are responsible for reviewing and approving changes to ensure they are planned properly and they are implemented at an appropriate time.Computer system describes a group of applications which combined, provides an IT service. E.g. a typical computer system may include a business application, database and web access application.
Computer System describes one or more applications and the associated hardware (servers and disk) which combined, provides an IT service. E.g. a typical computer system may include a business application, database and web access application.Disaster recovery (DR) plan refers to procedures detailing the recovery of IT services and systems following an adverse event
Disaster recovery solution refers to the overall solution, including DR Plan and DR system
Failover refers to transferring processing from the production (or primary) computer system to the disaster recovery computer system
Incident is a disruptive event that causes an unacceptable interruption to business operations and/or IT services
ITS is the IT Services department. ITS are responsible for supporting the University’s technology infrastructure.
Recovery point objective (RPO) is the point (time and date) to which a DR system will be restored to
Recovery requirements are the requirements around which a BCP or DR Plan is developed
Recovery time objective (RTO) is the length of time following the decision to invoke the DR Plan that an IT service must be restored
Unit(s) refers to an organisational grouping across the University and includes a faculty, or research centre or service division or UniService
University means the University of Auckland and includes all subsidiaries
The following templates apply to this standard:
- DR plan: The main DR plan template is to be used for a single system or service (e.g. PeopleSoft CS). Variations of this template have been developed for specific platforms (vMSC) and system using specific recovery tools (VMware Site Recovery Manager – SRM)
- Test scope: Defines the scope of each DR test
- Test report: Used for the DR test report
- BIA: The business impact analysis questionnaire
Document management and control
Prepared by: DR Programme Manager
Owned by: CIO
Approved by: The Vice Chancellor
Date approved: November 2013
Review date: November 2016