Disaster Recovery

From Alteeve Wiki
Jump to navigation Jump to search

 AN!Wiki :: Disaster Recovery

Caching

When the Anvil! system's disaster recovery function is configured, you will be presented with the option to enable caching. This can provide different benefits depending on whether DR cache drives are available on either the Anvil! nodes and/or on the DR target machine.

How Caching Works

A DR cache drive is any drive found on either the Anvil! nodes and/or the DR targets that have the '.dr_cache' file in the root of their file system (unless altered via the 'tools::dr::cache_signature' variable).

Typically, USB drives are used for caching purposes. However, any drive with the signature can and will be used.

When multiple cache drives are found, the largest will be used first. When cache drives are the same size, then the drives will be used based on their device path names in alphabetical order (ie: '/dev/sdb' -> '/dev/sdc').

NOTE: In all cases, the cache drive must have a supported file system on it. Typically 'ext4' is recommended, though not required. Images are written to the device as normal files and removed once the image is no longer needed. The drive does NOT need to be mounted for it to be used. Striker will mount cache devices as needed. Likewise, already mounted devices can be used regardless of the current moint point, so long as the 'root' user has read/write access.

Caching With Multi-Server DR Jobs

If a disaster recovery job has two or more servers in it, the Anvil! will attempt to cache all servers at the same time. Servers will be cached from largest to smallest. If the cache drive space is insufficient to host all servers in DR job, then servers will be processed one at a time.

Caching on Anvil! Nodes

The primary benefit of using cache drives on Anvil! nodes is to reduce the amount of time that servers have to be shut off. In many (but not all!) cases, locally attached storage can be written to faster than data can be transmitted over the IFN, particularly if bandwidth limiting is also used. In such cases, the server(s) in the backup job can be shut down, imaged to local cache, then booted back up prior to the images being transmitted to the DR target.

Note that multi-server DR jobs, if the total disk space used by the servers exceeds the available space on local cache disks, the DR job will have to wait for transmission to the DR target before servers can be booted back up. Multiple cache drives can be used to account for this.

Note also that the cache drives can be connected to either node. The Anvil! will check both nodes when a job starts and use the node with the cache drive(s) attached for the image process.

Caching on DR Targets

When a DR job retains only one copy of each server, the cache drives can be used to avoid overwritting the pervious image until the new image has been transfered safely. Once the image (or images in multi-server jobs) have been received, only then will the image be copied from the target's cache to final storage. This ensure that, at all times, at least one good image of each server exists.

Note that in multi-server DR jobs, if the source images exceed the size of the available cache drives, images will be processed sequentially.

DR Stores

Server images can be stored on the DR target as either flat files on a file system (in 'raw' format) or as LVM logical volumes.

When setting up a DR target, you will be asked to specify the DR store type and location. There are two options; 'File System' and 'LVM'.

LVM Stores

When the 'LVM' store is selected, the store location is the volume group name to use. If your target has two or more available volume groups, they can be specified using commas, like 'dr01_vg0,dr01_vg1,...,dr01_vgN'.

Logical volumes will be created and renamed as required. When multiple volume groups are defined, and when a new LV needs to be created, volume groups will be selected by which has the most free space first. If the amount of free space is the same, volume groups will be selected in alphabetical order.

Filesystem Stores

The the 'File System' store is selected, the store location is the directory under which the server image files will be written.

Image Naming Convention

Regardless of whether the DR store is a file system or LVM based, the naming convention will be the same, The only difference is whether the name is the logical volume name or the file name (with a '.raw' suffix).

The basic format is:

  • <server_name>_<source_lv_name>_<sequence_number>{.raw}

When an image is being written, the suffice '_incomplete' will be added. When the write is complete, this suffix will be removed.

The 'sequence number' reflects the number of copies being stored. When 'Copies' is set to '1', the sequnce will always be '1'. If, as an example, 'Copies' is set to '3', the the oldest image will have the sequence number '_3', the second oldest will be '_2' and the most recent will be '_1'.

For example, a server named 'Web01', which had two virtual disks named 'Web01_0' and 'Web01_1' backing up to a DR target where '3' copies are stored will have the following names;

  • Oldest images;
    • 'Web01_Web01_0_3' + 'Web01_Web01_1_3'
  • Second oldest images;
    • 'Web01_Web01_0_2' + 'Web01_Web01_1_2'
  • Most recent images;
    • 'Web01_Web01_0_1' + 'Web01_Web01_1_1'

Image Name Cycling

When a backup job starts and 2 or more copies are stored, the existing image names are cycled. The oldest image sequence number is changed to '_incomplete'. When the image write is complete, '_incomplete' will be renamed as '_1'.

When the remaining images are renamed to increase their sequence number by one. When only '1' copy is stored, then the existing image will simply be changed from '_1' to '_incomplete'.

When imagine a server with two or more disks, *all* disks are renamed at the same time. This way, should an image fail part way, you can be sure that the images with the same sequence number go together.

Continuing the example above, the following sequence will occur;

  • Oldest image are renamed to '_incomplete';
    • 'Web01_Web01_0_3' -> 'Web01_Web01_0_incomplete'
    • 'Web01_Web01_1_3' -> 'Web01_Web01_1_incomplete'
  • Other existing images have their sequence number incremented;
    • 'Web01_Web01_0_2' -> 'Web01_Web01_0_3'
    • 'Web01_Web01_1_2' -> 'Web01_Web01_1_3'
    • 'Web01_Web01_0_1' -> 'Web01_Web01_0_2'
    • 'Web01_Web01_1_1' -> 'Web01_Web01_1_2'
  • The new images are written to the '_incomplete' files/LVs.
  • Imaging completes and the newest images are renamed;
    • 'Web01_Web01_0_incomplete' -> 'Web01_Web01_0_1'
    • 'Web01_Web01_1_incomplete' -> 'Web01_Web01_1_1'

Server Definition Files

When a DR job images a server, a check is made to see if there is a definition file for the server in the DR target's '/shared/definitions/' directory. If not, one is created based on the definition file on the source Anvil!, updated to reflect the DR store path.

The generated definition file will automatically eject any optical disks from the drive. This is done to improve the chances that the server will boot in a disaster, as the hypervisor will abort booting a server if an optical disk is defined in the server's defintion file but does not actually exist on disk.

Note that if a definition file already exists, it will NOT be overwritten or edited. This is to allow the admin to adjust the definition file manually and have their changes persist. An example would be when the admin needs to reduce the amount of RAM allocated to a server on the DR target.

Note also that servers are *NOT* defined on the DR target! This is done intentionally to avoid a user accidentally booting a DR server while the source server is still operational. In a disaster scenario, the DR servers can be booted with the following command;

Note: Plans are in place to create a program called 'anvil-activate-dr' that will automatically discover and boot all DR servers'. Until complete, however, the servers must be booted manually with the following command:
  • 'virsh create /shared/definitions/<server>.xml'

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.