Build an m2 Anvil!

From AN!Wiki
Jump to: navigation, search

 AN!Wiki :: Build an m2 Anvil!

Template warning icon.png
Warning: This guide is now a complete draft, but is being edited before final release. This warning will be removed when the guide is completed. Feedback is very much appreciated!

This guide will walk you through all the steps needed to build an Anvil! platform.

Quick start;

Contents

A What?

An Anvil! platform is, fundamentally, a system designed to keep your (virtual) servers running for as long as possible, regardless of what internal or external problems arise.

It does this by a combination of simplicity, consistency and an absolute focus on availability.

The Anvil! is a combination of three layers, acting as one;

  1. Total hardware and storage redundancy; An field-tested architecture blueprint that has remained unchanged for years and has survived many real-world faults.
  2. Traditional High-Availability; Building on over ten years of proven, open source high availability software used in large enterprises all over the world.
  3. A new ScanCore "Intelligent Availability" layer that augments the HA stack with autonomous, proactive risk mitigation and recovery.

An Anvil! Platform contains three software layers, acting as one;

  • Traditional enterprise os-level configuration designed for maximum fault tolerance.
  • An easy to use web interface, called Striker.
  • An autonomous "decision engine", called ScanCore.

Collectively, these parts for the Anvil! platform, the first "Intelligent Availability™" platform, the next generation of availability.

Practical Examples of IA

The best way to understand the power of the Anvil! is to walk through some real-world fault scenarios and see how the Anvil! reacts. These, and other examples, are covered in detail later in the tutorial as part of the "pre-flight" checks, so you will be able to see the power of "Intelligent Availability" first hand.

Health-based migration.

  • A fan fails on the active server host, "node 1". Nothing overheats, but the peer, "node 2", is now seen as "more" healthy, so the servers are migrated.
  • Next, a hard drive fails degrading the RAID array on "node 2", rendering both nodes "sick". ScanCore deems the fan failure a lower risk, so the servers are migrated back to "node 1".
  • A hot-spare rebuilds the array, restoring optimal storage on "node 2". The fan is still failed on "node 1", so ScanCore again migrates the servers back to "node 2".
  • The fan is replaced, making both nodes perfectly healthy again. ScanCore does not migrate back to "node 1", as "node 2" is perfectly healthy.

Load shedding in response to mains power loss;

  • Input power is lost to one UPS; ScanCore triggers an alert but takes no action. The redundancy of the second UPS is sufficient to power the whole system.
  • The second UPS loses power. ScanCore waits a short while to see if the failure is transient. The outage remains and ScanCore determines the outage is persistent. Now the greatest threat to server availability is depletion of the energy in the UPSes.
  • ScanCore invokes load shedding; Several criteria are checked to determine which node will be shut down. If necessary, servers will migrate off the target node, followed by the node withdrawing from the Anvil! and then it will power off.
  • At this point, one of two scenarios are possible;
  1. The power returns before the UPSes deplete
  2. Despite the load shedding, the batteries drop below critical runtime levels.
    • In the first case, where power is restored, ScanCore watches the UPSes and once they've recharged to a safe level, the shed node is restarted and full redundancy is restored.
    • In the second case, where the outage is extended and the UPSes are close to depletion, a graceful shutdown of all hosted servers is initiated. Once stopped, the remaining node powers off.
    • When power is eventually restored, ScanCore (running on the dashboards) monitor the charge in the UPSes. Once they're up to a safe charge level, both nodes are restarted and the servers are booted.

Load shedding in response to over-heating from cooling loss;

  • Load shedding can also be triggered by a loss of environmental cooling, where both nodes start to overheat. ScanCore reacts in a very similar manner to power loss, save that in over-heating, load is shed to slow the rate of heating and lower the maximum temperature reached, hopefully staying under critical shutdown temperatures.
  • In the case where the temperatures return to nominal levels before going critical, then ScanCore will restore the backup node once the available thermal sensors are nominal.
  • In the case where the temperature does hit critical levels, all servers will be gracefully shut down, then the remaining node will power off to avoid damage. ScanCore will monitor the temperatures (via the nodes' IPMI sensors) and once temperatures are nominal, restart both nodes and restore normal operation.

This "big picture" approach to availability allows the Anvil! system to be highly autonomous and require minimal input from the administrator. We're always extending ScanCore’s intelligence and capabilities as feedback from deployed systems come in. Being a fully open source platform, these ever-growing capabilities are always available to you.

What Do I Need To Make An Anvil! System?

Template note icon.png
Note: The Anvil! architecture is a vendor-neutral platform. However, some hardware may not yet have a ScanCore agent yet. If you have as-yet unsupported hardware, please let us know.

An Anvil! system requires zero single points of failure. This strict requirement means there is a non-trivial "minimum requirements" list. This includes;

Template note icon.png
Note: Some vendors claim that you can not build a proper availability platform on just two nodes. We disagree and argue our point here: "The 2-Node Myth".
  • Two Anvil! nodes with;
    • IPMI support.
    • Redundant power supplies.
    • Six network interfaces across three separate dual-port network cards.
  • Two Striker dashboard machines.
  • Two ethernet network switches that support VLANs, 24-ports per switch.
  • Two network-switched PDUs (network-connected power bars).
  • Two network-connected UPSes.

The Anvil! software wants to make sure that, once deployed, your system will provide all the resiliency and durability we promise here. This approach requires strict checking by the Anvil! install tool. If you want to cross-check your planned hardware, we will be more than happy to help. Please just ask.

Knowing what hardware you need to support your anticipated workload is a bigger topic. So we wrote a guide to help you select the best hardware for you Anvil! system;

The Striker Web Interface

The Striker web interface is designed to be as easy to use as possible. This is not for your benefit, but instead for the protection of the Anvil! itself. Even the best of us makes mistakes, and Striker is designed to minimize that risk.

This is in-line with the Remove the human part of "Intelligent Availability".

The consistent Anvil! architecture allows Striker to make a lot of assumptions, dramatically assisting with the "simplicity" goal.

The ScanCore Decision Engine

ScanCore is the heart of the Anvil! platform's Intelligent Availability, providing the proactive component of Intelligent Availability.

ScanCore runs on all Anvil! nodes and Striker dashboards. On each machine, it invokes all available scan agents and then analyses the results to make "big picture" decisions on what, if anything, to do to best protect the hosted servers.

Fundamentally, this results in one of three actions;

  • Preventative Live-Migration
  • Load Shedding
  • Self-healing

We discussed the first two above. The last, "self-healing", is where ScanCore’s capabilities are evolving the most. The modular approach to scan agents, and the freedom they're given to take actions they deem necessary, means that all the ways the system can "heal" improve and extend over time.

This also means that you can build your own agents, adding entirely new capabilities to solve challenges specific to your environment.

A couple examples of self-healing;

Advanced UPS management;

  • We have found that in some cases, after a total power loss, an APC-brand UPS may fail to re-energize an outlet group. ScanCore's 'scan-apc-ups' was adapted to check for this state and re-energize the outlet groups, automatically restoring full power.

Server-failures;

  • A server fails and the built-in availability stack fails to recover it. The scan-server scan agent has several diagnostic and recovery options available to try and recover from the fault, rather than simply leaving the server in a 'failed' state until an administrator is available.

One of the best use-cases of the Anvil! is in remote or inaccessible deployments. In these cases, a component failure might not be repairable for some time. Here are a couple examples of why the Anvil! is particularly powerful in these deployments;

Single-node start;

  • If the failure is the total loss of a node, this can present challenges after an emergency shut down. Normally, a node won't join a traditional high availability cluster on boot until it can talk to the peer to avoid split-brains or fence loops. In remote deployments, however, am optional "single-node boot" mechanism allows the sole survivor node to, carefully, shutdown and restart on its own and restore hosted servers.

Rack-wide "watchdog" capability;

  • An optional capability of the Anvil! platform is to use the network-managed UPSes as a "rack-wide watchdog timer". When enabled, the UPSes will be commanded to power off in some number of minutes, sleep for a short period and then restore power. During normal operation, this countdown will be cancelled and a new timer started every so often. So long as at least one node is able to reach at least one UPS, power will never be lost. However, if something causes total network communication loss, like a bad switch configuration, the UPSes will timeout and power cycle. This will cause everything in the Anvil! to hard reset, often recovering from the initial loss of access.

OK, I'm convinced! Building an Anvil

Template note icon.png
Note: This may seem like a daunting task, but you will find that most of it is quite simple and not too different from adding any new server to your environment. Most involves ensuring that there won't be network conflicts and other pedestrian tasks. Of course, we're happy to help, commercially or as part of the open source community.

Building an Anvil! requires a few steps;

  1. Planning your network.
  2. Selecting and assembling hardware.
  3. Creating the install media and building the first Striker dashboard.
  4. Using the first dashboard to build the second dashboard.
  5. Creating an "Install Manifest" and building an Anvil! node pair.
  6. Installing Servers on the Anvil! platform.
  7. "Pre-flight checks" - Making sure everything is working as advertised.

Planning Your Network

Template warning icon.png
Warning: The Anvil! uses bonded network interfaces for redundancy, and this precludes the use of direct-connection between nodes. Please don't try to connect the nodes interfaces directly to one another. Doing so is unsupported by Red Hat and could cause failures to not be detected properly, causing an entire bond to fail. This is a hardware limitation, not a software issue. For more information please see this Red Hat knowledge-base article (Red Hat account required).

Before we start Assembling hardware, we need to have a plan for the three networks we will use, what their subnets will be and what IP address(es) we will assign to devices. We need to be sure that we don't conflict with any existing networks you have.

The Anvil! will use three separate /16 (255.255.0.0) networks isolated on three VLANs:

  • BCN - Back-Channel Network
  • SN - Storage Network
  • IFN - Internet/Intranet-Facing Network

We strongly recommend using different colour cables, with labels, to help quickly identify which cables do what. Should an issue arise months or years from now, this will help speed up recovery.

The actual mapping of interfaces to bonds to networks will be:

Subnet Cable Colour IP
BCN White 10.20.x.y/16
SN Green 10.10.x.y/16
IFN Black 10.255.x.y/16

In almost all cases, the BCN and IFN should stay the same across all Anvil! installs. The IFN will likely vary on every install as it depends entirely on what your existing network uses. The only time to you should change the BCN or SN would be if they collide with your existing network. In those rare cases, we've avoided the collision by dropping the '0 from second octet.

  • If your IFN collides with the BCN, switch to 10.2.0.0/16.
  • If your IFN collides with the SN, switch to 10.2.0.0/16.

Now that we've defined our subnets, we need to plan the actual IP addresses to assign to the foundation pack devices, Anvil! nodes and Striker dashboards. Most of these will be on the BCN, and we will use the third octet to describe the device with the IP. The fourth octet will describe the actual device.

Template note icon.png
Note: So long as you know what subnets you will use, and what IPs you will assign to dashboards and nodes on the IFN, Striker will pre-configure the rest of the network IPs automatically when we create the "Install Manifest", which we will talk more about later. The list below is largely for reference.

For Anvil! nodes, we will use the third octet to identify the Anvil!’s sequence. We're building our third Anvil! in this tutorial, so the third octet will be 'x.x.10.x' for the node and 'x.x.11.x' for the IPMI interface.

IP address allocation plan:

Foundation Pack:

Device Host Name Subnet IP Address Note
Ethernet Switch 1 an-switch01.alteeve.com BCN 10.20.1.1
Ethernet Switch 2 an-switch02.alteeve.com BCN 10.20.1.2 Not used when stacked.
Switched PDU 1 an-pdu01.alteeve.com BCN 10.20.2.1
Switched PDU 2 an-pdu02.alteeve.com BCN 10.20.2.2
UPS 1 an-ups01.alteeve.com BCN 10.20.3.1
UPS 2 an-ups02.alteeve.com BCN 10.20.3.2

Striker dashboards;

Device Host Name Subnet IP Address Note
Striker 1, BCN an-striker01.alteeve.com, an-striker01.bcn BCN 10.20.4.1
Striker 1, IFN an-striker01.ifn IFN 10.255.4.1
Striker 1, IPMI an-striker01.ipmi BCN 10.20.5.1 Only used on dashboards with IPMI BMCs
Striker 2, BCN an-striker02.alteeve.com, an-striker02.bcn BCN 10.20.4.2
Striker 2, IFN an-striker02.ifn IFN 10.255.4.2
Striker 2, IPMI an-striker02.ipmi BCN 10.20.5.2 Only used on dashboards with IPMI BMCs

Anvil! nodes;

Device Host Name Subnet IP Address Note
Anvil! 3, Node 1, BCN an-a01n01.alteeve.com, an-a01n01.bcn BCN 10.20.10.1 The '10 reflects the sequence number.
Anvil! 3, Node 1, IPMI an-a01n01.ipmi BCN 10.20.11.1 The '11 reflects the sequence number, plus one one for the IPMI.
Anvil! 3, Node 1, SN an-a01n01.sn SN 10.10.10.1
Anvil! 3, Node 1, IFN an-a01n01.ifn IFN 10.255.10.1
Anvil! 3, Node 2, BCN an-a01n02.alteeve.com, an-a01n02.bcn BCN 10.20.10.2 The '10 reflects the sequence number.
Anvil! 3, Node 2, IPMI an-a01n02.ipmi BCN 10.20.11.2 The '11 reflects the sequence number, plus one one for the IPMI.
Anvil! 3, Node 2, SN an-a01n02.sn SN 10.10.10.2
Anvil! 3, Node 2, IFN an-a01n02.ifn IFN 10.255.10.2

With this planned out, we'll be ready to assign IP addresses as we go!

Assembling Hardware

The Anvil! can be thought of as having three distinct part groups;

  • The 'Foundation Pack'; UPSes, PDUs and ethernet switches. A single foundation pack can host two Anvil! node pairs.
  • A pair of Anvil! nodes, onto which your servers are hosted and protected.
  • A pair of Striker dashboards, which can host N-number of node pairs.

Node pairs act as a single logical unit to host and protect your servers which can migrate between these two nodes without interruption. Servers can be moved from one Anvil! pair to another through a cold migration process, which is useful when you eventually need to upgrade an Anvil! node pair. The Striker dashboards monitor and manage any number of node pairs, limited only by their ability to store ScanCore data coming in from the various node pairs.

Once you have the hardware, the next step is to configure the foundation pack equipment, nodes (RAID and BIOS) and dashboards. We've written a dedicated guide showing how to do this with hardware we've had experience with.

Logical Map; Hardware And Plumbing

Template note icon.png
Note: We're often asked if certain parts of the design can be trimmed, like cutting out PDUs or using basic UPSes. The answer is "no". We've spent years trimming the Anvil! to make it as simple as possible, without compromising resiliency.

Below is a full block-diagram map of the Anvil! platform. It depicts a single node pair on a standard foundation pack, including which ports on the foundation packs are used. Inside the nodes, the virtual networking and a set of example servers are shown.

                                                                                            \_  {   }  _/                                                                                           
  Striker Dashboards                                                                          \_  ^  _/                                                                                             
                  __________________________                                                    \___/                                                    __________________________                 
                 | an-striker01.alteeve.com |                                                     |                                                     | an-striker02.alteeve.com |                
                 |        __________________|                                                     |                                                     |__________________        |                
                 |       | ifn_link1        =---------------------------------\ /-----------------------------------------------------------------------=        ifn_likn1 |       |                
                 |       |      10.255.4.1 ||                                 | |                 |                                                     || 10.255.4.2      |       |                
                 |       |_________________||                                 | |                 |                                                     ||_________________|       |                
                 |        __________________|                                 | |                 |                                                     |__________________        |                
                 |       | bcn_link1        =---------------------------------|-|-----------------------------------\ /---------------------------------=        bcn_link1 |       |                
                 |       |       10.20.4.1 ||                                 | |                 |                 | |                                 || 10.20.4.2       |       |                
                 |       |_________________||                                 | |                 |                 | |                                 ||_________________|       |                
                 |                          |                                 | |                 |                 | |                                 |                          |                
                 |                   _______|                                 | |                 |                 | |                                 |_______                   |                
                 |                  | PSU 1 |    ____________________         | |                 |                 | |         ____________________    | PSU 2 |                  |                
                 |__________________|_______|==={_to_an-pdu01_port-8_}        | |                 |                 | |        {_to_an-pdu02_port-8_}===|_______|__________________|                
                                                                              | |                 |                 | |                                                                             
  Nodes                                                                       | |                 |                 | |                                                                             
  _________________________________________________________________________   | |            _____|____             | |   _________________________________________________________________________ 
 | an-a01n01.alteeve.com                                                   |  | | /---------{_Internet_}----------\ | |  |                                                   an-a01n02.alteeve.com |
 |                                 Network:               _________________|  | | |                               | | |  |_________________               Network:                                 |
 |                                 _________________     | ifn_bond1       |  | | |   _________________________   | | |  |       ifn_bond1 |     _________________                                 |
 |      Servers:                  |   ifn_bridge1   |----| <ip on bridge>  |  | | |  | an-switch01             |  | | |  |  <ip on bridge> |----|   ifn_bridge1   |                  Servers:      |
 |      _______________________   |   10.255.10.1   |    |     ____________|  | | |  |____ Internet-Facing ____|  | | |  |____________     |    |   10.255.10.2   |  .........................     |
 |     | [ srv01-rhel7 ]       |  |_________________|    |    | ifn_link1  =---------=_01_]    Network    [_02_=---------=  ifn_link1 |    |    |_________________|  :       [ srv01-rhel7 ] :     |
 |     |     __________________|    | | | | | |          |    |___________||  \------=_09_]_______________[_24_=--/ | |  ||___________|    |          : : : : : :    :__________________     :     |
 |     |    | eth0             =----/ | | | | |          |                 |    | |  | an-switch02             |    | |  |                 |          : : : : : -----=             eth0 |    :     |
 |     |    |      10.255.1.1 ||      | | | | |          |     ____________|    | |  |____                 ____|    | |  |____________     |          : : : : :      :|      10.255.1.1 |    :     |
 |     |    |_________________||      | | | | |          |    | ifn_link2  =---------=_01_]  VLAN ID 300  [_02_=---------=  ifn_link2 |    |          : : : : :      :|_________________|    :     |
 |     |                       |      | | | | |          |    |___________||    \-|--=_09_]_______________[_24_=--\ | |  ||___________|    |          : : : : :      :                       :     |
 |     |   _____               |      | | | | |          |_________________|      |                               | | |  |_________________|          : : : : :      :               _____   :     |
 |  /--=--[_vda_]              |      | | | | |                            |      \-------------------------------/ | |  |                            : : : : :      :              [_vda_]--=--\  |
 |  |  |_______________________|      | | | | |           _________________|                                        | |  |_________________           : : : : :      :.......................:  |  |
 |  |                                 | | | | |          | sn_bond1        |          _________________________     | |  |        sn_bond1 |          : : : : :                                 |  |
 |  |    _______________________      | | | | |          |      10.10.10.1 |         | an-switch01             |    | |  | 10.10.10.2      |          : : : : :     .........................   |  |
 |  |   | [ srv02-win2012 ]     |     | | | | |          |     ____________|         |____     Storage     ____|    | |  |____________     |          : : : : :     :     [ srv02-win2012 ] :   |  |
 |  |   |     __________________|     | | | | |          |    |  sn_link1  =---------=_09_]    Network    [_10_=---------=   sn_link1 |    |          : : : : :     :__________________     :   |  |
 |  |   |    | NIC 1            =-----/ | | | |          |    |___________||         |_________________________|    | |  ||___________|    |          : : : : ------=            NIC 1 |    :   |  |
 |  |   |    |      10.255.1.2 ||       | | | |          |                 |         | an-switch02    Switch 2 |    | |  |                 |          : : : :       :| 10.255.1.2      |    :   |  |
 |  |   |    |_________________||       | | | |          |     ____________|         |____                 ____|    | |  |____________     |          : : : :       :|_________________|    :   |  |
 |  |   |                       |       | | | |          |    |  sn_link2  =---------=_09_]  VLAN ID 200  [_10_=---------=   sn_link2 |    |          : : : :       :                       :   |  |
 |  |   |   ____                |       | | | |       /--|    |___________||         |_________________________|    | |  ||___________|    |--\       : : : :       :                ____   :   |  |
 |  +---=--[_c:_]               |       | | | |       |  |_________________|                                        | |  |_________________|  |       : : : :       :               [_c:_]--=---+  |
 |  |   |_______________________|       | | | |       |                    |                                        | |  |                    |       : : : :       :.......................:   |  |
 |  |                                   | | | |       |   _________________|                                        | |  |_________________   |       : : : :                                   |  |
 |  |    _______________________        | | | |       |  | bcn_bond1       |          _________________________     | |  |       bcn_bond1 |  |       : : : :       .........................   |  |
 |  |   | [ srv03-sles12 ]      |       | | | |       |  |      10.20.10.1 |         | an-switch01             |    | |  | 10.20.10.2      |  |       : : : :       :      [ srv03-sles12 ] :   |  |
 |  |   |     __________________|       | | | |       |  |     ____________|         |____  Back-Channel   ____|    | |  |____________     |  |       : : : :       :__________________     :   |  |
 |  |   |    | eth0             =-------/ | | |       |  |    | bcn_link1  =---------=_13_]    Network    [_14_=---------=  bcn_link1 |    |  |       : : : --------=             eth0 |    :   |  |
 |  |   |    |      10.255.1.3 ||         | | |       |  |    |___________||         |____________________[_19_=----/ |  ||___________|    |  |       : : :         :| 10.255.1.3      |    :   |  |
 |  |   |    |_________________||         | | |       |  |                 |         | an-switch02             |      |  |                 |  |       : : :         :|_________________|    :   |  |
 |  |   |                       |         | | |       |  |     ____________|         |____                 ____|      |  |____________     |  |       : : :         :                       :   |  |
 |  |   |  _____                |         | | |       |  |    | bcn link2  =---------=_13_]  VLAN ID 100  [_14_=---------=  bcn_link2 |    |  |       : : :         :               _____   :   |  |
 |  +--=--[_vda_]               |         | | |       |  |    |___________||         |____________________[_19_=------/  ||___________|    |  |       : : :         :              [_vda_]--=---+  |
 |  |   |_______________________|         | | |       |  |_________________|                                             |_________________|  |       : : :         :.......................:   |  |
 |  |                                     | | |       |                    |                                             |                    |       : : :                                     |  |
 |  |    _______________________          | | |       |                    |                                             |                    |       : : :         .........................   |  |
 |  |   | [ srv04-freebsd11 ]   |         | | |       |                    |                                             |                    |       : : :         :   [ srv04-freebsd11 ] :   |  |
 |  |   |     __________________|         | | |       |                    |                                             |                    |       : : :         :__________________     :   |  |
 |  |   |    | em0              =---------/ | |       |                    |                                             |                    |       : : ----------=            NIC 1 |    :   |  |
 |  |   |    |      10.255.1.4 ||           | |       |                    |                                             |                    |       : :           :| 10.255.1.4      |    :   |  |
 |  |   |    |_________________||           | |       |                    |                                             |                    |       : :           :|_________________|    :   |  |
 |  |   |                       |           | |       |                    |                                             |                    |       : :           :                       :   |  |
 |  |   |  ______               |           | |       |                    |                                             |                    |       : :           :              ______   :   |  |
 |  +--=--[_ada0_]              |           | |       |                    |                                             |                    |       : :           :             [_ada0_]--=---+  |
 |  |   |_______________________|           | |       |                    |                                             |                    |       : :           :.......................:   |  |
 |  |                                       | |       |                    |                                             |                    |       : :                                       |  |
 |  |    _______________________            | |       |                    |                                             |                    |       : :           .........................   |  |
 |  |   | [ srv05-win2016 ]     |           | |       |                    |                                             |                    |       : :           :     [ srv05-win2016 ] :   |  |
 |  |   |     __________________|           | |       |                    |                                             |                    |       : :           :__________________     :   |  |
 |  |   |    | NIC 1            =-----------/ |       |                    |                                             |                    |       : ------------=            NIC 1 |    :   |  |
 |  |   |    |      10.255.1.5 ||             |       |                    |                                             |                    |       :             :| 10.255.1.5      |    :   |  |
 |  |   |    |_________________||             |       |                    |                                             |                    |       :             :|_________________|    :   |  |
 |  |   |                       |             |       |                    |                                             |                    |       :             :                       :   |  |
 |  |   |   ____                |             |       |                    |                                             |                    |       :             :                ____   :   |  |
 |  +---=--[_c:_]               |             |       |                    |                                             |                    |       :             :               [_c:_]--=---+  |
 |  |   |_______________________|             |       |                    |                                             |                    |       :             :.......................:   |  |
 |  |                                         |       |                    |                                             |                    |       :                                         |  |
 |  |    _______________________              |       |                    |                                             |                    |       :             .........................   |  |
 |  |   | [ srv06-centos4 ]     |             |       |                    |                                             |                    |       :             :     [ srv06-centos4 ] :   |  |
 |  |   |     __________________|             |       |                    |                                             |                    |       :             :__________________     :   |  |
 |  |   |    | eth0             =-------------/       |                    |                                             |                    |       --------------=             eth0 |    :   |  |
 |  |   |    |      10.255.1.6 ||                     |                    |                                             |                    |                     :| 10.255.1.6      |    :   |  |
 |  |   |    |_________________||                     |                    |                                             |                    |                     :|_________________|    :   |  |
 |  |   |                       |                     |                    |                                             |                    |                     :                       :   |  |
 |  |   |   _____               |                     |                    |                                             |                    |                     :               _____   :   |  |
 |  +---=--[_vda_]              |                     |                    |                                             |                    |                     :              [_vda_]--=---+  |
 |  |   |_______________________|                     |                    |                                             |                    |                     :.......................:   |  |
 |  |                                                 |                    |                                             |                    |                                                 |  |
 |  |                                                 |                    |                                             |                    |                                                 |  |
 |  |                                                 |                    |                                             |                    |                                                 |  |
 |  |     Storage:                                    |                    |                                             |                    |                                    Storage:     |  |
 |  |     __________                                  |                    |                                             |                    |                                  __________     |  |
 |  |    [_/dev/sda_]                                 |                    |                                             |                    |                                 [_/dev/sda_]    |  |
 |  |      |   ___________    _______                 |                    |                                             |                    |                 _______    ___________   |      |  |
 |  |      +--[_/dev/sda1_]--[_/boot_]                |                    |                                             |                    |                [_/boot_]--[_/dev/sda1_]--+      |  |
 |  |      |   ___________    ________                |                    |                                             |                    |                ________    ___________   |      |  |
 |  |      +--[_/dev/sda2_]--[_<swap>_]               |                    |                                             |                    |               [_<swap>_]--[_/dev/sda2_]--+      |  |
 |  |      |   ___________    ___                     |                    |                                             |                    |                     ___    ___________   |      |  |
 |  |      +--[_/dev/sda3_]--[_/_]                    |                    |                                             |                    |                    [_/_]--[_/dev/sda3_]--+      |  |
 |  |      |   ___________    ____    ____________    |                    |                                             |                    |    ____________    ____    ___________   |      |  |
 |  |      \--[_/dev/sda5_]--[_r0_]--[_/dev/drbd0_]---/                    |                                             |                    \---[_/dev/drbd0_]--[_r0_]--[_/dev/sda5_]--+      |  |
 |  |                                          |                           |                                             |                           |                                   |      |  |
 |  |                                          |                           |                                             |                           |                                          |  |
 |  |    Clustered LVM:                        \---\                       |                                             |                       /---/                     Clustered LVM:       |  |
 |  |        ___________________________           |                       |                                             |                       |           ___________________________        |  |
 |  |   /---[_/dev/an-a01n01_vg0/shared_]----------+                       |                                             |                       +----------[_/dev/an-a01n01_vg0/shared_]---\   |  |
 |  |   |    _________                             |                       |                                             |                       |                             _________    |   |  |
 |  |   \---[_/shared_]                            |                       |                                             |                       |                            [_/shared_]---/   |  |
 |  |                                              |                       |                                             |                       |                                              |  |
 |  |    __________________________________        |                       |                                             |                       |        __________________________________    |  |
 |  +---[_/dev/an-a01n01_vg0/srv01-rhel7_0_]-------+                       |                                             |                       +-------[_/dev/an-a01n01_vg0/srv01-rhel7_0_]---+  |
 |  |    ____________________________________      |                       |                                             |                       |      ____________________________________    |  |
 |  +---[_/dev/an-a01n01_vg0/srv02-win2012_0_]-----+                       |                                             |                       +-----[_/dev/an-a01n01_vg0/srv02-win2012_0_]---+  |
 |  |    ___________________________________       |                       |                                             |                       |       ___________________________________    |  |
 |  +---[_/dev/an-a01n01_vg0/srv03-sles12_0_]------+                       |                                             |                       +------[_/dev/an-a01n01_vg0/srv03-sles12_0_]---+  |
 |  |    ______________________________________    |                       |                                             |                       |    ______________________________________    |  |
 |  +---[_/dev/an-a01n01_vg0/srv04-freebsd11_0_]---+                       |                                             |                       +---[_/dev/an-a01n01_vg0/srv04-freebsd11_0_]---+  |
 |  |    ____________________________________      |                       |                                             |                       |      ____________________________________    |  |
 |  +---[_/dev/an-a01n01_vg0/srv05-win2016_0_]-----+                       |                                             |                       +-----[_/dev/an-a01n01_vg0/srv05-win2016_0_]---+  |
 |  |    ____________________________________      |                       |                                             |                       |      ____________________________________    |  |
 |  +---[_/dev/an-a01n01_vg0/srv06-centos4_0_]-----+                       |          _________________________          |                       +-----[_/dev/an-a01n01_vg0/srv06-centos4_0_]---+  |
 |                                                                         |         | an-switch01             |         |                                                                         |
 |                                                       __________________|         |___        BCN       ____|         |__________________                                                       |
 |                                                      | IPMI            =----------=_03_]    VID 100    [_04_=---------=             IPMI |                                                      |
 |                                 _________    _____   |      10.20.11.1 ||         |_________________________|         || 10.255.11.2     |   _____    _________                                 |
 |                                {_sensors_}--[_BMC_]--|_________________||         | an-switch02             |         ||_________________|--[_BMC_]--{_sensors_}                                |
 |                                                                         |         |           BCN           |         |                                                                         |
 |                                                          ______ ______  |         |         VID 100         |         |  ______ ______                                                          |
 |                                                         | PSU1 | PSU2 | |         |____   ____   ____   ____|         | | PSU1 | PSU2 |                                                         |
 |_________________________________________________________|______|______|_|         |_03_]_[_07_]_[_08_]_[_04_|         |_|______|______|_________________________________________________________|
                                                                 || ||                  |      |      |       |                  || ||                                                              
                                     /---------------------------||-||------------------|------/      \-------|------------------||-||---------------------------\                                  
                                     |                           || ||                  |                     |                  || ||                           |                                  
                      _______________|___                        || ||        __________|________     ________|__________        || ||                        ___|_______________                   
           _______   | an-ups01          |                       || ||       | an-pdu01          |   |          an-pdu02 |       || ||                       |          an-ups02 |   _______        
          {_Mains_}==|         10.20.3.1 |=======================||=||=======|         10.20.2.1 |   | 10.20.2.1         |=======||=||=======================| 10.20.3.1         |=={_Mains_}       
                     |___________________|                       || ||       |___________________|   |___________________|       || ||                       |___________________|                  
                                                                 || ||                      || ||     || ||                      || ||                                                              
                                                                 || \\========[ Port 1 ]====// ||     || \\====[ Port 2 ]========// ||                                                              
                                                                 \\===========[ Port 1 ]=======||=====//                            ||                                                              
                                                                                               \\==============[ Port 2 ]===========//
Screenshot from CNN interview with George Takei. Oh my!

Creating Install Media

Creating install media is a three step process;

  1. Download the install ISO for either CentOS 6 (disc 1 and 2) or RHEL 6 (one disc).
  2. Download 'anvil-generate-iso to your Linux workstation or server.
  3. Run anvil-generate-iso, pointed at the CentOS or RHEL source disc(s).

This will generate a custom install ISO for building your Anvil! system. Optionally, the ISO can be converted to a USB-based install media if you prefer.

Wait, Why Do I Have To Build My Own ISO?

The Anvil! platform itself works like an appliance. That is to say, it builds and manages the operating systems under the Striker dashboards and Anvil! nodes.

We had to choose early on if we were going to create our own operating system or not, and we chose not to. The additional burden of maintaining an OS would distract us from our core goal of building the most resilient, intelligent availability platform possible.

So how to resolve this?

For completely understandable trademark reasons, we can't simply repackage and distribute a RHEL or CentOS based Anvil! install ISO. Red Hat and CentOS work very hard to deliver their ISOs and they want to ensure that, if their name is on something, they've tested it to their satisfaction.

So given that we don't want to create a distro, and out of respect for Red Hat and CentOS's trademarks, the best option available was for us to create an ISO generation tool, and that's what we've done.

The full instructions on building the install ISO, and optional install USB disk, are here;

Build the Anvil! m2 Install Media

Once the media is ready, you can proceed!

Building The First Striker Dashboard

The first dashboard will be booted off of either the DVD or USB drive. How exactly you do this will depend on your hardware, so please consult your machine's service manual for instructions on how to choose a temporary boot device.

Hardware Requirements

The Striker dashboard has fairly modest system requirements. The only hard requirement is that the machine can run RHEL or CentOS 6 and that it has two network interfaces (wireless is NOT supported).

The recommended minimum configuration is:

  • Intel Core i5 v5 (or AMD equivalent) or newer CPU
  • 8 GiB of RAM
  • 128 GiB SSD
  • 4x 1 Gbps NICs

The above specs will provide plenty of performance for hosting the ScanCore database as well as provide network redundancy on both the Back-Channel Network and the Intranet-Facing Network.

Stage-1 Striker Install

Template note icon.png
Note: You will need to select a temporary boot device when you first power on your Striker dashboard. Exactly how you do this will depend on the manufacturer of your system. If you need help sorting this out, please consult the BIOS (or UEFI) documentation for your hardware. If you still need help, feel free to reach out to us and we will try to help.
Template note icon.png
Note: At the end of the stage-1 install, if your install media is slow, it might hold at "Performing Post Install Tasks" for a while. Please be patient! It could take several minutes to finish.

Boot from the USB drive or DVD.

Select boot device. In this case, we're booting from an ISO. This screen will look different, depending on your hardware.

Once you select the boot device, you will see the Striker boot menu.

The Striker installer boot menu.
Template warning icon.png
Warning: Once selected, the rest of the stage-1 install is automated. Any existing data on the machine will be erased!

We're building "Striker 01", so that is what we'll choose.

Selecting "New Striker Dashboard 01".

The rest of the install is automated.

Stage-1 install under way.

When the stage-1 install is finished, you will be presented with a standard RHEL login.

Template note icon.png
Note: The user name is 'root' and the password is 'Initial1'. This is the case for all stage-1 installed dashboards and nodes.
Stage-1 install complete; At the login prompt.

That's all for stage-1!

Stage-2 Striker Install

There are several ways to customise Striker for your environment. Below is a couple of examples with a link to the complete list.

striker-installer Switches

Template note icon.png
Note: There is an example install call on all Striker dashboards at '/root/striker-installer.example'. You can edit this file and then run it directly with 'sh /root/striker-installer.example' to further simplify the stage-2 install.

Example striker-install Invocations

Template note icon.png
Note: This is a standard bash call, so please be sure to quote anything with spaces and to escape special characters like !.
Template warning icon.png
Warning: The --host-uuid <uuid> MUST be unique across dashboards. Please don't directly copy and past without changing the UUID (use uuidgen to create one for you).

The most common install will look like this:

./striker-installer \
 -c "Alteeve's Niche! Inc." \
 -n "an-striker01.alteeve.com" \
 -u "admin:super secret password" \
 -i 10.255.4.1/16,dg=10.255.255.254,dns1=8.8.8.8,dns2=8.8.4.4 \
 -b 10.20.4.1/16 \
 -p 10.20.10.200:10.20.10.209 \
 --peer-dashboard hostname=an-striker02.alteeve.com,bcn_ip=10.20.4.2 \
 --host-uuid 32e47561-5840-4c6a-9732-2627efcd0af2 \
 --router-mode \
 --rhn "rhn_admin:rhn_secret"

This will configure the dashboard as the first striker using the standard BCN address '10.20.4.1' with a '255.255.0.0' subnet mask. In this example, the internet-facing network is '10.255.0.0/16', so we set the IFN address '10.255.4.1/16' and '10.255.255.254' gateway using Google's open DNS servers. The host name is set to 'an-striker01.alteeve.com', the administrative user (and striker web user) is 'admin' and the nice long password of 'super secret password'.

We want this Striker to stay in sync with its peer, 'an-striker02.alteeve.com', so we tell this machine where to find it. Don't worry that the peer doesn't exist yet. When it is built, they will sync when they find each other for the first time later.

We want to use this as an "Install Target", so we set a few extra options. We tell it to route BCN traffic to the internet (when "Install Target" is activated only) and we tell this machine to offer up IP addresses from '10.20.10.200' to '10.20.10.209'. These are only needed for a short time when a machine (node or dashboard) boots without an operating system (or when you are about to reload the OS).

This is a RHEL install proper, so the Red Hat user name 'rhn_user' and password are set so that the machine is registered and the needed subscriptions added. Obviously, there is no need to use --rhn when you're installing based on CentOS.

Later, when we're ready to do stage-2 on the second dashboard, we would use:

./striker-installer \
 -c "Alteeve's Niche! Inc." \
 -n "an-striker01.alteeve.com" \
 -u "admin:super secret password" \
 -i 10.255.4.2/16,dg=10.255.255.254,dns1=8.8.8.8,dns2=8.8.4.4 \
 -b 10.20.4.2/16 \
 -p 10.20.10.210:10.20.10.219 \
 --peer-dashboard hostname=an-striker01.alteeve.com,bcn_ip=10.20.4.1 \
 --host-uuid c63dda20-accb-4cc5-a2ff-8fc5be8d953b \
 --router-mode \
 --rhn "rhn_admin:rhn_secret"

The main differences are the final octet of the BCN and IFN addresses are '2', the hostname is 'an-striker02', the peer information points to Striker 1 and the lease range is from '10.20.10.210' to '10.20.10.219'.

Running striker-installer

For this document, we are using KVM virtual servers (the same foundation used in the Anvil! itself). When run on a workstation, it uses the IFN '192.168.122.0/24', so we'll adapt our 'striker-installer' to suit. We'll do this by editing the provides 'striker-installer.example'.

vim striker-installer.example
./striker-installer \
 -c "Alteeve's Niche! Inc." \
 -n "an-striker01.alteeve.com" \
 -u "admin:super secret password" \
 -i 192.168.122.251/24,dg=192.168.122.1,dns1=8.8.8.8,dns2=8.8.4.4 \
 -b 10.20.4.1/16 \
 -p 10.20.10.200:10.20.10.209 \
 --peer-dashboard hostname=an-striker02.alteeve.com,bcn_ip=10.20.4.2 \
 --host-uuid 32e47561-5840-4c6a-9732-2627efcd0af2 \
 --router-mode \
 --rhn "rhn_admin:rhn_secret"
Template note icon.png
Note: For this tutorial, the admin password is 'super secret password'. Please use a different password in your environment!

With that edited, we can start the install!

sh /root/striker-installer.example

If your striker dashboard has a Siig, and if you accepted the ASIX user license when you created the Anvil! ISO, the drivers will be compiled and loaded, the interface will be started and if available, an IP address will be acquired. Many smaller Striker dashboards are built on Intel NUCs with these Siig adapters as their IFN interface, so this behaviour makes sure everything is up and running before the install starts. If your machine does not have a Siig adapter, this step is skipped.

The install starts by checking for an Internet connection. If a connection is found and if the base OS is RHEL and if --rhn was used, the machine will register with Red Hat.

The beginning of the install will look something like this

 ##############################################################################
 #   ___ _       _ _                                    The Anvil! Dashboard  #
 #  / __| |_ _ _(_) |_____ _ _                                 -=] Installer  #
 #  \__ \  _| '_| | / / -_) '_|                                               #
 #  |___/\__|_| |_|_\_\___|_|                                                 #
 #                                               https://alteeve.com/w/Striker #
 ##############################################################################
 
Sanity-checking command line switches:
Done.
 
Checking the operating system to ensure it is compatible.
- We're on a RHEL (based) OS, good. Checking version.
- Looks good! You're on: [6.8]
- This OS is RHEL proper.
Done.
 
Checking for an Internet connection...
- Internet access detected.
Done.
 
RHN credentials given. Attempting to register now.
- [ Note ] Please be patient, this might take a minute...
- Registration was successful.
- Adding 'Optional' channel...
- Output: [Repository 'rhel-6-server-optional-rpms' is enabled for this system.]
- 'Optional' channel added successfully.
Done.
 
Backing up some network related system files.
- The backup directory: [/root/anvil] doesn't exist, creting it.
- Backup directory successfully created.
- Backing up: [/etc/udev/rules.d/70-persistent-net.rules]
- It exists, backing it up.
- Copying: [/etc/udev/rules.d/70-persistent-net.rules] to: [/root/anvil/]
- Backing up: [/etc/sysconfig/network-scripts]
- Copying: [/etc/sysconfig/network-scripts] to: [/root/anvil/]
- Backing up: [/etc/rc.local]
- It exists, backing it up.
- Copying: [/etc/rc.local] to: [/root/anvil/]
Done.
 
Checking if we need to freeze NetworkManager on the active interface.
- NetworkManager isn't running, freeze not needed.
Done
 
Making sure all network interfaces are up.
- The network interface: [eth1] is down. It must be started for the next stage.
- Checking if: [/etc/sysconfig/network-scripts/ifcfg-eth1] exists.
- Config file exists, changing BOOTPROTO to 'none'.
- Attempting to bring up: [eth1]...
- Checking to see if it is up now.
- The interface: [eth1] is now up!
- The network interface: [eth2] is down. It must be started for the next stage.
- Checking if: [/etc/sysconfig/network-scripts/ifcfg-eth2] exists.
- Config file exists, changing BOOTPROTO to 'none'.
- Attempting to bring up: [eth2]...
- Checking to see if it is up now.
- The interface: [eth2] is now up!
- The network interface: [eth3] is down. It must be started for the next stage.
- Checking if: [/etc/sysconfig/network-scripts/ifcfg-eth3] exists.
- Config file exists, changing BOOTPROTO to 'none'.
- Attempting to bring up: [eth3]...
- Checking to see if it is up now.
- The interface: [eth3] is now up!
Done.

Mapping the Network

There is no way for a program to know that a given network interface is going to be used for a particular task. To handle this in a user-friendly way, the installer will ask you to unplug each cable, wait a moment, and then plug it back in. The installer can see which interface loses a connection and determine which interface you manipulated.

This mapping is the last step before the automated portion takes over.

Template note icon.png
Note: Striker dashboards with four interfaces will have redundant connections to the BCN and IFN. Larger installs using proper server-grade equipment should have four interfaces (and redundant power) for additional resiliency. It is not required, as the two Striker dashboards are inherently redundant for each other.

When two interfaces are found, the mapping order will be:

  1. "Back-Channel Network - Link 1"
  2. "Internet-Facing Network - Link 1"

When four interfaces are found, the mapping order will be:

  1. "Back-Channel Network - Link 1"
  2. "Back-Channel Network - Link 2"
  3. "Internet-Facing Network - Link 1"
  4. "Internet-Facing Network - Link 2"

Lets see what this looks like with four interfaces

-=] Configuring network to enable access to Anvil! systems.
- Beginning NIC identification...
- Please unplug the interface you want to make:
  [Back-Channel Network, Link 1]

Unplug the interface you want to make BCN link 1.

- NIC with MAC: [52:54:00:71:20:fa] will become: [bcn_link1]
  (it is currently: [eth2])
- Please plug in all network cables to proceed.

Plug it back in and unplug the interface you want to make BCN link 2.

- Please unplug the interface you want to make:
  [Back-Channel Network, Link 2]

Plug it back in and unplug the interface you want to make IFN link 1.

- NIC with MAC: [52:54:00:3d:bc:57] will become: [bcn_link2]
  (it is currently: [eth3])
- Please plug in all network cables to proceed.
- Please unplug the interface you want to make:
  [Internet-Facing Network, Link 1]

Plug it back in and unplug the interface you want to make IFN link 2.

- NIC with MAC: [52:54:00:d3:a9:0f] will become: [ifn_link1]
  (it is currently: [eth0])
- Please plug in all network cables to proceed.
- Please unplug the interface you want to make:
  [Internet-Facing Network, Link 2]

Finally, plug it back in.

You will now see a summary of what will be done.

- NIC with MAC: [52:54:00:55:c0:6e] will become: [ifn_link2]
  (it is currently: [eth1])
- Please plug in all network cables to proceed.
Done.
 
Here is what you selected:
- Interface: [52:54:00:71:20:FA], currently named: [eth2],
  will be renamed to: [bcn_link1]
- Interface: [52:54:00:3D:BC:57], currently named: [eth3],
  will be renamed to: [bcn_link2]
- Interface: [52:54:00:D3:A9:0F], currently named: [eth0],
  will be renamed to: [ifn_link1]
- Interface: [52:54:00:55:C0:6E], currently named: [eth1],
  will be renamed to: [ifn_link2]
 
The Back-Channel Network interface will be set to:
- IP:      [10.20.4.1]
- Netmask: [255.255.0.0]
 
The Internet-Facing Network interface will be set to:
- IP:      [192.168.122.251]
- Netmask: [255.255.255.0]
- Gateway: [192.168.122.1]
- DNS1:    [8.8.8.8]
- DNS2:    [8.8.4.4]
 
Shall I proceed? [Y/n]

If you are happy, press '<enter>' to start the install. If you made a mistake, type 'n' + '<enter>' and the install will exit. Simply restart the install and try again until you are happy.

Stage-2 Install Completes

From here on out, the rest of the install is automated. It might take a while, so this is a good time to go grab a $drink.

When it is done, it will reboot and you will see the graphical login screen!

Stage-2 install complete.

You should now be able to access the Striker web interface from any device on the BCN of IFN with a web browser.

Enter the dashboard's IP address into your browser and you should see the login prompt.

Enter the administrator user name and password you set via '-u "<user>:<password>"'.

Login using the user name and password you specified during the stage-2 install. In our case, we stuck with the default user 'admin'.

The configuration and control menu.

Voila! You have your first Striker dashboard built. If all went well, you no longer need the install media. Future installs can be done over the network from this Striker.

Enabling 'Install Target'

Now that we have the first dashboard built, we can build the second using it as a PXE server.

From the Striker web interface, select 'Enable Install Target'.

Click on 'Enable Install Target'.
Template warning icon.png
Warning: Enabling 'Enable Install Target' causes a DHCP server to start running on Striker’s BCN. If you haven't isolated the BCN from the IFN. This could cause normal clients to get their IP from Striker, which wouldn't work in most cases.

Read and then confirm:

Confirm.

Now you can boot bare-iron machines using their network interface!

'Install Target' now running.

Done!

Building The Second Striker From The First

As before, building a Striker dashboard is a two-stage process. The difference this time is that we don't use the ISO (or USB drive) we made before. Now we're going to install off of the first dashboard directly.

Second Striker Stage-1 Install

Boot your second dashboard and manually select the boot device. How exactly you do this will depend on the vendor of your machine.

Template note icon.png
Note: If your machine has multiple network-bootable devices, you may need to experiment to find which one is connected to the BCN.

Select "New Striker 02 dashboard", and the rest of the stage-1 install will complete automatically.

Stage-1 'Install Target' boot menu.

When done, you will see the same default login screen as you did for the first Striker dashboard install.

Stage-1 install complete.

Second Striker Stage-2 Install

As before, the login user for stage-1 is 'root' and the password is 'Initial1'.

Once you log in, edit '/root/striker-installer.example', edit it to suit your needs and then run it. The rest of the install proceeds exactly as it did for the first Striker install above.

Here is our '/root/striker-installer.example'.

Template warning icon.png
Warning: The --host-uuid <uuid> MUST be unique across dashboards. Please don't directly copy and past without changing the UUID (use uuidgen to create one for you).

The most common install will look like this:

./striker-installer \
 -c "Alteeve's Niche! Inc." \
 -n "an-striker02.alteeve.com" \
 -u "admin:super secret password" \
 -i 192.168.122.252/24,dg=192.168.122.1,dns1=8.8.8.8,dns2=8.8.4.4 \
 -b 10.20.4.2/16 \
 -p 10.20.10.210:10.20.10.219 \
 --peer-dashboard hostname=an-striker01.alteeve.com,bcn_ip=10.20.4.1 \
 --host-uuid a9a9cdc5-0c10-4ee1-9102-ecba16f75e09 \
 --router-mode \
 --rhn "rhn_admin:rhn_Initial1"

With that edited, we can run the file as a script and perform our stage-2 install.

sh /root/striker-installer.example

Map the network interfaces when prompted and the rest of the stage-2 install will be completed automatically. When it is done, it will reboot and present you with the same login screen as before.

Stage-2 install complete.

Done!

Building The First Anvil! Node Pair

Building an Anvil! node pair is also a 2-stage process. The main difference is that stage-2 is controlled by an 'Install Manifest' that we run on a Striker dashboard. Striker then uses that manifest as a reference to control how the nodes are partitioned, named, and configured.

The benefit of this approach is that the manifest is recorded and can be used in the future to rebuild a node that was damaged or destroyed.

Node Stage-1 Install

Running the stage-1 install for nodes is identical to how it was run for the second Striker.

Boot the nodes and select a temporary boot device.

Template note icon.png
Note: Nodes will have six network bootable interfaces, so you might need to experiment to find which ones are on the BCN. Once you find them, it is a good idea to make note of the adapter's ID. It will save you having to hunt around if you need to rebuild or replace the node in the future.

Select "New Anvil! Node 01", and the rest of the stage-1 install will complete automatically.

an-a01n01 an-a01n02
Node 1, Stage-1 'Install Target' boot menu.
Node 2, Stage-1 'Install Target' boot menu.
Template note icon.png
Note: The node install starts by "zero'ing out" the first 100 GiB of the disk. This causes the stage-1 install to appear to hang for a while. Please be patient.

When done, you will see the same default login screen as you did for the Striker dashboards.

an-a01n01 an-a01n02
Node 1, Stage-1 complete.
Node 2, Stage-1 complete.

There is no need to log in this time, however. Make a note of the IP addresses that each node got. We want to use the BCN IP addresses, which are the '10.20.0.0/16' IPs (unless you changed the BCN subnet when you built your Striker dashboards).

In the example above, the IPs we care about are:

an-a01n01 an-a01n02
10.20.10.204 10.20.10.205
Template note icon.png
Note: The 'Install Target' feature is no longer needed. You can now disable it by clicking on 'Disable Install Target'.

Stage-1 is done!

Basic Striker Navigation

Before we start using Striker, lets look at the top navigation options. These will remain consistent throughout all of Striker's UI.

Striker navigation options.

If you click on the top-centre logo, you will be taken to the main Striker configuration menu with the list of configured Anvil! systems. In day to day use, this will be mainly used for switching between the Anvil! systems you want to view or work on.

The 'back' button.

You will not want to use your browser's "back" button when navigating Striker. This is because the back button on most browsers will not reload the page. Using this button instead tells Striker to refresh when appropriate.

The 'reload' button.

When you want to refresh your view of the Anvil!, media library or other pages, press this button. Generally speaking, the browser's "reload" button will do the right thing, so it is less important to always use this one. However, there are times when a browser reload won't work, so it is a good idea to get into the habit of refreshing the page this way.

Template note icon.png
Note: By default, an Anvil! main page will NOT update on its own. You can change this by setting 'sys::reload_page_timer = X' in Striker’s configuration file '/etc/striker/striker.conf'.

Creating An 'Install Manifest'

To create an 'Install Manifest', click on the 'Install Manifest' button on Striker.

Click 'Install Manifest'.

The 'Install Manifest' form is long, but most of it will be auto-filled for you.

The 'Install Manifest' form, top section.

The main part to be aware of are the eight fields at the top.

Field What it does What it should be set to Restrictions
Anvil! Prefix This prefix is used by people with multiple Anvil! systems and can help group Anvil! pairs by client, department or location. This is pre-set using the Dashboard's prefix and likely doesn't need to change. Only letters or numbers are allowed. Generally this is 2 or 3 characters, but there is no hard limit on it's length.
Sequence Number Provides a differentiation between Anvil!' systems with the same prefix. This sequence number will be used to preset IP addresses and PDU outlet numbers. a simple digit. There is no hard limit, but numbers over 24 will likely require manual setting of IPs.
Domain Name Sets the domain name portion of the nodes' hostnames and foundation pack equipment. Generally, this should be set to your organisation's host name (or division, client, etc). This must be a standard domain name comprising letters, numbers and hyphens.
Anvil! Password Sets the node's 'root' and 'ricci' passwords. Something kept private! Note: The password set is echoed back to you and must be stored in plain text in the ScanCore database in order to interact with the nodes. Don't use the same password you use elsewhere and restrict access to the Striker dashboards!
BCN Subnet The private management network used for inter-node, dashboard to node and node to foundation pack communication. 10.20.0.0/16, except when that would conflict with existing networks. A valid IPv4 subnet.
SN Subnet The private storage network used for inter-node storage replication. 10.10.0.0/16, except when that would conflict with existing networks. A valid IPv4 subnet.
IFN Subnet The public network used to communicate with your network equipment and clients. Free IPv4 addresses available on your network. The form will try to select sane default IPs, though you will want to verify the selected IPs are available. A valid IPv4 subnet.
Media Library Size How much space to allocate for ISOs (images of CD and DVD disks) used to install servers or "put into" a server's (virtual) optical drive. Between 10 and 40 GiB is sufficient for most users. Start small, the size can be extended later if needed. It can not be shrunk.

In this tutorial, we will be creating our third Anvil! system, and we'll stick with the default 'an' prefix and 'alteeve.com' domain name. Our IFN subnet is '192.168.122.0/24' and we'll start with a modest 10 GiB media library size.

The 'Install Manifest' form, top section filled out.

With that set, click on 'Set Values Below' and the form will do its best to set the rest of the form's details for you.

The 'Install Manifest' form, all filled out.

Please review all the fields the first time you create a manifest. Each field is explained and there is a link on the right for all fields that will take you to our wiki and explain each field in much more detail, if needed.

The main fields to check are:

  • Are the IFN IP addresses of the nodes right?
  • Are the IP addresses of the foundation pack devices and Striker dashboards right?
  • Are the PDU outlet port numbers correct?
Template warning icon.png
Warning: It is critical that the PDUs and their outlet ports map properly to the nodes and their PSUs. Double and triple check these values!

Once you are happy with the fields, click on 'Generate' to view a summary of the manifest data.

The 'Install Manifest' summary.

Careful! Be sure that the summary data is as you expect it to be. Once generated, this manifest will control the Anvil! node pair construction and regeneration should a node fail. It is very important that the data is correct.

If you are happy, click 'Generate' on the summary and the 'Install Manifest' will be created and saved.

The 'Install Manifest' generated and saved.

The 'Install Manifest' is saved in the ScanCore database. As such, it is immediately sync'ed to other dashboards and will remain in Striker until you delete it.

All done!

Anvil! Node Pair Stage-2 Install

Now that the manifest has been created and both nodes have completed their stage-1 installs, we're ready to build the Anvil! node pair!

Choosing the 'Install Manifest' to run.

At this stage, there is only one manifest to choose. So click on 'Run'.

Telling Striker how to connect to the nodes.

The manifest always defaults to each node's BCN IP address and the recorded password.

In our case, both nodes just finished their stage-1 install, so they will have the temporary IPs we noted earlier and both will use the default stage-1 password 'Initial1'.

Recall that the temporary IPs of the two newly built nodes were:

an-a01n01 an-a01n02
10.20.10.204 10.20.10.205

So we will update the login fields.

Updated initial login and temporary IPs set.
Template warning icon.png
Warning: Be sure to review the manifest data! This is particularly true when you have multiple manifests. Don't run the wrong one by accident.

When you're ready, click on 'Begin Install'.

Anvil! Node Pair Stage-2 Network Mapping

Once the run begins, it will verify that it can talk to both nodes, do some sanity checks and then begin the network mapping process. This is virtually identical to what you did when you mapped the dashboard's network, except this time the prompts are in the browser.

Connected to the nodes, sanity checks run and beginning the network mapping.
Template note icon.png
Note: When you pull the network cable that Striker is using to talk to the node, you will not see the prompt to plug the cable back in. This is normal. Keep the cable unplugged for about five seconds, plug it back in and then wait. Striker will catch up and move on normally a few seconds later.

If you can't see the screen when you unplug the cables, don't worry. The order of the prompts is always the same. If you wait five seconds between each unplug and another five seconds after plugging the cables back in, you will be able to reliably map without seeing the screen. Do wait a minute between node 1 and 2 though so that node 2 has time to prepare for the remap.

The map order is always;

  1. . Back-Channel Network, Link 1
  2. . Back-Channel Network, Link 2
  3. . Storage Network, Link 1
  4. . Storage Network, Link 2
  5. . Internet-Facing Network, Link 1
  6. . Internet-Facing Network, Link 2
Network mapping done.

Once you've mapped both nodes and are happy with the results, you're ready to proceed.

If your nodes are built on RHEL proper, you will see the fields for registering your nodes with Red Hat, as you can see above. If you built on CentOS, those fields will be absent.

Template note icon.png
Note: If you don't register with Red Hat and you are using RHEL, the nodes will not be updated during the install.

If you made a mistake mapping the network, just keep going until all the interfaces have been unplugged and plugged back in. You can start over by clicking on 'Start Over'.

Anvil! Node Pair Stage-2 Under Way

Template warning icon.png
Warning: Do not browse away once the install has started! If you do for some reason, reboot both nodes, see what IPs they have and re-run the install manifest.
Template note icon.png
Note: Once you click on 'Install', the network mapping is recorded. If you run the manifest again against the same hardware nodes, you won't have to map the network again, unless you specifically choose to.

If you are happy with the mapping, the click on 'Install' to begin.

Once the install starts, you will have about 20 to 30 minutes to wait.

Install has begun, go grab a coffee.
Template note icon.png
Note: If the install fails for any reason, it tries to tell you why it failed. That may not be useful enough, however. Details can be viewed by reading '/var/log/striker.log' on the Striker dashboard that was running the manifest. Once you think you've fixed the problem, you can restart by clicking on the 'Restart' button. Please use this button! If the error was hit after the IP addresses changed, using your browser's 'reload' button might tell the browser to connect to the old address.

After some time, possibly quite some time depending on your nodes, the install will finish.

Install completed!

Click on 'add it' at the bottom to record the new Striker in the database.

The 'add it' button.

Adding The New Anvil! To The Database

The Anvil! is not automatically added because there are a few things you need to tell Striker.

Add the new Anvil! to Striker.

There are two sections; The top section where you need to saw who owns the new Anvil! and a short description that will be displayed in the Anvil! selection list. The owner name and description are for your benefit and you can set them to whatever makes sense to you. The 'owner' might well be a department, site name and so on.

The second section is the information needed to send email alerts. If you don't plan to use email-based alerts, simply clear the 'Login Name' field. If you do want to use email alerts, however, please fill out the information needed to connect to your SMTP server.

Template note icon.png
Note: Anvil! nodes run a local copy of postfix that queues and sends emails. If your outgoing email server supports encrypted connections, then alerts from the nodes to the mail server will be securely delivered.

If you have added an Anvil! system before, then you will be able to select existing owners and outgoing mail servers, saving the need to re-enter the information.

New Anvil! form filled out.

With the form filled out, click on 'Save'.

New Anvil! form saved successfully.

Once saved, you will see a 'success' message along with two buttons; One for managing alert recipients and one to start working with the new Anvil! right away.

When you saved the new Anvil!, you told the nodes how to send email alerts, but not were to send them. When you click on the button for alert recipients, you get a chance to define who will receive alerts, and what level of alerts they want to receive. We'll cover that menu in a moment.

When this is a new Anvil! system, click on 'Click here to manage alert recipients. This will open a new tab in your browser with the alert recipient menu. Then click on 'Click here to start using it' to view your newly built Anvil! system

Setting Up Alert Recipients

Alert recipients can be one of two types; Alerts sent by email and alerts written to a local file.

The alert recipient's menu.

The first field, 'Target', determines what the recipient type is. If it is an email address, then the recipient will be sent alerts over email. Otherwise, it is assumed that the notification target is a file. In that case, alerts will be written to '/var/log/$filename' on the machine generating the alert.

Generally speaking, most users will want to send alerts via email. The main users who will use the file-type alert target are users whose Anvil! system is not connected to the public Internet. If this is your case, then you can periodically pick up the file for analysis. Be sure to delete or empty the file when you pick it up, if you want to keep its size small.

The second field, 'Recipient Name', is mainly used for email notification targets. This is used when sending to email and is used in the email's 'To' field.

If you have multiple languages installed, then you can select the recipient's language.


The second field, 'Recipient Name', is mainly used for email notification targets. This is used when sending to email and is used in the email's 'To' field.

If you have multiple languages installed, then you can select the recipient's language.

The 'Alert Level' is a selection box that determines what alert level this recipient wants, if at all, when new Anvil! systems are added.

The 'Measurements' lets you choose between sending measurements to the recipient in metric or imperial measurements.

The 'Notes' field is purely for your benefit to store information about the alert recipient. Use it however you want, if at all.

Lastly, there is a section to select the alert level the recipient for each existing Anvil! system. We just added the first one, so only it is shown. Select the desired alert level, and you're done!

The alert recipient's menu filled out.

When you're happy with your entry, click on 'Save'.

The new alert recipient saved.

Voila! You'll see the drop-down box at the top. If you want to change anything, go for it and then click 'Save' again.

You can close the tab now.

Using Your New Anvil! System

Template note icon.png
Note: For the sake of screenshots and documenting the install process, a virtual server-based Anvil! system, 'an-anvil-03', was used. From this point in the document, we will switch to a hardware-based Anvil! system called 'an-anvil-07'.

When you load an Anvil! system, there will be a delay while the current state of the Anvil! is examined.

Gathering current state information.

After a moment, the main Anvil! menu page will be displayed.

The main page for the new Anvil!.
Template note icon.png
Note: You will notice that the 'Disk State' of the second node is 'Inconsistent'. This is normal and expected. When an Anvil! is first built, or when a node boots after being offline for a while, the data on the one node needs to write to the peer to bring it up to speed. The rate of the sync varies depending on how busy the storage is, and slows down when your servers use the disk to avoid impacting storage performance.
Template warning icon.png
Warning: You can start using your Anvil! right away, you do NOT need to wait for the resync to finish. NOTE HOWEVER; The Anvil! is considered to be in a degraded state until the resync is complete. If the 'UpToDate' node goes offline, the 'Inconsistent' will shut down storage to prevent corruption. Some functionality, like migration, will be disabled until the resync is complete.

That's it, you're done!

Adding An Existing Anvil! To Striker

We just saw how to add an Anvil! system to Striker after a successful install manifest run. Now, lets look at how to add an existing Anvil! manually.

The Striker configuration menu and Anvil! selection screen.

From any page, click on the logo at the top of Striker to access the main menu.

Click on the "Anvil! Systems" button.

Click on the top button, "Anvil! Systems".

The new Anvil! form.

Last time, much of the form was filled out for you by Striker using data from the install manifest. This time, we're adding an Anvil! manually, so we will need to specify things manually.

Selecting an existing owner and outgoing mail server.

If you plan to use the same owner or outgoing mail server, select them from the drop down box and then click on "Load". If you want to use a new owner or outgoing mail server, then leave the selection box as 'New' and enter the details in the main form. When you save the Anvil!, the new owner and/or outgoing mail server will be created at the same time.

In our case, we're using the same owner and mail server, so we loaded them from the selection boxes.

The details of the new Anvil! added.

In our case, we're configuring access to an Anvil! that is remote. To handle this, both nodes will have the same IP address, but unique ports to access them via standard TCP port forwards. To do this, we enter the public IP address and port as 'a.b.c.d:x'

Template note icon.png
Note: If you want to, you can edit the '/etc/hosts' and '/etc/ssh/ssh_config' files on the dashboards before adding the Anvil! so that you can enter host names that map to the proper IP and port. It's up to you and your comfort with somewhat advanced Linux networking.
The new Anvil! has been added.

When you're ready, click on 'Save' to finish.

Template note icon.png
Note: If you want to add new alert recipients for this Anvil!, you can click on the link to do so at the top. Otherwise, click on the "Click here to start using it!" link.
The new Anvil!’s main menu page.

Voila!

Adding A New ScanCore Database To Nodes

In order for Striker to manage an Anvil!, the nodes need to write to the Striker’s ScanCore database.

To do this, edit each node's /etc/striker/striker.conf file and add the striker database connection information:

Template note icon.png
Note: In the example below, the nodes are remote, and so from the perspective of the nodes, the dashboards are also remote. In the same way that we configured the new Anvil! to have the same IP but different ports, the same is reverse on the way back. If the Anvil! and dashboards are on the same subnet, then use the BCN IP addresses.
Template warning icon.png
Warning: When connecting nodes to dashboards by directly forwarding TCP ports, please read this link on SSL connections and this one on SSH tunnels for securing the PostgreSQL connection.

It should look like this;

scancore::db::3::host			=	1.2.3.4
scancore::db::3::port			=	15432
scancore::db::3::name			=	scancore
scancore::db::3::user			=	secret
scancore::db::3::password		=	super secret password
 
scancore::db::4::host			=	1.2.3.4
scancore::db::4::port			=	25432
scancore::db::4::name			=	scancore
scancore::db::4::user			=	admin
scancore::db::4::password		=	secret

Save and exit. The next time ScanCore loops, it will connect to and sync with the new ScanCore databases. As soon as that happens, you're ready to go.

If you get the error:

-=] Error - Error Code 176 - Fatal Error - Striker.pm at 386 [=-
 
The 'AN::Tools::Striker' module's 'load_anvil()' method was called before ScanCore has run on both nodes. Please make sure that ScanCore is enabled and has run at least once.
 
Exiting.

Please wait about a minute and the error should go away.

Managing Two Anvil! System

The new Anvil! should now be available!

The previous and new Anvil! are now in the selection list.

At the bottom of the menu will be all configured Anvil! systems. In this case, we can see 'an-anvil-01' and 'an-anvil-07'.

Template note icon.png
Note: Now that two Anvil! systems exist, Striker will load the selection menu by default from now on instead of automatically loading the previously-alone Anvil!.

Click on the name of the Anvil! on the left that you want to manage. If you want to edit (or delete) an Anvil!, click on 'Edit' to the right of the given system.

Media Library

Template note icon.png
Note: If only one Anvil! exists, then connecting to Striker will automatically load that Anvil!. If two or more exist, the first screen you see will be the configuration menu with the available Anvil! systems listed at the bottom. If you are in an Anvil! main page, click on the title logo at the top of the page to access the Striker menu.
Striker configuration menu and Anvil! selection screen.

Building a server on an Anvil! is pretty close to installing an operating system on traditional hardware. The only difference, of course, is that there is no drive to put a physical disc into. So before we can build the first server, we need an operating system disc image (called an ISO).

Striker uses a library to upload and store install media, driver discs or any other "CD" or "DVD" you might want to use. It is called the 'Media Library", and the library is specific to each Anvil! pair. So to access the library, first load the Anvil! main page.

Add Disc Images To The Media Library

Main menu for the new Anvil!.

Look for the 'Media Library' button at the bottom right.

The 'Media Library' button.

The media library starts with a summary of the used and available free space. Below that is a list of files already in the library, if any.

The 'Media Library' page.

The three remaining sections provide three ways to upload disc images to the library.

  1. . Image a physical disk.
  2. . Pass a URL to an image for direct-download to the Anvil!.
  3. . Upload an ISO directly from your computer.

We'll try all three to show how each works.

Imaging A Physical Disc

Template note icon.png
Note: This process uses the optical drive on the Striker dashboard itself. The drive can be internal or plugged in over USB.

The first method we will show takes a physical CD or DVD disc in a physical drive, converts it to a ISO and then uploads it to the Anvil!. This is also very useful if you have software, drivers or updates sent to you by a vendor. Disc images can also be used to "insert" into existing server's optical drive.

The disc imaging menu.

The disc in the drive is the disc used to build the dashboard. We don't want that one, so lets eject it and then reload the page.

No disc in the drive.

We will use Windows 2008 R2 as our first server, so we'll insert the physical disc into the drive.

Windows 2008 R2 disc in the drive.

Click on 'Upload'.

Image disc menu.

It won't start the imaging process right away. You have the option to choose the ISO's file name. In this case, the default file name "GRMSXVOL_EN_DVD.iso" isn't very helpful. So we'll give it a more verbose name.

A better file name.

The name 'Windows_2012_R2_x86_64.iso' will be a lot more informative. Now we're ready, just click 'Confirm' to begin!

Image process under way.

The first step is to convert the physical disc into an ISO image file on the dashboard.

Image of the disc complete, upload started.

Next, the image file is uploaded to the Anvil!'s shared storage.

Upload complete!

Click on the back button and you will see the image now in the library. It will stay in the library until you decide to delete it. So you can safely store the original disc, you won't need it again.

The disc image in the library.

Done!

Directly Download An ISO

The most common way to add a disc image to the media library is to pass a URL to Striker. It will connect to one of the Anvil! nodes and directly download the target URL to the library. You can use ftp, http or https addresses.

Direct download menu.

For this example, we will download a CentOS version 7 disc images. We'll use a mirror close to us, which is likely not the best mirror for you. If you are following along, please check the mirror list for a mirror closest to you.

The closest mirror to us is "http://mirror.csclub.uwaterloo.ca/centos/7/isos/x86_64/CentOS-7-x86_64-DVD-1611.iso" hosted by the University of Waterloo Computer Science Club (thanks!).

URL to the CentOS 7 install disc image entered into the address field.
Template note icon.png
Note: Please ignore the 'Script?' check box for now. We will explore it's use and purpose later on in this tutorial.

Paste the URL into the 'Address' field and then click on 'Download' button.

Download confirmation dialogue.
Template note icon.png
Note: Some websites use referal links which may not work. If the file you download doesn't work, Download the image to your computer directly and follow the instructions on uploading images in the next section.

You will be asked to confirm the download. When ready, click on 'Confirm'.

Download requested.

The download will be sent to one of the nodes (usually node 1).

Once requested, the page will reload into the download monitor screen.

Download in queue.

Depending on timing, you may see that the download is in queue for less than a minute before the download actually starts.

Download started.

When the download starts, you can either leave the page to refresh periodically so that you can monitor the download, or return to the main page and do other work. The download will proceed in the background.

Download well underway.

With each refresh, the current and average download rates will be shown. This can be useful for tracking downloads that slow down because of congestion, for example.

If you want to cancel a download, you have two options;

  • Abort and Preserve.
  • Abort and Delete.

The 'Preserve' option will cancel the download, but it will leave the partially downloaded file in the library. There is no visual indication that the download is incomplete, save for the reduced size. You should only do this if you plan to restart the download. This is possible if you wanted to switch to a faster mirror, for example. If the download is restarted with the same output file (even if it is from a different URL), the download will pick up where it left off.

The 'Delete' option will cancel the download and delete the partial file. This is what you will want to do most times.

Download complete!

When the download completes, for a short time, a message showing that the download completed will be displayed.

Nothing downloading in the download monitor.

If the download monitor page is left open, eventually no download will show on either node.

Download ready for use.

Done!

Upload An ISO From Your Computer

The final option for loading a disc image is to upload an ISO image f from your computer to the media library. This will cause the ISO to be uploaded to the Striker dashboard, and the Striker will push the ISO to the Anvil! directly.

Direct download menu.

If you have an ISO image on your computer, then you can upload it to Striker. This is often needed if the image was made available through an authenticated service, had to be renamed or similar reasons.

Template note icon.png
Note: When provisioning windows servers, the Anvil! will use an emulated network card and storage controller that are optimised for higher performance. These are called "virtio". These work like any other brand of network or storage hardware in Windows, but the drivers need to be available during the operating system installation.

We will cover this is detail when we provision a Windows server later in this tutorial. For now, we will download the latest signed virtio drivers from the Fedora Project. Follow the link in the previous sentence and then click on the "Stable virtio-win iso" link. Save it anywhere on your computer.

We're using this example because, you will note, that the download URL is this:

When downloaded, though, the file name changes to reflect the specific version. At the time of writing this, the resulting file was saved as 'virtio-win-0.1.126.iso'.

File to upload selected.

When you click on the 'Browse' button, you will be presented with an OS-specific file browser. Browse to where you downloaded the file and select it.

Template warning icon.png
Warning: When you start the upload, your browser may or may not show the upload progress. If you don't see the upload running directly, please just wait a couple minutes. We know this upload feature is pretty crappy and plan to replace it with a version that shows an upload progress bar later. (patches welcomed!)
File to uploaded to Striker, copying to the Anvil!.

With the file selected, click on the 'Upload' button to start the upload. After a minute, the upload to Striker will complete and Striker will begin copying the ISO to the Anvil!.

File copied to the Anvil!.

Once finished, the disc image is on the Anvil!.

File uploaded!

With this, we have everything we need to start provisioning servers!

Provisioning Servers

This tutorial is not meant to teach you how to install different operating systems. Just the same, we will walk through the installation of several different kinds of operating systems. As we install each, any Anvil!-specific (really, KVM/qemu-specific) install steps will be covered.

We'll install the following server operating systems, in no particular order;

One of the benefits of virtualization, in general, is the ability to extend the life of legacy servers when migrating is not an option. So lets toss in a legacy OS;

Template warning icon.png
Warning: If you don't plan to build a RHEL 7 server, please read through it just the same. We will go into more detail on that install, then mainly focus on operating-specific differences in subsequent guest installs.
Template note icon.png
Note: All the install media needed to provision these servers was downloaded to the Anvil! system outside this tutorial. How and where you get install media depends strongly on what operating system you want to install.

Guest OS Notes

It would appear that Solaris 11 has trouble running on the latest Anvil!. We will investigate this further at a later time. If you want to use Solaris on an Anvil! system, contact us.

Installing Windows XP or Windows 2003 requires emulating a floppy drive in order to load the storage drivers. This is possible, but not via the Striker web interface. If you want to run any copy of windows that requires a floppy drive, please contact us and we'll be happy to help you with the manual install process.

Preamble - Log Into Striker

When a server is provisioned, it will be just like a traditional, "bare iron" server. That is to say, the (virtual) hardware is assembled, powered on and the install disc is booted.

So the install process requires "plugging in a monitor, keyboard and mouse".

We do this using the "Virtual Machine Manager" program, which you can find on Striker’s desktop. Log into Striker and on the desktop, look for the 'VM' icon on the desktop.

The "Virtual Machine Manager" desktop icon.
Template note icon.png
Note: If you haven't logged out since you added the Anvil!, please do so now and then log back in. Striker runs a script when you log in that automatically configures new Anvil! systems so that they automatically appear in Virtual Machine Manager.
Template warning icon.png
Warning: If you are familiar with Virtual Machine Manager, you will notice that you can use it to modify and create servers. Do Not Do This!. VMM is not cluster aware, and the Anvil! stores the server data in a cluster-aware file system. Any changes you make will not be preserved.
The VMM desktop icon desktop location.

Double-click on the icon and VMM will start and automatically connect to the Anvil! nodes. At first, you won't see anything and that is fine. Once you create the server, you will see it appear and then you can double-click on it to gain access.

VMM running and connected to the nodes, but no servers are shown.

Once connected to the server, you will effectively be using a monitor, mouse and keyboard plugged into a traditional bare iron server. You then do the rest of the operating system installation exactly as you would have on real hardware.

Now, on to provisioning the server!

Red Hat Enterprise Linux 7

We will start by provisioning a RHEL 7 based guest as the operating system install is simple, allowing us to focus on the Striker and Anvil! portion of the process.

The 'Build a New Server' main menu button.

Off of the main page, click on 'Build a New Server' at the bottom-left of the page.

The new server form.

The Seven Provision Form Fields

The new server form has seven sections;

Entry Description
Server Name This is a free-form field you can use to give the server. Keeping it short an unique among all your Anvil! systems is ideal. This way, you can migrate the server to other Anvil! systems at a later date and not worry about name collisions.
Optimize for This is a drop-down list of common operating system types. It alters how the emulated hardware is configured to help get better compatibility and performance with various systems.
Template note icon.png
Note: If you don't find your exact operating system, don't worry, just pick something close. In a pinch, you can leave it at the default and you will be OK in most all cases.
Install From This will be the disc image that the new server will boot off of. This is a drop-down list of all '*.iso' files in the media library.
Driver Disc When needed, this is an optional disc image, like a second DVD drive, with a second disc. This is usually used for drivers, as we will see later when we build a windows server. It could also be a "Disc 2", if your operating system spans install media.
Memory This is where you tell the Anvil! how much RAM to allocate to the new server. This can be changed later as needed. The maximum available is shown just to the left of the field.
CPUs This is how many CPU cores to allocate to this server. It is a select box ranging based on how many cores are on the node. You can change this later, if you need.
Template note icon.png
Note: It is usually good to start with two cores, and increase the number later if performance dictates it.
Storage This is how much (replicated) disk space to allocate to the new server. The amount free is listed on the left of the input form. On the right are the units to use; MiB, GiB, TiB or a percentage.
Template note icon.png
Note: It is fairly easy to grow space in modern operating systems, but it is very hard to take it away. So it is always recommended to allocate a lower amount at first and add space later as needed.

Creating The RHEL 7 Server

  • Server Name

We're going to name the server, creatively, 'srv01-rhel7'. Purely out of our own convention, we like to prefix all names with a sequence number that grows for every server on any Anvil! system we manage. We do this so that we can be sure that, where ever we might move servers in the future, there won't be a name collision. You are more than free to do whatever makes sense to you. The second part is a simple descriptor and it can also be whatever makes sense to you. This server has no purpose, so we're going to just name it after the OS version.

  • Optimize For

Choose 'Red Hat Enterprise Linux 7'.

  • Install From

We're going to select 'rhel-server-7.3-x86_64-dvd.iso', which we downloaded outside of the tutorial and uploaded to the media library.

  • Driver Disc

RHEL 7 does not need any drivers or second install disk, so we'll leave this blank.

  • Memory

We will allocate 4 GiB of RAM to this server.

  • CPUs

As a rule of thumb, we recommend allocating two cores to new servers.

Post install, do performance and load testing to see if you really need more before allocating more. This helps insure that no one server can unduly load the host. Of course, if you plan to only host one server on your Anvil!, start with N-2 cores. So if you have 8 cores, use 6. We'll stick to '2' for this example server.

  • Storage

RHEL 7 is pretty modest in its space needs, so we'll provision a modest 40 GiB.

The provision form filled out for the new server.
Note: The 'Add a New Disc' field was removed in v2.0.5.

Click on "Create"!

The new server has been created and added to the Anvil!.

The server is created and then two things now happen;

  1. . The server is added to the Anvil!. Once complete, there is nothing more to do. It will already be under the Anvil! system's protection. You can return to the main menu
  2. . The server will appear in Virtual Machine Manager. You can double-click on it and proceed with the operating system install just as you would on a bare-iron server.
The new server is now on the Anvil! main page.

Going back to the Anvil! menu, you can now see the new server is up and running! We'll come back to managing the server in Striker later. Lets look at the operating system install now.

Installing The RHEL 7 Operating System

The new server is now visible in Virtual Machine Manager.

You can now see the new 'srv01-rhel7' server in VMM. Double-click on it.

Connected to 'srv01-rhel7'!

From here, you proceed as you would doing a normal OS install on normal hardware.

The screen isn't big enough...

The window won't auto resize, so if you find that the full screen isn't visible, you can resize it easily.

Resize to VM.

click on 'View' -> 'Resize to VM'.

There, now you can see the full screen.

That's better!

Proceed with the install.

Now you can proceed with the OS install normally.

Install is done.

When the install is done, the install will ask you to reboot. On some OSes, it will reboot on its own.

The server disappears on reboot.

When a server shuts off (or reboots), it will vanish from virtual machine manager. This is by design. The standard virtualization tools are not cluster-aware, so the Anvil! hides them to prevent accidental manipulation and to ensure split-brains can't happen.

The server is back.

Within a few seconds, the server will start back up again.

The server is ready for use!

Done!

At this point, you can switch to connecting to the server using SSH, RDP or any other remote access technology that you prefer. You should only ever have to use Virtual Machine Manager again if you lose remote access for some reason. These rare cases would be similar to having to go sit at a bare iron server after losing access for some reason.

With Virtual Machine Manager, because it is effectively a monitor plugged into a machine, you can monitor the full boot process, shutdown, boot off of rescue media and so on.

Post Install Notes

There are a few things to know about running servers on Anvil! systems.

Powering Down Gracefully

The Anvil! makes no attempt to "reach inside" your servers. There are no agents that are needed and you can even use full disk encryption without any issue. To the Anvil!, network and storage is simply bit streams that we pass to and from hardware.

The down-side to this, however, is that the system has limited ability to control your servers. Specifically with regard to emergency or planned shut downs.

To shut down a server, we send an ACPI "power button" event to your server. For this to be useful, you need to ensure that your operating system will react to this by gracefully shutting down applications and the operating system itself.

Template note icon.png
Note: How exactly you configure your server to handle power button events will depend entirely on your operating system and version. Please consult your operating system documentation, or search the web for 'ACPI <os name>'.

Please be sure to test graceful shutdowns using the Striker interface BEFORE going into production.

When Will Servers Be Powered Off?

There are only two times when the Anvil! will shut down your servers;

  1. . When you ask it to, by clicking on 'Graceful Shut Down' in Striker
  2. . When ScanCore determines that a power off is imminent. This is the case if UPSes are almost depleted or if both nodes are approaching critical temperatures.
Template note icon.png
Note: You can disable hardware-related graceful shutdowns in '/etc/striker/striker.conf' by setting 'scancore::disable::power_shutdown' and/or 'scancore::disable::thermal_shutdown' to '1' on each node.

What If I Want To Shut Down The Server?

The Anvil! platform treats any unexpected shutdown of a server like a failure and turn the server back on. So trying to shut down a server using your operating system's normal power off command will result in a reboot.

To shut down the server and leave it off, be you must first click on 'Graceful Shut Down' from Striker. This will the Anvil! that you want the server to be off, and send the ACPI power button event to your server to start the shutdown process for you.

A Warning On Automatic Operating System Updates and Windows

Many Microsoft Windows versions will perform operating system shut downs when the operating system has is shutting down. In an emergency shut down, this can significantly delay the shut down process and could result in a hard power off in the middle of an update being installed. This could happen as UPS batteries are draining in a power outage, for example.

Also, because the Anvil! can't look inside your server, if the server hasn't powered off two minutes after being asked to, it has to assume that the server has crashed or hung, so the server will be forced off.

To deal with this;

  1. Disable automatic updates and manually install operating system updates on a schedule and ensure that you reboot to ensure updates are fully installed.
  2. Check to see if any updates are pending before gracefully shutting down a server. If there are, use your operating system's reboot command to install the updates before clicking on 'Graceful Shut Down' in Striker.

With this, you should never have a problem.

Provision A Windows 2012 Server

Template note icon.png
Note: Please review "Provisioning A RHEL 7 Server section, particularly the Post Install Notes section. Steps and information covered there will not be covered here.

The large majority of servers running on Anvil! systems are one flavour of Microsoft Windows or another. So the second server we will build will use

The new server form has seven sections;

Entry Description
Server Name srv02-win2012
Optimize for Microsoft Windows Server 2012 (R2)
Install From Windows_2012_R2_x86_64.iso
Driver Disc virtio-win-0.1.126.iso
Memory 4 GiB
CPUs 2
Storage 80 GiB
The new server form.
Note: The 'Add a New Disc' field was removed in v2.0.5.
Template warning icon.png
Warning: Do NOT use the older virtio-win-0.1.102.iso'. It has a bug that causes live migrations to fail on some systems.

The main note here is that we're using a driver disk for this install. To get the best performance, the virtual server will use the 'virtio' network and storage devices. These look and act just like normal network cards and storage devices, but they are optimized for running on the KVM/qemu hypervisor.

Windows ships with a certain set of drivers baked-in to the install disk. For devices without drivers on the disk, you need to provide the drivers during or after the install. In our case, windows won't see the virtual hard drive until we provide the drivers, which are on the 'virtio-win-0.1.126.iso' disc. Once the install is finished, we will need to add the network drivers.

The new 'srv02-win2012' server has been added to the Anvil!.

Click on 'Create' and the server will be added to the Anvil!.

Connect to 'srv02-win2012' in Virtual Machine Manager to do the OS install.

Use Virtual Machine Manager to connect to the new server. From there, the installation process is typical for Windows 2012.

Detecting Storage in srv02-win2012

The storage won't be found, so we will cover how to load the drivers to find it.

Storage wasn't found in 'srv02-win2012'.

Click on 'Load Driver' on the bottom left.

Browse to find the storage drivers.

Click on 'Browse' on the pop-up menu.

Expand the 'E:' drive.
Template note icon.png
Note: We're installing Windows 2012 R2, 64-bit ("amd64"). If you are installing a different version, please navigate to the appropriate subdirectory.

The driver disc will usually come up as the 'E:' drive. Expand it and navigate to 'E:\viostor\2k12R2\amd64' and click on 'OK'.

The 'Red Hat VirtIO SCSI controller' driver is in the list.

The 'Red Hat VirtIO SCSI controller' driver is displayed and highlighted. Click on 'Next'.

The storage drive is now available!

Now the server's storage drive is available!

Install is done!

Finish the operating system installation as you normally would.

First Login - How to send 'ctrl + alt + delete'

When using Virtual Machine Manager, you can't press 'ctrl + alt + del' directly, because Striker's OS will respond to it instead of the guest.

Send special key combinations with 'Send Key'.

To send special key combinations directly to the guest server, click on 'Send Key' in the menu bar above the screen. There you will see a selection of key combinations, including 'ctrl+alt+delete'.

Click that and you will get the log in prompt.

Post-Install Drivers

We're almost done. The last step is to install the network card and serial port drivers.

Missing drivers post-install.

In Device Manager, you will see two devices without drivers; 'Ethernet Controller' and 'PCI Simple Communication Controller'. The drivers for both are on the same driver disc which is still inserted into the new server's second optical drive.

Browsing the local system for drivers.

Right click on 'Ethernet Controller' and select the option to search for drivers.

Navigate to the driver disc.

Browse to the 'virtio-win-0.1.1' disc, usually in 'E:'.

Search the entire 'E:' drive.

You don't need to find the specific directory, just select the root of the disc and click on 'Next'. Be sure that 'Include subfolders' is checked and then click on 'Next'.

The 'Red Hat VirtIO Ethernet Adapter' driver was found.

The 'Red Hat VirtIO Ethernet Adapter' driver will be found. You can, if you wish, check 'Always trust software from "Red Hat, Inc."' to trust future drivers. Click on 'Install'.

The 'Red Hat VirtIO Ethernet Adapter' driver has been installed.

The network driver is installed and you can now assign an IP address to your new server.

The 'VirtIO Serial Adapter' driver has been installed.

Repeat the process to install the 'VirtIO Serial Adapter'.

The basic graphics drivers.

By default, Windows 2012 will use generic 'Microsoft Basic Display Adapter' drivers. This is fine, but limits the display to 1024x768.

The Red Hat graphics drivers.

You can scan for new drivers and it will find the 'Red Hat, Inc. Display Adapters' driver.

The Red Hat QXL controller driver.

Once loaded, the new graphics adapter will show as a 'Red Hat QXL controller'. Now you can set much higher resolutions!

Done! All drivers are installed now.

With that, all drivers have been installed!

Proceed with all your normal server setup steps just as if the server was on bare iron.

Disable Power Management

By default, most versions of Microsoft Windows appear to ignore ACPI power button events when it puts the screen to sleep.

This is a problem for servers on Anvil! systems because, if we need to shut down the server, we "tap the power button" (from the perspective of the virtual hardware). Normally, this causes the hosted server's operating system to begin shutting down. When ignore though, the guest never shuts off.

This is a particular concern when the Anvil! is performing an emergency shut down, as can happen in prolonged power outages or over temperature events.

The solution seems to be to configure windows to never "power off the monitor".

  • Click on the 'Control Panel' icon.
The 'Control Panel' icon.
  • Click on 'Hardware'.
The 'Hardware' section.
  • Click on 'Change when the computer sleeps'.
The 'Change when the computer sleeps' option.
  • Change 'Turn off the display' from '10 Minutes' to 'Never'.
Change 'Turn off the display' to 'Never'.
  • Save the settings.
Click 'Save changes'.

Done!

Access The Second Disk (Optional)

Template note icon.png
Note: By default, Windows gives the two optical drives the letters 'd:' and 'e:'. This means that the new drive will be given the drive letter 'f:' or above. If you want to make the new disk 'd:', please be sure to change the drive letters assigned to the 'd:' drive before beginning.

If you told the Anvil! to create two disks for your server by setting 'Storage' to be '80:3550' (for example), Windows 2012 won't initially activate the second disk. So after the install completes, you will only see the '80 GiB' partition.

To resolve this, we need to bring the second disk online inside windows.

  • Click on the 'Start' icon on the far left of the bottom task bar.
  • Click on the 'Administrative Tools' icon.
'Administrative Tools' icon.
  • Double-click on 'Computer Management'.
'Computer Management' menu.
  • Click on 'Disk Management' under the 'Storage' menu. This will show all disks on the system.
  • Notice that 'Disk 1' is an 'Unknown' disk and that it is 'Offline'.
'Disk 1' is 'Offline'.
  • Right-click on the 'Disk 1' box to the left of the disk display and click on 'Online'.
Put 'Disk 1' 'Online'.
  • Disk 1 is now online. Next, we need to initialize it. Right-click again on the 'Disk 1' box to the left of the disk display and click on 'Initialize'.
Initializing 'Disk 1'.
  • In almost all cases, you will want to initialize the disk as a 'GPT' disk. When ready, click 'OK'.
Initializing 'Disk 1' as a GPT disk.
  • After a moment, the disk will be listed as being 'Unallocated'. Right-click somewhere on the disk display and click on 'New Simple Volume'.
Make 'Disk 1' a new simple volume.
  • This will open the 'New Simple Volume Wizard'. Click on 'Next' to begin.
The 'New Simple Volume Wizard', step 1.
  • It will ask you how big to make the volume. The default is to use all space, which is generally what you will want. Click 'Next' when ready.
The 'New Simple Volume Wizard', step 2.
  • Next, you will be asked to assign a drive letter to the new disk. On the Anvil!, disk 'd:' and 'e:' are used by the two optical drives, so you will be offered 'f:'. Click 'Next' to proceed.
The 'New Simple Volume Wizard', step 3.
  • Finally, you will be asked to format the new disk. In most all cases, go with the defaults. Click 'Next' to finish.
The 'New Simple Volume Wizard', step 4.
  • Verify on the final screen that everything is as you want it, then click on 'Finish'.
Verify the new volume.
  • The new volume will be formatted. This can take a little bit.
The new volume is formatting.
  • When the format is finished, the new drive will be online and ready to go!
The new volume is ready!

Done!

SUSE Linux Enterprise Server

Template note icon.png
Note: Please review "Provisioning A RHEL 7 Server section, particularly the Post Install Notes section. Steps and information covered there will not be covered here.

The large majority of servers running on Anvil! systems are one flavour of Microsoft Windows or another. So the second server we will build will use

The new server form has seven sections;

Entry Description
Server Name srv03-sles12
Optimize for Suse Linux Enterprise Server 11
Install From SLE-12-SP2-Server-DVD-x86_64-GM-DVD1.iso
Driver Disc SLE-12-SP2-Server-DVD-x86_64-GM-DVD2.iso
Memory 4 GiB
CPUs 2
Storage 40 GiB

Note two things;

  1. The closest "optimize for" option was for SLES 11.
  2. We've specified a "driver" disk, which isn't a driver at all. It's simply the second disk needed for the install and this is fine.
The new server form.
Note: The 'Add a New Disc' field was removed in v2.0.5.

Fill out the form as normal.

Install initializing...

Being a modern Linux distribution, the installation of SLES 12 is quite painless...

All hardware is detected without issue.

There is no need to load any drivers, everything is detected and loaded properly.

The installation process is no different from a hardware install.

The install proceed perfectly normally.

Install completed!

The install finishes and automatically reboots.

Login screen.

Ready for use! That was easy.

FreeBSD 11

Template note icon.png
Note: Please review "Provisioning A RHEL 7 Server section, particularly the Post Install Notes section. Steps and information covered there will not be covered here.

FreeBSD is a direct descendent of UNIX and is still quite popular.

The new server form has seven sections;

Entry Description
Server Name srv04-freebsd11
Optimize for FreeBSD 11.x
Install From FreeBSD-11.0-RELEASE-amd64-dvd1.iso
Driver Disc
Memory 4 GiB
CPUs 2
Storage 40 GiB

FreeBSD works without any special drivers.

The new server form.
Note: The 'Add a New Disc' field was removed in v2.0.5.

Fill out the form as usual.

The install media boot screen.

Readers familiar with BSD will have no trouble installing FreeBSD. Just press '<enter>' (or wait) to start the install.

Confirm the install.

Confirm that you want to install.

Partitioning plan.

The drive appears without any special drivers or issue.

Install under way.

The install proceeds as normal.

Intel e1000 network card.

The Anvil! emulates the network card as an Intel e1000 model, which FreeBSD natively supports.

Install is done.

The install finishes up normally. All done!

Microsoft Windows Server 2016

Template note icon.png
Note: Please review "Provisioning A RHEL 7 Server section, particularly the Post Install Notes section. Steps and information covered there will not be covered here.
Template note icon.png
Note: Please be sure to disable turning off the monitor, as was done in the Windows 2012 install.

The new server form has seven sections;

Entry Description
Server Name srv05-win2016
Optimize for Microsoft Windows Server 2016
Install From Windows_2016_64-bit_eval.ISO
Driver Disc virtio-win-0.1.126.iso
Memory 8 GiB
CPUs 2
Storage 80 GiB

Windows 2016 is nearly and identical install process to 2012. We'll need to provide the virtio driver disc during install to load the storage drivers. Post install, we'll use "Device Manager" to install the network, serial and video drivers.

Given this similarity, we'll just show the special driver steps.

The new server form.
Note: The 'Add a New Disc' field was removed in v2.0.5.

Provision normally, making note to add the virtio driver disc.

The virtio storage controller isn't found.

As before, Windows doesn't have the virtio drivers built in, so we'll add them from the driver disc.

Choosing the drivers.

Browsing to the drivers; d:\viostor\2k16\amd64.

C: drive is found, install can proceed.

Now the drive is found and the install can proceed as normal.

Device manager showing the network and serial devices without drivers.

As expected, the network card and serial port drivers are not loaded. We'll install them from the virtio driver disc as we did before.

All drivers loaded.

Update the drivers for the network card, serial adaptor and video card. Simply tell windows to search from the root of the driver disk and it will find the drivers. When done, you server is ready to use!

CentOS 4

Template note icon.png
Note: Please review "Provisioning A RHEL 7 Server section, particularly the Post Install Notes section. Steps and information covered there will not be covered here.

CentOS and Red Hat Enterprise Linux 4 are both end-of-life and not supported.

Entry Description
Server Name srv06-centos4
Optimize for Red Hat Enterprise Linux 4
Install From CentOS-4.8-x86)64-binDVD.iso
Driver Disc
Memory 2 GiB
CPUs 2
Storage 80 GiB
Template warning icon.png
Warning: Using unsupported operating systems is never recommended. Security vulnerabilities and bugs exist on almost all EOL'ed operating systems. When their use is required, be sure to secure them behind supported firewalls when possible.

CentOS 4, like many older operating systems, finds a lot of use in labs and factories running legacy applications. Older operating systems can be difficult to find new hardware for, and so are often virtualized. This is the scenario we imagine in this install.

The new server form.
Note: The 'Add a New Disc' field was removed in v2.0.5.

Fill out the provision form as before.

CentOS 4 install.

There is nothing special needed for CentOS 4.

CentOS 4 install complete.

It will install without any issue and you can immediately use it as you would had it been installed on hardware.

That was easy.

Pre-Flight Checks - Test Thoroughly or Crash

Before going into production, verifying that everything is wired up and working properly is critical.

This section will cover how to test, as best as possible, if you are ready for production.

This is a long process

The Anvil! has been designed to make it as easy as possible to build an extremely resilient platform, but it is all for nothing without proper testing. Nothing is perfect; You may have hardware that exposes a bug in our software. You may had a piece of hardware that is not operating perfectly that will fail when it takes on the full load when degraded. You may have made a simple mistake when you wired up your equipment.

Stuff happens.

The only way to make your Anvil! properly ready for production is extensive testing. So please, grab a coffee and settle in. This won't be exciting, but it is one of the most critical stages of a successful deployment.

Template warning icon.png
Warning: Through out testing procedures, have one or more servers running and monitor their availability. The ultimate determination of "success" is that the hosted servers are not interrupted. If, at any point during the test, a server fails to be available, the overall testing has failed. Fix the problem and restart the test. Remember; If the services are important enough to run on an Anvil!, they are important enough to test thoroughly!

Monitoring Anvil! Status With 'anvil-cli-state'

Template note icon.png
Note: This is an optional step. Do this only if you feel comfortable logged into the shell of the Anvil! nodes.

If you are comfortable using a Linux terminal, you can log into the nodes and use the anvil-cli-state tool to monitor the status of the overall Anvil! system.

When called alone, it reports a "snapshot" view of the system and exits. Optionally, in can be invoked with the -i X switch which will tell it to stay running, refreshing the status every X seconds.

/sbin/striker/anvil-cli-state -i 5
+---[ 'ctrl + c' to quit ]--------------------------------------------------------------------[ 22:23:14 ]---+
| Anvil!: an-anvil-01          Local: an-a01n01.alteeve.com             Peer: an-a01n02.alteeve.com          |
+------------------------------------------------------------------------------------------------------------+
|      Node      |      Membership      |   Resource Manager   |   Storage Services   |      Hypervisor      |
|   an-a01n01    |        Online        |        Online        |        Online        |        Online        |
|   an-a01n02    |        Online        |        Online        |        Online        |        Online        |
+------------------------------------------------------------------------------------------------------------+
| Server Name                              |    State     | Host                                             |
| srv01-rhel7                              |    Online    | an-a01n01.alteeve.com                            |
| srv02-win2012                            |    Online    | an-a01n01.alteeve.com                            |
| srv03-sles12                             |    Online    | an-a01n01.alteeve.com                            |
| srv04-freebsd11                          |    Online    | an-a01n01.alteeve.com                            |
| srv05-win2016                            |    Online    | an-a01n01.alteeve.com                            |
| srv06-centos4                            |    Online    | an-a01n01.alteeve.com                            |
+------------------------------------------------------------------------------------------------------------+
| DRBD ver.  8.4.9-1   -    Module Loaded                                                Timeout:   100 ms   |
|                                              Disk State                      Role                          |
|   Resource     |    Connection   |  an-a01n01   .  an-a01n02   |  an-a01n01   .  an-a01n02   | To Resync.  |
| r0:/dev/drbd0  |    Connected    |  Up To Date  |  Up To Date  |   Primary    |   Primary    |     --      |
+------------------------------------------------------------------------------------------------------------+
|       Bond        |   Active   |            Link 1            |            Link 2            |  Up Delay   |
| bcn_bond1  -  Up  | bcn_link1  | bcn_link1  -  Up  - 1000  FD | bcn_link2  -  Up  - 1000  FD | 120 seconds |
| ifn_bond1  -  Up  | ifn_link1  | ifn_link1  -  Up  - 1000  FD | ifn_link2  -  Up  - 1000  FD | 120 seconds |
| sn_bond1   -  Up  | sn_link1   | sn_link1   -  Up  - 1000  FD | sn_link2   -  Up  - 1000  FD | 120 seconds |
+------------------------------------------------------------------------------------------------------------+
[ Caution ] - This can place a high load on the node. Don't leave it running longer than necessary!

The output is coloured to make it a bit easier to parse the data:

Coloured output from anvil-cli-state -i 3

Letting this run on your nodes during testing can be a useful way to verify that the servers are operational.


If you run into trouble during the testing phase, feel free to contact us and we will be happy to help you get your Anvil! system production-ready.

Template note icon.png
Note: Be sure to have alerts sent to an email address you are monitoring at the 'warning' level. We will verify that tests succeeded by both visually verifying the Anvil!’s operation as well as the receipt of appropriate alert messages.
Template note icon.png
Note: Some alerts will be split across a couple separate emails. This is part of the reason that you will want to wait a few minutes between failure and recovery stages. By waiting, you give time for the alerts to arrive and you increase confidence that the Anvil! is properly running in a degraded state.

All of these tests need to pass for the Anvil! to be considered ready for production.

Power Tests

These tests insure that full power redundancy exists. It will also verify that, in the case of full input power loss, automatic load shedding, emergency shut down and recovery all function properly.

Power Redundancy

In this test, we will shut down both UPSes, one at a time, simulating a total failure of a power circuit, UPS or PDU. We will leave each UPS in the failed state for at least five minutes to verify that ScanCore does not trigger load shedding.

This test will also cause one of the Striker dashboards to lose power and go offline. You will need to switch between the Striker dashboards when the corresponding power rail loses power.

Template note icon.png
Note: For this test, you can use either the UPS'es front panel to turn off the UPS or use the web interface to shut down the UPS.
  • Shut down UPS #1;

An alert should arrive within a minute that looks like this:

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  Is the server running and does the firewall allow connections on TCP port: [5432]?
 
Warning:
  Connection to the PDU: [an-pdu01.alteeve.com] has been lost!
 
Warning:
  The UPS: [an-ups01.alteeve.com] health status has changed:
- [The UPS is operating normally.] -> [The UPS is off. No power is being provided to down-stream equipment.]
 
Warning:
  The sensor: [FAN PSU1] has changed.
- [3600.000 rpm] -> [0.000 rpm]
- [ok] -> [cr]
- Thresholds:
  - High critical: [--] -> [--]
  - High warning:  [--] -> [--]
  - Low warning:   [--] -> [--]
  - Low critical:  [400.000rpm] -> [400.000rpm]

Here we see the ScanCore database go offline for the lost Striker dashboard as well as the fan for PSU1 stopping.

  • Wait for five minutes to make sure normal operation proceeds.

If all servers and both nodes remain operational, then you are ready to restore power to UPS #1. Wait a minute or two for the PDU to energize and the lost Striker dashboard to boot back up.

Template warning icon.png
Warning: Part of this test is verifying that the Striker dashboards boot on their own. If the dashboard stays off when power is restored, the test has failed. Update the BIOS to boot on AC restore and then repeat the test.
Subject: [ ScanCore ] - Warning - There is a message from an-a01n02.alteeve.com
When the power is restored, an alert similar to this should be sent;
 
Warning Cleared:
  The PDU model: [AP7900] with the serial number: [ZA1104027208] at the IP address: [10.20.2.1] has returned.
 
Important:
  The PDU: [an-pdu01.alteeve.com] has rebooted! The uptime changed; [1 h 43 m 47.9 s] -> [44 s]
 
Warning Cleared:
  Connection to the PDU: [an-pdu01.alteeve.com] has been restored.
 
Warning Cleared:
  The UPS: [an-ups01.alteeve.com] health status has changed:
- [The UPS is off. No power is being provided to down-stream equipment.] -> [The UPS is operating normally.]
 
Warning:
  The temperature sensor: [PSU1] has jumped a large amount in a short period of time!
- [47.000 °C] -> [54.000 °C]
 
Warning:
  The sensor: [FAN PSU1] has changed.
- [0.000 rpm] -> [3520.000 rpm]
- [cr] -> [ok]
- Thresholds:
  - High critical: [--] -> [--]
  - High warning:  [--] -> [--]
  - Low warning:   [--] -> [--]
  - Low critical:  [400.000rpm] -> [400.000rpm]

Wait another five minutes or so to ensure that the recovery causes no interruption in server availability.

  • Now we will shut down UPS #2.

As before, alerts will be generated that look like this;

Subject: [ ScanCore ] - Warning - There is a message from an-a01n02.alteeve.com
Warning:
  The sensor: [FAN PSU2] has changed.
- [3600.000 rpm] -> [0.000 rpm]
- [ok] -> [cr]
- Thresholds:
  - High critical: [--] -> [--]
  - High warning:  [--] -> [--]
  - Low warning:   [--] -> [--]
  - Low critical:  [400.000rpm] -> [400.000rpm]
 
Warning:
  The UPS: [an-ups02.alteeve.com] health status has changed:
- [The UPS is operating normally.] -> [The UPS is off. No power is being provided to down-stream equipment.]
 
Warning:
  Connection to the PDU: [an-pdu02.alteeve.com] has been lost!
  • Wait at least five minutes to again ensure that everything remains operating normally and that load shedding does not occur.

Restore power and wait again to ensure that the second Striker comes back online and that power redundancy is restored.

Alerts should arrive indicating that power has been restored.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Important:
  The PDU: [an-pdu02.alteeve.com] has rebooted! The uptime changed; [1 h 58 m 35.6 s] -> [30.2 s]
 
Warning Cleared:
  The UPS: [an-ups02.alteeve.com] health status has changed:
- [The UPS is off. No power is being provided to down-stream equipment.] -> [The UPS is operating normally.]
 
Warning:
  The sensor: [FAN PSU2] has changed.
- [0.000 rpm] -> [3600.000 rpm]
- [cr] -> [ok]
- Thresholds:
  - High critical: [--] -> [--]
  - High warning:  [--] -> [--]
  - Low warning:   [--] -> [--]
  - Low critical:  [400.000rpm] -> [400.000rpm]
 
Warning:
  The temperature sensor: [PSU2] has jumped a large amount in a short period of time!
- [46.000 °C] -> [54.000 °C]
  • Wait another five minutes to ensure that the second UPS recovers and does not cause any problems.

Once power has been restored, Striker 2 has come back online and everything seems good, you can declare this portion of the tests successful!

Load Shedding with Recovery

In this test, we will unplug both UPSes from the mains power. This will cause both UPSes to begin running on battery power. Within a few minutes, ScanCore will consider the power outage to be non-transient and initiate a load shed to conserve power.

The goal of this test is to verify that load shedding happens. The node that will be shut down will depends on which node is healthier and where the servers are running. In most cases, both nodes are equally healthy and the servers will be running on node 1, so node 2 will be selected for shut down.

Template note icon.png
Note: If you wish, you can test the shut down logic by migrating some servers so that both nodes are hosting servers, which invokes consolidation logic.
  • Unplug both UPSes and then wait.

Initially, you will get an alert indicating that power has been lost.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  The UPS: [an-ups01.alteeve.com] has lost input power!:
- [119.5 vAC] -> [0.0 vAC]
 
Warning:
  The UPS: [an-ups02.alteeve.com] has lost input power!:
- [120.1 vAC] -> [0.0 vAC]
 
Warning:
  The UPS: [an-ups02.alteeve.com] health status has changed:
- [The UPS is operating normally.] -> [The UPS is running on its batteries. It is likely that the input power feeding the UPS has failed.]
 
Warning:
  The UPS: [an-ups01.alteeve.com] health status has changed:
- [The UPS is operating normally.] -> [The UPS is running on its batteries. It is likely that the input power feeding the UPS has failed.]
 
Warning:
  The node: [an-a01n01.alteeve.com] has entered a 'warning' state. 
Warning: If the peer goes critical, it will not migrate to this node and instead gracefully shut down its server.

Initially, nothing more will happen. So we will wait for a couple of minutes, then we should see one of the nodes power down.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  The UPSes feeding this node have been running on batteries for more than: [300] seconds.
To extend battery runtime, one of the nodes will now be withdrawn and shutdown.
Which node will shutdown will be determined momentarily.

Sure enough, a node shut down.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  The node: [an-a01n02.alteeve.com] has gone offline. This can be caused by someone is doing work on your Anvil!, or if a node was withdrawn to shed load, or if the node's health went critical. If this was not expected, the node may have crashed and been fenced.

As seen in Striker;

Striker showing node 2 is offline after load shed.
  • Excellent!

Now, restore power and wait a few minutes. The Anvil! should see that the power is restored and turn the second node back on automatically.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning Cleared:
  The UPS: [an-ups01.alteeve.com] has input power again:
- [0.0 vAC] -> [118.8 vAC]
 
Warning Cleared:
  The UPS: [an-ups01.alteeve.com] health status has changed:
- [The UPS is running on its batteries. It is likely that the input power feeding the UPS has failed.] -> [The UPS is operating normally.]
 
Warning Cleared:
  The UPS: [an-ups02.alteeve.com] has input power again:
- [0.0 vAC] -> [120.1 vAC]
 
Warning Cleared:
  The UPS: [an-ups02.alteeve.com] health status has changed:
- [The UPS is running on its batteries. It is likely that the input power feeding the UPS has failed.] -> [The UPS is operating normally.]
 
Warning Cleared:
  The node: [an-a01n01.alteeve.com] has returned to being healthy.
  • Within a few minutes, we should see an-a01n02.alteeve.com come back online and rejoin the Anvil! system.
Subject: [ ScanCore ] - Warning - There is a message from an-a01n02.alteeve.com
Warning Cleared:
  The UPS: [an-ups01.alteeve.com] has input power again:
- [0.0 vAC] -> [118.8 vAC]
 
Warning Cleared:
  The UPS: [an-ups01.alteeve.com] health status has changed:
- [The UPS is running on its batteries. It is likely that the input power feeding the UPS has failed.] -> [The UPS is operating normally.]
 
Warning Cleared:
  The UPS: [an-ups02.alteeve.com] health status has changed:
- [The UPS is running on its batteries. It is likely that the input power feeding the UPS has failed.] -> [The UPS is operating normally.]
 
Warning Cleared:
  The UPS: [an-ups02.alteeve.com] has input power again:
- [0.0 vAC] -> [120.1 vAC]
 
Warning:
  The link state of the network interface: [sn_link2] in the bond: [sn_bond1] has went down and is coming back up!
- [Up] -> [Going Back Up]
  Note: If someone is working on the Anvil!'s network, this might be expected. If not, this might be the sign of a flaky network cable or plug.
 
Warning:
  The active interface in the bond: [bcn_bond1] has fallen back into link #2.
- [bcn_link1] -> [bcn_link2]
  Warning: This usually is a sign of a failure in the primary link. It can be caused by the cable failing or being removed, the plug in the node or switch failing or the primary switch failing.
 
Warning:
  The link state of the network interface: [ifn_link2] in the bond: [ifn_bond1] has went down and is coming back up!
- [Up] -> [Going Back Up]
  Note: If someone is working on the Anvil!'s network, this might be expected. If not, this might be the sign of a flaky network cable or plug.
 
Warning:
  The link state of the network interface: [bcn_link1] in the bond: [bcn_bond1] has went down and is coming back up!
- [Up] -> [Going Back Up]
  Note: If someone is working on the Anvil!'s network, this might be expected. If not, this might be the sign of a flaky network cable or plug.
 
Warning:
  The node has left the cluster: [an-anvil-01].
 
Warning Cleared:
  The node: [an-a01n02.alteeve.com] has returned to being healthy.

Note that some of the alerts above were generated during the shut down stage. In this example, the node shut down faster than the alerts could be dispatched. This is not a concern. Recall that ScanCore is not designed to be a monitoring solution, but to be a decision engine. Its priority is to protect you system, and notify you whenever possible. This also shows how alerts that were generated but not sent are cached until the system is back online.

A few minutes later, the node will rejoin the Anvil! and full redundancy will be restored.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n02.alteeve.com
Warning Cleared:
  The active interface in the bond: [bcn_bond1] has returned to be the link #1.
- [bcn_link2] -> [bcn_link1]
 
Warning Cleared:
  The link state of the network interface: [ifn_link2] in the bond: [ifn_bond1] is back to full operational state.
- [Going Back Up] -> [Up]
  Note: The 'up delay' timer has expired and the interface is again deemed to be healthy and ready for use.
 
Warning Cleared:
  The link state of the network interface: [bcn_link1] in the bond: [bcn_bond1] is back to full operational state.
- [Going Back Up] -> [Up]
  Note: The 'up delay' timer has expired and the interface is again deemed to be healthy and ready for use.
 
Warning Cleared:
  The link state of the network interface: [sn_link2] in the bond: [sn_bond1] is back to full operational state.
- [Going Back Up] -> [Up]
  Note: The 'up delay' timer has expired and the interface is again deemed to be healthy and ready for use.
 
Warning:
  The temperature sensor: [RAID Controller] has jumped a large amount in a short period of time!
- [74.000 °C] -> [80.000 °C]

As seen in Striker;

Striker showing node 2 back online.

Perfect! Test passed.

Load Shedding and Emergency Shut Down with Recovery

The last power test is going to take some time.

In this case, we're going to again pull the power to both UPSes. The difference this time is that we're going to leave the power unplugged until both nodes shut down. This will verify that the system can gracefully shut down hosted servers and power off the second node prior to total power loss.

Once both nodes have gone offline, we will restore power and wait. ScanCore will wait for the UPSes to have a minimum charge (45% by default) and then boot up both nodes. The Anvil! should automatically restore full redundancy and boot hosted servers back up.

With power removed, we get the same alerts as before;

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  The UPS: [an-ups01.alteeve.com] has lost input power!:
- [119.5 vAC] -> [0.0 vAC]
 
Warning:
  The UPS: [an-ups02.alteeve.com] health status has changed:
- [The UPS is operating normally.] -> [The UPS is running on its batteries. It is likely that the input power feeding the UPS has failed.]
 
Warning:
  The UPS: [an-ups02.alteeve.com] has lost input power!:
- [120.1 vAC] -> [0.0 vAC]
 
Warning:
  The UPS: [an-ups01.alteeve.com] health status has changed:
- [The UPS is operating normally.] -> [The UPS is running on its batteries. It is likely that the input power feeding the UPS has failed.]
 
Warning:
  The node: [an-a01n01.alteeve.com] has entered a 'warning' state. 
Warning: If the peer goes critical, it will not migrate to this node and instead gracefully shut down its server.

Initially, nothing more happens. After a few minutes, one of the nodes will be shut down to conserve battery power.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n02.alteeve.com
Warning:
  The UPSes feeding this node have been running on batteries for more than: [300] seconds.
To extend battery runtime, one of the nodes will now be withdrawn and shutdown.
Which node will shutdown will be determined momentarily.

There goes node 2 again...

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  The node: [an-a01n02.alteeve.com] has gone offline. This can be caused by someone is doing work on your Anvil!, or if a node was withdrawn to shed load, or if the node's health went critical. If this was not expected, the node may have crashed and been fenced.
  • Now we wait...
Template warning icon.png
Warning: It is important that we are patient. This testing needs to be as close to real world as possible. So now would be a good time to go take a walk or have a bite to eat. If your nodes are low power consumption, and/or your UPSes are large, it could take some time for them to drain to a point where the estimated run time is low enough to trigger a complete shut down.

Eventually;

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  The UPS: [an-ups02.alteeve.com] has discharged below the warning threshold!:
- [22.0%] -> [19.0%]
  Warning: If power is not restored soon, this UPS will be marked as unusable. If the other UPS(es) also drop below this level, your Anvil! will automatically shut down, if configured to do so.
 
Warning:
  The UPS: [an-ups01.alteeve.com] has discharged below the warning threshold!:
- [23.0%] -> [18.0%]
  Warning: If power is not restored soon, this UPS will be marked as unusable. If the other UPS(es) also drop below this level, your Anvil! will automatically shut down, if configured to do so.

Both UPSes are now very low on power, but have not yet dropped enough to trigger shut down. That happens when the strongest of the two UPSes has less than ten minutes estimated hold up time.

Subject: [ ScanCore ] - Critical - There is a message from an-a01n01.alteeve.com
Critical:
  The node: [an-a01n01.alteeve.com] has gone into "emergency stop" and is shutting down!

There it goes!

Servers should begin powering off gracefully, then the node will power itself off. At this point, the load on the UPSes will drop and the estimated hold up time will jump up a fair bit, given how low the draw is.

Template note icon.png
Note: If the power stays out long enough, the last bit of charge will be lost and the UPSes will shut down, taking the Striker dashboards with them. In this case, when power is restored, the Striker dashboards will power back up immediately, as we saw in the power loss tests earlier.

When power is restored, ScanCore on the Striker dashboards will begin monitoring the charge in the UPSes. Once the first UPS reaches 45% charge, the nodes will be powered back on.

Template warning icon.png
Warning: If possible, monitor the charge rate either by using the web interface for the UPSes or monitoring their LCD display. If the nodes turn on before at least one UPS reaches 45%, the test has failed.

Restore power and monitor the charge. Verify that both nodes boot once the UPSes reach a safe minimum charge.

... Some time later ...

The strongest UPS charges above 45% and both nodes power on!

This being a complete cold start, the boot process will take a few more minutes. The Anvil! is quite careful on cold start and performs several sanity checks to make sure everything is operating properly.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning Cleared:
  The UPS: [an-ups01.alteeve.com] has input power again:
- [0.0 vAC] -> [119.5 vAC]
 
Warning Cleared:
  The UPS: [an-ups02.alteeve.com] has charged to its "minimum good" charge percentage:
- [13.0%] -> [49.0%]
 
Warning Cleared:
  The UPS: [an-ups01.alteeve.com] has charged to its "minimum good" charge percentage:
- [11.0%] -> [45.0%]
 
Warning Cleared:
  The UPS: [an-ups02.alteeve.com] health status has changed:
- [The UPS is running on its batteries. It is likely that the input power feeding the UPS has failed.] -> [The UPS is operating normally.]
 
Warning Cleared:
  The UPS: [an-ups02.alteeve.com] has input power again:
- [0.0 vAC] -> [119.5 vAC]
 
Warning Cleared:
  The UPS: [an-ups01.alteeve.com] health status has changed:
- [The UPS is running on its batteries. It is likely that the input power feeding the UPS has failed.] -> [The UPS is operating normally.]
 
Warning:
  The link state of the network interface: [bcn_link1] in the bond: [bcn_bond1] has went down and is coming back up!
- [Up] -> [Going Back Up]
  Note: If someone is working on the Anvil!'s network, this might be expected. If not, this might be the sign of a flaky network cable or plug.
 
Warning:
  The active interface in the bond: [bcn_bond1] has fallen back into link #2.
- [bcn_link1] -> [bcn_link2]
  Warning: This usually is a sign of a failure in the primary link. It can be caused by the cable failing or being removed, the plug in the node or switch failing or the primary switch failing.
 
Warning:
  The active interface in the bond: [sn_bond1] has fallen back into link #2.
- [sn_link1] -> [sn_link2]
  Warning: This usually is a sign of a failure in the primary link. It can be caused by the cable failing or being removed, the plug in the node or switch failing or the primary switch failing.
 
Warning:
  The active interface in the bond: [ifn_bond1] has fallen back into link #2.
- [ifn_link1] -> [ifn_link2]
  Warning: This usually is a sign of a failure in the primary link. It can be caused by the cable failing or being removed, the plug in the node or switch failing or the primary switch failing.
 
Warning:
  The link state of the network interface: [sn_link1] in the bond: [sn_bond1] has went down and is coming back up!
- [Up] -> [Going Back Up]
  Note: If someone is working on the Anvil!'s network, this might be expected. If not, this might be the sign of a flaky network cable or plug.
 
Warning:
  The link state of the network interface: [ifn_link1] in the bond: [ifn_bond1] has went down and is coming back up!
- [Up] -> [Going Back Up]
  Note: If someone is working on the Anvil!'s network, this might be expected. If not, this might be the sign of a flaky network cable or plug.
 
Warning:
  It looks like 'rgmanager' (the cluster's resource manager) was stopped on the node: [an-a01n01.alteeve.com]. It is still an Anvil! member, but it can no longer recover lost servers.
 
Warning Cleared:
  The node: [an-a01n02.alteeve.com] has rejoined the Anvil!, but it is not yet able to take over lost servers. It should be ready in the next minute or so.
 
Warning Cleared:
  The node: [an-a01n01.alteeve.com] has returned to being healthy.

It begins!

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning Cleared:
  The active interface in the bond: [bcn_bond1] has returned to be the link #1.
- [bcn_link2] -> [bcn_link1]
 
Warning Cleared:
  The link state of the network interface: [bcn_link1] in the bond: [bcn_bond1] is back to full operational state.
- [Going Back Up] -> [Up]
  Note: The 'up delay' timer has expired and the interface is again deemed to be healthy and ready for use.
 
Warning Cleared:
  The active interface in the bond: [ifn_bond1] has returned to be the link #1.
- [ifn_link2] -> [ifn_link1]
 
Warning Cleared:
  The link state of the network interface: [ifn_link1] in the bond: [ifn_bond1] is back to full operational state.
- [Going Back Up] -> [Up]
  Note: The 'up delay' timer has expired and the interface is again deemed to be healthy and ready for use.
 
Warning Cleared:
  The active interface in the bond: [sn_bond1] has returned to be the link #1.
- [sn_link2] -> [sn_link1]
 
Warning Cleared:
  The link state of the network interface: [sn_link1] in the bond: [sn_bond1] is back to full operational state.
- [Going Back Up] -> [Up]
  Note: The 'up delay' timer has expired and the interface is again deemed to be healthy and ready for use.
 
Warning Cleared:
  The node: [an-a01n02.alteeve.com] is now a full member of Anvil!. As soon as its storage is 'UpToDate', it will be ready to take over servers.
 
Warning Cleared:
  The node: [an-a01n01.alteeve.com] is now a full member of Anvil!. As soon as its storage is 'UpToDate', it will be ready to take over servers.
 
Warning:
  The temperature sensor: [RAID Controller] has jumped a large amount in a short period of time!
- [73.000 °C] -> [79.000 °C]

Verify with Striker;

Striker showing node 2 back online.

Test passed!

Network Testing

This pair of tests is pretty straight forward.

Each switch will have it's power removed, starting with switch #1.

This will cause Striker #1 to become unavailable, but Striker #2 will be working. At the same time, all three networks on both nodes will fail over to their backup links.

For this test to pass, all three "link1" interfaces must drop and no interruption in the servers can occur. Both nodes must remain in the Anvil! and Striker #2 must be accessible.

For this test, you can watch the incoming alerts. If you are comfortable, though, the anvil-cli-state tool will be of particular use as it reports the link states of the nodes.

an-a01n01 an-a01n02
/sbin/striker/anvil-cli-state -i 3
Node #1, pre-test.
/sbin/striker/anvil-cli-state -i 3
Node #2, pre-test.

Cutting Power to Switch #1

Cut the power to switch #1 to simulate a total failure of the switch. Leave the power out for a few minutes to ensure that everything keeps running properly.

an-a01n01 an-a01n02
Node #1, switch #1 down.
Node #2, switch #1 down.

As expected, all '*_link1' interfaces are down, all '*_link2' interfaces are now active and the Anvil! is continuing to operate properly. Alerts like these will be dispatched;

Subject: [ ScanCore ] - Warning - There is a message from an-a01n02.alteeve.com
Warning:
  The active interface in the bond: [ifn_bond1] has fallen back into link #2.
- [ifn_link1] -> [ifn_link2]
  Warning: This usually is a sign of a failure in the primary link. It can be caused by the cable failing or being removed, the plug in the node or switch failing or the primary switch failing.
 
Warning:
  The duplex state of the network interface: [bcn_link1] in the bond: [bcn_bond1] is unknown!
- [Full Duplex] -> [Unknown]
  Note: 'Full duplex' indicates that the interface can send and receive data at the same time. 'Half duplex' indicates that it can only send or receive at any given time.
  Note: It is normal for the duplex to be unknown when the link itself is down.
 
Warning:
  The active interface in the bond: [bcn_bond1] has fallen back into link #2.
- [bcn_link1] -> [bcn_link2]
  Warning: This usually is a sign of a failure in the primary link. It can be caused by the cable failing or being removed, the plug in the node or switch failing or the primary switch failing.
 
Warning:
  The link speed of the network interface: [sn_link1] in the bond: [sn_bond1] is no longer known.
- [1 Gbps] -> [Unknown]
  Note: When the link state is down, the speed is marked "Unknown".
 
Warning:
  The duplex state of the network interface: [sn_link1] in the bond: [sn_bond1] is unknown!
- [Full Duplex] -> [Unknown]
  Note: 'Full duplex' indicates that the interface can send and receive data at the same time. 'Half duplex' indicates that it can only send or receive at any given time.
  Note: It is normal for the duplex to be unknown when the link itself is down.
 
Warning:
  The duplex state of the network interface: [ifn_link1] in the bond: [ifn_bond1] is unknown!
- [Full Duplex] -> [Unknown]
  Note: 'Full duplex' indicates that the interface can send and receive data at the same time. 'Half duplex' indicates that it can only send or receive at any given time.
  Note: It is normal for the duplex to be unknown when the link itself is down.
 
Warning:
  The failure count of the network interface: [bcn_link1] in the bond: [bcn_bond1] has increased.
- [0] -> [1]
  Note: If the cable is unplugged or the switch is reset, this failure count will go up. It is not, itself, a sign of a hardware failure.
 
Warning:
  The link speed of the network interface: [bcn_link1] in the bond: [bcn_bond1] is no longer known.
- [1 Gbps] -> [Unknown]
  Note: When the link state is down, the speed is marked "Unknown".
 
Warning:
  The failure count of the network interface: [ifn_link1] in the bond: [ifn_bond1] has increased.
- [0] -> [1]
  Note: If the cable is unplugged or the switch is reset, this failure count will go up. It is not, itself, a sign of a hardware failure.
 
Warning:
  The link speed of the network interface: [ifn_link1] in the bond: [ifn_bond1] is no longer known.
- [1 Gbps] -> [Unknown]
  Note: When the link state is down, the speed is marked "Unknown".
 
Warning:
  The link state of the network interface: [ifn_link1] in the bond: [ifn_bond1] has failed!
- [Up] -> [Down]
  Warning: The parent bond is now degraded. If the other link fails, the bond itself will fail and traffic will stop flowing.
 
Warning:
  The link state of the network interface: [sn_link1] in the bond: [sn_bond1] has failed!
- [Up] -> [Down]
  Warning: The parent bond is now degraded. If the other link fails, the bond itself will fail and traffic will stop flowing.
 
Warning:
  The failure count of the network interface: [sn_link1] in the bond: [sn_bond1] has increased.
- [0] -> [1]
  Note: If the cable is unplugged or the switch is reset, this failure count will go up. It is not, itself, a sign of a hardware failure.
 
Warning:
  The link state of the network interface: [bcn_link1] in the bond: [bcn_bond1] has failed!
- [Up] -> [Down]
  Warning: The parent bond is now degraded. If the other link fails, the bond itself will fail and traffic will stop flowing.
 
Warning:
  The active interface in the bond: [sn_bond1] has fallen back into link #2.
- [sn_link1] -> [sn_link2]
  Warning: This usually is a sign of a failure in the primary link. It can be caused by the cable failing or being removed, the plug in the node or switch failing or the primary switch failing.

This is a good start.

Restore Power to Switch #1

It is not enough to survive failure, we also must survive recovery. So lets restore power to the switch and wait a few minutes.

Template note icon.png
Note: Most switches, when coming online, will show a link go up and down a few times. This is normal. The Anvil! will not switch back to a link until it has been up without interruption for 120 seconds.
an-a01n01 an-a01n02
Node #1, switch #1 coming back up.
Node #2, switch #1 coming back up.

The alerts start rolling in...

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  The link state of the network interface: [bcn_link1] in the bond: [bcn_bond1] is now recovering.
- [Down] -> [Going Back Up]
  Note: This bond's primary link is now in the 'up delay' stage of recovering. It will not yet be used unless the backup network interface fails. Once the delay expires, the link will be deemed healthy and, if it comes up cleanly, the bond will switch to using it.
 
Warning Cleared:
  The network interface: [ifn_link1] in the bond: [ifn_bond1] is back to full duplex operating.
- [Unknown] -> [Full Duplex]
  Note: 'Full duplex' indicates that the interface can send and receive data at the same time. 'Half duplex' indicates that it can only send or receive at any given time.
 
Warning Cleared:
  The link speed of the network interface: [ifn_link1] in the bond: [ifn_bond1] is faster than it was before.
- [Unknown] -> [1 Gbps]
  Note: Normally, the link will recover at the same speed it used to be at. In this case, it is faster so it is likely the network interface was upgraded.
 
Warning:
  The link state of the network interface: [ifn_link1] in the bond: [ifn_bond1] is now recovering.
- [Down] -> [Going Back Up]
  Note: This bond's primary link is now in the 'up delay' stage of recovering. It will not yet be used unless the backup network interface fails. Once the delay expires, the link will be deemed healthy and, if it comes up cleanly, the bond will switch to using it.
 
Warning Cleared:
  The network interface: [sn_link1] in the bond: [sn_bond1] is back to full duplex operating.
- [Unknown] -> [Full Duplex]
  Note: 'Full duplex' indicates that the interface can send and receive data at the same time. 'Half duplex' indicates that it can only send or receive at any given time.
 
Warning Cleared:
  The link speed of the network interface: [bcn_link1] in the bond: [bcn_bond1] is faster than it was before.
- [Unknown] -> [1 Gbps]
  Note: Normally, the link will recover at the same speed it used to be at. In this case, it is faster so it is likely the network interface was upgraded.
 
Warning Cleared:
  The link speed of the network interface: [sn_link1] in the bond: [sn_bond1] is faster than it was before.
- [Unknown] -> [1 Gbps]
  Note: Normally, the link will recover at the same speed it used to be at. In this case, it is faster so it is likely the network interface was upgraded.
 
Warning:
  The link state of the network interface: [sn_link1] in the bond: [sn_bond1] is now recovering.
- [Down] -> [Going Back Up]
  Note: This bond's primary link is now in the 'up delay' stage of recovering. It will not yet be used unless the backup network interface fails. Once the delay expires, the link will be deemed healthy and, if it comes up cleanly, the bond will switch to using it.
 
Warning Cleared:
  The network interface: [bcn_link1] in the bond: [bcn_bond1] is back to full duplex operating.
- [Unknown] -> [Full Duplex]
  Note: 'Full duplex' indicates that the interface can send and receive data at the same time. 'Half duplex' indicates that it can only send or receive at any given time.

Above, we see the links coming up but waiting, as expected.

Two minutes after the switch finishes booting, everything goes back to normal.

an-a01n01 an-a01n02
Node #1, switch #1 is back!
Node #2, switch #1 is back!

Some more alerts confirming the links are up.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n02.alteeve.com
Warning Cleared:
  The active interface in the bond: [ifn_bond1] has returned to be the link #1.
- [ifn_link2] -> [ifn_link1]
 
Warning Cleared:
  The link state of the network interface: [bcn_link1] in the bond: [bcn_bond1] is back to full operational state.
- [Going Back Up] -> [Up]
  Note: The 'up delay' timer has expired and the interface is again deemed to be healthy and ready for use.
 
Warning Cleared:
  The link state of the network interface: [ifn_link1] in the bond: [ifn_bond1] is back to full operational state.
- [Going Back Up] -> [Up]
  Note: The 'up delay' timer has expired and the interface is again deemed to be healthy and ready for use.
 
Warning:
  The link state of the network interface: [sn_link1] in the bond: [sn_bond1] is now recovering.
- [Down] -> [Going Back Up]
  Note: This bond's primary link is now in the 'up delay' stage of recovering. It will not yet be used unless the backup network interface fails. Once the delay expires, the link will be deemed healthy and, if it comes up cleanly, the bond will switch to using it.
 
Warning Cleared:
  The link speed of the network interface: [sn_link1] in the bond: [sn_bond1] is faster than it was before.
- [Unknown] -> [1 Gbps]
  Note: Normally, the link will recover at the same speed it used to be at. In this case, it is faster so it is likely the network interface was upgraded.
 
Warning Cleared:
  The active interface in the bond: [sn_bond1] has returned to be the link #1.
- [sn_link2] -> [sn_link1]
 
Warning Cleared:
  The active interface in the bond: [bcn_bond1] has returned to be the link #1.
- [bcn_link2] -> [bcn_link1]
 
Warning Cleared:
  The network interface: [sn_link1] in the bond: [sn_bond1] is back to full duplex operating.
- [Unknown] -> [Full Duplex]
  Note: 'Full duplex' indicates that the interface can send and receive data at the same time. 'Half duplex' indicates that it can only send or receive at any given time.
 
Warning Cleared:
  The link state of the network interface: [sn_link1] in the bond: [sn_bond1] is back to full operational state.
- [Going Back Up] -> [Up]
  Note: The 'up delay' timer has expired and the interface is again deemed to be healthy and ready for use.

Test passed!

Cutting Power to Switch #2

The next test is to kill the second switch. In this case, no failover will happen in the nodes because the backup links will drop. We will lose Striker #2 though, which is fine. Our main test is to verify that all of the backup links actually drop, and that the switch loss and recovery causes no problems.

an-a01n01 an-a01n02
Node #1, switch #2 down.
Node #2, switch #2 down.

The alerts roll in...

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  The duplex state of the network interface: [ifn_link2] in the bond: [ifn_bond1] is unknown!
- [Full Duplex] -> [Unknown]
  Note: 'Full duplex' indicates that the interface can send and receive data at the same time. 'Half duplex' indicates that it can only send or receive at any given time.
  Note: It is normal for the duplex to be unknown when the link itself is down.
 
Warning:
  The link speed of the network interface: [sn_link2] in the bond: [sn_bond1] is no longer known.
- [1 Gbps] -> [Unknown]
  Note: When the link state is down, the speed is marked "Unknown".
 
Warning:
  The duplex state of the network interface: [sn_link2] in the bond: [sn_bond1] is unknown!
- [Full Duplex] -> [Unknown]
  Note: 'Full duplex' indicates that the interface can send and receive data at the same time. 'Half duplex' indicates that it can only send or receive at any given time.
  Note: It is normal for the duplex to be unknown when the link itself is down.
 
Warning:
  The link state of the network interface: [sn_link2] in the bond: [sn_bond1] has failed!
- [Up] -> [Down]
  Warning: The parent bond is now degraded. If the other link fails, the bond itself will fail and traffic will stop flowing.
 
Warning:
  The failure count of the network interface: [ifn_link2] in the bond: [ifn_bond1] has increased.
- [12] -> [13]
  Note: If the cable is unplugged or the switch is reset, this failure count will go up. It is not, itself, a sign of a hardware failure.
 
Warning:
  The link speed of the network interface: [ifn_link2] in the bond: [ifn_bond1] is no longer known.
- [1 Gbps] -> [Unknown]
  Note: When the link state is down, the speed is marked "Unknown".
 
Warning:
  The link speed of the network interface: [bcn_link2] in the bond: [bcn_bond1] is no longer known.
- [1 Gbps] -> [Unknown]
  Note: When the link state is down, the speed is marked "Unknown".
 
Warning:
  The failure count of the network interface: [sn_link2] in the bond: [sn_bond1] has increased.
- [12] -> [13]
  Note: If the cable is unplugged or the switch is reset, this failure count will go up. It is not, itself, a sign of a hardware failure.
 
Warning:
  The duplex state of the network interface: [bcn_link2] in the bond: [bcn_bond1] is unknown!
- [Full Duplex] -> [Unknown]
  Note: 'Full duplex' indicates that the interface can send and receive data at the same time. 'Half duplex' indicates that it can only send or receive at any given time.
  Note: It is normal for the duplex to be unknown when the link itself is down.
 
Warning:
  The link state of the network interface: [ifn_link2] in the bond: [ifn_bond1] has failed!
- [Up] -> [Down]
  Warning: The parent bond is now degraded. If the other link fails, the bond itself will fail and traffic will stop flowing.
 
Warning:
  The link state of the network interface: [bcn_link2] in the bond: [bcn_bond1] has failed!
- [Up] -> [Down]
  Warning: The parent bond is now degraded. If the other link fails, the bond itself will fail and traffic will stop flowing.
 
Warning:
  The failure count of the network interface: [bcn_link2] in the bond: [bcn_bond1] has increased.
- [12] -> [13]
  Note: If the cable is unplugged or the switch is reset, this failure count will go up. It is not, itself, a sign of a hardware failure.

A short time later, you will see alerts that the PDUs and UPSes were lost.

Subject: Warning - There is a message from an-a01n02.alteeve.com
Warning:
  Connection to the PDU: [an-pdu02.alteeve.com] has been lost!
 
Warning:
  Connection to the PDU: [an-pdu01.alteeve.com] has been lost!
 
Warning:
  Communication with the UPS: [an-ups01.alteeve.com] has been lost!
Warning: The UPS power state and hold up time can no longer be monitored. 
         Graceful shutdown in the case of a power loss will not function until 
         communication is restored!
 
Warning:
  Communication with the UPS: [an-ups02.alteeve.com] has been lost!
Warning: The UPS power state and hold up time can no longer be monitored. 
         Graceful shutdown in the case of a power loss will not function until 
         communication is restored!

This is expected because those devices were connected to switch #2.

So far, so good!

Restore Power to Switch #2

The final test for network resiliency is to restore the power to switch #2 and wait for it to rejoin the stack and all bonds to bring their backup links back online.

an-a01n01 an-a01n02
Node #1, switch #2 coming back up.
Node #2, switch #2 coming back up.

As before, the links will flutter as the switch comes online. This is expected and not a concern.

The alerts start to roll in...

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning Cleared:
  The network interface: [ifn_link2] in the bond: [ifn_bond1] is back to full duplex operating.
- [Unknown] -> [Full Duplex]
  Note: 'Full duplex' indicates that the interface can send and receive data at the same time. 'Half duplex' indicates that it can only send or receive at any given time.
 
Warning Cleared:
  The link speed of the network interface: [sn_link2] in the bond: [sn_bond1] is faster than it was before.
- [Unknown] -> [1 Gbps]
  Note: Normally, the link will recover at the same speed it used to be at. In this case, it is faster so it is likely the network interface was upgraded.
 
Warning Cleared:
  The link speed of the network interface: [ifn_link2] in the bond: [ifn_bond1] is faster than it was before.
- [Unknown] -> [1 Gbps]
  Note: Normally, the link will recover at the same speed it used to be at. In this case, it is faster so it is likely the network interface was upgraded.
 
Warning:
  The link state of the network interface: [bcn_link2] in the bond: [bcn_bond1] is now recovering.
- [Down] -> [Going Back Up]
  Note: This bond's backup link is now in the 'up delay' stage of recovering. Once the delay expires, the link will be deemed healthy. If the primary interface fails, it will be put into use immediately.
 
Warning Cleared:
  The link speed of the network interface: [bcn_link2] in the bond: [bcn_bond1] is faster than it was before.
- [Unknown] -> [1 Gbps]
  Note: Normally, the link will recover at the same speed it used to be at. In this case, it is faster so it is likely the network interface was upgraded.
 
Warning Cleared:
  The network interface: [bcn_link2] in the bond: [bcn_bond1] is back to full duplex operating.
- [Unknown] -> [Full Duplex]
  Note: 'Full duplex' indicates that the interface can send and receive data at the same time. 'Half duplex' indicates that it can only send or receive at any given time.
 
Warning Cleared:
  The network interface: [sn_link2] in the bond: [sn_bond1] is back to full duplex operating.
- [Unknown] -> [Full Duplex]
  Note: 'Full duplex' indicates that the interface can send and receive data at the same time. 'Half duplex' indicates that it can only send or receive at any given time.
 
Warning:
  The link state of the network interface: [sn_link2] in the bond: [sn_bond1] is now recovering.
- [Down] -> [Going Back Up]
  Note: This bond's backup link is now in the 'up delay' stage of recovering. Once the delay expires, the link will be deemed healthy. If the primary interface fails, it will be put into use immediately.
 
Warning:
  The link state of the network interface: [ifn_link2] in the bond: [ifn_bond1] is now recovering.
- [Down] -> [Going Back Up]
  Note: This bond's backup link is now in the 'up delay' stage of recovering. Once the delay expires, the link will be deemed healthy. If the primary interface fails, it will be put into use immediately.

and finally things stabilize.

an-a01n01 an-a01n02
Node #1, switch #2 back online!
Node #2, switch #2 back online!

The alerts will confirm that all is again well.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning Cleared:
  Communication with the UPS: [an-ups01.alteeve.com] has been restored.
 
Warning Cleared:
  Communication with the UPS: [an-ups02.alteeve.com] has been restored.
 
Warning Cleared:
  The link state of the network interface: [sn_link2] in the bond: [sn_bond1] is back to full operational state.
- [Going Back Up] -> [Up]
  Note: The 'up delay' timer has expired and the interface is again deemed to be healthy and ready for use.
 
Warning Cleared:
  The link state of the network interface: [ifn_link2] in the bond: [ifn_bond1] is back to full operational state.
- [Going Back Up] -> [Up]
  Note: The 'up delay' timer has expired and the interface is again deemed to be healthy and ready for use.
 
Warning Cleared:
  The link state of the network interface: [bcn_link2] in the bond: [bcn_bond1] is back to full operational state.
- [Going Back Up] -> [Up]
  Note: The 'up delay' timer has expired and the interface is again deemed to be healthy and ready for use.
 
Warning Cleared:
  The PDU model: [AP7900] with the serial number: [5A1121E01686] at the IP address: [10.20.2.2] has returned.
 
Warning Cleared:
  The PDU model: [AP7900] with the serial number: [ZA1104027208] at the IP address: [10.20.2.1] has returned.
 
Warning Cleared:
  Connection to the PDU: [an-pdu01.alteeve.com] has been restored.
 
Warning Cleared:
  Connection to the PDU: [an-pdu02.alteeve.com] has been restored.

Success!

Proactive Live Migration Tests

Template warning icon.png
Warning: If you are using a Dell-based node with a PERC RAID controller, please install perccli on the nodes. The default storcli RPM won't find the PERC controller (despite using the same LSI controller), so disk-based monitoring will not work without perccli being installed.

This next set of tests will verify that ScanCore will properly weigh different node failure types and proactively live migrate servers accordingly.

This test will use the following failure scenario;

  1. . Servers are running on node #1.
  2. . A fan will be removed from node #1 to simulate the loss of a cooling fan.
  3. . After five minutes, the servers will migrate to node #2.
  4. . A hard drive will be removed from node #2, degrading its array.
  5. . ScanCore will determine that the lost fan is less dangerous and live migrate the servers back to node #1.
  6. . Node #2 has a hot-spare that will be drawn in to rebuild the array.
  7. . Once the rebuild completes, node #2 will again be healthier than node #1, which still has a "failed" fan. So servers will again migrate to node #2.
  8. . The fan will be re-installed into node #1, returning it to a healthy state.
  9. . Both nodes are of equal health, so the servers will remain on node #2.

If all this works properly, this test will pass.

Simulating a Fan Failure

Before we start, confirm that all servers are indeed on node #1.

Starting point; All servers are on an-a01n01.
Template warning icon.png
Warning: This step involves opening the chassis of a running server and ejecting a fan that is still in motion. This must be done carefully and according to the hardware manufacturer's instructions for hot-swapping a fan. If you aren't careful, you could hurt yourself or damage your machine!

Now carefully remove one of the fans from node #1.

A moment later, an alert indicating the loss of the fan will arrive.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  The sensor: [FAN4 SYS] has changed.
- [7800.000 rpm] -> [0.000 rpm]
- [ok] -> [cr]
- Thresholds:
  - High critical: [--] -> [--]
  - High warning:  [--] -> [--]
  - Low warning:   [--] -> [--]
  - Low critical:  [1920.000rpm] -> [1920.000rpm]
 
Warning:
  The sensor: [FAN3 SYS] has changed.
- [7200.000 rpm] -> [0.000 rpm]
- [ok] -> [cr]
- Thresholds:
  - High critical: [--] -> [--]
  - High warning:  [--] -> [--]
  - Low warning:   [--] -> [--]
  - Low critical:  [1920.000rpm] -> [1920.000rpm]
Template note icon.png
Note: In this case, our nodes use stacked fans (2 fans back to back), so removing one fan back causes two to go missing.

At first, nothing more happens. This is by design; ScanCore doesn't want to overreact, so it waits to see if the problem is resolved shortly after the event. If it is, as would be the case when replacing the fan, then nothing will happen. If the fan is in the failed state for a few minutes, though, ScanCore will determine that the problem is persistent and live migration will begin.

Wait a few minutes, and then an alert indicating that the preventative live migration has begun.

Template note icon.png
Note: Preventative migrations always "pull" servers. That is to say, the healthier node always initiates the migration of the peer's servers over to itself.
Subject: [ ScanCore ] - Warning - There is a message from an-a01n02.alteeve.com
Warning:
  This node is healthier that the peer. Initiating proactive live migration.
- All servers will now migrate from [an-a01n01.alteeve.com] to: [an-a01n02.alteeve.com]

Once the migration completes, another alert goes out.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n02.alteeve.com
Warning:
  Preventative migration complete.

Sure enough, all the servers are now on node #2.

All servers are now on an-a01n02.

Excellent! First part of the test has passed.

Ejecting a Drive from the RAID Array

Next, we will physically eject a drive out of node #2's array, causing the disk to be listed as failed and degrading the RAID array.

Before we start, though, lets verify that node #2's storage is currently "Optimal". To do this, under "Anvil! Nodes - Controls", click on node #2's hostname to open the disk manager.

Clickable hostname for an-a01n02.

This will show the storage array state. In our case, the node has a RAID level 6 array with a hot spare.

Template note icon.png
Note: Clicking the host name opens the storage manager in a new tab or window.
Storage state and information for an-a01n02.

Everything looks good, so we're ready to pull a drive out.

Template warning icon.png
Warning: Hard drives with spinning platters are sensitive to rotation out of a the plane of platter rotation. When you eject the drive, pull it out just a few centimetres and then wait about 15 seconds to let the platters spin down. When in doubt, please consult your hardware manufacturer's documentation on how to safely eject a spinning drive. Failure to exercise due caution could damage your drive.
Template note icon.png
Note: In our system, we have a hot-spare, so the array will begin to rebuild the array as soon as the drive is removed. If you do not have a hot spare, don't worry. We will document the process of making the drive good again and re-adding it to your array.

As before, an alert will go out indicating that indicates the drive was lost and array is now (partially) degraded, and nothing more will happen.

Subject: [ ScanCore ] - Critical - There is a message from an-a01n02.alteeve.com
Warning:
  - The RAID controller: [0000000043113628]: 'Controller Status' changed: [Optimal] -> [Needs Attention]
  The severity of this alert depends on the new state. if the new state is 'Optimal', then there is no cause for concern.
  Most other states are likely worth investigating as soon as possible as the controller state should never change under normal conditions.
 
The RAID controller's Drive Group (DG) properties have changed:
- ID String: ................. [0000000043113628-vd0-dg0] -> [0000000043113628-vd0-dg0]
- Access: .................... [Read Write] -> [Read Write]
- RAID Level: ................ [RAID6] -> [RAID6]
- Array Size: ................ [681.09 GiB] -> [681.09 GiB]
- Array State: ............... [Optimal] -> [Partially Degraded]
- Consistent: ................ [Yes] -> [Yes]
- Write Cache: ............... [Write-Back] -> [Write-Back]
- Read Cache: ................ [No Read-Ahead] -> [No Read-Ahead]
- Disk Cache: ................ [Direct IO] -> [Direct IO]
- CacheCade: ................. [No] -> [No]
- Scheduled Consistency Check: [Off] -> [Off]
- Raw Cache String: .......... [NRWBD] -> [NRWBD]
The Physical Drive (PD) properties have changed:
- On Controller: ....... [0000000043113628] -> [0000000043113628]
- Virtual Drive: ....... [9999] -> [0]
- Drive Group: ......... [9999] -> [0]
- Enclosure ID: ........ [252] -> [252]
- Slot Number: ......... [7] -> [7]
- Vendor: .............. [TOSHIBA] -> [TOSHIBA]
- Model: ............... [MK1401GRRB] -> [MK1401GRRB]
- Serial Number: ....... [Y3E0A00GFSR2] -> [Y3E0A00GFSR2]
- Capacity: ............ [136.218 GB] -> [136.218 GB]
- Sector Size: ......... [512 B] -> [512 B]
- Self-Encrypting Drive? [N] -> [N]
Warning:
  - The Physical Drive: [Y3D0A0ECFSR2] on the RAID controller: [0000000043113628] has vanished!

We can confirm that the drive was lost by refreshing the drive management window.

Template note icon.png
Note: Notice how disk 252:7 is now a member of the array and that the drive is rebuilding.
Confirming array degradation in an-a01n02's storage manager.

Sure enough, a few minutes later, node #1 starts pulling the servers back.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  This node is healthier that the peer. Initiating proactive live migration.
- All servers will now migrate from [an-a01n02.alteeve.com] to: [an-a01n01.alteeve.com]

Once the last server migrates, we will get another notification saying that it is done.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  Preventative migration complete.

We can confirm that the servers have returned to node #1 in Striker.

Confirming servers are back on an-a01n01.

Excellent!

Waiting For the RAID Array To Rebuild

The next test is to just sit back and do nothing. Once the replacement drive has finished rebuilding and the RAID array on node #2 returns to an "Optimal" state, ScanCore will determine that it is again healthier than node #1, which still has a "failed" fan, and again migrate the servers over to node #2.

Template note icon.png
Note: If you don't have a hot-spare, please read: Managing Drive Failures with Striker and follow the instructions on how to make a drive good again. Once good, add the drive back to your array to start the rebuild process.

Once the array finishes rebuilding, you will get an alert informing you of this.

Subject: [ ScanCore ] - Critical - There is a message from an-a01n02.alteeve.com
Warning:
  - The RAID controller: [0000000043113628]: 'Controller Status' changed: [Needs Attention] -> [Optimal]
  The severity of this alert depends on the new state. if the new state is 'Optimal', then there is no cause for concern.
  Most other states are likely worth investigating as soon as possible as the controller state should never change under normal conditions.
 
The RAID controller's Drive Group (DG) properties have changed:
- ID String: ................. [0000000043113628-vd0-dg0] -> [0000000043113628-vd0-dg0]
- Access: .................... [Read Write] -> [Read Write]
- RAID Level: ................ [RAID6] -> [RAID6]
- Array Size: ................ [681.09 GiB] -> [681.09 GiB]
- Array State: ............... [Partially Degraded] -> [Optimal]
- Consistent: ................ [Yes] -> [Yes]
- Write Cache: ............... [Write-Back] -> [Write-Back]
- Read Cache: ................ [No Read-Ahead] -> [No Read-Ahead]
- Disk Cache: ................ [Direct IO] -> [Direct IO]
- CacheCade: ................. [No] -> [No]
- Scheduled Consistency Check: [Off] -> [Off]
- Raw Cache String: .......... [NRWBD] -> [NRWBD]

As before, nothing further will happen at first. ScanCore will wait a few minutes to make sure things are settled and then migrate the servers once again.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n02.alteeve.com
Warning:
  This node is healthier that the peer. Initiating proactive live migration.
- All servers will now migrate from [an-a01n01.alteeve.com] to: [an-a01n02.alteeve.com]

The migrations have begun!

Subject: [ ScanCore ] - Warning - There is a message from an-a01n02.alteeve.com
Warning:
  Preventative migration complete.

Once again, all servers are on node #2, given that it is the healthier one. We can confirm this in Striker;

Confirming all servers are once again on an-a01n02.

Excellent, almost done!

Restoring the Fan in Node #1

The final test is to restore the fan in node #1 and then waiting, verifying that the servers don't migrate.

The goal of this test is to show that ScanCore won't migrate servers without a good reason. There is nothing special about node #1 over node #2, so there is no reason to migrate servers again if node #2 is already perfectly healthy.

So, plug the fan back in...

</syntaxhighlight>

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  The sensor: [FAN3 SYS] has changed.
- [0.000 rpm] -> [16440.000 rpm]
- [cr] -> [ok]
- Thresholds:
  - High critical: [--] -> [--]
  - High warning:  [--] -> [--]
  - Low warning:   [--] -> [--]
  - Low critical:  [1920.000rpm] -> [1920.000rpm]
 
Warning:
  The sensor: [FAN4 SYS] has changed.
- [0.000 rpm] -> [15480.000 rpm]
- [cr] -> [ok]
- Thresholds:
  - High critical: [--] -> [--]
  - High warning:  [--] -> [--]
  - Low warning:   [--] -> [--]
  - Low critical:  [1920.000rpm] -> [1920.000rpm]

Now wait about ten minutes and make sure nothing further happens.

After waiting, verify on Striker that the servers are still on node #2.

Confirming servers stayed on an-a01n02.

Hoozah!! All tests passed!

Fence Testing

Perhaps the most important of all testing is the fence testing.

Fencing is the part of the Anvil! system that is responsible for ejecting a node from the cluster when it enters an unknown state. Generally, this is done by forcing the target node to power off.

In the Anvil!, a node can be fenced by two different mechanisms;

  1. . Force off using the node's IPMI interface.
  2. . Cutting power but opening the circuits feeding the node on the two PDUs.

When possible, IPMI based fencing is preferred. if the target node's IPMI interface tells us that the node is dead, we can be sure that it is dead. However, IPMI is inside the target node itself and, so, can die along with the node is certain failure conditions. Examples are cases of total power loss, internal power shorts, failed voltage regulators, etc.

In the Anvil!, failing to talk to the node and failing to talk to the IPMI interface is NOT confirmation that the node is dead. What if the problem is that all networking failed? All we know for sure is that we don't know what state the node is in.

So as a backup, the Anvil! will log into both PDUs and ask them to cut power to the ports feeding the target node. Once both PDUs have opened the requested circuits, we can be certain that the target node is dead. The downside to PDU fencing, and the reason we use it as a backup method, is that it returns "success" when the circuits have been opened. They can't tell us for sure if the node is truly dead... If the power cables were moved around, the node could still be alive.

Template warning icon.png
Warning: This is why it is so important to specify which ports are used for each nodes in the Install Manifests. It is also why it is critical that no one moves power cables around after the deployment. Power cables should be locked to their nodes and strapped to the PDU's cable management trays before these tests are performed.

These tests will involve four steps;

  1. . Causing node #1 to kernel panic.
  2. . Manually turning off the ports on the two PDUs feeding node #1.
  3. . Causing node #2 to kernel panic.
  4. . Manually turning off the ports on the two PDUs feeding node #2.

During this test, when we panic the host running the servers, all hosted servers will be lost. This test also verifies that all servers can boot at the same successfully.

Template note icon.png
Note: We will be starting with the servers all running on node #2. If this isn't the case in your system, don't worry. What matters is that, when the host under the servers is crashed, the servers reboot on the peer.
Fence test starting point.

Panic'ing Node #1

The Magic SysRq trigger 'c' will tell the Linux kernel to immediate panic (crash). When this happens, the system immediately becomes unresponsive and all processing and operation stops.

Doing this to a running node will cause it to be declared lost by node #2. It will then be fenced, first by IPMI, which should succeed because the crash was in the operating system, leaving the hardware alone.

So for this test to succeed, node #1 should be rebooted automatically. If this happens, the test passes.

secondary to this test, the node should rejoin the Anvil! and restore full redundancy automatically.

Template warning icon.png
Warning: This test will use a command line call to trigger the kernel panic. If you don't feel comfortable doing this, you can instead press and hold the power button on node #1 until it powers down.

Log into an-a01n01.alteeve.com directly or by using the terminal program from either Striker dashboard.

ssh root@10.20.10.1
[root@an-a01n01 ~]#
Template warning icon.png
Warning: The next command will crash the node (that is the point, but still...).
echo c > /proc/sysrq-trigger
<there is no output, the system is now dead>

Within a short time, alerts will start to come in.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n02.alteeve.com
Warning:
  The node: [an-a01n01.alteeve.com] has gone offline. This can be caused by someone is doing work on your Anvil!, or if a node was withdrawn to shed load, or if the node's health went critical. If this was not expected, the node may have crashed and been fenced.

Checking with Striker;

Node #1 is gone, all servers are running on node #2.

Log into your servers and verify that they are still working properly. If they are, the fence worked!

Template note icon.png
Note: Strictly speaking, what matters in a fence is that the node is ejected from the Anvil!. We hope that it returns, but that is not what matters.

In this test, we know that nothing is actually wrong with node #1 so it should boot back up and rejoin the Anvil! automatically. We'll see this as new alerts roll in showing it coming back online.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n02.alteeve.com
Warning Cleared:
  The node: [an-a01n01.alteeve.com] has rejoined the Anvil!, but it is not yet able to take over lost servers. It should be ready in the next minute or so.

When a node boots up, it runs a program called 'anvil-safe-start' that carefully checks numerous things, like network connections, before connecting to the peer to rejoin the Anvil!. Once connected, it walks through, step by step, starting the various services to make sure everything is healthy before fully rejoining the Anvil! peer.

This process takes a little while, and if you check Striker while it is running, you will see a message like this;

Node #1 running anvil-safe-start.

Please be patient!

"Slow and steady" is one of the Anvil! platform's mottos. As you wait, you will see more alerts roll in from both nodes showing the node coming back online.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  The link state of the network interface: [sn_link1] in the bond: [sn_bond1] has went down and is coming back up!
- [Up] -> [Going Back Up]
  Note: If someone is working on the Anvil!'s network, this might be expected. If not, this might be the sign of a flaky network cable or plug.
 
Warning:
  The link state of the network interface: [bcn_link2] in the bond: [bcn_bond1] has went down and is coming back up!
- [Up] -> [Going Back Up]
  Note: If someone is working on the Anvil!'s network, this might be expected. If not, this might be the sign of a flaky network cable or plug.
 
Warning:
  The active interface in the bond: [sn_bond1] has fallen back into link #2.
- [sn_link1] -> [sn_link2]
  Warning: This usually is a sign of a failure in the primary link. It can be caused by the cable failing or being removed, the plug in the node or switch failing or the primary switch failing.
 
Warning:
  The link state of the network interface: [ifn_link1] in the bond: [ifn_bond1] has went down and is coming back up!
- [Up] -> [Going Back Up]
  Note: If someone is working on the Anvil!'s network, this might be expected. If not, this might be the sign of a flaky network cable or plug.
 
Warning:
  The active interface in the bond: [ifn_bond1] has fallen back into link #2.
- [ifn_link1] -> [ifn_link2]
  Warning: This usually is a sign of a failure in the primary link. It can be caused by the cable failing or being removed, the plug in the node or switch failing or the primary switch failing.
 
Warning Cleared:
  The link state of the network interface: [bcn_link2] in the bond: [bcn_bond1] is back to full operational state.
- [Going Back Up] -> [Up]
  Note: The 'up delay' timer has expired and the interface is again deemed to be healthy and ready for use.
 
Warning:
  It looks like 'rgmanager' (the cluster's resource manager) was stopped on the node: [an-a01n01.alteeve.com]. 
It is still an Anvil! member, but it can no longer recover lost servers.
- NOTE: If this node's rgmanager has stopped, it's view of the peer's state may be inaccurate.
 
Warning:
  It looks like 'rgmanager' (the cluster's resource manager) was stopped on the node: [an-a01n02.alteeve.com]. 
It is still an Anvil! member, but it can no longer recover lost servers.
- NOTE: If this node's rgmanager has stopped, it's view of the peer's state may be inaccurate.

Take note of the last two messages. It looks, from the perspective of node #1, that both nodes rejoined the Anvil!. As we've seen, node #2 has been a member all the time. This is an example of how alerts come from the perspective of the sender only. We saw no messages about node #2 going offline and we confirmed ourselves that node #2 was fine, so we can head the note about the peer's state being inaccurate and ignore that message.

Once safe-start finishes, the node will be fully back into the Anvil! and recovery will be complete.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning Cleared:
  The active interface in the bond: [ifn_bond1] has returned to be the link #1.
- [ifn_link2] -> [ifn_link1]
 
Warning Cleared:
  The link state of the network interface: [ifn_link1] in the bond: [ifn_bond1] is back to full operational state.
- [Going Back Up] -> [Up]
  Note: The 'up delay' timer has expired and the interface is again deemed to be healthy and ready for use.
 
Warning Cleared:
  The link state of the network interface: [sn_link1] in the bond: [sn_bond1] is back to full operational state.
- [Going Back Up] -> [Up]
  Note: The 'up delay' timer has expired and the interface is again deemed to be healthy and ready for use.
 
Warning Cleared:
  The active interface in the bond: [sn_bond1] has returned to be the link #1.
- [sn_link2] -> [sn_link1]
 
Warning Cleared:
  The node: [an-a01n01.alteeve.com] is now a full member of Anvil!. 
As soon as its storage is 'UpToDate', it will be ready to take over servers.
- NOTE: If this node is the one that joined, it's view of the peer's state may be inaccurate.
 
Warning Cleared:
  The node: [an-a01n02.alteeve.com] is now a full member of Anvil!. 
As soon as its storage is 'UpToDate', it will be ready to take over servers.
- NOTE: If this node is the one that joined, it's view of the peer's state may be inaccurate.

Confirm with Striker;

Node #1 is back in the Anvil! system.

First fence test (and recovery) passed!

Cutting Power to Node #1

For this test, we're going to pull the IPMI cable out of the node and then trigger a panic it again. This time, though, IPMI will fail because the network cable will be unplugged. This will force node #2 to fall back to the PDUs to cut the power to the node.

For this test to pass, the node will have to lose all power and then have its power restored. This will leave the node powered off until ScanCore on one of the Striker dashboards recovers it.

Template note icon.png
Note: ScanCore will not be able to boot the node until it can reach it's IPMI BMC again. We will plug the network cable in once we verify the reboot occurred..

As before, the core test is that the node powers off and then power is restored. This is all that is needed for node #2 to know the state of the peer and resume normal operation. The rebooting of node #1 and it rejoining the Anvil! is a secondary test.

Template note icon.png
Note: Unplug the IPMI network cable from node #1 now.

Next, log into node #1 again using the terminal from either dashboard.

ssh root@10.20.10.1
[root@an-a01n01 ~]#

Next, cause node #1 to panic again.

Template warning icon.png
Warning: The next command will crash the node (that is the point, but still...).
Template note icon.png
Note: It isn't required, but you may want to have eyes on node #1 to visually confirm that it loses power. However, if it recovers, then you know the power cycle had to have worked because power cycling will be the only way to recover the node.
echo c > /proc/sysrq-trigger
<there is no output, the system is now dead>
Template note icon.png
Note: For advanced users; If you check the system logs on node #2, you will see the following message;
May 19 12:41:19 an-a01n02 python: Failed: Unable to obtain correct plug status or plug is not available#012
May 19 12:41:22 an-a01n02 fenced[3899]: fence an-a01n01.alteeve.com dev 0.0 agent fence_ipmilan result: error from agent
May 19 12:41:22 an-a01n02 fenced[3899]: fence an-a01n01.alteeve.com success
This shows that the IPMI fence method failed, as expected, then the PDU fence method succeeded.

Within a short time, alerts will start to come in.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n02.alteeve.com
Warning:
  The node: [an-a01n01.alteeve.com] has gone offline. This can be caused by someone is doing work on your Anvil!, or if a node was withdrawn to shed load, or if the node's health went critical. If this was not expected, the node may have crashed and been fenced.

Excellent! Log into your servers to confirm they are working properly, and you can use Striker again to verify that node #1 has been removed from the Anvil!.

Template note icon.png
Note: Plug node #1's IPMI network cable back in again. ScanCore should then reboot the node within a minute or so.

Now we wait for the node to rejoin the Anvil! to verify that ScanCore on the Striker dashboards successfully rebooted it.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  The link state of the network interface: [sn_link1] in the bond: [sn_bond1] has went down and is coming back up!
- [Up] -> [Going Back Up]
  Note: If someone is working on the Anvil!'s network, this might be expected. If not, this might be the sign of a flaky network cable or plug.
 
Warning:
  The active interface in the bond: [sn_bond1] has fallen back into link #2.
- [sn_link1] -> [sn_link2]
  Warning: This usually is a sign of a failure in the primary link. It can be caused by the cable failing or being removed, the plug in the node or switch failing or the primary switch failing.
 
Warning:
  The active interface in the bond: [bcn_bond1] has fallen back into link #2.
- [bcn_link1] -> [bcn_link2]
  Warning: This usually is a sign of a failure in the primary link. It can be caused by the cable failing or being removed, the plug in the node or switch failing or the primary switch failing.
 
Warning:
  The link state of the network interface: [ifn_link2] in the bond: [ifn_bond1] has went down and is coming back up!
- [Up] -> [Going Back Up]
  Note: If someone is working on the Anvil!'s network, this might be expected. If not, this might be the sign of a flaky network cable or plug.
 
Warning:
  The link state of the network interface: [bcn_link1] in the bond: [bcn_bond1] has went down and is coming back up!
- [Up] -> [Going Back Up]
  Note: If someone is working on the Anvil!'s network, this might be expected. If not, this might be the sign of a flaky network cable or plug.
 
Warning:
  It looks like 'rgmanager' (the cluster's resource manager) was stopped on the node: [an-a01n01.alteeve.com]. 
It is still an Anvil! member, but it can no longer recover lost servers.
- NOTE: If this node's rgmanager has stopped, it's view of the peer's state may be inaccurate.
 
Warning:
  It looks like 'rgmanager' (the cluster's resource manager) was stopped on the node: [an-a01n02.alteeve.com]. 
It is still an Anvil! member, but it can no longer recover lost servers.
- NOTE: If this node's rgmanager has stopped, it's view of the peer's state may be inaccurate.

Here it comes!

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning Cleared:
  The node: [an-a01n02.alteeve.com] is now a full member of Anvil!. 
As soon as its storage is 'UpToDate', it will be ready to take over servers.
- NOTE: If this node is the one that joined, it's view of the peer's state may be inaccurate.
 
Warning Cleared:
  The node: [an-a01n01.alteeve.com] is now a full member of Anvil!. 
As soon as its storage is 'UpToDate', it will be ready to take over servers.
- NOTE: If this node is the one that joined, it's view of the peer's state may be inaccurate.

Success! The node was power fenced and, once IPMI was back, automatically recovered.

Panic'ing Node #2

This test is going to be exactly the same as the panic test on node #1. The key difference is that, this time, the servers are running on node #2.

When we panic node #2, all hosted servers will die as well. This simulates a total and unexpected failure of node #2 where ScanCore was not able to predict the fault and proactively migrate the servers.

Once node #1 ejects node #2, it will reboot all of the servers on node #1. So for this test, be sure to monitor your servers and see how long it takes for them to come back online.

Verify once more that the servers are, indeed, on node #2.

Node #2 is hosting all servers.

Log into node #2.

ssh root@10.20.10.2
[root@an-a01n02 ~]#

And cause it to crash.

Template warning icon.png
Warning: The next command will crash the node (that is the point, but still...).
echo c > /proc/sysrq-trigger
<there is no output, the system is now dead>
Template note icon.png
Note: The node hosting the servers will have a delay when fencing it. This is to ensure the node hosting the servers wins if both nodes should ever try to fence each other at the same time. As such, in this case, there will be a 15 second delay before the node is rebooted.

All servers will go offline and a few moments later, node #2 will be ejected from the Anvil!.

The alerts start rolling in...

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  The node: [an-a01n02.alteeve.com] has gone offline. This can be caused by someone is doing work on your Anvil!, or if a node was withdrawn to shed load, or if the node's health went critical. If this was not expected, the node may have crashed and been fenced.

All the servers will boot back up on node #1 and services should resume normal operation. Be sure to log into your servers and verify that they rebooted properly.

All servers now running on node #1.

Success!

Now we wait for node #2 to return to the Anvil!.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning Cleared:
  The node: [an-a01n02.alteeve.com] is now a full member of Anvil!. 
As soon as its storage is 'UpToDate', it will be ready to take over servers.
- NOTE: If this node is the one that joined, it's view of the peer's state may be inaccurate.

It's back!

Node #2 is back in the Anvil!.

Fence and recovery tests passed!

Cutting Power to Node #2

The final test will verify that node #2 can be fenced using the PDUs.

First, unplug the IPMI network cable for node #2. Once unplugged, we'll again log into node #2 and cause it to panic. For this test to pass, node #2 will need to be power cycled to recover. Once you confirm that node #2 has been rebooted, plug the IPMI cable back in and wait to verify that ScanCore reboots it.

Log into node #2;

ssh root@10.20.10.2
[root@an-a01n02 ~]#

With all the servers now on node #1, we won't need to recover the servers again.

Now panic node #2.

Template warning icon.png
Warning: The next command will crash the node (that is the point, but still...).
echo c > /proc/sysrq-trigger
<there is no output, the system is now dead>

Now that the servers are on node #1, the delay we saw in the last test will not apply as ScanCore adjusted to prefer node #1.

Shortly, the alerts will start to come in.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning:
  The node: [an-a01n02.alteeve.com] has gone offline. This can be caused by someone is doing work on your Anvil!, or if a node was withdrawn to shed load, or if the node's health went critical. If this was not expected, the node may have crashed and been fenced.
Template note icon.png
Note: Plug node #2's IPMI network cable back in.

Within a minute, node #2 should be powered back on by ScanCore and we'll start seeing messages arrive that it is coming back online.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning Cleared:
  The node: [an-a01n02.alteeve.com] has rejoined the Anvil!, but it is not yet able to take over lost servers. It should be ready in the next minute or so.

Once anvil-safe-start finishes, the node will be back in the Anvil! and the recovery portion of the test will be complete.

Subject: [ ScanCore ] - Warning - There is a message from an-a01n01.alteeve.com
Warning Cleared:
  The node: [an-a01n02.alteeve.com] is now a full member of Anvil!. 
As soon as its storage is 'UpToDate', it will be ready to take over servers.
- NOTE: If this node is the one that joined, it's view of the peer's state may be inaccurate.

All fencing tests have passed.

Congratulations! You are ready for production!

Administrative Tasks

Have fun with your Anvil! platform!

We're always here to help.

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Us: Alteeve's Niche! Support: Mailing List IRC: #clusterlabs on Freenode   © Alteeve's Niche! Inc. 1997-2018
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.
Personal tools
Namespaces

Variants
Actions
Navigation
projects
Toolbox