Scancore

From Alteeve Wiki
Revision as of 03:28, 13 November 2023 by Digimer (talk | contribs) (Created page with "{{howto_header}} {{warning|1=This is little more that raw notes, do not consider anything here to be valid or accurate at this time.}} = Scancore - The Decision Engine = Scancore is, at its core, a "decision engine". It was created as a way for Anvil! systems to make intelligent decisions based on data coming in from any number of places. It generates alerts for admins, so in this regard it is an alert and monitoring solution, but that is almost a secondary benef...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

 Alteeve Wiki :: How To :: Scancore

Warning: This is little more that raw notes, do not consider anything here to be valid or accurate at this time.

Scancore - The Decision Engine

Scancore is, at its core, a "decision engine".

It was created as a way for Anvil! systems to make intelligent decisions based on data coming in from any number of places. It generates alerts for admins, so in this regard it is an alert and monitoring solution, but that is almost a secondary benefit.

The core of Scancore has no way of gathering data and it doesn't care how data is collected. It walks through a special agents directory and any agent it finds in there, it runs. Each agent connects to any number of Scancore databases, checks whatever it knows how to scan, compares the current data with static limits and compares against historic values (as it deems fit) and records data (new or changed values) into the database.

An agent may decide to take independent action, like sending an alert or attempting a recovery of the devices or software it monitors, and then exits. If an agent doesn't find any hardware or software it knows about, it immediately exits without doing anything further.

After all agents run, Scancore runs through post-scan tasks, depending on whether the machine it is running on is an Anvil! node or a Scancore database. This is where the "decision engine" comes into play.

Lets look at a couple of examples;

Example 1; Overheating

Scancore can tell the difference between a local node overheating and the room it is in overheating.

If the node itself has overheated, it will migrate servers over to the healthy peer. If the enough temperature sensors go critical, the node will power off.

If, however, both nodes are overheating then Scancore can deduce that the room is overheating. In this case, it can automatically shed load to reduce the amount of heat being pumped into the room and slow down the rate of heating. Later, when the room cools, it will automatically reboot the shedded node and reform the Anvil! pair, restoring redundancy without ever requiring a human's input.

How does it do this?

Multiple scan agents record thermal data. The scan-ipmitool tool checks the host's IPMI sensor data which includes many thermal sensors and their upper and lower warning and critical thresholds. The scan-storcli agent scan LSI-based RAID controllers and the attached hard drives and solid state drives. These also have thermal data. This is true also for many UPSes, ethernet switches and so forth.

As each agent checks its thermal sensors, any within nominal ranges are recorded by the agent in its database tables. Any that are in a warning state though, that is, overly warm or cool but not yet a problem, get pushed into a special 'temperature' database table. Alone, Scancore does nothing more than mark the node's health as 'warning' and no further action is taken.

If a given agent finds a given sensor reaching a 'critical' state, that is hot enough or cold enough to be a real concern, it it also pushed into the 'temperature' table. At the end of the scan, Scancore will "add up" the number of sensors that are critical.

If the sum of the sensors exceed a limit, and if the host is a node, Scancore will take action by shutting down. Each sensor has a default weight of '1' and by default, the shutdown threshold is "greater than five". So by default, a node will shut down when 6 or more sensors go critical. This is entirely configurable on a per-sensor basis as well as the shutdown threshold.

Later, when the still-accessible temperature sensors return to an acceptable level, Scancore running on any one of the dashboards will power the node back up. Note that Scancore will check how many times a node has overheated recently and extend a "cool-down" period before rebooting a node. This way, a node with a chronic overheating condition will be rebooted less often. Once repaired though, the reboots will eventually be "forgotten" and the cool-down delay will reset.

What about thermal load shedding?

The example above spoke to a single node overheating. If you recall, Scancore does "post-scan calculations". When on a node, this includes a check to see if the peer's temperature has entered a "warning" state when it has as well. Using a similar heuristic, when both nodes have enough temperature sensors in 'warning' or 'critical' state for more than a set period of time, one of the nodes will be withdrawn and shut down.

Unlike the example above, which shutdown the host node after a critical heuristic is passed, the load-shedding kicks in only when both nodes are registering a thermal event at the same time for more than a set (and configurable) period of time.

Example 2; Loss of input power

In all Anvil! systems, at least two network-monitored UPSes are powering the nodes' redundant power supplies. Thus, the loss of one UPS does not pose a risk to the system and can be ignored. Traditionally, most UPS monitoring software would assume it was the sole power provider for a machine and would initiate a shutdown if it reached critically low power levels.

With Scancore, it understands that each node has two (or more) power sources. If one UPS loses mains power, an alert will be registered but nothing more will be done. Should the one UPS deplete entirely, the power will be lost and additional alerts will be registered when input power is lost to one of the redundant power supplies, but otherwise nothing more will happen.

Thus, Scancore is redundancy-aware.

Consider another power scenario; Power is lost the both UPSes feeding a node. In this case, Scancore does two things;

  1. It begins monitoring the estimated hold-up time of the strongest UPS. If the strongest UPS drops below a minimum hold-up time, a graceful shutdown of hosted servers is initiated followed by the node(s) withdrawing and powering off. Note that if different UPSes power the nodes, Scancore will know that the peer is healthy and will migrate servers to the node with power long before the node needs to shutdown.

In a typical install, the same pair of UPSes power both nodes in the Anvil!. In the case where power is lost to both UPSes, a timer is checked. Once both nodes have been running on UPS batteries for more than two minutes, load shedding will occur. If needed, servers will migrate to consolidate on one node, then the sacrificial node will withdraw and power off to extend the runtime of the remaining node.

If, after load shedding, power stays out for too long and minimum hold-up times are crossed, the remaining node will gracefully shut down the servers and then power itself off.

Later, power is restored.

At this point, the Striker dashboards will boot (if all power was lost). Once up, they will note that both nodes are off and check the UPSes. If both UPSes are depleted (or minimally charged), they will take no action. Instead, they will monitor the charge rate of the UPSes. Once one of the UPSes hits a minimum charge percentage, it will boot the nodes and restore full Anvil! services, including booting all servers.

The logic behind the delay is to ensure that, if mains power is lost immediately after powering the nodes back on, there is sufficient charge for the nodes to power back up, detect the loss and shut back down safely.

Example 3; Node Health

The final example will show how Scancore can react to a localized node issue.

Consider the scenario where Node 1 is the active host. The RAID controller on the host reports that a hard drive is potentially failing. An alert is generated but no further action is taken.

Later, a drive fails entirely and the node enters a degraded state.

At this point, Scancore would note that Node 1 is now in a 'warning' state and the peer node is 'ok' and a timer is started. Recall that Scancore can't determine the nature of a warning, so it pauses a little bit to avoid taking action on a transient issue. Two minutes after the failure, with the 'warning' state still present, Scancore will migrate all hosted servers over to Node 2.

It will remain in the Anvil! and no further action will be taken. However, now, if a second drive were to fail (assuming RAID level 5), Node 1 would be lost and fenced, but no interruption would occur because the servers were already moved as a precaution.

If the drive is replaced before any further issues arise, Node 1 would return to an 'ok' state but nothing else would happen. Servers would be left on Node 2 because there is no benefit or concern around which node is hosting the servers at any given time.

Scan Agents

When an agent runs and connects to the database layer, a timestamp is created and that time stamp is then used for all databases changes made in that given pass. This means that the modification timestamps will be the same for a given pass, regardless of the actual time when the record was changed. This makes resynchronization far more sane, at the cost of some resolution.

If your agent needs accurate record change timestamps, please make a note to record that current time as a separate database column.


 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.