ScanCore: Difference between revisions

From Alteeve Wiki
Jump to navigation Jump to search
 
(23 intermediate revisions by the same user not shown)
Line 3: Line 3:
{{warning|1=This is little more that raw notes, do not consider anything here to be valid or accurate at this time.}}
{{warning|1=This is little more that raw notes, do not consider anything here to be valid or accurate at this time.}}


= Installing =
= ScanCore - The Decision Engine =


== PostgreSQL Setup ==
ScanCore is, at its core, a "decision engine".


<syntaxhighlight lang="bash">
It was created as a way for [[Anvil!]] systems to make intelligent decisions based on data coming in from any number of places. It generates alerts for admins, so in this regard it is an alert and monitoring solution, but that is almost a secondary benefit.
yum install -y postgresql postgresql-server postgresql-plperl postgresql-contrib postgresql-libs Scanner
</syntaxhighlight>
<syntaxhighlight lang="text">
...
Complete!
</syntaxhighlight>


DB config:
The core of ScanCore has no way of gathering data and it doesn't care how data is collected. It walks through a special <span class="code">agents</span> directory and any agent it finds in there, it runs. Each agent connects to any number of ScanCore databases, checks whatever it knows how to scan, compares the current data with static limits and compares against historic values (as it deems fit) and records data (new or changed values) into the database.


<syntaxhighlight lang="bash">
An agent may decide to take independent action, like sending an alert or attempting a recovery of the devices or software it monitors, and then exits. If an agent doesn't find any hardware or software it knows about, it immediately exits without doing anything further.
/etc/init.d/postgresql initdb
</syntaxhighlight>
<syntaxhighlight lang="text">
Initializing database:                                    [  OK  ]
</syntaxhighlight>


Start
After all agents run, ScanCore runs through post-scan tasks, depending on whether the machine it is running on is an Anvil! node or a ScanCore database. This is where the "decision engine" comes into play.


<syntaxhighlight lang="bash">
Lets look at a couple of examples;
chkconfig postgresql on
/etc/init.d/postgresql start
</syntaxhighlight>
<syntaxhighlight lang="text">
Starting postgresql service:                              [  OK  ]
</syntaxhighlight>


Create the striker user.
== Example 1; Overheating ==


<syntaxhighlight lang="bash">
ScanCore can tell the difference between a local node overheating and the room it is in overheating.
su - postgres -c "createuser --no-superuser --createdb --no-createrole striker"
</syntaxhighlight>
<syntaxhighlight lang="bash">
# no output expected
</syntaxhighlight>


Set 'postgres' and 'striker' user passwords:
If the node itself has overheated, it will migrate servers over to the healthy peer. If the enough temperature sensors go critical, the node will power off.


<syntaxhighlight lang="bash">
If, however, both nodes are overheating then ScanCore can deduce that the room is overheating. In this case, it can automatically shed load to reduce the amount of heat being pumped into the room and slow down the rate of heating. Later, when the room cools, it will automatically reboot the shedded node and reform the Anvil! pair, restoring redundancy without ever requiring a human's input.
su - postgres -c "psql -U postgres"
</syntaxhighlight>
<syntaxhighlight lang="text">
psql (8.4.20)
Type "help" for help.
</syntaxhighlight>


<syntaxhighlight lang="bash">
How does it do this?
postgres=# \password
</syntaxhighlight>
<syntaxhighlight lang="text">
Enter new password:
Enter it again:
</syntaxhighlight>


<syntaxhighlight lang="bash">
Multiple scan agents record thermal data. The <span class="code">scan-ipmitool</span> tool checks the host's [[IPMI]] sensor data which includes many thermal sensors and their upper and lower warning and critical thresholds. The <span class="code">scan-storcli</span> agent scan [[AVAGO]]-based [[RAID]] controllers and the attached hard drives and solid state drives. These also have thermal data. This is true also for many UPSes, ethernet switches and so forth.
postgres=# \password striker
</syntaxhighlight>
<syntaxhighlight lang="text">
Enter new password:
Enter it again:
</syntaxhighlight>


Exit.
As each agent checks its thermal sensors, any within nominal ranges are recorded by the agent in its database tables. Any that are in a <span class="code">warning</span> state though, that is, overly warm or cool but not yet a problem, get pushed into a special '<span class="code">temperature</span>' database table. Alone, ScanCore does nothing more than mark the node's health as 'warning' and no further action is taken.


<syntaxhighlight lang="bash">
If a given agent finds a given sensor reaching a '<span class="code">critical</span>' state, that is hot enough or cold enough to be a real concern, it it also pushed into the '<span class="code">temperature</span>' table. At the end of the scan, ScanCore will "add up" the number of sensors that are critical.
postgres=# \q
</syntaxhighlight>


{{warning|1=In the below example, the [[BCN]] is <span class="code">10.20.0.0/16</span> and the IFN is <span class="code">192.168.199.0/24</span>. If you have different networks, be sure to adjust your values accordingly!}}
If the sum of the sensors exceed a limit, and if the host is a <span class="code">node</span>, ScanCore will take action by shutting down. Each sensor has a default weight of '<span class="code">1</span>' and by default, the shutdown threshold is "greater than five". So by default, a node will shut down when 6 or more sensors go critical. This is entirely configurable on a per-sensor basis as well as the shutdown threshold.


Configure access:
Later, when the still-accessible temperature sensors return to an acceptable level, ScanCore running on any one of the dashboards will power the node back up. Note that ScanCore will check how many times a node has overheated recently and extend a "cool-down" period before rebooting a node. This way, a node with a chronic overheating condition will be rebooted less often. Once repaired though, the reboots will eventually be "forgotten" and the cool-down delay will reset.


<syntaxhighlight lang="bash">
What about thermal load shedding?
cp /var/lib/pgsql/data/pg_hba.conf /var/lib/pgsql/data/pg_hba.conf.striker
vim /var/lib/pgsql/data/pg_hba.conf
diff -u /var/lib/pgsql/data/pg_hba.conf.striker /var/lib/pgsql/data/pg_hba.conf
</syntaxhighlight>
<syntaxhighlight lang="diff">
--- /var/lib/pgsql/data/pg_hba.conf.striker 2015-03-05 14:33:40.902733374 +0000
+++ /var/lib/pgsql/data/pg_hba.conf 2015-03-05 14:34:44.861733318 +0000
@@ -65,9 +65,13 @@
# TYPE  DATABASE    USER        CIDR-ADDRESS          METHOD
+# dashboards
+host    all        all        192.168.199.0/24      md5
+# node servers
+host    all        all        10.20.0.0/16          md5
# "local" is for Unix domain socket connections only
-local  all        all                              ident
+local  all        all                              md5
# IPv4 local connections:
host    all        all        127.0.0.1/32          ident
# IPv6 local connections:
</syntaxhighlight>


<syntaxhighlight lang="bash">
The example above spoke to a single node overheating. If you recall, ScanCore does "post-scan calculations". When on a node, this includes a check to see if the peer's temperature has entered a "warning" state when it has as well. Using a similar heuristic, when both nodes have enough temperature sensors in 'warning' or 'critical' state for more than a set period of time, one of the nodes will be withdrawn and shut down.  
cp /var/lib/pgsql/data/postgresql.conf /var/lib/pgsql/data/postgresql.conf.striker
vim /var/lib/pgsql/data/postgresql.conf
diff -u /var/lib/pgsql/data/postgresql.conf.striker /var/lib/pgsql/data/postgresql.conf
</syntaxhighlight>
<syntaxhighlight lang="diff">
--- /var/lib/pgsql/data/postgresql.conf.striker 2015-03-05 14:35:35.388733307 +0000
+++ /var/lib/pgsql/data/postgresql.conf 2015-03-05 14:36:07.111733159 +0000
@@ -56,7 +56,7 @@
# - Connection Settings -
-#listen_addresses = 'localhost' # what IP address(es) to listen on;
+listen_addresses = '*' # what IP address(es) to listen on;
# comma-separated list of addresses;
# defaults to 'localhost', '*' = all
# (change requires restart)
</syntaxhighlight>
 
<syntaxhighlight lang="bash">
/etc/init.d/postgresql restart
</syntaxhighlight>
<syntaxhighlight lang="text">
Stopping postgresql service:                              [  OK  ]
Starting postgresql service:                              [  OK  ]
</syntaxhighlight>


== Striker Database Setup ==
Unlike the example above, which shutdown the host node after a critical heuristic is passed, the load-shedding kicks in only when both nodes are registering a thermal event at the same time for more than a set (and configurable) period of time.


Create DB:
== Example 2; Loss of input power ==


<syntaxhighlight lang="bash">
In all Anvil! systems, at least two network-monitored UPSes are powering the nodes' redundant power supplies. Thus, the loss of one UPS does not pose a risk to the system and can be ignored. Traditionally, most UPS monitoring software would assume it was the sole power provider for a machine and would initiate a shutdown if it reached critically low power levels.
su - postgres -c "createdb --owner striker scanner"
</syntaxhighlight>
<syntaxhighlight lang="text">
Password:
</syntaxhighlight>


The SQL files we need to load are found in the <span class="code">/etc/striker/SQL</span> directory.
With ScanCore, it understands that each node has two (or more) power sources. If one UPS loses mains power, an alert will be registered but nothing more will be done. Should the one UPS deplete entirely, the power will be lost and additional alerts will be registered when input power is lost to one of the redundant power supplies, but otherwise nothing more will happen.


The core SQL file is <span class="code"></span>
Thus, ScanCore is redundancy-aware.


<syntaxhighlight lang="bash">
Consider another power scenario; Power is lost the both UPSes feeding a node. In this case, ScanCore does two things;
ls -lah /etc/striker/SQL/
</syntaxhighlight>
<syntaxhighlight lang="text">
total 64K
drwxr-xr-x. 2 root root 4.0K Mar  4 23:50 .
drwxr-xr-x. 5 root root 4.0K Mar  4 23:50 ..
-rw-r--r--. 1 root root  397 Mar  4 23:41 00_drop_db.sql
-rw-r--r--. 1 root root 2.5K Mar  4 23:41 01_create_node.sql
-rw-r--r--. 1 root root 3.2K Mar  4 23:41 02_create_alerts.sql
-rw-r--r--. 1 root root 1.9K Mar  4 23:41 03_create_alert_listeners.sql
-rw-r--r--. 1 root root 1.3K Mar  4 23:41 04_load_alert_listeners.sql
-rw-r--r--. 1 root root 3.2K Mar  4 23:41 05_create_random_agent.sql
-rw-r--r--. 1 root root 3.4K Mar  4 23:41 06a_create_snm_apc_pdu.sql
-rw-r--r--. 1 root root 3.6K Mar  4 23:41 06b_create_snmp_brocade_switch.sql
-rw-r--r--. 1 root root 3.4K Mar  4 23:41 06_create_snm_apc_ups.sql
-rw-r--r--. 1 root root 3.5K Mar  4 23:41 07_create_ipmi.sql
-rw-r--r--. 1 root root 5.9K Mar  4 23:41 08_create_raid.sql
-rw-r--r--. 1 root root 3.8K Mar  4 23:41 09_create_bonding.sql
-rw-r--r--. 1 root root 1.2K Mar  4 23:41 Makefile
</syntaxhighlight>


{{note|1=The default is that the database owner name is <span class="code">striker</span>. If you used a different database name owner, please update the <span class="code">.sql</span> files with the command <span class="code">sed -i 's/striker/yourname/' *.sql</span>.}}
# It begins monitoring the estimated hold-up time of the ''strongest'' UPS. If the strongest UPS drops below a minimum hold-up time, a graceful shutdown of hosted servers is initiated followed by the node(s) withdrawing and powering off. Note that if different UPSes power the nodes, ScanCore will know that the peer is healthy and will migrate servers to the node with power long before the node needs to shutdown.


Load the SQL tables into the database.
In a typical install, the same pair of UPSes power both nodes in the Anvil!. In the case where power is lost to both UPSes, a timer is checked. Once both nodes have been running on UPS batteries for more than two minutes, load shedding will occur. If needed, servers will migrate to consolidate on one node, then the sacrificial node will withdraw and power off to extend the runtime of the remaining node.


<syntaxhighlight lang="bash">
If, after load shedding, power stays out for too long and minimum hold-up times are crossed, the remaining node will gracefully shut down the servers and then power itself off.
cat /etc/striker/SQL/*.sql > /tmp/all.sql
psql scanner -U striker -f /tmp/all.sql
</syntaxhighlight>
<syntaxhighlight lang="text">
Password for user striker:
</syntaxhighlight>
<syntaxhighlight lang="text">
<sql load messages>
</syntaxhighlight>


Test:
Later, power is restored.


<syntaxhighlight lang="bash">
At this point, the Striker dashboards will boot (if all power was lost). Once up, they will note that both nodes are off and check the UPSes. If both UPSes are depleted (or minimally charged), they will take no action. Instead, they will monitor the charge rate of the UPSes. Once one of the UPSes hits a minimum charge percentage, it will boot the nodes and restore full Anvil! services, including booting all servers.
psql -U striker -d scanner -c "SELECT * FROM alert_listeners"
</syntaxhighlight>
<syntaxhighlight lang="text">
Password for user striker:
</syntaxhighlight>
<syntaxhighlight lang="text">
id |      name      |    mode      |  level  |  contact_info  | language | added_by |            updated           
----+----------------+---------------+---------+----------------+----------+----------+-------------------------------
  1 | screen        | Screen        | DEBUG  | screen        | en_CA    |        0 | 2014-12-11 14:42:13.273057-05
  2 | Tom Legrady    | Email        | DEBUG  | tom@striker.ca | en_CA    |        0 | 2014-12-11 16:54:25.477321-05
  3 | Health Monitor | HealthMonitor | WARNING |                | en_CA    |        0 | 2015-01-14 14:08:15-05
(3 rows)
</syntaxhighlight>


Done!
The logic behind the delay is to ensure that, if mains power is lost immediately after powering the nodes back on, there is sufficient charge for the nodes to power back up, detect the loss and shut back down safely.


== Configure Scan Core on a Node ==
== Example 3; Node Health ==


Install dependencies:
The final example will show how ScanCore can react to a localized node issue.


<syntaxhighlight lang="bash">
Consider the scenario where Node 1 is the active host. The RAID controller on the host reports that a hard drive is potentially failing. An alert is generated but no further action is taken.
yum install Scanner postgresql perl-DBD-Pg
</syntaxhighlight>


On the clients, you need to be sure your configuration files are set the way you want.  
Later, a drive fails entirely and the node enters a degraded state.


Most importantly is that the connection details to the databases on the dashboards are configured properly. Most installs have two dashboards, and Scanner will record it's data to both for resiliency.
At this point, ScanCore would note that Node 1 is now in a 'warning' state and the peer node is 'ok' and a timer is started. Recall that ScanCore can't determine the nature of a warning, so it pauses a little bit to avoid taking action on a transient issue. Two minutes after the failure, with the 'warning' state still present, ScanCore will migrate all hosted servers over to Node 2.  


The configuration files are found in <span class="code">/etc/striker/Config/</span>.
It will remain in the Anvil! and no further action will be taken. However, now, if a second drive were to fail (assuming RAID level 5), Node 1 would be lost and fenced, but no interruption would occur because the servers were already moved as a precaution.


<syntaxhighlight lang="bash">
If the drive is replaced before any further issues arise, Node 1 would return to an 'ok' state but nothing else would happen. Servers would be left on Node 2 because there is no benefit or concern around which node is hosting the servers at any given time.
ls -lah /etc/striker/Config/
</syntaxhighlight>
<syntaxhighlight lang="text">
total 68K
drwxr-xr-x. 2 root root 4.0K Mar  5 15:06 .
drwxr-xr-x. 5 root root 4.0K Mar  5 15:06 ..
-rw-r--r--. 1 root root  741 Mar  4 23:41 bonding.conf
-rw-r--r--. 1 root root 1.1K Mar  4 23:41 dashboard.conf
-rw-r--r--. 1 root root  379 Mar  4 23:41 db.conf
-rw-r--r--. 1 root root 5.1K Mar  4 23:41 ipmi.conf
-rw-r--r--. 1 root root  939 Mar  4 23:41 nodemonitor.conf
-rw-r--r--. 1 root root 1.2K Mar  4 23:41 raid.conf
-rw-r--r--. 1 root root  961 Mar  4 23:41 scanner.conf
-rw-r--r--. 1 root root 1.7K Mar  4 23:41 snmp_apc_pdu.conf
-rw-r--r--. 1 root root 8.9K Mar  4 23:41 snmp_apc_ups.conf
-rw-r--r--. 1 root root 4.7K Mar  4 23:41 snmp_brocade_switch.conf
-rw-r--r--. 1 root root 1.4K Mar  4 23:41 system_check.conf
</syntaxhighlight>


{{note|1=We're showing two databases, but in theory, there is no set limit on the number of database servers that the nodes can use. Simply copy the configuration section for each additional server you wish to use. Just be sure to increment the id number for each section (ie: <span class="code">db::X::name</span> where <span class="code">X</span> is a unique integer for the additional server).}}
= Scan Agents =


In this example, the two [[Striker]] dashboards with our databases have the [[BCN]] IPs <span class="code">10.20.4.1</span> and <span class="code">10.20.4.2</span>. Both use the database name <span class="code">scanner</span> owned by the database user <span class="code">striker</span> with the password <span class="code">secret</span>. So their configurations will be nearly identical.
When an agent runs and connects to the database layer, a timestamp is created and that time stamp is then used for all databases changes made in that given pass. This means that the modification timestamps will be the same for a given pass, regardless of the actual time when the record was changed. This makes resynchronization far more sane, at the cost of some resolution.  


<syntaxhighlight lang="bash">
If your agent needs accurate record change timestamps, please make a note to record that current time as a separate database column.
cp /etc/striker/Config/db.conf /etc/striker/Config/db.conf.original
vim /etc/striker/Config/db.conf
</syntaxhighlight>
<syntaxhighlight lang="text">
db::1::name      = scanner
db::1::db_type  = Pg
db::1::host      = 10.20.4.1
db::1::port      = 5432
db::1::user      = striker
db::1::password  = secret


db::2::name      = scanner
* [[List of ScanCore Agents]]
db::2::db_type  = Pg
db::2::host      = 10.20.4.2
db::2::port      = 5432
db::2::user      = striker
db::2::password  = secret
</syntaxhighlight>


Now the node should be able to reach the databases. Lets test though, to be sure. The nodes have IPMI, so we will test by manually calling the <span class="code">ipmi</span> agent.
== DB Resync ==


<syntaxhighlight lang="bash">
Part of the difference between ScanCore and various other tools is that ScanCore is designed from its core as a resilient project. The data collected by agents needs to, from the user's perspective, sync N-way between ScanCore databases without the user needing to worry about backups, recoveries and whatnot.
/usr/share/striker/agents/ipmi --verbose --verbose
</syntaxhighlight>
<syntaxhighlight lang="text">
Program ipmi writing to DB '10.20.4.1'.
Program ipmi writing to DB '10.20.4.2'.
ipmi loop 1 at 01:16:08 ->  960.295 ms elapsed;  29039.705 ms pending.


----------------------------------------------------------------------
How does this work?


ipmi loop 2 at 01:16:38 -> 1005.016 ms elapsed; 28994.984 ms pending.
In essence, the data Agents collects can be categorized in one of two ways;


----------------------------------------------------------------------
* Data that is global (like data on servers on the [[Anvil!]] platform)
</syntaxhighlight>
* Data that is target-bound (like a host's sensor data from IPMI interfaces or a given machines view of UPSes it cares about)


It all is well, it should record it's values once every 30 seconds or so. Let it run a couple of loops, and then press <span class="code"><ctrl></span> + <span class="code">c</span> to stop the scan.
As an agent author, you need to consider that data may exist in some databases and not others.


Now we can verify the data was written to both dashboard's databases:
Consider;


<syntaxhighlight lang="bash">
A site has two Striker dashboards acting as ScanCore databases. This is a satellite office so you data replicates to a third Striker at head office. Meanwhile, head-office is collecting data from many different sites and the two dashboards on your site doesn't care about the data on the head-office site from those other locations.
psql -h 10.20.4.1 -U striker scanner -c "SELECT * FROM ipmi_temperatures;"
</syntaxhighlight>
<syntaxhighlight lang="text">
id | node_id |      target      |      field      | value |  units  | status  |  message_tag  | message_arguments |          timestamp         
----+---------+------------------+-----------------+-------+-----------+---------+---------------+-------------------+-------------------------------
  1 |      1 | node1.alteeve.ca | Ambient        | 25    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.390891+00
  2 |      1 | node1.alteeve.ca | Systemboard 1  | 29    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.415248+00
  3 |      1 | node1.alteeve.ca | Systemboard 2  | 39    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.429477+00
  4 |      1 | node1.alteeve.ca | CPU1            | 35    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.4434+00
  5 |      1 | node1.alteeve.ca | CPU2            | 39    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.455114+00
  6 |      1 | node1.alteeve.ca | MEM A           | 32    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.466447+00
  7 |      1 | node1.alteeve.ca | MEM B          | 32    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.47765+00
  8 |      1 | node1.alteeve.ca | MEM C          | 35    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.489131+00
  9 |      1 | node1.alteeve.ca | MEM D          | 34    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.500622+00
10 |      1 | node1.alteeve.ca | MEM E          | 37    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.51189+00
11 |      1 | node1.alteeve.ca | MEM F          | 36    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.523267+00
12 |      1 | node1.alteeve.ca | MEM G          | 34    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.534761+00
13 |      1 | node1.alteeve.ca | MEM H          | 36    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.54614+00
14 |      1 | node1.alteeve.ca | PSU1 Inlet      | 29    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.557422+00
15 |      1 | node1.alteeve.ca | PSU2 Inlet      | 28    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.569362+00
16 |      1 | node1.alteeve.ca | PSU1            | 53    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.580696+00
17 |      1 | node1.alteeve.ca | PSU2            | 56    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.591993+00
18 |      1 | node1.alteeve.ca | BBU            | 30    | degrees C | OK      |              |                  | 2015-03-06 01:16:02.603261+00
19 |      1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS  | Value crisis  | value=76          | 2015-03-06 01:16:02.614824+00
20 |      1 | node1.alteeve.ca | summary        | 1    |          | WARNING | Value warning | value=1          | 2015-03-06 01:16:02.64331+00
21 |      1 | node1.alteeve.ca | Ambient        | 25    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.400365+00
22 |      1 | node1.alteeve.ca | Systemboard 1  | 29    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.425598+00
23 |      1 | node1.alteeve.ca | Systemboard 2  | 39    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.439627+00
24 |      1 | node1.alteeve.ca | CPU1            | 35    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.453921+00
25 |      1 | node1.alteeve.ca | CPU2            | 39    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.468253+00
26 |      1 | node1.alteeve.ca | MEM A          | 32    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.482567+00
27 |      1 | node1.alteeve.ca | MEM B          | 32    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.496698+00
28 |      1 | node1.alteeve.ca | MEM C          | 35    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.508425+00
29 |      1 | node1.alteeve.ca | MEM D          | 34    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.522475+00
30 |      1 | node1.alteeve.ca | MEM E          | 37    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.536592+00
31 |      1 | node1.alteeve.ca | MEM F          | 36    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.548096+00
32 |      1 | node1.alteeve.ca | MEM G          | 34    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.559742+00
33 |      1 | node1.alteeve.ca | MEM H          | 36    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.573795+00
34 |      1 | node1.alteeve.ca | PSU1 Inlet      | 29    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.585372+00
35 |      1 | node1.alteeve.ca | PSU2 Inlet      | 28    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.599816+00
36 |      1 | node1.alteeve.ca | PSU1            | 53    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.613983+00
37 |      1 | node1.alteeve.ca | PSU2            | 56    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.628238+00
38 |      1 | node1.alteeve.ca | BBU            | 30    | degrees C | OK      |              |                  | 2015-03-06 01:16:32.642372+00
39 |      1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS  | Value crisis  | value=76          | 2015-03-06 01:16:32.653909+00
40 |      1 | node1.alteeve.ca | summary        | 1    |          | WARNING | Value warning | value=1          | 2015-03-06 01:16:32.682502+00
(40 rows)
</syntaxhighlight>


We'll address the warnings in a moment. For now, this tells up that we are recording to dashboard 1 properly. Lets check dashboard 2:
{{warning|1=Isolating data onto a limited number of databases is an efficiency effort, '''not''' a security effort! If you don't trust a ScanCore database machine, don't connect to it, period. Similarly, if you don't trust trust a machine with access to your database, don't give the owner access.}}


<syntaxhighlight lang="bash">
You also need to plan for N-directional resynchronization.
psql -h 10.20.4.2 -U striker scanner -c "SELECT * FROM ipmi_temperatures;"
</syntaxhighlight>
<syntaxhighlight lang="text">
id | node_id |      target      |      field      | value |  units  | status  |  message_tag  | message_arguments |          timestamp         
----+---------+------------------+-----------------+-------+-----------+---------+---------------+-------------------+-------------------------------
  1 |      1 | node1.alteeve.ca | Ambient        | 25    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.689144+00
  2 |      1 | node1.alteeve.ca | Systemboard 1  | 29    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.708423+00
  3 |      1 | node1.alteeve.ca | Systemboard 2  | 39    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.722751+00
  4 |      1 | node1.alteeve.ca | CPU1            | 35    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.733944+00
  5 |      1 | node1.alteeve.ca | CPU2            | 39    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.74567+00
  6 |      1 | node1.alteeve.ca | MEM A          | 32    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.756925+00
  7 |      1 | node1.alteeve.ca | MEM B          | 32    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.768102+00
  8 |      1 | node1.alteeve.ca | MEM C          | 35    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.779549+00
  9 |      1 | node1.alteeve.ca | MEM D          | 34    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.791011+00
10 |      1 | node1.alteeve.ca | MEM E          | 37    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.802332+00
11 |      1 | node1.alteeve.ca | MEM F          | 36    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.813697+00
12 |      1 | node1.alteeve.ca | MEM G          | 34    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.825063+00
13 |      1 | node1.alteeve.ca | MEM H          | 36    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.836604+00
14 |      1 | node1.alteeve.ca | PSU1 Inlet      | 29    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.848219+00
15 |      1 | node1.alteeve.ca | PSU2 Inlet      | 28    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.859965+00
16 |      1 | node1.alteeve.ca | PSU1            | 53    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.870959+00
17 |      1 | node1.alteeve.ca | PSU2            | 56    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.88233+00
18 |      1 | node1.alteeve.ca | BBU            | 30    | degrees C | OK      |              |                  | 2015-03-06 01:16:08.893657+00
19 |      1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS  | Value crisis  | value=76          | 2015-03-06 01:16:08.905299+00
20 |      1 | node1.alteeve.ca | summary        | 1    |          | WARNING | Value warning | value=1          | 2015-03-06 01:16:08.93407+00
21 |      1 | node1.alteeve.ca | Ambient        | 25    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.699395+00
22 |      1 | node1.alteeve.ca | Systemboard 1  | 29    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.718864+00
23 |      1 | node1.alteeve.ca | Systemboard 2  | 39    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.73341+00
24 |      1 | node1.alteeve.ca | CPU1            | 35    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.747455+00
25 |      1 | node1.alteeve.ca | CPU2            | 39    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.762113+00
26 |      1 | node1.alteeve.ca | MEM A          | 32    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.776163+00
27 |      1 | node1.alteeve.ca | MEM B          | 32    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.787508+00
28 |      1 | node1.alteeve.ca | MEM C          | 35    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.802058+00
29 |      1 | node1.alteeve.ca | MEM D          | 34    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.816296+00
30 |      1 | node1.alteeve.ca | MEM E          | 37    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.827444+00
31 |      1 | node1.alteeve.ca | MEM F          | 36    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.83877+00
32 |      1 | node1.alteeve.ca | MEM G          | 34    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.853383+00
33 |      1 | node1.alteeve.ca | MEM H          | 36    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.864927+00
34 |      1 | node1.alteeve.ca | PSU1 Inlet      | 29    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.879143+00
35 |      1 | node1.alteeve.ca | PSU2 Inlet      | 28    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.893541+00
36 |      1 | node1.alteeve.ca | PSU1            | 53    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.907655+00
37 |      1 | node1.alteeve.ca | PSU2            | 56    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.922028+00
38 |      1 | node1.alteeve.ca | BBU            | 30    | degrees C | OK      |              |                  | 2015-03-06 01:16:38.933201+00
39 |      1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS  | Value crisis  | value=76          | 2015-03-06 01:16:38.947347+00
40 |      1 | node1.alteeve.ca | summary        | 1    |          | WARNING | Value warning | value=1          | 2015-03-06 01:16:38.976188+00
(40 rows)
</syntaxhighlight>


Excellent!
Also consider;


Now, note the lines:
Power is lost to both/all UPSes and load-shedding takes "Striker 2" offline. Now data is being recorded to "Striker 1" that will need to be copied to Striker 2 later. Time passes and all power is lost. Power is restored, but for some reason Striker 2 boots up first and starts collecting data. Eventually, Striker 1 comes back online.


<syntaxhighlight lang="bash">
Now, Striker 1 has data that 2 doesn't, and Striker 2 has data that 1 doesn't.
psql -h 10.20.4.1 -U striker scanner -c "SELECT * FROM ipmi_temperatures WHERE field='RAID Controller' ORDER BY timestamp ASC;"
</syntaxhighlight>
<syntaxhighlight lang="text">
id | node_id |      target      |      field      | value |  units  | status | message_tag  | message_arguments |          timestamp         
----+---------+------------------+-----------------+-------+-----------+--------+--------------+-------------------+-------------------------------
19 |      1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:16:02.614824+00
39 |      1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:16:32.653909+00
(2 rows)
</syntaxhighlight>


This tells us that the <span class="code">RAID Controller</span> is running at 76°C, which scanner thinks is dangerously hot. We know that, according to the manufacturer, the controller is rated for up to 95°C, so this is fine. To account for this, we'll update the <span class="code">/etc/striker/Config/ipmi.conf</span> file from:
ScanCore has already solved this problem using the following schemes, depending on which type of data your agent collects.


<syntaxhighlight lang="text">
{{note|1=Yes, this is expensive in terms of memory and processing power, relatively speaking. However, a lot of effort is made to never <span class="code">UPDATE</span> the database unless something actually changes, keeping the <span class="code">history</span> schema as small and efficient as possible. For this reason, even data collected from many nodes over a long period of time should not add up to too much. If you are concerned, be sure to run periodic archiving of the data.}}
ipmi::RAID Controller::ok        = 60
ipmi::RAID Controller::warn      = 70
ipmi::RAID Controller::hysteresis =  1
ipmi::RAID Controller::units      = degrees C
</syntaxhighlight>


To:
{{warning|1=As this is written, automatic archiving has not been implemented, though it is planned to be implemented shortly.}}


<syntaxhighlight lang="text">
=== Resync Global Data ===
ipmi::RAID Controller::ok        = 80
ipmi::RAID Controller::warn      = 90
ipmi::RAID Controller::hysteresis = 1
ipmi::RAID Controller::units      = degrees C
</syntaxhighlight>


Now, over 80°C will cause a warning and over 90°C will cause a critical alert. Lets test by running the <span class="code">ipmi</span> scan agent for one pass.
This is the simplest data to resync because it will go to all databases, no matter what. This is rare in practice but provides a good starting point.


<syntaxhighlight lang="bash">
The process;
/usr/share/striker/agents/ipmi --verbose --verbose
</syntaxhighlight>
<syntaxhighlight lang="text">
Program ipmi writing to DB '10.20.4.1'.
Program ipmi writing to DB '10.20.4.2'.
ipmi loop 1 at 01:32:39 ->  937.201 ms elapsed; 29062.799 ms pending.


----------------------------------------------------------------------
The agent starts and connects to the databases. As part of the connection process, a check is made to see if any databases are behind (see <span class="code">[https://github.com/ClusterLabs/striker/blob/master/AN/Tools/DB.pm AN::Tools::DB.pm->find_behind_databases()]</span>). If so, the agent will act on this by initiating a resync.


^C
The resync process is fundamentally simple; All records are read in from it's history schema of all connected databases into a common hash based on the time a given record was recorded and the unique ID of the record. The same data is loaded into database-specific hash for later comparison. We also note for each unique record that we've seen at least one copy of the record for a later step. An example "record" would be a server's UUID, which uniquely identifies it regardless of the host node or Anvil!.
</syntaxhighlight>


Now lets look at the database again:
Here is an example of how the data is read in:


<syntaxhighlight lang="bash">
<syntaxhighlight lang="perl">
psql -h 10.20.4.1 -U striker scanner -c "SELECT * FROM ipmi_temperatures WHERE field='RAID Controller' ORDER BY timestamp ASC;"
my $query = "
SELECT  
    server_uuid,
    server_name,
    server_stop_reason,
    server_start_after,
    server_start_delay,
    server_note,
    server_definition,
    server_host,
    server_state,
    server_migration_type,
    server_pre_migration_script,
    server_pre_migration_arguments,
    server_post_migration_script,
    server_post_migration_arguments,
    modified_date
FROM  
    history.servers
;";
</syntaxhighlight>
</syntaxhighlight>
<syntaxhighlight lang="text">
id | node_id |      target      |      field      | value |  units  | status | message_tag  | message_arguments |          timestamp         
----+---------+------------------+-----------------+-------+-----------+--------+--------------+-------------------+-------------------------------
19 |      1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:16:02.614824+00
39 |      1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:16:32.653909+00
59 |      2 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:31:17.370253+00
79 |      3 | node1.alteeve.ca | RAID Controller | 76    | degrees C | OK    |              |                  | 2015-03-06 01:32:34.447756+00
(4 rows)
</syntaxhighlight>
Notice the last entry is '<span class="code">OK</span>' now? That tells us we're doing fine.
{{note|1=Be sure to update the configuration values on both nodes!}}
Start scanner every five minutes. If another copy is running, it simply exits. If no other copy was running (do to OS boot, scanner crash, etc) it will start up.
== Testing Automatic Shutdown ==


One of the features of Scanner is that it can safely shut down a node if it starts to get too hot, or if the UPS has lost power and the batteries in the strongest UPS drops below a minimum hold-up time. To test this, you have two choices;
Without constraints, ''all'' data in the table will be read in. This data is recorded in the 'unified' hash using the modification time and the unique identifier as keys.


# Pull the power on the UPSes and watch their hold-up time. If all goes well, both nodes will power off when the minimum thresh-hold is passed.
<syntaxhighlight lang="perl">
# Artificially set five or more thermal sensors to be too low so that normal thermal levels trigger a shut down.
# Record this in the unified and local hashes.
 
$an->data->{db_data}{unified}{servers}{modified_date}{$modified_date}{server_uuid}{$server_uuid} = {
{{warning|1=If you're testing option 2, '''do not''' configure scanner to run on boot or via cron! Your node will shut down within five minutes otherwise, requiring a boot to single-user mode to correct.}}
server_name => $server_name,
 
server_stop_reason => $server_stop_reason,
For time sake, we'll drop the sensors.
server_start_after => $server_start_after,  
 
server_start_delay => $server_start_delay,  
First, we need to know what values would be "too high", so lets see what our RAM and RAID controller is sitting at:
server_note => $server_note,
 
server_definition => $server_definition,
<syntaxhighlight lang="bash">
server_host => $server_host,  
psql -h 10.20.4.1 -U striker scanner -c "SELECT * FROM ipmi_temperatures WHERE field='RAID Controller' or field LIKE 'MEM %' ORDER BY field ASC, timestamp ASC;"
server_state => $server_state,
server_migration_type => $server_migration_type,
server_pre_migration_script => $server_pre_migration_script,
server_pre_migration_arguments => $server_pre_migration_arguments,
server_post_migration_script => $server_post_migration_script,
server_post_migration_arguments => $server_post_migration_arguments,
};
</syntaxhighlight>
</syntaxhighlight>
<syntaxhighlight lang="text">
id | node_id |      target      |      field      | value |  units  | status | message_tag  | message_arguments |          timestamp         
----+---------+------------------+-----------------+-------+-----------+--------+--------------+-------------------+-------------------------------
  6 |      1 | node1.alteeve.ca | MEM A          | 32    | degrees C | OK    |              |                  | 2015-03-06 01:16:02.466447+00
26 |      1 | node1.alteeve.ca | MEM A          | 32    | degrees C | OK    |              |                  | 2015-03-06 01:16:32.482567+00
46 |      2 | node1.alteeve.ca | MEM A          | 33    | degrees C | OK    |              |                  | 2015-03-06 01:31:17.222054+00
66 |      3 | node1.alteeve.ca | MEM A          | 33    | degrees C | OK    |              |                  | 2015-03-06 01:32:34.299146+00
  7 |      1 | node1.alteeve.ca | MEM B          | 32    | degrees C | OK    |              |                  | 2015-03-06 01:16:02.47765+00
27 |      1 | node1.alteeve.ca | MEM B          | 32    | degrees C | OK    |              |                  | 2015-03-06 01:16:32.496698+00
47 |      2 | node1.alteeve.ca | MEM B          | 33    | degrees C | OK    |              |                  | 2015-03-06 01:31:17.233122+00
67 |      3 | node1.alteeve.ca | MEM B          | 33    | degrees C | OK    |              |                  | 2015-03-06 01:32:34.310512+00
  8 |      1 | node1.alteeve.ca | MEM C          | 35    | degrees C | OK    |              |                  | 2015-03-06 01:16:02.489131+00
28 |      1 | node1.alteeve.ca | MEM C          | 35    | degrees C | OK    |              |                  | 2015-03-06 01:16:32.508425+00
48 |      2 | node1.alteeve.ca | MEM C          | 36    | degrees C | OK    |              |                  | 2015-03-06 01:31:17.244798+00
68 |      3 | node1.alteeve.ca | MEM C          | 36    | degrees C | OK    |              |                  | 2015-03-06 01:32:34.321981+00
  9 |      1 | node1.alteeve.ca | MEM D          | 34    | degrees C | OK    |              |                  | 2015-03-06 01:16:02.500622+00
29 |      1 | node1.alteeve.ca | MEM D          | 34    | degrees C | OK    |              |                  | 2015-03-06 01:16:32.522475+00
49 |      2 | node1.alteeve.ca | MEM D          | 35    | degrees C | OK    |              |                  | 2015-03-06 01:31:17.256127+00
69 |      3 | node1.alteeve.ca | MEM D          | 35    | degrees C | OK    |              |                  | 2015-03-06 01:32:34.333338+00
10 |      1 | node1.alteeve.ca | MEM E          | 37    | degrees C | OK    |              |                  | 2015-03-06 01:16:02.51189+00
30 |      1 | node1.alteeve.ca | MEM E          | 37    | degrees C | OK    |              |                  | 2015-03-06 01:16:32.536592+00
50 |      2 | node1.alteeve.ca | MEM E          | 38    | degrees C | OK    |              |                  | 2015-03-06 01:31:17.26758+00
70 |      3 | node1.alteeve.ca | MEM E          | 38    | degrees C | OK    |              |                  | 2015-03-06 01:32:34.34476+00
11 |      1 | node1.alteeve.ca | MEM F          | 36    | degrees C | OK    |              |                  | 2015-03-06 01:16:02.523267+00
31 |      1 | node1.alteeve.ca | MEM F          | 36    | degrees C | OK    |              |                  | 2015-03-06 01:16:32.548096+00
51 |      2 | node1.alteeve.ca | MEM F          | 36    | degrees C | OK    |              |                  | 2015-03-06 01:31:17.278884+00
71 |      3 | node1.alteeve.ca | MEM F          | 36    | degrees C | OK    |              |                  | 2015-03-06 01:32:34.356109+00
12 |      1 | node1.alteeve.ca | MEM G          | 34    | degrees C | OK    |              |                  | 2015-03-06 01:16:02.534761+00
32 |      1 | node1.alteeve.ca | MEM G          | 34    | degrees C | OK    |              |                  | 2015-03-06 01:16:32.559742+00
52 |      2 | node1.alteeve.ca | MEM G          | 35    | degrees C | OK    |              |                  | 2015-03-06 01:31:17.290446+00
72 |      3 | node1.alteeve.ca | MEM G          | 35    | degrees C | OK    |              |                  | 2015-03-06 01:32:34.367751+00
13 |      1 | node1.alteeve.ca | MEM H          | 36    | degrees C | OK    |              |                  | 2015-03-06 01:16:02.54614+00
33 |      1 | node1.alteeve.ca | MEM H          | 36    | degrees C | OK    |              |                  | 2015-03-06 01:16:32.573795+00
53 |      2 | node1.alteeve.ca | MEM H          | 37    | degrees C | OK    |              |                  | 2015-03-06 01:31:17.301801+00
73 |      3 | node1.alteeve.ca | MEM H          | 37    | degrees C | OK    |              |                  | 2015-03-06 01:32:34.378846+00
19 |      1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:16:02.614824+00
39 |      1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:16:32.653909+00
59 |      2 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:31:17.370253+00
79 |      3 | node1.alteeve.ca | RAID Controller | 76    | degrees C | OK    |              |                  | 2015-03-06 01:32:34.447756+00
(36 rows)
</syntaxhighlight>
So the RAM is sitting around 35°C and the RAID controller is sitting around 75°C. So to trigger a CRITICAL shutdown, we'll need five values or more. In my case, I had eight RAM modules, so that is enough to trigger a shut down. We'll modify those.
To save time restoring after the test is done, lets copy our properly configured <span class="code">ipmi.conf</span> out of the way. We'll copy it back when the testing is done.


<syntaxhighlight lang="bash">
Next, for the current Database ID that we're reading from, note that the server with the given ID exists in the <span class="code">public</span> database schema. We'll also set this 'seen' as '0' for now. We'll see why in a moment.
cp /etc/striker/Config/ipmi.conf /etc/striker/Config/ipmi.conf.good
<syntaxhighlight lang="perl">
$an->data->{db_data}{$id}{servers}{server_uuid}{$server_uuid}{'exists'} = 1;
$an->data->{db_data}{$id}{servers}{server_uuid}{$server_uuid}{seen}    = 0;
</syntaxhighlight>
</syntaxhighlight>


Now we'll edit <span class="code">/etc/striker/Config/ipmi.conf</span>:
Finally, record the same data in another hash, identified by the currently active database ID in another hash.
 
<syntaxhighlight lang="bash">
<syntaxhighlight lang="perl">
vim /etc/striker/Config/ipmi.conf
$an->data->{db_data}{$id}{servers}{modified_date}{$modified_date}{server_uuid}{$server_uuid} = {
server_name => $server_name,
server_stop_reason => $server_stop_reason,
server_start_after => $server_start_after,
server_start_delay => $server_start_delay,
server_note => $server_note,
server_definition => $server_definition,
server_host => $server_host,
server_state => $server_state,
server_migration_type => $server_migration_type,
server_pre_migration_script => $server_pre_migration_script,
server_pre_migration_arguments => $server_pre_migration_arguments,
server_post_migration_script => $server_post_migration_script,
server_post_migration_arguments => $server_post_migration_arguments,
};
</syntaxhighlight>
</syntaxhighlight>


The memory entries should look like this, normally:
So, once the read is done from all accessible databases, we'll have a set of hashes; One being the unified collection of all data from both/all sources, plus a hash for each database.


<syntaxhighlight lang="text">
{{note|1=This looks a little complicated, but it is worth the mental effort. With this in place, users will never need to worry about data recovery or synchronization so long as even one copy of the database exists somewhere. ScanCore database servers can come and go or be destroyed and replaced trivially. So please bear with it... The logic seems complex, but it is fundamentally quite simple.}}
ipmi::MEM A::ok        = 45
ipmi::MEM A::warn      = 55
ipmi::MEM A::hysteresis =  1
ipmi::MEM A::units      = degrees C
</syntaxhighlight>


We'll change them all to:
With this, here is the sync process:


<syntaxhighlight lang="text">
# Walk through the unified records for each given modification timestamp, newest records first, oldest records last.
ipmi::MEM A::ok        = 20
## Walk through each unique record for the given timestamp (continuing the example, this would be each server's UUID).
ipmi::MEM A::warn      = 30
### Loop through each connected database ID.
ipmi::MEM A::hysteresis = 1
#### Check to see if the unique record ID '''has been seen''' in the resync process yet. (''Note'': This will always be 'not' the first time because the first instance of a record at the most recent time stamp will go into the <span class="code">public</span> schema where all other records will go into the <span class="code">history</span> schema.)
ipmi::MEM A::units      = degrees C
##### '''IF NOT seen''':
</syntaxhighlight>
###### Mark the record as now having been seen.
###### Check to see if the unique record ID '''exists at all''' on this database.
####### '''IF exists''': Does the record at the current time stamp exist?
######## '''IF NOT at this timestamp''': <span class="code">UPDATE</span> the <span class="code">public</span> schema (the record was already in the <span class="code">public</span> schema, but it was old).
####### '''IF NOT exists''': <span class="code">INSERT</span> it into the <span class="code">public</span> schema as the record didn't exist yet.
##### '''IF seen''':  
###### Does it exist at this timestamp?
####### '''If not at this timestamp''': <span class="code">INSERT</span> it into the <span class="code">history</span> schema at the current timestamp.


Once you've edited five or more values down, save the file.
All of these <span class="code">UPDATE</span> and <span class="code">INSERT</span> calls go into an array per database. When all the unified records have been processed, each database array with one or more records is then sent to the given database to be processed in one transaction.


Before we run the test, we need to tell Scanner how to shut down the [[Anvil!]]. In [[Striker]], there is a script called <span class="code">[https://github.com/digimer/striker/blob/master/tools/safe_anvil_shutdown safe_anvil_shutdown]</span> which can be found on the dashboads at <span class="code">/var/www/tools/safe_anvil_shutdown</span>. We need to copy this onto the nodes:
Lastly, the hashes that stored all the unified and per-DB records is deleted to clear up memory.


<syntaxhighlight lang="bash">
Voila! Your data is now synchronized on all databases!
rsync -av /var/www/tools/safe_anvil_shutdown root@an-a05n01:/root/
rsync -av /var/www/tools/safe_anvil_shutdown root@an-a05n02:/root/
</syntaxhighlight>


Now we need to configure Scanner to call it when a <span class="code">CRITICAL</span> state is reached. We do this by editing the <span class="code">scanner.conf</span> file.
=== Resync Target-Bound Data ===


<syntaxhighlight lang="bash">
The only difference between resync'ing global data from target-bound records in that a constraint is used on the initial reading of data from the connected databases.
vim /etc/striker/Config/scanner.conf
</syntaxhighlight>


There are two key entries to set:
We will use <span class="code">[https://github.com/ClusterLabs/striker/blob/master/ScanCore/agents/scan-bond/scan-bond scan-bond]</span> agent which monitors bonded network interfaces on each node or Striker dashboard. In all cases, the state of the bonds only matters to the one host with the actual bonds. The other nodes and dashboards don't care about it.


<syntaxhighlight lang="text">
In this example, then, the bond records will be bound to the <span class="code">hosts</span> -> <span class="code">host_uuid</span>, which is stored on each machine in <span class="code">/etc/striker/host.uuid</span> and is presented in ScanCore in the <span class="code">sys::host_uuid</span> variable.
scanner::healthfile = /shared/status/.an-a05n01
scanner::shutdown  = /root/safe_anvil_shutdown
</syntaxhighlight>


The <span class="code">scanner::healthfile</span> '''MUST''' match the short host name of the node with a preceding '<span class="code">.</span>'. To determine the name to use, you can run:
The read, then, looks like this;


<syntaxhighlight lang="bash">
<syntaxhighlight lang="perl">
clustat |grep Local |awk '{print $1}' | awk -F '.' '{print $1}'
my $query = "
</syntaxhighlight>
SELECT
<syntaxhighlight lang="text">
    bond_uuid,
an-a05n01
    bond_name,
    bond_mode,
    bond_primary_slave,
    bond_primary_reselect,
    bond_active_slave,
    bond_mii_status,
    bond_mii_polling_interval,
    bond_up_delay,
    bond_down_delay,
    modified_date
FROM
    history.bond
WHERE
    bond_host_uuid = ".$an->data->{sys}{use_db_fh}->quote($an->data->{sys}{host_uuid})."
;";
</syntaxhighlight>
</syntaxhighlight>


If the cluster isn't running on the node, and provided you built the cluster using [[AN!Cluster_Tutorial_2#Node_Host_Names|proper host names]], you can get the name to use with this:
With the <span class="code">WHERE bond_host_uuid = ".$an->data->{sys}{use_db_fh}->quote($an->data->{sys}{host_uuid})."</span> constraint, all of the data read in from the database will come from the current host machine. Bond records for other nodes and dashboard systems will be ignored.


<syntaxhighlight lang="bash">
In this way, our data will sync between the ScanCore databases we use, but we ''won't'' sync bond records for other hosts (which may sync between an entirely different set of ScanCore databases).
uname -n | awk -F '.' '{print $1}'
</syntaxhighlight>
<syntaxhighlight lang="text">
an-a05n01
</syntaxhighlight>


This is important because <span class="code">safe_anvil_shutdown</span> will look for the file <span class="code">/shared/status/.<peer's name></span>. If it finds that file, it will be able to determine the health of the peer. Assuming the peer is healthy, <span class="code">safe_anvil_shutdown</span> will assume the CRITICAL state is localized and so it will migrate servers to the peer before shutting down. However, if the peer is sick, it will gracefully shut down the servers before powering off.
The rest of the synchronization is process is exactly the same as above. The unified and per-DB hashes will be processed exactly the same way (just with a subset of the data).


So setting <span class="code">scanner::healthfile = /shared/status/.an-a05n01</span> allows our peer to check our state if it goes critical, enabling this intelligence to work reliably.
Easy peasy!


The second value is the program that Scanner will execute when it goes critical. This should always be <span class="code">/root/safe_anvil_start</span> (or the path to the program, if you saved it elsewhere).
== Unit Parsing ==


Save the changes and exit.
One of the tricker bits of magic that ScanCore pulls off is the ability to simultaneously deliver alerts to different recipients in different languages. This is tricky because the agents setting alerts don't process the messages. So we need a standard way to pass values in an alert to ScanCore in a translatable format.


=== Testing one node going critical ===
This is done via the special '<span class="code">alerts</span>' table.


For the first test, we're going to run a server on <span class="code">an-a05n01</span> and change it's sensor values limits low enough to trigger an immediate crisis. We'll leave the configuration on the second node as normal. This way, if all goes well, starting Scanner on the first node should cause the hosted server to be migrated and then the node will withdraw from the cluster and shut down.
{{note|1=Explain this...}}


Edit <span class="code">an-a05n01</span>'s <span class="code">ipmi.conf</span> as discussed, start the cluster and run a test server on the node.
When setting a string to be later translated using double-bang variables line '<span class="code">!!$variable!$value!!</span>', the '<span class="code">$value</span>' will be analysed for certain suffixes. Those suffixes, when found, are translates into the language, unit or human readable appropriate values. For example, '<span class="code">!!size!1024 bytes!!</span>' will be translated to the language-appropriate base-2 human readable size, '<span class="code">1 KiB</span>'.  


{{note|1=TODO: Show example output.}}
Similarly, temperatures can also be unit-converted for the notification target. So a value like '<span class="code">#!core_temperature!30 C!!</span>' can be translated to '<span class="code">30°C</span>' or, for users preferring imperial measurements, '<span class="code">68°F</span>'.


Start Scanner on <span class="code">an-a05n02</span> and verify it wrote it's status file and that we can read it from <span class="code">an-a05n01</span>.
The full list of translated special suffixes are:


On '''<span class="code">an-a05n02</span>''':
{{note|1=The 'suffix' strings are case sensitive! If you want your agent's alerts to use these translation, please mind the case and spelling. This is strict to minimise the chance of accidentally formatting a string not meant to be translated by this feature.}}


<syntaxhighlight lang="bash">
{|class="wikitable sortable"
/usr/share/striker/bin/scanner
!Suffix
</syntaxhighlight>
!String Key
<syntaxhighlight lang="text">
!Note
Replacing defective previous scanner: OLD_PROCESS_RECENT_CRASH
|-
Starting /usr/share/striker/bin/scanner at Fri Mar  6 03:26:30 2015.
|class="code"|%
Program scanner reading from DB '10.20.4.1'.
|class="code"|tools_suffix_0016
scan 1425612390.18275 [bonding,ipmi,raid], [].
|Percentage
id na | 2015-03-06 03:26:30+0000: node2.ccrs.bcn->scanner (22338); DEBUG: Old process crashed recently.; (0 : pidfile check)
|-
</syntaxhighlight>
|class="code"|W
|class="code"|tools_suffix_0017
|[https://en.wikipedia.org/wiki/Watt Watts]
|-
|class="code"|vDC
|class="code"|tools_suffix_0018
|[https://en.wikipedia.org/wiki/Volt Volts] [https://en.wikipedia.org/wiki/Direct_current DC]
|-
|class="code"|vAC
|class="code"|tools_suffix_0019
|[https://en.wikipedia.org/wiki/Volt Volts] [https://en.wikipedia.org/wiki/Alternating_current AC]
|-
|class="code"|A
|class="code"|tools_suffix_0020
|[https://en.wikipedia.org/wiki/Ampere Amperes]
|-
|class="code"|RPM
|class="code"|tools_suffix_0021
|[https://en.wikipedia.org/wiki/Revolutions_per_minute Rotations Per Minute]
|-
|class="code"|Bps
|class="code"|tools_suffix_0022
|[https://en.wikipedia.org/wiki/Data_rate_units Bits per second]
|-
|class="code"|Kbps
|class="code"|tools_suffix_0023
|[https://en.wikipedia.org/wiki/Data_rate_units#Kilobit_per_second Kilobits per second]
|-
|class="code"|Mbps
|class="code"|tools_suffix_0024
|[https://en.wikipedia.org/wiki/Data_rate_units#Megabit_per_second Megabits per second]
|-
|class="code"|Gbps
|class="code"|tools_suffix_0025
|[https://en.wikipedia.org/wiki/Data_rate_units#Gigabit_per_second Gigabits per second]
|-
|class="code"|Tbps
|class="code"|tools_suffix_0026
|[https://en.wikipedia.org/wiki/Data_rate_units#Terabit_per_second Terabits per second]
|-
|class="code"|Bytes
|class="code"|--
|{{note|1=Only whole byte values are supported. Fractional byte values will not be converted.}}
These will be translated to the [[Base-2]] human readable size via the '<span class="code">[https://github.com/ClusterLabs/striker/blob/master/AN/Tools/Readable.pm AN::Tools::Readable->bytes_to_hr()]</span>' method. The suffix returned are those accepted by the [https://en.wikipedia.org/wiki/International_System_of_Quantities ISQ] for base-2 short forms. The [[IEC and SI Size Notations|sizes returned]] are; <span class="code">[[KiB]]</span>, <span class="code">[[MiB]]</span>, <span class="code">[[GiB]]</span>, [[TiB]], <span class="code">[[PiB]]</span>, <span class="code">[[EiB]]</span>, <span class="code">[[ZiB]]</span> and <span class="code">[[YiB]]</span>. <span class="code">KiB</span> is rounded to one decimal place, <span class="code">MiB</span> through <span class="code">TiB</span> are rounded to two decimal places and <span class="code">PiB</span> through <span class="code">YiB</span> are rounded to three decimal places.
|-
|class="code"|sec
|class="code" style="text-align: center;"|tools_suffix_0027
~<br />
tools_suffix_0031
|{{note|1=Only whole seconds are supported. Fractional values will not be converted.}}
The number of seconds given will be returned as a human-readable period of time in the short format '<span class="code">#w, #d, #h, #m, #s</span>' via the '<span class="code">[https://github.com/ClusterLabs/striker/blob/master/AN/Tools/Readable.pm AN::Tools::Readable->time()]</span>' method. If the number of seconds is too short for a number of minutes, hours, days or weeks, those units will be omitted.
|-
|class="code"|seconds
|class="code" style="text-align: center;"|tools_suffix_0032
~<br />
tools_suffix_0036
|{{note|1=Only whole seconds are supported. Fractional values will not be converted.}}
The number of seconds given will be returned as a human-readable period of time in the long format '<span class="code"># Weeks, # Days, # Hours, # Minutes, # Seconds</span>' via the '<span class="code">[https://github.com/ClusterLabs/striker/blob/master/AN/Tools/Readable.pm AN::Tools::Readable->time()]</span>' method. If the number of seconds is too short for a number of minutes, hours, days or weeks, those units will be omitted.
|-
|class="code"|Second
|class="code"|tools_suffix_0037
|Singular "Second".
|-
|class="code"|Seconds
|class="code"|tools_suffix_0038
|{{note|1=Note that this has a capitalised 'S'.}}
Plural "Seconds".
|-
|class="code"|Minute
|class="code"|tools_suffix_0039
|Singular "Minute"
|-
|class="code"|Minutes
|class="code"|tools_suffix_0040
|Plural "Minutes"
|-
|class="code"|Hour
|class="code"|tools_suffix_0041
|Singular "Hour".
|-
|class="code"|Hours
|class="code"|tools_suffix_0042
|Plural "Hours".
|-
|class="code"|Day
|class="code"|tools_suffix_0043
|Singular "Day".
|-
|class="code"|Days
|class="code"|tools_suffix_0044
|Plural "Days".
|-
|class="code"|Week
|class="code"|tools_suffix_0045
|Singular "Week".
|-
|class="code"|Weeks
|class="code"|tools_suffix_0046
|Plural "Weeks".
|-
|class="code"|C
|class="code" style="text-align: center;"|tools_suffix_0010
or<br />
tools_suffix_0012
|The value is in celsius. Which string is returned will depend on the notification target's preference for metric or imperial units of measurement. If metric (the default), <span class="code">tools_suffix_0010</span> is appended to the value and returned. If imperial, the value is converted to fahrenheit and the suffix <span class="code">tools_suffix_0012</span> will be appended.
|}


Wait a minute, and then check the status file:
In some cases, the value returned by a string is a simple string in a given language (usually English). To translate this, certain values will be translated based on the table below.
 
From '''<span class="code">an-a05n01</span>''':
 
<syntaxhighlight lang="bash">
cat /shared/status/.an-a05n02
</syntaxhighlight>
<syntaxhighlight lang="text">
health = ok
</syntaxhighlight>
 
Make sure <span class="code">an-a05n01</span> is in the cluster and is hosting a server.
 
OK, now we'll start scanner on <span class="code">an-a05n01</span>
 
<syntaxhighlight lang="bash">
clustat
</syntaxhighlight>
<syntaxhighlight lang="text">
Cluster Status for an-anvil-05 @ Fri Mar  6 02:30:31 2015
Member Status: Quorate
 
Member Name                            ID  Status
------ ----                            ---- ------
an-a05n01.alteeve.ca                      1 Online, Local, rgmanager
an-a05n02.alteeve.ca                      2 Online, rgmanager
 
Service Name                  Owner (Last)                   State       
------- ----                  ----- ------                  -----       
service:libvirtd_n01          an-a05n01.alteeve.ca          started     
service:libvirtd_n02          an-a05n02.alteeve.ca          started     
service:storage_n01            an-a05n01.alteeve.ca          started     
service:storage_n02            an-a05n02.alteeve.ca          started     
vm:vm01-rhel6                  an-a05n01.alteeve.ca          started     
</syntaxhighlight>
 
OK, start scanner on node 1!
 
<syntaxhighlight lang="bash">
</syntaxhighlight>
<syntaxhighlight lang="text">
</syntaxhighlight>


{{note|1=Unlike 'value unit' pairs above, these are evaluated '''without''' case sensitivity.}}


{|class="wikitable sortable"
!Suffix
!String Key
!Note
|-
|class="code"|Yes
|class="code"|tools_suffix_0047
|The affirmative string "Yes".
|-
|class="code"|No
|class="code"|tools_suffix_0048
|The negative string "No".
|-
|class="code"|Enabled
|class="code"|tools_suffix_0049
|The string "Enabled".
|-
|class="code"|Disabled
|class="code"|tools_suffix_0050
|The string "Enabled".
|-
|class="code"|On
|class="code"|tools_suffix_0051
|The string "On".
|-
|class="code"|Off
|class="code"|tools_suffix_0052
|The string "Off".
|-
|class="code"|
|class="code"|
|
|}


<span class="code"></span>
<span class="code"></span>
<syntaxhighlight lang="bash">
</syntaxhighlight>
<syntaxhighlight lang="text">
</syntaxhighlight>
== Enabling Scanner ==
<syntaxhighlight lang="text">
# Crontab
*/5 * * * * /usr/share/striker/bin/scanner
</syntaxhighlight>
<span class="code"></span>
<syntaxhighlight lang="bash">
</syntaxhighlight>
<syntaxhighlight lang="text">
</syntaxhighlight>
Test:


<syntaxhighlight lang="bash">
<syntaxhighlight lang="perl">
Agents/ipmi --verbose --verbose
</syntaxhighlight>
<syntaxhighlight lang="text">
ipmi loop 1 at 1421444884.53996 2378.437:27621.563 mSec.
^C
</syntaxhighlight>
</syntaxhighlight>
Yay!
----
<span class="code"></span>
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
</syntaxhighlight>
</syntaxhighlight>

Latest revision as of 03:08, 6 July 2016

 AN!Wiki :: How To :: ScanCore

Warning: This is little more that raw notes, do not consider anything here to be valid or accurate at this time.

ScanCore - The Decision Engine

ScanCore is, at its core, a "decision engine".

It was created as a way for Anvil! systems to make intelligent decisions based on data coming in from any number of places. It generates alerts for admins, so in this regard it is an alert and monitoring solution, but that is almost a secondary benefit.

The core of ScanCore has no way of gathering data and it doesn't care how data is collected. It walks through a special agents directory and any agent it finds in there, it runs. Each agent connects to any number of ScanCore databases, checks whatever it knows how to scan, compares the current data with static limits and compares against historic values (as it deems fit) and records data (new or changed values) into the database.

An agent may decide to take independent action, like sending an alert or attempting a recovery of the devices or software it monitors, and then exits. If an agent doesn't find any hardware or software it knows about, it immediately exits without doing anything further.

After all agents run, ScanCore runs through post-scan tasks, depending on whether the machine it is running on is an Anvil! node or a ScanCore database. This is where the "decision engine" comes into play.

Lets look at a couple of examples;

Example 1; Overheating

ScanCore can tell the difference between a local node overheating and the room it is in overheating.

If the node itself has overheated, it will migrate servers over to the healthy peer. If the enough temperature sensors go critical, the node will power off.

If, however, both nodes are overheating then ScanCore can deduce that the room is overheating. In this case, it can automatically shed load to reduce the amount of heat being pumped into the room and slow down the rate of heating. Later, when the room cools, it will automatically reboot the shedded node and reform the Anvil! pair, restoring redundancy without ever requiring a human's input.

How does it do this?

Multiple scan agents record thermal data. The scan-ipmitool tool checks the host's IPMI sensor data which includes many thermal sensors and their upper and lower warning and critical thresholds. The scan-storcli agent scan AVAGO-based RAID controllers and the attached hard drives and solid state drives. These also have thermal data. This is true also for many UPSes, ethernet switches and so forth.

As each agent checks its thermal sensors, any within nominal ranges are recorded by the agent in its database tables. Any that are in a warning state though, that is, overly warm or cool but not yet a problem, get pushed into a special 'temperature' database table. Alone, ScanCore does nothing more than mark the node's health as 'warning' and no further action is taken.

If a given agent finds a given sensor reaching a 'critical' state, that is hot enough or cold enough to be a real concern, it it also pushed into the 'temperature' table. At the end of the scan, ScanCore will "add up" the number of sensors that are critical.

If the sum of the sensors exceed a limit, and if the host is a node, ScanCore will take action by shutting down. Each sensor has a default weight of '1' and by default, the shutdown threshold is "greater than five". So by default, a node will shut down when 6 or more sensors go critical. This is entirely configurable on a per-sensor basis as well as the shutdown threshold.

Later, when the still-accessible temperature sensors return to an acceptable level, ScanCore running on any one of the dashboards will power the node back up. Note that ScanCore will check how many times a node has overheated recently and extend a "cool-down" period before rebooting a node. This way, a node with a chronic overheating condition will be rebooted less often. Once repaired though, the reboots will eventually be "forgotten" and the cool-down delay will reset.

What about thermal load shedding?

The example above spoke to a single node overheating. If you recall, ScanCore does "post-scan calculations". When on a node, this includes a check to see if the peer's temperature has entered a "warning" state when it has as well. Using a similar heuristic, when both nodes have enough temperature sensors in 'warning' or 'critical' state for more than a set period of time, one of the nodes will be withdrawn and shut down.

Unlike the example above, which shutdown the host node after a critical heuristic is passed, the load-shedding kicks in only when both nodes are registering a thermal event at the same time for more than a set (and configurable) period of time.

Example 2; Loss of input power

In all Anvil! systems, at least two network-monitored UPSes are powering the nodes' redundant power supplies. Thus, the loss of one UPS does not pose a risk to the system and can be ignored. Traditionally, most UPS monitoring software would assume it was the sole power provider for a machine and would initiate a shutdown if it reached critically low power levels.

With ScanCore, it understands that each node has two (or more) power sources. If one UPS loses mains power, an alert will be registered but nothing more will be done. Should the one UPS deplete entirely, the power will be lost and additional alerts will be registered when input power is lost to one of the redundant power supplies, but otherwise nothing more will happen.

Thus, ScanCore is redundancy-aware.

Consider another power scenario; Power is lost the both UPSes feeding a node. In this case, ScanCore does two things;

  1. It begins monitoring the estimated hold-up time of the strongest UPS. If the strongest UPS drops below a minimum hold-up time, a graceful shutdown of hosted servers is initiated followed by the node(s) withdrawing and powering off. Note that if different UPSes power the nodes, ScanCore will know that the peer is healthy and will migrate servers to the node with power long before the node needs to shutdown.

In a typical install, the same pair of UPSes power both nodes in the Anvil!. In the case where power is lost to both UPSes, a timer is checked. Once both nodes have been running on UPS batteries for more than two minutes, load shedding will occur. If needed, servers will migrate to consolidate on one node, then the sacrificial node will withdraw and power off to extend the runtime of the remaining node.

If, after load shedding, power stays out for too long and minimum hold-up times are crossed, the remaining node will gracefully shut down the servers and then power itself off.

Later, power is restored.

At this point, the Striker dashboards will boot (if all power was lost). Once up, they will note that both nodes are off and check the UPSes. If both UPSes are depleted (or minimally charged), they will take no action. Instead, they will monitor the charge rate of the UPSes. Once one of the UPSes hits a minimum charge percentage, it will boot the nodes and restore full Anvil! services, including booting all servers.

The logic behind the delay is to ensure that, if mains power is lost immediately after powering the nodes back on, there is sufficient charge for the nodes to power back up, detect the loss and shut back down safely.

Example 3; Node Health

The final example will show how ScanCore can react to a localized node issue.

Consider the scenario where Node 1 is the active host. The RAID controller on the host reports that a hard drive is potentially failing. An alert is generated but no further action is taken.

Later, a drive fails entirely and the node enters a degraded state.

At this point, ScanCore would note that Node 1 is now in a 'warning' state and the peer node is 'ok' and a timer is started. Recall that ScanCore can't determine the nature of a warning, so it pauses a little bit to avoid taking action on a transient issue. Two minutes after the failure, with the 'warning' state still present, ScanCore will migrate all hosted servers over to Node 2.

It will remain in the Anvil! and no further action will be taken. However, now, if a second drive were to fail (assuming RAID level 5), Node 1 would be lost and fenced, but no interruption would occur because the servers were already moved as a precaution.

If the drive is replaced before any further issues arise, Node 1 would return to an 'ok' state but nothing else would happen. Servers would be left on Node 2 because there is no benefit or concern around which node is hosting the servers at any given time.

Scan Agents

When an agent runs and connects to the database layer, a timestamp is created and that time stamp is then used for all databases changes made in that given pass. This means that the modification timestamps will be the same for a given pass, regardless of the actual time when the record was changed. This makes resynchronization far more sane, at the cost of some resolution.

If your agent needs accurate record change timestamps, please make a note to record that current time as a separate database column.

DB Resync

Part of the difference between ScanCore and various other tools is that ScanCore is designed from its core as a resilient project. The data collected by agents needs to, from the user's perspective, sync N-way between ScanCore databases without the user needing to worry about backups, recoveries and whatnot.

How does this work?

In essence, the data Agents collects can be categorized in one of two ways;

  • Data that is global (like data on servers on the Anvil! platform)
  • Data that is target-bound (like a host's sensor data from IPMI interfaces or a given machines view of UPSes it cares about)

As an agent author, you need to consider that data may exist in some databases and not others.

Consider;

A site has two Striker dashboards acting as ScanCore databases. This is a satellite office so you data replicates to a third Striker at head office. Meanwhile, head-office is collecting data from many different sites and the two dashboards on your site doesn't care about the data on the head-office site from those other locations.

Warning: Isolating data onto a limited number of databases is an efficiency effort, not a security effort! If you don't trust a ScanCore database machine, don't connect to it, period. Similarly, if you don't trust trust a machine with access to your database, don't give the owner access.

You also need to plan for N-directional resynchronization.

Also consider;

Power is lost to both/all UPSes and load-shedding takes "Striker 2" offline. Now data is being recorded to "Striker 1" that will need to be copied to Striker 2 later. Time passes and all power is lost. Power is restored, but for some reason Striker 2 boots up first and starts collecting data. Eventually, Striker 1 comes back online.

Now, Striker 1 has data that 2 doesn't, and Striker 2 has data that 1 doesn't.

ScanCore has already solved this problem using the following schemes, depending on which type of data your agent collects.

Note: Yes, this is expensive in terms of memory and processing power, relatively speaking. However, a lot of effort is made to never UPDATE the database unless something actually changes, keeping the history schema as small and efficient as possible. For this reason, even data collected from many nodes over a long period of time should not add up to too much. If you are concerned, be sure to run periodic archiving of the data.
Warning: As this is written, automatic archiving has not been implemented, though it is planned to be implemented shortly.

Resync Global Data

This is the simplest data to resync because it will go to all databases, no matter what. This is rare in practice but provides a good starting point.

The process;

The agent starts and connects to the databases. As part of the connection process, a check is made to see if any databases are behind (see AN::Tools::DB.pm->find_behind_databases()). If so, the agent will act on this by initiating a resync.

The resync process is fundamentally simple; All records are read in from it's history schema of all connected databases into a common hash based on the time a given record was recorded and the unique ID of the record. The same data is loaded into database-specific hash for later comparison. We also note for each unique record that we've seen at least one copy of the record for a later step. An example "record" would be a server's UUID, which uniquely identifies it regardless of the host node or Anvil!.

Here is an example of how the data is read in:

		my $query = "
SELECT 
    server_uuid,
    server_name, 
    server_stop_reason, 
    server_start_after, 
    server_start_delay, 
    server_note, 
    server_definition, 
    server_host, 
    server_state, 
    server_migration_type, 
    server_pre_migration_script, 
    server_pre_migration_arguments, 
    server_post_migration_script, 
    server_post_migration_arguments, 
    modified_date 
FROM 
    history.servers 
;";

Without constraints, all data in the table will be read in. This data is recorded in the 'unified' hash using the modification time and the unique identifier as keys.

			# Record this in the unified and local hashes.
			$an->data->{db_data}{unified}{servers}{modified_date}{$modified_date}{server_uuid}{$server_uuid} = {
				server_name			=>	$server_name, 
				server_stop_reason		=>	$server_stop_reason, 
				server_start_after		=>	$server_start_after, 
				server_start_delay		=>	$server_start_delay, 
				server_note			=>	$server_note, 
				server_definition		=>	$server_definition, 
				server_host			=>	$server_host, 
				server_state			=>	$server_state, 
				server_migration_type		=>	$server_migration_type,
				server_pre_migration_script	=>	$server_pre_migration_script,
				server_pre_migration_arguments	=>	$server_pre_migration_arguments,
				server_post_migration_script	=>	$server_post_migration_script,
				server_post_migration_arguments	=>	$server_post_migration_arguments,
			};

Next, for the current Database ID that we're reading from, note that the server with the given ID exists in the public database schema. We'll also set this 'seen' as '0' for now. We'll see why in a moment.

			$an->data->{db_data}{$id}{servers}{server_uuid}{$server_uuid}{'exists'} = 1;
			$an->data->{db_data}{$id}{servers}{server_uuid}{$server_uuid}{seen}     = 0;

Finally, record the same data in another hash, identified by the currently active database ID in another hash.

			$an->data->{db_data}{$id}{servers}{modified_date}{$modified_date}{server_uuid}{$server_uuid} = {
				server_name			=>	$server_name, 
				server_stop_reason		=>	$server_stop_reason, 
				server_start_after		=>	$server_start_after, 
				server_start_delay		=>	$server_start_delay, 
				server_note			=>	$server_note, 
				server_definition		=>	$server_definition, 
				server_host			=>	$server_host, 
				server_state			=>	$server_state, 
				server_migration_type		=>	$server_migration_type,
				server_pre_migration_script	=>	$server_pre_migration_script,
				server_pre_migration_arguments	=>	$server_pre_migration_arguments,
				server_post_migration_script	=>	$server_post_migration_script,
				server_post_migration_arguments	=>	$server_post_migration_arguments,
			};

So, once the read is done from all accessible databases, we'll have a set of hashes; One being the unified collection of all data from both/all sources, plus a hash for each database.

Note: This looks a little complicated, but it is worth the mental effort. With this in place, users will never need to worry about data recovery or synchronization so long as even one copy of the database exists somewhere. ScanCore database servers can come and go or be destroyed and replaced trivially. So please bear with it... The logic seems complex, but it is fundamentally quite simple.

With this, here is the sync process:

  1. Walk through the unified records for each given modification timestamp, newest records first, oldest records last.
    1. Walk through each unique record for the given timestamp (continuing the example, this would be each server's UUID).
      1. Loop through each connected database ID.
        1. Check to see if the unique record ID has been seen in the resync process yet. (Note: This will always be 'not' the first time because the first instance of a record at the most recent time stamp will go into the public schema where all other records will go into the history schema.)
          1. IF NOT seen:
            1. Mark the record as now having been seen.
            2. Check to see if the unique record ID exists at all on this database.
              1. IF exists: Does the record at the current time stamp exist?
                1. IF NOT at this timestamp: UPDATE the public schema (the record was already in the public schema, but it was old).
              2. IF NOT exists: INSERT it into the public schema as the record didn't exist yet.
          2. IF seen:
            1. Does it exist at this timestamp?
              1. If not at this timestamp: INSERT it into the history schema at the current timestamp.

All of these UPDATE and INSERT calls go into an array per database. When all the unified records have been processed, each database array with one or more records is then sent to the given database to be processed in one transaction.

Lastly, the hashes that stored all the unified and per-DB records is deleted to clear up memory.

Voila! Your data is now synchronized on all databases!

Resync Target-Bound Data

The only difference between resync'ing global data from target-bound records in that a constraint is used on the initial reading of data from the connected databases.

We will use scan-bond agent which monitors bonded network interfaces on each node or Striker dashboard. In all cases, the state of the bonds only matters to the one host with the actual bonds. The other nodes and dashboards don't care about it.

In this example, then, the bond records will be bound to the hosts -> host_uuid, which is stored on each machine in /etc/striker/host.uuid and is presented in ScanCore in the sys::host_uuid variable.

The read, then, looks like this;

		my $query = "
SELECT 
    bond_uuid, 
    bond_name, 
    bond_mode, 
    bond_primary_slave, 
    bond_primary_reselect, 
    bond_active_slave, 
    bond_mii_status, 
    bond_mii_polling_interval, 
    bond_up_delay, 
    bond_down_delay, 
    modified_date 
FROM 
    history.bond
WHERE
    bond_host_uuid = ".$an->data->{sys}{use_db_fh}->quote($an->data->{sys}{host_uuid})."
;";

With the WHERE bond_host_uuid = ".$an->data->{sys}{use_db_fh}->quote($an->data->{sys}{host_uuid})." constraint, all of the data read in from the database will come from the current host machine. Bond records for other nodes and dashboard systems will be ignored.

In this way, our data will sync between the ScanCore databases we use, but we won't sync bond records for other hosts (which may sync between an entirely different set of ScanCore databases).

The rest of the synchronization is process is exactly the same as above. The unified and per-DB hashes will be processed exactly the same way (just with a subset of the data).

Easy peasy!

Unit Parsing

One of the tricker bits of magic that ScanCore pulls off is the ability to simultaneously deliver alerts to different recipients in different languages. This is tricky because the agents setting alerts don't process the messages. So we need a standard way to pass values in an alert to ScanCore in a translatable format.

This is done via the special 'alerts' table.

Note: Explain this...

When setting a string to be later translated using double-bang variables line '!!$variable!$value!!', the '$value' will be analysed for certain suffixes. Those suffixes, when found, are translates into the language, unit or human readable appropriate values. For example, '!!size!1024 bytes!!' will be translated to the language-appropriate base-2 human readable size, '1 KiB'.

Similarly, temperatures can also be unit-converted for the notification target. So a value like '#!core_temperature!30 C!!' can be translated to '30°C' or, for users preferring imperial measurements, '68°F'.

The full list of translated special suffixes are:

Note: The 'suffix' strings are case sensitive! If you want your agent's alerts to use these translation, please mind the case and spelling. This is strict to minimise the chance of accidentally formatting a string not meant to be translated by this feature.
Suffix String Key Note
% tools_suffix_0016 Percentage
W tools_suffix_0017 Watts
vDC tools_suffix_0018 Volts DC
vAC tools_suffix_0019 Volts AC
A tools_suffix_0020 Amperes
RPM tools_suffix_0021 Rotations Per Minute
Bps tools_suffix_0022 Bits per second
Kbps tools_suffix_0023 Kilobits per second
Mbps tools_suffix_0024 Megabits per second
Gbps tools_suffix_0025 Gigabits per second
Tbps tools_suffix_0026 Terabits per second
Bytes --
Note: Only whole byte values are supported. Fractional byte values will not be converted.

These will be translated to the Base-2 human readable size via the 'AN::Tools::Readable->bytes_to_hr()' method. The suffix returned are those accepted by the ISQ for base-2 short forms. The sizes returned are; KiB, MiB, GiB, TiB, PiB, EiB, ZiB and YiB. KiB is rounded to one decimal place, MiB through TiB are rounded to two decimal places and PiB through YiB are rounded to three decimal places.

sec tools_suffix_0027

~
tools_suffix_0031

Note: Only whole seconds are supported. Fractional values will not be converted.

The number of seconds given will be returned as a human-readable period of time in the short format '#w, #d, #h, #m, #s' via the 'AN::Tools::Readable->time()' method. If the number of seconds is too short for a number of minutes, hours, days or weeks, those units will be omitted.

seconds tools_suffix_0032

~
tools_suffix_0036

Note: Only whole seconds are supported. Fractional values will not be converted.

The number of seconds given will be returned as a human-readable period of time in the long format '# Weeks, # Days, # Hours, # Minutes, # Seconds' via the 'AN::Tools::Readable->time()' method. If the number of seconds is too short for a number of minutes, hours, days or weeks, those units will be omitted.

Second tools_suffix_0037 Singular "Second".
Seconds tools_suffix_0038
Note: Note that this has a capitalised 'S'.

Plural "Seconds".

Minute tools_suffix_0039 Singular "Minute"
Minutes tools_suffix_0040 Plural "Minutes"
Hour tools_suffix_0041 Singular "Hour".
Hours tools_suffix_0042 Plural "Hours".
Day tools_suffix_0043 Singular "Day".
Days tools_suffix_0044 Plural "Days".
Week tools_suffix_0045 Singular "Week".
Weeks tools_suffix_0046 Plural "Weeks".
C tools_suffix_0010

or
tools_suffix_0012

The value is in celsius. Which string is returned will depend on the notification target's preference for metric or imperial units of measurement. If metric (the default), tools_suffix_0010 is appended to the value and returned. If imperial, the value is converted to fahrenheit and the suffix tools_suffix_0012 will be appended.

In some cases, the value returned by a string is a simple string in a given language (usually English). To translate this, certain values will be translated based on the table below.

Note: Unlike 'value unit' pairs above, these are evaluated without case sensitivity.
Suffix String Key Note
Yes tools_suffix_0047 The affirmative string "Yes".
No tools_suffix_0048 The negative string "No".
Enabled tools_suffix_0049 The string "Enabled".
Disabled tools_suffix_0050 The string "Enabled".
On tools_suffix_0051 The string "On".
Off tools_suffix_0052 The string "Off".


 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.