ScanCore
Alteeve Wiki :: How To :: ScanCore |
![]() |
Warning: This is little more that raw notes, do not consider anything here to be valid or accurate at this time. |
Installing
PostgreSQL Setup
yum install -y postgresql postgresql-server postgresql-plperl postgresql-contrib postgresql-libs Scanner
...
Complete!
DB config:
/etc/init.d/postgresql initdb
Initializing database: [ OK ]
Start
chkconfig postgresql on
/etc/init.d/postgresql start
Starting postgresql service: [ OK ]
Create the striker user.
su - postgres -c "createuser --no-superuser --createdb --no-createrole striker"
# no output expected
Set 'postgres' and 'striker' user passwords:
su - postgres -c "psql -U postgres"
psql (8.4.20)
Type "help" for help.
postgres=# \password
Enter new password:
Enter it again:
postgres=# \password striker
Enter new password:
Enter it again:
Exit.
postgres=# \q
![]() |
Warning: In the below example, the BCN is 10.20.0.0/16 and the IFN is 192.168.199.0/24. If you have different networks, be sure to adjust your values accordingly! |
Configure access:
cp /var/lib/pgsql/data/pg_hba.conf /var/lib/pgsql/data/pg_hba.conf.striker
vim /var/lib/pgsql/data/pg_hba.conf
diff -u /var/lib/pgsql/data/pg_hba.conf.striker /var/lib/pgsql/data/pg_hba.conf
--- /var/lib/pgsql/data/pg_hba.conf.striker 2015-03-05 14:33:40.902733374 +0000
+++ /var/lib/pgsql/data/pg_hba.conf 2015-03-05 14:34:44.861733318 +0000
@@ -65,9 +65,13 @@
# TYPE DATABASE USER CIDR-ADDRESS METHOD
+# dashboards
+host all all 192.168.199.0/24 md5
+# node servers
+host all all 10.20.0.0/16 md5
# "local" is for Unix domain socket connections only
-local all all ident
+local all all md5
# IPv4 local connections:
host all all 127.0.0.1/32 ident
# IPv6 local connections:
cp /var/lib/pgsql/data/postgresql.conf /var/lib/pgsql/data/postgresql.conf.striker
vim /var/lib/pgsql/data/postgresql.conf
diff -u /var/lib/pgsql/data/postgresql.conf.striker /var/lib/pgsql/data/postgresql.conf
--- /var/lib/pgsql/data/postgresql.conf.striker 2015-03-05 14:35:35.388733307 +0000
+++ /var/lib/pgsql/data/postgresql.conf 2015-03-05 14:36:07.111733159 +0000
@@ -56,7 +56,7 @@
# - Connection Settings -
-#listen_addresses = 'localhost' # what IP address(es) to listen on;
+listen_addresses = '*' # what IP address(es) to listen on;
# comma-separated list of addresses;
# defaults to 'localhost', '*' = all
# (change requires restart)
/etc/init.d/postgresql restart
Stopping postgresql service: [ OK ]
Starting postgresql service: [ OK ]
Striker Database Setup
Create DB:
su - postgres -c "createdb --owner striker scanner"
Password:
The SQL files we need to load are found in the /etc/striker/SQL directory.
The core SQL file is
ls -lah /etc/striker/SQL/
total 64K
drwxr-xr-x. 2 root root 4.0K Mar 4 23:50 .
drwxr-xr-x. 5 root root 4.0K Mar 4 23:50 ..
-rw-r--r--. 1 root root 397 Mar 4 23:41 00_drop_db.sql
-rw-r--r--. 1 root root 2.5K Mar 4 23:41 01_create_node.sql
-rw-r--r--. 1 root root 3.2K Mar 4 23:41 02_create_alerts.sql
-rw-r--r--. 1 root root 1.9K Mar 4 23:41 03_create_alert_listeners.sql
-rw-r--r--. 1 root root 1.3K Mar 4 23:41 04_load_alert_listeners.sql
-rw-r--r--. 1 root root 3.2K Mar 4 23:41 05_create_random_agent.sql
-rw-r--r--. 1 root root 3.4K Mar 4 23:41 06a_create_snm_apc_pdu.sql
-rw-r--r--. 1 root root 3.6K Mar 4 23:41 06b_create_snmp_brocade_switch.sql
-rw-r--r--. 1 root root 3.4K Mar 4 23:41 06_create_snm_apc_ups.sql
-rw-r--r--. 1 root root 3.5K Mar 4 23:41 07_create_ipmi.sql
-rw-r--r--. 1 root root 5.9K Mar 4 23:41 08_create_raid.sql
-rw-r--r--. 1 root root 3.8K Mar 4 23:41 09_create_bonding.sql
-rw-r--r--. 1 root root 1.2K Mar 4 23:41 Makefile
![]() |
Note: The default is that the database owner name is striker. If you used a different database name owner, please update the .sql files with the command sed -i 's/striker/yourname/' *.sql. |
Load the SQL tables into the database.
cat /etc/striker/SQL/*.sql > /tmp/all.sql
psql scanner -U striker -f /tmp/all.sql
Password for user striker:
<sql load messages>
Test:
psql -U striker -d scanner -c "SELECT * FROM alert_listeners"
Password for user striker:
id | name | mode | level | contact_info | language | added_by | updated
----+----------------+---------------+---------+----------------+----------+----------+-------------------------------
1 | screen | Screen | DEBUG | screen | en_CA | 0 | 2014-12-11 14:42:13.273057-05
2 | Tom Legrady | Email | DEBUG | tom@striker.ca | en_CA | 0 | 2014-12-11 16:54:25.477321-05
3 | Health Monitor | HealthMonitor | WARNING | | en_CA | 0 | 2015-01-14 14:08:15-05
(3 rows)
Done!
Configure Scan Core on a Node
Install dependencies:
yum install Scanner postgresql perl-DBD-Pg
On the clients, you need to be sure your configuration files are set the way you want.
Most importantly is that the connection details to the databases on the dashboards are configured properly. Most installs have two dashboards, and Scanner will record it's data to both for resiliency.
The configuration files are found in /etc/striker/Config/.
ls -lah /etc/striker/Config/
total 68K
drwxr-xr-x. 2 root root 4.0K Mar 5 15:06 .
drwxr-xr-x. 5 root root 4.0K Mar 5 15:06 ..
-rw-r--r--. 1 root root 741 Mar 4 23:41 bonding.conf
-rw-r--r--. 1 root root 1.1K Mar 4 23:41 dashboard.conf
-rw-r--r--. 1 root root 379 Mar 4 23:41 db.conf
-rw-r--r--. 1 root root 5.1K Mar 4 23:41 ipmi.conf
-rw-r--r--. 1 root root 939 Mar 4 23:41 nodemonitor.conf
-rw-r--r--. 1 root root 1.2K Mar 4 23:41 raid.conf
-rw-r--r--. 1 root root 961 Mar 4 23:41 scanner.conf
-rw-r--r--. 1 root root 1.7K Mar 4 23:41 snmp_apc_pdu.conf
-rw-r--r--. 1 root root 8.9K Mar 4 23:41 snmp_apc_ups.conf
-rw-r--r--. 1 root root 4.7K Mar 4 23:41 snmp_brocade_switch.conf
-rw-r--r--. 1 root root 1.4K Mar 4 23:41 system_check.conf
In this example, the two Striker dashboards with our databases have the BCN IPs 10.20.4.1 and 10.20.4.2. Both use the database name scanner owned by the database user striker with the password secret. So their configurations will be nearly identical.
cp /etc/striker/Config/db.conf /etc/striker/Config/db.conf.original
vim /etc/striker/Config/db.conf
db::1::name = scanner
db::1::db_type = Pg
db::1::host = 10.20.4.1
db::1::port = 5432
db::1::user = striker
db::1::password = secret
db::2::name = scanner
db::2::db_type = Pg
db::2::host = 10.20.4.2
db::2::port = 5432
db::2::user = striker
db::2::password = secret
Now the node should be able to reach the databases. Lets test though, to be sure. The nodes have IPMI, so we will test by manually calling the ipmi agent.
/usr/share/striker/agents/ipmi --verbose --verbose
Program ipmi writing to DB '10.20.4.1'.
Program ipmi writing to DB '10.20.4.2'.
ipmi loop 1 at 01:16:08 -> 960.295 ms elapsed; 29039.705 ms pending.
----------------------------------------------------------------------
ipmi loop 2 at 01:16:38 -> 1005.016 ms elapsed; 28994.984 ms pending.
----------------------------------------------------------------------
It all is well, it should record it's values once every 30 seconds or so. Let it run a couple of loops, and then press <ctrl> + c to stop the scan.
Now we can verify the data was written to both dashboard's databases:
psql -h 10.20.4.1 -U striker scanner -c "SELECT * FROM ipmi_temperatures;"
id | node_id | target | field | value | units | status | message_tag | message_arguments | timestamp
----+---------+------------------+-----------------+-------+-----------+---------+---------------+-------------------+-------------------------------
1 | 1 | node1.alteeve.ca | Ambient | 25 | degrees C | OK | | | 2015-03-06 01:16:02.390891+00
2 | 1 | node1.alteeve.ca | Systemboard 1 | 29 | degrees C | OK | | | 2015-03-06 01:16:02.415248+00
3 | 1 | node1.alteeve.ca | Systemboard 2 | 39 | degrees C | OK | | | 2015-03-06 01:16:02.429477+00
4 | 1 | node1.alteeve.ca | CPU1 | 35 | degrees C | OK | | | 2015-03-06 01:16:02.4434+00
5 | 1 | node1.alteeve.ca | CPU2 | 39 | degrees C | OK | | | 2015-03-06 01:16:02.455114+00
6 | 1 | node1.alteeve.ca | MEM A | 32 | degrees C | OK | | | 2015-03-06 01:16:02.466447+00
7 | 1 | node1.alteeve.ca | MEM B | 32 | degrees C | OK | | | 2015-03-06 01:16:02.47765+00
8 | 1 | node1.alteeve.ca | MEM C | 35 | degrees C | OK | | | 2015-03-06 01:16:02.489131+00
9 | 1 | node1.alteeve.ca | MEM D | 34 | degrees C | OK | | | 2015-03-06 01:16:02.500622+00
10 | 1 | node1.alteeve.ca | MEM E | 37 | degrees C | OK | | | 2015-03-06 01:16:02.51189+00
11 | 1 | node1.alteeve.ca | MEM F | 36 | degrees C | OK | | | 2015-03-06 01:16:02.523267+00
12 | 1 | node1.alteeve.ca | MEM G | 34 | degrees C | OK | | | 2015-03-06 01:16:02.534761+00
13 | 1 | node1.alteeve.ca | MEM H | 36 | degrees C | OK | | | 2015-03-06 01:16:02.54614+00
14 | 1 | node1.alteeve.ca | PSU1 Inlet | 29 | degrees C | OK | | | 2015-03-06 01:16:02.557422+00
15 | 1 | node1.alteeve.ca | PSU2 Inlet | 28 | degrees C | OK | | | 2015-03-06 01:16:02.569362+00
16 | 1 | node1.alteeve.ca | PSU1 | 53 | degrees C | OK | | | 2015-03-06 01:16:02.580696+00
17 | 1 | node1.alteeve.ca | PSU2 | 56 | degrees C | OK | | | 2015-03-06 01:16:02.591993+00
18 | 1 | node1.alteeve.ca | BBU | 30 | degrees C | OK | | | 2015-03-06 01:16:02.603261+00
19 | 1 | node1.alteeve.ca | RAID Controller | 76 | degrees C | CRISIS | Value crisis | value=76 | 2015-03-06 01:16:02.614824+00
20 | 1 | node1.alteeve.ca | summary | 1 | | WARNING | Value warning | value=1 | 2015-03-06 01:16:02.64331+00
21 | 1 | node1.alteeve.ca | Ambient | 25 | degrees C | OK | | | 2015-03-06 01:16:32.400365+00
22 | 1 | node1.alteeve.ca | Systemboard 1 | 29 | degrees C | OK | | | 2015-03-06 01:16:32.425598+00
23 | 1 | node1.alteeve.ca | Systemboard 2 | 39 | degrees C | OK | | | 2015-03-06 01:16:32.439627+00
24 | 1 | node1.alteeve.ca | CPU1 | 35 | degrees C | OK | | | 2015-03-06 01:16:32.453921+00
25 | 1 | node1.alteeve.ca | CPU2 | 39 | degrees C | OK | | | 2015-03-06 01:16:32.468253+00
26 | 1 | node1.alteeve.ca | MEM A | 32 | degrees C | OK | | | 2015-03-06 01:16:32.482567+00
27 | 1 | node1.alteeve.ca | MEM B | 32 | degrees C | OK | | | 2015-03-06 01:16:32.496698+00
28 | 1 | node1.alteeve.ca | MEM C | 35 | degrees C | OK | | | 2015-03-06 01:16:32.508425+00
29 | 1 | node1.alteeve.ca | MEM D | 34 | degrees C | OK | | | 2015-03-06 01:16:32.522475+00
30 | 1 | node1.alteeve.ca | MEM E | 37 | degrees C | OK | | | 2015-03-06 01:16:32.536592+00
31 | 1 | node1.alteeve.ca | MEM F | 36 | degrees C | OK | | | 2015-03-06 01:16:32.548096+00
32 | 1 | node1.alteeve.ca | MEM G | 34 | degrees C | OK | | | 2015-03-06 01:16:32.559742+00
33 | 1 | node1.alteeve.ca | MEM H | 36 | degrees C | OK | | | 2015-03-06 01:16:32.573795+00
34 | 1 | node1.alteeve.ca | PSU1 Inlet | 29 | degrees C | OK | | | 2015-03-06 01:16:32.585372+00
35 | 1 | node1.alteeve.ca | PSU2 Inlet | 28 | degrees C | OK | | | 2015-03-06 01:16:32.599816+00
36 | 1 | node1.alteeve.ca | PSU1 | 53 | degrees C | OK | | | 2015-03-06 01:16:32.613983+00
37 | 1 | node1.alteeve.ca | PSU2 | 56 | degrees C | OK | | | 2015-03-06 01:16:32.628238+00
38 | 1 | node1.alteeve.ca | BBU | 30 | degrees C | OK | | | 2015-03-06 01:16:32.642372+00
39 | 1 | node1.alteeve.ca | RAID Controller | 76 | degrees C | CRISIS | Value crisis | value=76 | 2015-03-06 01:16:32.653909+00
40 | 1 | node1.alteeve.ca | summary | 1 | | WARNING | Value warning | value=1 | 2015-03-06 01:16:32.682502+00
(40 rows)
We'll address the warnings in a moment. For now, this tells up that we are recording to dashboard 1 properly. Lets check dashboard 2:
psql -h 10.20.4.2 -U striker scanner -c "SELECT * FROM ipmi_temperatures;"
id | node_id | target | field | value | units | status | message_tag | message_arguments | timestamp
----+---------+------------------+-----------------+-------+-----------+---------+---------------+-------------------+-------------------------------
1 | 1 | node1.alteeve.ca | Ambient | 25 | degrees C | OK | | | 2015-03-06 01:16:08.689144+00
2 | 1 | node1.alteeve.ca | Systemboard 1 | 29 | degrees C | OK | | | 2015-03-06 01:16:08.708423+00
3 | 1 | node1.alteeve.ca | Systemboard 2 | 39 | degrees C | OK | | | 2015-03-06 01:16:08.722751+00
4 | 1 | node1.alteeve.ca | CPU1 | 35 | degrees C | OK | | | 2015-03-06 01:16:08.733944+00
5 | 1 | node1.alteeve.ca | CPU2 | 39 | degrees C | OK | | | 2015-03-06 01:16:08.74567+00
6 | 1 | node1.alteeve.ca | MEM A | 32 | degrees C | OK | | | 2015-03-06 01:16:08.756925+00
7 | 1 | node1.alteeve.ca | MEM B | 32 | degrees C | OK | | | 2015-03-06 01:16:08.768102+00
8 | 1 | node1.alteeve.ca | MEM C | 35 | degrees C | OK | | | 2015-03-06 01:16:08.779549+00
9 | 1 | node1.alteeve.ca | MEM D | 34 | degrees C | OK | | | 2015-03-06 01:16:08.791011+00
10 | 1 | node1.alteeve.ca | MEM E | 37 | degrees C | OK | | | 2015-03-06 01:16:08.802332+00
11 | 1 | node1.alteeve.ca | MEM F | 36 | degrees C | OK | | | 2015-03-06 01:16:08.813697+00
12 | 1 | node1.alteeve.ca | MEM G | 34 | degrees C | OK | | | 2015-03-06 01:16:08.825063+00
13 | 1 | node1.alteeve.ca | MEM H | 36 | degrees C | OK | | | 2015-03-06 01:16:08.836604+00
14 | 1 | node1.alteeve.ca | PSU1 Inlet | 29 | degrees C | OK | | | 2015-03-06 01:16:08.848219+00
15 | 1 | node1.alteeve.ca | PSU2 Inlet | 28 | degrees C | OK | | | 2015-03-06 01:16:08.859965+00
16 | 1 | node1.alteeve.ca | PSU1 | 53 | degrees C | OK | | | 2015-03-06 01:16:08.870959+00
17 | 1 | node1.alteeve.ca | PSU2 | 56 | degrees C | OK | | | 2015-03-06 01:16:08.88233+00
18 | 1 | node1.alteeve.ca | BBU | 30 | degrees C | OK | | | 2015-03-06 01:16:08.893657+00
19 | 1 | node1.alteeve.ca | RAID Controller | 76 | degrees C | CRISIS | Value crisis | value=76 | 2015-03-06 01:16:08.905299+00
20 | 1 | node1.alteeve.ca | summary | 1 | | WARNING | Value warning | value=1 | 2015-03-06 01:16:08.93407+00
21 | 1 | node1.alteeve.ca | Ambient | 25 | degrees C | OK | | | 2015-03-06 01:16:38.699395+00
22 | 1 | node1.alteeve.ca | Systemboard 1 | 29 | degrees C | OK | | | 2015-03-06 01:16:38.718864+00
23 | 1 | node1.alteeve.ca | Systemboard 2 | 39 | degrees C | OK | | | 2015-03-06 01:16:38.73341+00
24 | 1 | node1.alteeve.ca | CPU1 | 35 | degrees C | OK | | | 2015-03-06 01:16:38.747455+00
25 | 1 | node1.alteeve.ca | CPU2 | 39 | degrees C | OK | | | 2015-03-06 01:16:38.762113+00
26 | 1 | node1.alteeve.ca | MEM A | 32 | degrees C | OK | | | 2015-03-06 01:16:38.776163+00
27 | 1 | node1.alteeve.ca | MEM B | 32 | degrees C | OK | | | 2015-03-06 01:16:38.787508+00
28 | 1 | node1.alteeve.ca | MEM C | 35 | degrees C | OK | | | 2015-03-06 01:16:38.802058+00
29 | 1 | node1.alteeve.ca | MEM D | 34 | degrees C | OK | | | 2015-03-06 01:16:38.816296+00
30 | 1 | node1.alteeve.ca | MEM E | 37 | degrees C | OK | | | 2015-03-06 01:16:38.827444+00
31 | 1 | node1.alteeve.ca | MEM F | 36 | degrees C | OK | | | 2015-03-06 01:16:38.83877+00
32 | 1 | node1.alteeve.ca | MEM G | 34 | degrees C | OK | | | 2015-03-06 01:16:38.853383+00
33 | 1 | node1.alteeve.ca | MEM H | 36 | degrees C | OK | | | 2015-03-06 01:16:38.864927+00
34 | 1 | node1.alteeve.ca | PSU1 Inlet | 29 | degrees C | OK | | | 2015-03-06 01:16:38.879143+00
35 | 1 | node1.alteeve.ca | PSU2 Inlet | 28 | degrees C | OK | | | 2015-03-06 01:16:38.893541+00
36 | 1 | node1.alteeve.ca | PSU1 | 53 | degrees C | OK | | | 2015-03-06 01:16:38.907655+00
37 | 1 | node1.alteeve.ca | PSU2 | 56 | degrees C | OK | | | 2015-03-06 01:16:38.922028+00
38 | 1 | node1.alteeve.ca | BBU | 30 | degrees C | OK | | | 2015-03-06 01:16:38.933201+00
39 | 1 | node1.alteeve.ca | RAID Controller | 76 | degrees C | CRISIS | Value crisis | value=76 | 2015-03-06 01:16:38.947347+00
40 | 1 | node1.alteeve.ca | summary | 1 | | WARNING | Value warning | value=1 | 2015-03-06 01:16:38.976188+00
(40 rows)
Excellent!
Now, note the lines:
psql -h 10.20.4.1 -U striker scanner -c "SELECT * FROM ipmi_temperatures WHERE field='RAID Controller' ORDER BY timestamp ASC;"
id | node_id | target | field | value | units | status | message_tag | message_arguments | timestamp
----+---------+------------------+-----------------+-------+-----------+--------+--------------+-------------------+-------------------------------
19 | 1 | node1.alteeve.ca | RAID Controller | 76 | degrees C | CRISIS | Value crisis | value=76 | 2015-03-06 01:16:02.614824+00
39 | 1 | node1.alteeve.ca | RAID Controller | 76 | degrees C | CRISIS | Value crisis | value=76 | 2015-03-06 01:16:32.653909+00
(2 rows)
This tells us that the RAID Controller is running at 76°C, which scanner thinks is dangerously hot. We know that, according to the manufacturer, the controller is rated for up to 95°C, so this is fine. To account for this, we'll update the /etc/striker/Config/ipmi.conf file from:
ipmi::RAID Controller::ok = 60
ipmi::RAID Controller::warn = 70
ipmi::RAID Controller::hysteresis = 1
ipmi::RAID Controller::units = degrees C
To:
ipmi::RAID Controller::ok = 80
ipmi::RAID Controller::warn = 90
ipmi::RAID Controller::hysteresis = 1
ipmi::RAID Controller::units = degrees C
Now, over 80°C will cause a warning and over 90°C will cause a critical alert. Lets test by running the ipmi scan agent for one pass.
/usr/share/striker/agents/ipmi --verbose --verbose
Program ipmi writing to DB '10.20.4.1'.
Program ipmi writing to DB '10.20.4.2'.
ipmi loop 1 at 01:32:39 -> 937.201 ms elapsed; 29062.799 ms pending.
----------------------------------------------------------------------
^C
Now lets look at the database again:
psql -h 10.20.4.1 -U striker scanner -c "SELECT * FROM ipmi_temperatures WHERE field='RAID Controller' ORDER BY timestamp ASC;"
id | node_id | target | field | value | units | status | message_tag | message_arguments | timestamp
----+---------+------------------+-----------------+-------+-----------+--------+--------------+-------------------+-------------------------------
19 | 1 | node1.alteeve.ca | RAID Controller | 76 | degrees C | CRISIS | Value crisis | value=76 | 2015-03-06 01:16:02.614824+00
39 | 1 | node1.alteeve.ca | RAID Controller | 76 | degrees C | CRISIS | Value crisis | value=76 | 2015-03-06 01:16:32.653909+00
59 | 2 | node1.alteeve.ca | RAID Controller | 76 | degrees C | CRISIS | Value crisis | value=76 | 2015-03-06 01:31:17.370253+00
79 | 3 | node1.alteeve.ca | RAID Controller | 76 | degrees C | OK | | | 2015-03-06 01:32:34.447756+00
(4 rows)
Notice the last entry is 'OK' now? That tells us we're doing fine.
![]() |
Note: Be sure to update the configuration values on both nodes! |
Start scanner every five minutes. If another copy is running, it simply exits. If no other copy was running (do to OS boot, scanner crash, etc) it will start up.
Testing Automatic Shutdown
One of the features of Scanner is that it can safely shut down a node if it starts to get too hot, or if the UPS has lost power and the batteries in the strongest UPS drops below a minimum hold-up time. To test this, you have two choices;
- Pull the power on the UPSes and watch their hold-up time. If all goes well, both nodes will power off when the minimum thresh-hold is passed.
- Artificially set five or more thermal sensors to be too low so that normal thermal levels trigger a shut down.
![]() |
Warning: If you're testing option 2, do not configure scanner to run on boot or via cron! Your node will shut down within five minutes otherwise, requiring a boot to single-user mode to correct. |
For time sake, we'll drop the sensors.
First, we need to know what values would be "too high", so lets see what our RAM and RAID controller is sitting at:
psql -h 10.20.4.1 -U striker scanner -c "SELECT * FROM ipmi_temperatures WHERE field='RAID Controller' or field LIKE 'MEM %' ORDER BY field ASC, timestamp ASC;"
id | node_id | target | field | value | units | status | message_tag | message_arguments | timestamp
----+---------+------------------+-----------------+-------+-----------+--------+--------------+-------------------+-------------------------------
6 | 1 | node1.alteeve.ca | MEM A | 32 | degrees C | OK | | | 2015-03-06 01:16:02.466447+00
26 | 1 | node1.alteeve.ca | MEM A | 32 | degrees C | OK | | | 2015-03-06 01:16:32.482567+00
46 | 2 | node1.alteeve.ca | MEM A | 33 | degrees C | OK | | | 2015-03-06 01:31:17.222054+00
66 | 3 | node1.alteeve.ca | MEM A | 33 | degrees C | OK | | | 2015-03-06 01:32:34.299146+00
7 | 1 | node1.alteeve.ca | MEM B | 32 | degrees C | OK | | | 2015-03-06 01:16:02.47765+00
27 | 1 | node1.alteeve.ca | MEM B | 32 | degrees C | OK | | | 2015-03-06 01:16:32.496698+00
47 | 2 | node1.alteeve.ca | MEM B | 33 | degrees C | OK | | | 2015-03-06 01:31:17.233122+00
67 | 3 | node1.alteeve.ca | MEM B | 33 | degrees C | OK | | | 2015-03-06 01:32:34.310512+00
8 | 1 | node1.alteeve.ca | MEM C | 35 | degrees C | OK | | | 2015-03-06 01:16:02.489131+00
28 | 1 | node1.alteeve.ca | MEM C | 35 | degrees C | OK | | | 2015-03-06 01:16:32.508425+00
48 | 2 | node1.alteeve.ca | MEM C | 36 | degrees C | OK | | | 2015-03-06 01:31:17.244798+00
68 | 3 | node1.alteeve.ca | MEM C | 36 | degrees C | OK | | | 2015-03-06 01:32:34.321981+00
9 | 1 | node1.alteeve.ca | MEM D | 34 | degrees C | OK | | | 2015-03-06 01:16:02.500622+00
29 | 1 | node1.alteeve.ca | MEM D | 34 | degrees C | OK | | | 2015-03-06 01:16:32.522475+00
49 | 2 | node1.alteeve.ca | MEM D | 35 | degrees C | OK | | | 2015-03-06 01:31:17.256127+00
69 | 3 | node1.alteeve.ca | MEM D | 35 | degrees C | OK | | | 2015-03-06 01:32:34.333338+00
10 | 1 | node1.alteeve.ca | MEM E | 37 | degrees C | OK | | | 2015-03-06 01:16:02.51189+00
30 | 1 | node1.alteeve.ca | MEM E | 37 | degrees C | OK | | | 2015-03-06 01:16:32.536592+00
50 | 2 | node1.alteeve.ca | MEM E | 38 | degrees C | OK | | | 2015-03-06 01:31:17.26758+00
70 | 3 | node1.alteeve.ca | MEM E | 38 | degrees C | OK | | | 2015-03-06 01:32:34.34476+00
11 | 1 | node1.alteeve.ca | MEM F | 36 | degrees C | OK | | | 2015-03-06 01:16:02.523267+00
31 | 1 | node1.alteeve.ca | MEM F | 36 | degrees C | OK | | | 2015-03-06 01:16:32.548096+00
51 | 2 | node1.alteeve.ca | MEM F | 36 | degrees C | OK | | | 2015-03-06 01:31:17.278884+00
71 | 3 | node1.alteeve.ca | MEM F | 36 | degrees C | OK | | | 2015-03-06 01:32:34.356109+00
12 | 1 | node1.alteeve.ca | MEM G | 34 | degrees C | OK | | | 2015-03-06 01:16:02.534761+00
32 | 1 | node1.alteeve.ca | MEM G | 34 | degrees C | OK | | | 2015-03-06 01:16:32.559742+00
52 | 2 | node1.alteeve.ca | MEM G | 35 | degrees C | OK | | | 2015-03-06 01:31:17.290446+00
72 | 3 | node1.alteeve.ca | MEM G | 35 | degrees C | OK | | | 2015-03-06 01:32:34.367751+00
13 | 1 | node1.alteeve.ca | MEM H | 36 | degrees C | OK | | | 2015-03-06 01:16:02.54614+00
33 | 1 | node1.alteeve.ca | MEM H | 36 | degrees C | OK | | | 2015-03-06 01:16:32.573795+00
53 | 2 | node1.alteeve.ca | MEM H | 37 | degrees C | OK | | | 2015-03-06 01:31:17.301801+00
73 | 3 | node1.alteeve.ca | MEM H | 37 | degrees C | OK | | | 2015-03-06 01:32:34.378846+00
19 | 1 | node1.alteeve.ca | RAID Controller | 76 | degrees C | CRISIS | Value crisis | value=76 | 2015-03-06 01:16:02.614824+00
39 | 1 | node1.alteeve.ca | RAID Controller | 76 | degrees C | CRISIS | Value crisis | value=76 | 2015-03-06 01:16:32.653909+00
59 | 2 | node1.alteeve.ca | RAID Controller | 76 | degrees C | CRISIS | Value crisis | value=76 | 2015-03-06 01:31:17.370253+00
79 | 3 | node1.alteeve.ca | RAID Controller | 76 | degrees C | OK | | | 2015-03-06 01:32:34.447756+00
(36 rows)
So the RAM is sitting around 35°C and the RAID controller is sitting around 75°C. So to trigger a CRITICAL shutdown, we'll need five values or more. In my case, I had eight RAM modules, so that is enough to trigger a shut down. We'll modify those.
To save time restoring after the test is done, lets copy our properly configured ipmi.conf out of the way. We'll copy it back when the testing is done.
cp /etc/striker/Config/ipmi.conf /etc/striker/Config/ipmi.conf.good
Now we'll edit /etc/striker/Config/ipmi.conf:
vim /etc/striker/Config/ipmi.conf
The memory entries should look like this, normally:
ipmi::MEM A::ok = 45
ipmi::MEM A::warn = 55
ipmi::MEM A::hysteresis = 1
ipmi::MEM A::units = degrees C
We'll change them all to:
ipmi::MEM A::ok = 20
ipmi::MEM A::warn = 30
ipmi::MEM A::hysteresis = 1
ipmi::MEM A::units = degrees C
Once you've edited five or more values down, save the file.
Before we run the test, we need to tell Scanner how to shut down the Anvil!. In Striker, there is a script called safe_anvil_shutdown which can be found on the dashboads at /var/www/tools/safe_anvil_shutdown. We need to copy this onto the nodes:
rsync -av /var/www/tools/safe_anvil_shutdown root@an-a05n01:/root/
rsync -av /var/www/tools/safe_anvil_shutdown root@an-a05n02:/root/
Now we need to configure Scanner to call it when a CRITICAL state is reached. We do this by editing the scanner.conf file.
vim /etc/striker/Config/scanner.conf
There are two key entries to set:
scanner::healthfile = /shared/status/.an-a05n01
scanner::shutdown = /root/safe_anvil_shutdown
The scanner::healthfile MUST match the short host name of the node with a preceding '.'. To determine the name to use, you can run:
clustat |grep Local |awk '{print $1}' | awk -F '.' '{print $1}'
an-a05n01
If the cluster isn't running on the node, and provided you built the cluster using proper host names, you can get the name to use with this:
uname -n | awk -F '.' '{print $1}'
an-a05n01
This is important because safe_anvil_shutdown will look for the file /shared/status/.<peer's name>. If it finds that file, it will be able to determine the health of the peer. Assuming the peer is healthy, safe_anvil_shutdown will assume the CRITICAL state is localized and so it will migrate servers to the peer before shutting down. However, if the peer is sick, it will gracefully shut down the servers before powering off.
So setting scanner::healthfile = /shared/status/.an-a05n01 allows our peer to check our state if it goes critical, enabling this intelligence to work reliably.
The second value is the program that Scanner will execute when it goes critical. This should always be /root/safe_anvil_start (or the path to the program, if you saved it elsewhere).
Save the changes and exit.
Testing one node going critical
For the first test, we're going to run a server on an-a05n01 and change it's sensor values limits low enough to trigger an immediate crisis. We'll leave the configuration on the second node as normal. This way, if all goes well, starting Scanner on the first node should cause the hosted server to be migrated and then the node will withdraw from the cluster and shut down.
Edit an-a05n01's ipmi.conf as discussed, start the cluster and run a test server on the node.
![]() |
Note: TODO: Show example output. |
Start Scanner on an-a05n02 and verify it wrote it's status file and that we can read it from an-a05n01.
On an-a05n02:
/usr/share/striker/bin/scanner
Replacing defective previous scanner: OLD_PROCESS_RECENT_CRASH
Starting /usr/share/striker/bin/scanner at Fri Mar 6 03:26:30 2015.
Program scanner reading from DB '10.20.4.1'.
scan 1425612390.18275 [bonding,ipmi,raid], [].
id na | 2015-03-06 03:26:30+0000: node2.ccrs.bcn->scanner (22338); DEBUG: Old process crashed recently.; (0 : pidfile check)
Wait a minute, and then check the status file:
From an-a05n01:
cat /shared/status/.an-a05n02
health = ok
Make sure an-a05n01 is in the cluster and is hosting a server.
OK, now we'll start scanner on an-a05n01
clustat
Cluster Status for an-anvil-05 @ Fri Mar 6 02:30:31 2015
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
an-a05n01.alteeve.ca 1 Online, Local, rgmanager
an-a05n02.alteeve.ca 2 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:libvirtd_n01 an-a05n01.alteeve.ca started
service:libvirtd_n02 an-a05n02.alteeve.ca started
service:storage_n01 an-a05n01.alteeve.ca started
service:storage_n02 an-a05n02.alteeve.ca started
vm:vm01-rhel6 an-a05n01.alteeve.ca started
OK, start scanner on node 1!
Enabling Scanner
# Crontab
*/5 * * * * /usr/share/striker/bin/scanner
Test:
Agents/ipmi --verbose --verbose
ipmi loop 1 at 1421444884.53996 2378.437:27621.563 mSec.
^C
Yay!
Any questions, feedback, advice, complaints or meanderings are welcome. | |||
Alteeve's Niche! | Alteeve Enterprise Support | Community Support | |
© 2025 Alteeve. Intelligent Availability® is a registered trademark of Alteeve's Niche! Inc. 1997-2025 | |||
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions. |