ScanCore

From Alteeve Wiki
Jump to navigation Jump to search

 AN!Wiki :: How To :: ScanCore

Warning: This is little more that raw notes, do not consider anything here to be valid or accurate at this time.

Installing

PostgreSQL Setup

yum install -y postgresql postgresql-server postgresql-plperl postgresql-contrib postgresql-libs Scanner
...
Complete!

DB config:

/etc/init.d/postgresql initdb
Initializing database:                                     [  OK  ]

Start

chkconfig postgresql on
/etc/init.d/postgresql start
Starting postgresql service:                               [  OK  ]

Create the striker user.

su - postgres -c "createuser --no-superuser --createdb --no-createrole striker"
# no output expected

Set 'postgres' and 'striker' user passwords:

su - postgres -c "psql -U postgres"
psql (8.4.20)
Type "help" for help.
postgres=# \password
Enter new password: 
Enter it again:
postgres=# \password striker
Enter new password: 
Enter it again:

Exit.

postgres=# \q
Warning: In the below example, the BCN is 10.20.0.0/16 and the IFN is 192.168.199.0/24. If you have different networks, be sure to adjust your values accordingly!

Configure access:

cp /var/lib/pgsql/data/pg_hba.conf /var/lib/pgsql/data/pg_hba.conf.striker
vim /var/lib/pgsql/data/pg_hba.conf
diff -u /var/lib/pgsql/data/pg_hba.conf.striker /var/lib/pgsql/data/pg_hba.conf
--- /var/lib/pgsql/data/pg_hba.conf.striker	2015-03-05 14:33:40.902733374 +0000
+++ /var/lib/pgsql/data/pg_hba.conf	2015-03-05 14:34:44.861733318 +0000
@@ -65,9 +65,13 @@
 
 
 # TYPE  DATABASE    USER        CIDR-ADDRESS          METHOD
+# dashboards
+host    all         all         192.168.199.0/24      md5
+# node servers
+host    all         all         10.20.0.0/16          md5
 
 # "local" is for Unix domain socket connections only
-local   all         all                               ident
+local   all         all                               md5
 # IPv4 local connections:
 host    all         all         127.0.0.1/32          ident
 # IPv6 local connections:
cp /var/lib/pgsql/data/postgresql.conf /var/lib/pgsql/data/postgresql.conf.striker
vim /var/lib/pgsql/data/postgresql.conf
diff -u /var/lib/pgsql/data/postgresql.conf.striker /var/lib/pgsql/data/postgresql.conf
--- /var/lib/pgsql/data/postgresql.conf.striker	2015-03-05 14:35:35.388733307 +0000
+++ /var/lib/pgsql/data/postgresql.conf	2015-03-05 14:36:07.111733159 +0000
@@ -56,7 +56,7 @@
 
 # - Connection Settings -
 
-#listen_addresses = 'localhost'		# what IP address(es) to listen on;
+listen_addresses = '*'			# what IP address(es) to listen on;
 					# comma-separated list of addresses;
 					# defaults to 'localhost', '*' = all
 					# (change requires restart)
/etc/init.d/postgresql restart
Stopping postgresql service:                               [  OK  ]
Starting postgresql service:                               [  OK  ]

Striker Database Setup

Create DB:

su - postgres -c "createdb --owner striker scanner"
Password:

The SQL files we need to load are found in the /etc/striker/SQL directory.

The core SQL file is

ls -lah /etc/striker/SQL/
total 64K
drwxr-xr-x. 2 root root 4.0K Mar  4 23:50 .
drwxr-xr-x. 5 root root 4.0K Mar  4 23:50 ..
-rw-r--r--. 1 root root  397 Mar  4 23:41 00_drop_db.sql
-rw-r--r--. 1 root root 2.5K Mar  4 23:41 01_create_node.sql
-rw-r--r--. 1 root root 3.2K Mar  4 23:41 02_create_alerts.sql
-rw-r--r--. 1 root root 1.9K Mar  4 23:41 03_create_alert_listeners.sql
-rw-r--r--. 1 root root 1.3K Mar  4 23:41 04_load_alert_listeners.sql
-rw-r--r--. 1 root root 3.2K Mar  4 23:41 05_create_random_agent.sql
-rw-r--r--. 1 root root 3.4K Mar  4 23:41 06a_create_snm_apc_pdu.sql
-rw-r--r--. 1 root root 3.6K Mar  4 23:41 06b_create_snmp_brocade_switch.sql
-rw-r--r--. 1 root root 3.4K Mar  4 23:41 06_create_snm_apc_ups.sql
-rw-r--r--. 1 root root 3.5K Mar  4 23:41 07_create_ipmi.sql
-rw-r--r--. 1 root root 5.9K Mar  4 23:41 08_create_raid.sql
-rw-r--r--. 1 root root 3.8K Mar  4 23:41 09_create_bonding.sql
-rw-r--r--. 1 root root 1.2K Mar  4 23:41 Makefile
Note: The default is that the database owner name is striker. If you used a different database name owner, please update the .sql files with the command sed -i 's/striker/yourname/' *.sql.

Load the SQL tables into the database.

cat /etc/striker/SQL/*.sql > /tmp/all.sql
psql scanner -U striker -f /tmp/all.sql
Password for user striker:
<sql load messages>

Test:

psql -U striker -d scanner -c "SELECT * FROM alert_listeners"
Password for user striker:
 id |      name      |     mode      |  level  |  contact_info  | language | added_by |            updated            
----+----------------+---------------+---------+----------------+----------+----------+-------------------------------
  1 | screen         | Screen        | DEBUG   | screen         | en_CA    |        0 | 2014-12-11 14:42:13.273057-05
  2 | Tom Legrady    | Email         | DEBUG   | tom@striker.ca | en_CA    |        0 | 2014-12-11 16:54:25.477321-05
  3 | Health Monitor | HealthMonitor | WARNING |                | en_CA    |        0 | 2015-01-14 14:08:15-05
(3 rows)

Done!

Configure Scan Core on a Node

Install dependencies:

yum install Scanner postgresql perl-DBD-Pg

On the clients, you need to be sure your configuration files are set the way you want.

Most importantly is that the connection details to the databases on the dashboards are configured properly. Most installs have two dashboards, and Scanner will record it's data to both for resiliency.

The configuration files are found in /etc/striker/Config/.

ls -lah /etc/striker/Config/
total 68K
drwxr-xr-x. 2 root root 4.0K Mar  5 15:06 .
drwxr-xr-x. 5 root root 4.0K Mar  5 15:06 ..
-rw-r--r--. 1 root root  741 Mar  4 23:41 bonding.conf
-rw-r--r--. 1 root root 1.1K Mar  4 23:41 dashboard.conf
-rw-r--r--. 1 root root  379 Mar  4 23:41 db.conf
-rw-r--r--. 1 root root 5.1K Mar  4 23:41 ipmi.conf
-rw-r--r--. 1 root root  939 Mar  4 23:41 nodemonitor.conf
-rw-r--r--. 1 root root 1.2K Mar  4 23:41 raid.conf
-rw-r--r--. 1 root root  961 Mar  4 23:41 scanner.conf
-rw-r--r--. 1 root root 1.7K Mar  4 23:41 snmp_apc_pdu.conf
-rw-r--r--. 1 root root 8.9K Mar  4 23:41 snmp_apc_ups.conf
-rw-r--r--. 1 root root 4.7K Mar  4 23:41 snmp_brocade_switch.conf
-rw-r--r--. 1 root root 1.4K Mar  4 23:41 system_check.conf
Note: We're showing two databases, but in theory, there is no set limit on the number of database servers that the nodes can use. Simply copy the configuration section for each additional server you wish to use. Just be sure to increment the id number for each section (ie: db::X::name where X is a unique integer for the additional server).

In this example, the two Striker dashboards with our databases have the BCN IPs 10.20.4.1 and 10.20.4.2. Both use the database name scanner owned by the database user striker with the password secret. So their configurations will be nearly identical.

cp /etc/striker/Config/db.conf /etc/striker/Config/db.conf.original
vim /etc/striker/Config/db.conf
db::1::name      = scanner
db::1::db_type   = Pg
db::1::host      = 10.20.4.1
db::1::port      = 5432
db::1::user      = striker
db::1::password  = secret

db::2::name      = scanner
db::2::db_type   = Pg
db::2::host      = 10.20.4.2
db::2::port      = 5432
db::2::user      = striker
db::2::password  = secret

Now the node should be able to reach the databases. Lets test though, to be sure. The nodes have IPMI, so we will test by manually calling the ipmi agent.

/usr/share/striker/agents/ipmi --verbose --verbose
Program ipmi writing to DB '10.20.4.1'.
Program ipmi writing to DB '10.20.4.2'.
ipmi loop 1 at 01:16:08 ->  960.295 ms elapsed;  29039.705 ms pending.

----------------------------------------------------------------------

ipmi loop 2 at 01:16:38 -> 1005.016 ms elapsed;  28994.984 ms pending.

----------------------------------------------------------------------

It all is well, it should record it's values once every 30 seconds or so. Let it run a couple of loops, and then press <ctrl> + c to stop the scan.

Now we can verify the data was written to both dashboard's databases:

psql -h 10.20.4.1 -U striker scanner -c "SELECT * FROM ipmi_temperatures;"
 id | node_id |      target      |      field      | value |   units   | status  |  message_tag  | message_arguments |           timestamp           
----+---------+------------------+-----------------+-------+-----------+---------+---------------+-------------------+-------------------------------
  1 |       1 | node1.alteeve.ca | Ambient         | 25    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.390891+00
  2 |       1 | node1.alteeve.ca | Systemboard 1   | 29    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.415248+00
  3 |       1 | node1.alteeve.ca | Systemboard 2   | 39    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.429477+00
  4 |       1 | node1.alteeve.ca | CPU1            | 35    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.4434+00
  5 |       1 | node1.alteeve.ca | CPU2            | 39    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.455114+00
  6 |       1 | node1.alteeve.ca | MEM A           | 32    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.466447+00
  7 |       1 | node1.alteeve.ca | MEM B           | 32    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.47765+00
  8 |       1 | node1.alteeve.ca | MEM C           | 35    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.489131+00
  9 |       1 | node1.alteeve.ca | MEM D           | 34    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.500622+00
 10 |       1 | node1.alteeve.ca | MEM E           | 37    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.51189+00
 11 |       1 | node1.alteeve.ca | MEM F           | 36    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.523267+00
 12 |       1 | node1.alteeve.ca | MEM G           | 34    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.534761+00
 13 |       1 | node1.alteeve.ca | MEM H           | 36    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.54614+00
 14 |       1 | node1.alteeve.ca | PSU1 Inlet      | 29    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.557422+00
 15 |       1 | node1.alteeve.ca | PSU2 Inlet      | 28    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.569362+00
 16 |       1 | node1.alteeve.ca | PSU1            | 53    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.580696+00
 17 |       1 | node1.alteeve.ca | PSU2            | 56    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.591993+00
 18 |       1 | node1.alteeve.ca | BBU             | 30    | degrees C | OK      |               |                   | 2015-03-06 01:16:02.603261+00
 19 |       1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS  | Value crisis  | value=76          | 2015-03-06 01:16:02.614824+00
 20 |       1 | node1.alteeve.ca | summary         | 1     |           | WARNING | Value warning | value=1           | 2015-03-06 01:16:02.64331+00
 21 |       1 | node1.alteeve.ca | Ambient         | 25    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.400365+00
 22 |       1 | node1.alteeve.ca | Systemboard 1   | 29    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.425598+00
 23 |       1 | node1.alteeve.ca | Systemboard 2   | 39    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.439627+00
 24 |       1 | node1.alteeve.ca | CPU1            | 35    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.453921+00
 25 |       1 | node1.alteeve.ca | CPU2            | 39    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.468253+00
 26 |       1 | node1.alteeve.ca | MEM A           | 32    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.482567+00
 27 |       1 | node1.alteeve.ca | MEM B           | 32    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.496698+00
 28 |       1 | node1.alteeve.ca | MEM C           | 35    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.508425+00
 29 |       1 | node1.alteeve.ca | MEM D           | 34    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.522475+00
 30 |       1 | node1.alteeve.ca | MEM E           | 37    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.536592+00
 31 |       1 | node1.alteeve.ca | MEM F           | 36    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.548096+00
 32 |       1 | node1.alteeve.ca | MEM G           | 34    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.559742+00
 33 |       1 | node1.alteeve.ca | MEM H           | 36    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.573795+00
 34 |       1 | node1.alteeve.ca | PSU1 Inlet      | 29    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.585372+00
 35 |       1 | node1.alteeve.ca | PSU2 Inlet      | 28    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.599816+00
 36 |       1 | node1.alteeve.ca | PSU1            | 53    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.613983+00
 37 |       1 | node1.alteeve.ca | PSU2            | 56    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.628238+00
 38 |       1 | node1.alteeve.ca | BBU             | 30    | degrees C | OK      |               |                   | 2015-03-06 01:16:32.642372+00
 39 |       1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS  | Value crisis  | value=76          | 2015-03-06 01:16:32.653909+00
 40 |       1 | node1.alteeve.ca | summary         | 1     |           | WARNING | Value warning | value=1           | 2015-03-06 01:16:32.682502+00
(40 rows)

We'll address the warnings in a moment. For now, this tells up that we are recording to dashboard 1 properly. Lets check dashboard 2:

psql -h 10.20.4.2 -U striker scanner -c "SELECT * FROM ipmi_temperatures;"
 id | node_id |      target      |      field      | value |   units   | status  |  message_tag  | message_arguments |           timestamp           
----+---------+------------------+-----------------+-------+-----------+---------+---------------+-------------------+-------------------------------
  1 |       1 | node1.alteeve.ca | Ambient         | 25    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.689144+00
  2 |       1 | node1.alteeve.ca | Systemboard 1   | 29    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.708423+00
  3 |       1 | node1.alteeve.ca | Systemboard 2   | 39    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.722751+00
  4 |       1 | node1.alteeve.ca | CPU1            | 35    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.733944+00
  5 |       1 | node1.alteeve.ca | CPU2            | 39    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.74567+00
  6 |       1 | node1.alteeve.ca | MEM A           | 32    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.756925+00
  7 |       1 | node1.alteeve.ca | MEM B           | 32    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.768102+00
  8 |       1 | node1.alteeve.ca | MEM C           | 35    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.779549+00
  9 |       1 | node1.alteeve.ca | MEM D           | 34    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.791011+00
 10 |       1 | node1.alteeve.ca | MEM E           | 37    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.802332+00
 11 |       1 | node1.alteeve.ca | MEM F           | 36    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.813697+00
 12 |       1 | node1.alteeve.ca | MEM G           | 34    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.825063+00
 13 |       1 | node1.alteeve.ca | MEM H           | 36    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.836604+00
 14 |       1 | node1.alteeve.ca | PSU1 Inlet      | 29    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.848219+00
 15 |       1 | node1.alteeve.ca | PSU2 Inlet      | 28    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.859965+00
 16 |       1 | node1.alteeve.ca | PSU1            | 53    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.870959+00
 17 |       1 | node1.alteeve.ca | PSU2            | 56    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.88233+00
 18 |       1 | node1.alteeve.ca | BBU             | 30    | degrees C | OK      |               |                   | 2015-03-06 01:16:08.893657+00
 19 |       1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS  | Value crisis  | value=76          | 2015-03-06 01:16:08.905299+00
 20 |       1 | node1.alteeve.ca | summary         | 1     |           | WARNING | Value warning | value=1           | 2015-03-06 01:16:08.93407+00
 21 |       1 | node1.alteeve.ca | Ambient         | 25    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.699395+00
 22 |       1 | node1.alteeve.ca | Systemboard 1   | 29    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.718864+00
 23 |       1 | node1.alteeve.ca | Systemboard 2   | 39    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.73341+00
 24 |       1 | node1.alteeve.ca | CPU1            | 35    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.747455+00
 25 |       1 | node1.alteeve.ca | CPU2            | 39    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.762113+00
 26 |       1 | node1.alteeve.ca | MEM A           | 32    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.776163+00
 27 |       1 | node1.alteeve.ca | MEM B           | 32    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.787508+00
 28 |       1 | node1.alteeve.ca | MEM C           | 35    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.802058+00
 29 |       1 | node1.alteeve.ca | MEM D           | 34    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.816296+00
 30 |       1 | node1.alteeve.ca | MEM E           | 37    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.827444+00
 31 |       1 | node1.alteeve.ca | MEM F           | 36    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.83877+00
 32 |       1 | node1.alteeve.ca | MEM G           | 34    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.853383+00
 33 |       1 | node1.alteeve.ca | MEM H           | 36    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.864927+00
 34 |       1 | node1.alteeve.ca | PSU1 Inlet      | 29    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.879143+00
 35 |       1 | node1.alteeve.ca | PSU2 Inlet      | 28    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.893541+00
 36 |       1 | node1.alteeve.ca | PSU1            | 53    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.907655+00
 37 |       1 | node1.alteeve.ca | PSU2            | 56    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.922028+00
 38 |       1 | node1.alteeve.ca | BBU             | 30    | degrees C | OK      |               |                   | 2015-03-06 01:16:38.933201+00
 39 |       1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS  | Value crisis  | value=76          | 2015-03-06 01:16:38.947347+00
 40 |       1 | node1.alteeve.ca | summary         | 1     |           | WARNING | Value warning | value=1           | 2015-03-06 01:16:38.976188+00
(40 rows)

Excellent!

Now, note the lines:

psql -h 10.20.4.1 -U striker scanner -c "SELECT * FROM ipmi_temperatures WHERE field='RAID Controller' ORDER BY timestamp ASC;"
 id | node_id |      target      |      field      | value |   units   | status | message_tag  | message_arguments |           timestamp           
----+---------+------------------+-----------------+-------+-----------+--------+--------------+-------------------+-------------------------------
 19 |       1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:16:02.614824+00
 39 |       1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:16:32.653909+00
(2 rows)

This tells us that the RAID Controller is running at 76°C, which scanner thinks is dangerously hot. We know that, according to the manufacturer, the controller is rated for up to 95°C, so this is fine. To account for this, we'll update the /etc/striker/Config/ipmi.conf file from:

ipmi::RAID Controller::ok         = 60
ipmi::RAID Controller::warn       = 70
ipmi::RAID Controller::hysteresis =  1
ipmi::RAID Controller::units      = degrees C

To:

ipmi::RAID Controller::ok         = 80
ipmi::RAID Controller::warn       = 90
ipmi::RAID Controller::hysteresis =  1
ipmi::RAID Controller::units      = degrees C

Now, over 80°C will cause a warning and over 90°C will cause a critical alert. Lets test by running the ipmi scan agent for one pass.

/usr/share/striker/agents/ipmi --verbose --verbose
Program ipmi writing to DB '10.20.4.1'.
Program ipmi writing to DB '10.20.4.2'.
ipmi loop 1 at 01:32:39 ->  937.201 ms elapsed;  29062.799 ms pending.

----------------------------------------------------------------------

^C

Now lets look at the database again:

psql -h 10.20.4.1 -U striker scanner -c "SELECT * FROM ipmi_temperatures WHERE field='RAID Controller' ORDER BY timestamp ASC;"
 id | node_id |      target      |      field      | value |   units   | status | message_tag  | message_arguments |           timestamp           
----+---------+------------------+-----------------+-------+-----------+--------+--------------+-------------------+-------------------------------
 19 |       1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:16:02.614824+00
 39 |       1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:16:32.653909+00
 59 |       2 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:31:17.370253+00
 79 |       3 | node1.alteeve.ca | RAID Controller | 76    | degrees C | OK     |              |                   | 2015-03-06 01:32:34.447756+00
(4 rows)

Notice the last entry is 'OK' now? That tells us we're doing fine.

Note: Be sure to update the configuration values on both nodes!

Start scanner every five minutes. If another copy is running, it simply exits. If no other copy was running (do to OS boot, scanner crash, etc) it will start up.

Testing Automatic Shutdown

One of the features of Scanner is that it can safely shut down a node if it starts to get too hot, or if the UPS has lost power and the batteries in the strongest UPS drops below a minimum hold-up time. To test this, you have two choices;

  1. Pull the power on the UPSes and watch their hold-up time. If all goes well, both nodes will power off when the minimum thresh-hold is passed.
  2. Artificially set five or more thermal sensors to be too low so that normal thermal levels trigger a shut down.
Warning: If you're testing option 2, do not configure scanner to run on boot or via cron! Your node will shut down within five minutes otherwise, requiring a boot to single-user mode to correct.

For time sake, we'll drop the sensors.

First, we need to know what values would be "too high", so lets see what our RAM and RAID controller is sitting at:

psql -h 10.20.4.1 -U striker scanner -c "SELECT * FROM ipmi_temperatures WHERE field='RAID Controller' or field LIKE 'MEM %' ORDER BY field ASC, timestamp ASC;"
 id | node_id |      target      |      field      | value |   units   | status | message_tag  | message_arguments |           timestamp           
----+---------+------------------+-----------------+-------+-----------+--------+--------------+-------------------+-------------------------------
  6 |       1 | node1.alteeve.ca | MEM A           | 32    | degrees C | OK     |              |                   | 2015-03-06 01:16:02.466447+00
 26 |       1 | node1.alteeve.ca | MEM A           | 32    | degrees C | OK     |              |                   | 2015-03-06 01:16:32.482567+00
 46 |       2 | node1.alteeve.ca | MEM A           | 33    | degrees C | OK     |              |                   | 2015-03-06 01:31:17.222054+00
 66 |       3 | node1.alteeve.ca | MEM A           | 33    | degrees C | OK     |              |                   | 2015-03-06 01:32:34.299146+00
  7 |       1 | node1.alteeve.ca | MEM B           | 32    | degrees C | OK     |              |                   | 2015-03-06 01:16:02.47765+00
 27 |       1 | node1.alteeve.ca | MEM B           | 32    | degrees C | OK     |              |                   | 2015-03-06 01:16:32.496698+00
 47 |       2 | node1.alteeve.ca | MEM B           | 33    | degrees C | OK     |              |                   | 2015-03-06 01:31:17.233122+00
 67 |       3 | node1.alteeve.ca | MEM B           | 33    | degrees C | OK     |              |                   | 2015-03-06 01:32:34.310512+00
  8 |       1 | node1.alteeve.ca | MEM C           | 35    | degrees C | OK     |              |                   | 2015-03-06 01:16:02.489131+00
 28 |       1 | node1.alteeve.ca | MEM C           | 35    | degrees C | OK     |              |                   | 2015-03-06 01:16:32.508425+00
 48 |       2 | node1.alteeve.ca | MEM C           | 36    | degrees C | OK     |              |                   | 2015-03-06 01:31:17.244798+00
 68 |       3 | node1.alteeve.ca | MEM C           | 36    | degrees C | OK     |              |                   | 2015-03-06 01:32:34.321981+00
  9 |       1 | node1.alteeve.ca | MEM D           | 34    | degrees C | OK     |              |                   | 2015-03-06 01:16:02.500622+00
 29 |       1 | node1.alteeve.ca | MEM D           | 34    | degrees C | OK     |              |                   | 2015-03-06 01:16:32.522475+00
 49 |       2 | node1.alteeve.ca | MEM D           | 35    | degrees C | OK     |              |                   | 2015-03-06 01:31:17.256127+00
 69 |       3 | node1.alteeve.ca | MEM D           | 35    | degrees C | OK     |              |                   | 2015-03-06 01:32:34.333338+00
 10 |       1 | node1.alteeve.ca | MEM E           | 37    | degrees C | OK     |              |                   | 2015-03-06 01:16:02.51189+00
 30 |       1 | node1.alteeve.ca | MEM E           | 37    | degrees C | OK     |              |                   | 2015-03-06 01:16:32.536592+00
 50 |       2 | node1.alteeve.ca | MEM E           | 38    | degrees C | OK     |              |                   | 2015-03-06 01:31:17.26758+00
 70 |       3 | node1.alteeve.ca | MEM E           | 38    | degrees C | OK     |              |                   | 2015-03-06 01:32:34.34476+00
 11 |       1 | node1.alteeve.ca | MEM F           | 36    | degrees C | OK     |              |                   | 2015-03-06 01:16:02.523267+00
 31 |       1 | node1.alteeve.ca | MEM F           | 36    | degrees C | OK     |              |                   | 2015-03-06 01:16:32.548096+00
 51 |       2 | node1.alteeve.ca | MEM F           | 36    | degrees C | OK     |              |                   | 2015-03-06 01:31:17.278884+00
 71 |       3 | node1.alteeve.ca | MEM F           | 36    | degrees C | OK     |              |                   | 2015-03-06 01:32:34.356109+00
 12 |       1 | node1.alteeve.ca | MEM G           | 34    | degrees C | OK     |              |                   | 2015-03-06 01:16:02.534761+00
 32 |       1 | node1.alteeve.ca | MEM G           | 34    | degrees C | OK     |              |                   | 2015-03-06 01:16:32.559742+00
 52 |       2 | node1.alteeve.ca | MEM G           | 35    | degrees C | OK     |              |                   | 2015-03-06 01:31:17.290446+00
 72 |       3 | node1.alteeve.ca | MEM G           | 35    | degrees C | OK     |              |                   | 2015-03-06 01:32:34.367751+00
 13 |       1 | node1.alteeve.ca | MEM H           | 36    | degrees C | OK     |              |                   | 2015-03-06 01:16:02.54614+00
 33 |       1 | node1.alteeve.ca | MEM H           | 36    | degrees C | OK     |              |                   | 2015-03-06 01:16:32.573795+00
 53 |       2 | node1.alteeve.ca | MEM H           | 37    | degrees C | OK     |              |                   | 2015-03-06 01:31:17.301801+00
 73 |       3 | node1.alteeve.ca | MEM H           | 37    | degrees C | OK     |              |                   | 2015-03-06 01:32:34.378846+00
 19 |       1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:16:02.614824+00
 39 |       1 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:16:32.653909+00
 59 |       2 | node1.alteeve.ca | RAID Controller | 76    | degrees C | CRISIS | Value crisis | value=76          | 2015-03-06 01:31:17.370253+00
 79 |       3 | node1.alteeve.ca | RAID Controller | 76    | degrees C | OK     |              |                   | 2015-03-06 01:32:34.447756+00
(36 rows)

So the RAM is sitting around 35°C and the RAID controller is sitting around 75°C. So to trigger a CRITICAL shutdown, we'll need five values or more. In my case, I had eight RAM modules, so that is enough to trigger a shut down. We'll modify those.

To save time restoring after the test is done, lets copy our properly configured ipmi.conf out of the way. We'll copy it back when the testing is done.

cp /etc/striker/Config/ipmi.conf /etc/striker/Config/ipmi.conf.good

Now we'll edit /etc/striker/Config/ipmi.conf:

vim /etc/striker/Config/ipmi.conf

The memory entries should look like this, normally:

ipmi::MEM A::ok         = 45
ipmi::MEM A::warn       = 55
ipmi::MEM A::hysteresis =  1
ipmi::MEM A::units      = degrees C

We'll change them all to:

ipmi::MEM A::ok         = 20
ipmi::MEM A::warn       = 30
ipmi::MEM A::hysteresis =  1
ipmi::MEM A::units      = degrees C

Once you've edited five or more values down, save the file.

Before we run the test, we need to tell Scanner how to shut down the Anvil!. In Striker, there is a script called safe_anvil_shutdown which can be found on the dashboads at /var/www/tools/safe_anvil_shutdown. We need to copy this onto the nodes:

rsync -av /var/www/tools/safe_anvil_shutdown root@an-a05n01:/root/
rsync -av /var/www/tools/safe_anvil_shutdown root@an-a05n02:/root/

Now we need to configure Scanner to call it when a CRITICAL state is reached. We do this by editing the scanner.conf file.

vim /etc/striker/Config/scanner.conf

There are two key entries to set:

scanner::healthfile = /shared/status/.an-a05n01
scanner::shutdown   = /root/safe_anvil_shutdown

The scanner::healthfile MUST match the short host name of the node with a preceding '.'. To determine the name to use, you can run:

clustat |grep Local |awk '{print $1}' | awk -F '.' '{print $1}'
an-a05n01

If the cluster isn't running on the node, and provided you built the cluster using proper host names, you can get the name to use with this:

uname -n | awk -F '.' '{print $1}'
an-a05n01

This is important because safe_anvil_shutdown will look for the file /shared/status/.<peer's name>. If it finds that file, it will be able to determine the health of the peer. Assuming the peer is healthy, safe_anvil_shutdown will assume the CRITICAL state is localized and so it will migrate servers to the peer before shutting down. However, if the peer is sick, it will gracefully shut down the servers before powering off.

So setting scanner::healthfile = /shared/status/.an-a05n01 allows our peer to check our state if it goes critical, enabling this intelligence to work reliably.

The second value is the program that Scanner will execute when it goes critical. This should always be /root/safe_anvil_start (or the path to the program, if you saved it elsewhere).

Save the changes and exit.

Testing one node going critical

For the first test, we're going to run a server on an-a05n01 and change it's sensor values limits low enough to trigger an immediate crisis. We'll leave the configuration on the second node as normal. This way, if all goes well, starting Scanner on the first node should cause the hosted server to be migrated and then the node will withdraw from the cluster and shut down.

Edit an-a05n01's ipmi.conf as discussed, start the cluster and run a test server on the node.

Note: TODO: Show example output.

Start Scanner on an-a05n02 and verify it wrote it's status file and that we can read it from an-a05n01.

On an-a05n02:

/usr/share/striker/bin/scanner
Replacing defective previous scanner: OLD_PROCESS_RECENT_CRASH
Starting /usr/share/striker/bin/scanner at Fri Mar  6 03:26:30 2015.
Program scanner reading from DB '10.20.4.1'.
scan 1425612390.18275 [bonding,ipmi,raid], [].
id na | 2015-03-06 03:26:30+0000: node2.ccrs.bcn->scanner (22338); DEBUG: Old process crashed recently.; (0 : pidfile check)

Wait a minute, and then check the status file:

From an-a05n01:

cat /shared/status/.an-a05n02
health = ok

Make sure an-a05n01 is in the cluster and is hosting a server.

OK, now we'll start scanner on an-a05n01

clustat
Cluster Status for an-anvil-05 @ Fri Mar  6 02:30:31 2015
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 an-a05n01.alteeve.ca                       1 Online, Local, rgmanager
 an-a05n02.alteeve.ca                       2 Online, rgmanager

 Service Name                   Owner (Last)                   State         
 ------- ----                   ----- ------                   -----         
 service:libvirtd_n01           an-a05n01.alteeve.ca           started       
 service:libvirtd_n02           an-a05n02.alteeve.ca           started       
 service:storage_n01            an-a05n01.alteeve.ca           started       
 service:storage_n02            an-a05n02.alteeve.ca           started       
 vm:vm01-rhel6                  an-a05n01.alteeve.ca           started

OK, start scanner on node 1!


Enabling Scanner

# Crontab
*/5 * * * * /usr/share/striker/bin/scanner



Test:

Agents/ipmi --verbose --verbose
ipmi loop 1 at 1421444884.53996 2378.437:27621.563 mSec.
^C

Yay!






 

Any questions, feedback, advice, complaints or meanderings are welcome.
Alteeve's Niche! Enterprise Support:
Alteeve Support
Community Support
© Alteeve's Niche! Inc. 1997-2024   Anvil! "Intelligent Availability®" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.