Fence na: Difference between revisions

Revision as of 06:24, 8 March 2010

This is the core fence agent that exists in /sbin/.

#!/usr/bin/perl
#
# Node Assassin - Fence Agent
# Digimer; digimer@alteeve.com
# Mar. 05, 2010.
# Version: 0.1.004
#
# Bugs;
# - None known, many expected
# 

=pod

Changes:

v0.1.004
 - Fixed the command line argument bug.
 - Updated the 'help' message to be more accurate.

Given the following:
<cluster name="an_san" config_version="1">
	<clusternodes>
		<clusternode name="an_san01.alteeve.com" nodeid="1">
			<fence>
				<method name="node_assassin">
					<device name="ariel" port="01" action="off"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="an_san02.alteeve.com" nodeid="2">
			<fence>
				<method name="node_assassin">
					<device name="ariel" port="02" action="off"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="node_assassin" agent="fence_na" ipaddr="ariel.alteeve.com" name="ariel" passwd="gr0tt0"></fencedevice>
	</fencedevices>
</cluster>

Questions:
- Is there a corelation between 'clusternode -> name', 'device -> name' and 
 'fencedevice -> name'? Which is used when sending 'name' to the fence agent?
 'fencedevice'?
  

When 'fenced' decides to fence "an_san01.alteeve.com", it will:
- call '/sbin/fence_na' because of the 'fencedevices -> agent' value.
- It will pass the following arguments to the fence agent, one pair per line:
    agent=fence_na		# From 'fencedevices -> agent'
    name=ariel			# From 'fencedevices -> name'
    ipaddr=ariel.alteeve.com	# From 'fencedevices -> ipaddr'
    passwd=gr0tt0		# From 'fencedevices -> passwd'
    port=01			# From 'clusternode "an_san01.alteeve.com" -> port'
    action=fence_na		# From 'clusternode "an_san01.alteeve.com" -> option'
    				# This must be 'on', 'off', 'reboot', 'status'
    				# or 'monitor'. See below for how these terms
    				# are interpretted by this agent.
    				# NOTE: If 'option' is passed, it's value will
    				# be stored in 'action'. That is, 'action' and
    				# 'option' are synonymous.
    				# 

- Node Assassin's implentation of options.
  - 'off'	This sets the node to state '0' on the reset port followed by
  		state '3' to the power port. State 0 is maintained to prevent
  		a reboot.
  - 'on'	This sets the node to state '1' on the reset port followed by
  		state '2' on the power port to boot the node.
  - 'reboot'	This sets the node to state '2' on the reset port to quickly
  		kill the node, then switches to state '3' on the power port,
  		checks the return value (later, will check the probe pin),
  		sets state '1' on the reset port, pauses 1 second, and then
  		sets state '2' on the power port to boot the node.
  - 'status'	This calls '00:0' and returns the state of the port. Later,
  		this will return the value from the voltage sensing pin.
  - 'monitor'	being a multi-port fence device, this should call 'list'.
  		MADI: Confirm that this is what is meant in "Issues" here:
  		http://sources.redhat.com/cluster/wiki/FenceAgentAPI
  - 'list'	No info on this

Command Line Arguments:
- Any command line arguments used by this fence agent are not dictated by the
  Fence Agent API. By convention only, the following command line options are
  used:
  -a <ip>	# Maps the value to 'ipaddr'.
  -h		# Print the help message and then exits.
  -l <name>	# Maps the value to 'name'.
  -n <num>	# Maps the value to 'port'.
  -o <string>	# Maps the value to 'action'.
  -p <string>	# Maps the value to 'passwd'.
  -S <path>	# Maps the value to 'passwd_script'. This is not used by Node
  		# Assassin yet and is simply ignored.
  -q		# Sets quiet mode. Only errors will be printed. Logging
  		# proceeds as normal
  -V		# Prints the 'fence_na' version and the version of any attached
  		# Node Assassin(s) and exits.

Note:
- For now, I will return '0' if the command succeeded, but will add a detection
  line checks if there is voltage from the node's PSU later.
=cut

# Play safe!
use strict;
use warnings;
# Load my library.
require '/etc/na/fence_na.lib';
# This is how I talk.
use IO::Handle;
use Net::Telnet;

# This will be read in from a config file later.
my $conf={
	'system'	=>	{
		max_valid_state	=>	3,
		conf_file	=>	"/etc/na/fence_na.conf",
		quiet		=>	"",
		version		=>	"",
		list_state	=>	"",
		list		=>	"",
		monitor		=>	"",
		node_assassin_id=>	0,
		got_cla		=>	0,	# This is set if command line arguments are read.
	},
	node		=>	{
		ipaddr		=>	"",
		tcp_port	=>	"",
		port		=>	"238",
		login		=>	"",
		passwd		=>	"",
		port		=>	"",
		set_state	=>	"",
		passwd_script	=>	"",
		action		=>	"",
		agent		=>	"",	# This is only used by 'fenced'
		handle		=>	"",
		set_state	=>	[],	# This anon array will store the states to set based on the action passed for the proper ports.
	}
};
# This method can't pass in the '$log' handle, obviously, as it does not yet
# exist.
read_conf($conf);

# Log file for output.
my $log=IO::Handle->new();
open ($log, ">$conf->{'system'}{'log'}") || die "Failed to open: [$conf->{'system'}{'log'}] for writing; Error: $!\n";

# If this gets set in the next two function, the agent will exit.
my $bad=0;

# Read in arguments from the command line.
($bad)=read_cla($conf, $log, $bad);

# Now read in arguments from STDIN, which is how 'fenced' passes arguments.
($bad)=read_stdin($conf, $log, $bad);

# This makes sure the node ID is zero-padded or '00'.
$conf->{node}{port}=$conf->{node}{port} ? $conf->{node}{port}=sprintf("%02d", $conf->{node}{port}) : "00";

# Find the TCP port from the config file.
foreach my $i (1..$conf->{'system'}{na_num})
{
	if ((lc($conf->{node}{$i}{ipaddr}) eq lc($conf->{node}{ipaddr})))
	{
		$conf->{'system'}{node_assassin_id}=$i;
		$conf->{node}{tcp_port}=$conf->{node}{$i}{tcp_port};
		last;
	}
}

die "Exiting on errors.\n" if $bad;
record($conf, $log, "Node Assassin: [$conf->{node}{ipaddr}].\n");
record($conf, $log, "TCP Port: .... [$conf->{node}{tcp_port}].\n");
record($conf, $log, "Port: ........ [$conf->{node}{port}].\n");
record($conf, $log, "Login: ....... [$conf->{node}{login}].\n");
record($conf, $log, "Password: .... [$conf->{node}{passwd}].\n");
record($conf, $log, "Action: ...... [$conf->{node}{action}].\n");
record($conf, $log, "Done reading args.\n");

# Connect to the Node Assassin.
$conf->{node}{handle}=new Net::Telnet(
	Timeout	=>	10,
	Errmode	=>	'die',
	Port	=>	$conf->{node}{tcp_port},
	Prompt	=>	'/EOM$/',
	Errmode	=>	'return'
) or do_exit($conf, $log, 1);
# print "Handle: [$conf->{node}{handle}]\n";
$conf->{node}{handle}->open($conf->{node}{ipaddr});

# Validate credentials.
# NOTE: Checking before the telnet fails on the exit. Also, this will be moved
# into the Node Assassin soon anyway.
if (($conf->{node}{login} ne $conf->{'system'}{username}) or ($conf->{node}{passwd} ne $conf->{'system'}{password}))
{
	record($conf, $log, "Username and/or password failed.\n");
	do_exit($conf, $log, 8);
}

###############################################################################
# What do?                                                                    #
###############################################################################

# If I've been asked to show the version information, do so and then exit.
record($conf, $log, "Version: ..... [$conf->{'system'}{version}].\n");
if ($conf->{'system'}{version})
{
	version($conf, $log);
	do_exit($conf, $log, 0);
}

# If I've been asked to show the info on the given node assassin, do so and
# then exit.
record($conf, $log, "List State: .. [$conf->{'system'}{list_state}].\n");
if ($conf->{'system'}{list_state})
{
	show_state($conf, $log);
	do_exit($conf, $log, 0);
}

# When asked to 'monitor' or 'list', do this... whatever 'this' is. All I know
# is that it should not generate output.
record($conf, $log, "Monitor: ..... [$conf->{'system'}{monitor}].\n");
record($conf, $log, "List: ........ [$conf->{'system'}{list}].\n");
if (($conf->{node}{monitor}) or ($conf->{node}{list}))
{
	show_list($conf, $log);
	do_exit($conf, $log, 0);
}

# If I made it this far, I am setting a state. Sort out what state from the
# values in my conf->{node} hash.
record($conf, $log, "Setting node: [$conf->{node}{port}] to action: [$conf->{node}{action}] using the Node Assassin: [$conf->{node}{ipaddr}] using the login: [$conf->{node}{login}/$conf->{node}{passwd}]\n");

# Convert the action into Node Assassin protocol arguments.
process_action($conf, $log);

# In the next step, when a 'check' is seen, the port is analyzed and an exit
# status is stored here. Exits 0, 1 and 2 have special meaning, so I default to
# 9.
my $exit_code=9;

# Process the orders.
foreach my $order (split/,/, $conf->{'system'}{call_order})
{
	if ($order=~/^sleep/)
	{
		my $time=$order=~/sleep (\d+)/ ? $1 : 1;
		record ($conf, $log, "Sleeping for: [$time]...\n");
		sleep $time;
		next;
	}
	record ($conf, $log, "Calling: [$order]\n");
	if ($order=~/(\d\d):(\D+)/)
	{
		my $node=$1;
		my $check=$2;
		# Verify the state of the port.
		record($conf, $log, "Status check on node: [$node] -> [$check]\n");
		
		# Get the state.
		my $states=get_states($conf, $log);
		if ($states == 1)
		{
			# I had a connection problem.
			do_exit($conf, $log, 1);
		}
		my $node_state=$states->{$node};
		record($conf, $log, "Node: [$node] state: [$node_state]\n");
		
		if ($check eq "check")
		{
			# Return '2' if the node is off and '0' if it is on.
			$exit_code=$node_state eq "off" ? 2 : 0;
		}
		elsif ($check eq "off")
		{
			# 'off' was called, make sure the node is now off. This
			# may be called by 'reboot' in which case 'exit_code'
			# will simply be over-written when the final 'reboot'
			# state check is called.
			$exit_code=$node_state eq "off" ? 0 : 1;
		}
		elsif ($check eq "on")
		{
			# 'on' was called, make sure the node is now off.
			$exit_code=$node_state eq "off" ? 1 : 0;
		}
		elsif ($check eq "reboot")
		{
			# Make sure that 'exit_code' was set to '0' by the
			# earlier call. We checked again to make sure the node
			# came back up, and will log an error if it didn't, but
			# we return '0' just the same, as per the API.
			if ($exit_code eq "0")
			{
				# The power off portion worked. Check if the
				# node booted properly and record an error if
				# not.
				if ($node_state eq "off")
				{
					record($conf, $log, "Node: [$node] failed to boot after a successful power off during a reboot action.\n");
					record($conf, $log, "This is a non-critical error as the node was fenced successfully but may\n");
					record($conf, $log, "indicate a hardware failure with the node or with Node Assassin itself.\n");
				}
			}
			else
			{
				# The power off portion failed, exit with '1'.
				$exit_code=1;
			}
			$exit_code=$node_state eq "off" ? 1 : 0;
		}
		next;
	}
	my @set_state=$conf->{node}{handle}->cmd("$order");
	foreach my $line (@set_state)
	{
		record($conf, $log, $line);
	}
	record($conf, $log, "Call complete.\n");
}
record($conf, $log, "All calls complete, exiting.\n");

# Now confirm that the requested node is in it's requested state and exit with
# the appropriate exit code. This function should not return.
# show_state($conf, $log);

# Cleanup and exit.
do_exit($conf, $log, $exit_code);

`Input, advice, complaints and meanderings all welcome!`
`Digimer`	`digimer@alteeve.ca`	`https://alteeve.ca/w`	`legal stuff:`
`All info is provided "As-Is". Do not use anything here unless you are willing and able to take resposibility for your own actions. © 1997-2013`
Naming credits go to Christopher Olah!
In memory of Kettle, Tonia, Josh, Leah and Harvey. In special memory of Hannah, Jack and Riley.

@@ Line 1: / Line 1: @@
 {{na_header}}
-'''NOTE''': The comments in this file need to be update, please don't trust them.
 This is the core fence agent that exists in <span class="code">/sbin/</span>.
@@ Line 11: / Line 9: @@
 # Digimer; digimer@alteeve.com
 # Mar. 05, 2010.
-# Version: 0.1.003
+# Version: 0.1.004
 #
 # Bugs;
@@ Line 18: / Line 16: @@
 =pod
-NAOS notes:
-Assign two ports per node. Have states '0' and '1' act on port 1 (odd-numbered
-ports) and have states '2' and '3' act on port 2 (even-numbered ports). For
-example, have:
-:0	fence port 1
-:1	release fence on port 1
-:2	pulse port 2 for one second.
-:3	pulse port 2 for five seconds.
-:0	fence port 3
-:1	release fence on port 3
-:2	pulse port 4 for one second.
-:3	pulse port 4 for five seconds.
-...
-:0	fence port 15
-:1	release fence on port 15
-:2	pulse port 16 for one second.
-:3	pulse port 16 for five seconds.
+Changes:
-Fence Agent notes:
+v0.1.004
-- 'fenced' issues the call to a fence agent.
+ - Fixed the command line argument bug.
-- Commands come in from STDIN when called by 'fenced' or 'fence_node'.
+ - Updated the 'help' message to be more accurate.
-  - Commands are passed in as 'var=val' pairs with lines starting with '#' to
-    be ignored. The example they give is:
-    - argument=value
-      #this line is ignored
-      argument2=value2
-  - Each argument is on a new line.
-  - The argument and value may contain spaces or other characters invalid in
-    XML.
-  - No spaces are allowed on either side of the '='.
-  - Arguments not recognized by the fence agent should simply be ignored.
-  - Technically, the argument's value can have an '=' sign, so be sure to split
-    on the first equal sign only. However, this is highly discouraged.
 Given the following:
@@ Line 159: / Line 129: @@
 		monitor		=>	"",
 		node_assassin_id=>	0,
+		got_cla		=>	0,	# This is set if command line arguments are read.
 	},
 	node		=>	{
@@ Line 196: / Line 167: @@
 # Find the TCP port from the config file.
-foreach my $i (1..$conf->{'system'}{nodes})
+foreach my $i (1..$conf->{'system'}{na_num})
 {
 	if ((lc($conf->{node}{$i}{ipaddr}) eq lc($conf->{node}{ipaddr})))
@@ Line 221: / Line 192: @@
 	Port	=>	$conf->{node}{tcp_port},
 	Prompt	=>	'/EOM$/',
-);
+	Errmode	=>	'return'
+) or do_exit($conf, $log, 1);
 # print "Handle: [$conf->{node}{handle}]\n";
 $conf->{node}{handle}->open($conf->{node}{ipaddr});
@@ Line 267: / Line 239: @@
 # If I made it this far, I am setting a state. Sort out what state from the
 # values in my conf->{node} hash.
-record ($conf, $log, "Setting node: [$conf->{node}{port}] to action: [$conf->{node}{action}] using the Node Assassin: [$conf->{node}{ipaddr}] using the login: [$conf->{node}{login}/$conf->{node}{passwd}]\n");
+record($conf, $log, "Setting node: [$conf->{node}{port}] to action: [$conf->{node}{action}] using the Node Assassin: [$conf->{node}{ipaddr}] using the login: [$conf->{node}{login}/$conf->{node}{passwd}]\n");
 # Convert the action into Node Assassin protocol arguments.
 process_action($conf, $log);
+# In the next step, when a 'check' is seen, the port is analyzed and an exit
+# status is stored here. Exits 0, 1 and 2 have special meaning, so I default to
+# 9.
+my $exit_code=9;
 # Process the orders.
@@ Line 283: / Line 260: @@
 	}
 	record ($conf, $log, "Calling: [$order]\n");
+	if ($order=~/(\d\d):(\D+)/)
+	{
+		my $node=$1;
+		my $check=$2;
+		# Verify the state of the port.
+		record($conf, $log, "Status check on node: [$node] -> [$check]\n");
+		# Get the state.
+		my $states=get_states($conf, $log);
+		if ($states == 1)
+		{
+			# I had a connection problem.
+			do_exit($conf, $log, 1);
+		}
+		my $node_state=$states->{$node};
+		record($conf, $log, "Node: [$node] state: [$node_state]\n");
+		if ($check eq "check")
+		{
+			# Return '2' if the node is off and '0' if it is on.
+			$exit_code=$node_state eq "off" ? 2 : 0;
+		}
+		elsif ($check eq "off")
+		{
+			# 'off' was called, make sure the node is now off. This
+			# may be called by 'reboot' in which case 'exit_code'
+			# will simply be over-written when the final 'reboot'
+			# state check is called.
+			$exit_code=$node_state eq "off" ? 0 : 1;
+		}
+		elsif ($check eq "on")
+		{
+			# 'on' was called, make sure the node is now off.
+			$exit_code=$node_state eq "off" ? 1 : 0;
+		}
+		elsif ($check eq "reboot")
+		{
+			# Make sure that 'exit_code' was set to '0' by the
+			# earlier call. We checked again to make sure the node
+			# came back up, and will log an error if it didn't, but
+			# we return '0' just the same, as per the API.
+			if ($exit_code eq "0")
+			{
+				# The power off portion worked. Check if the
+				# node booted properly and record an error if
+				# not.
+				if ($node_state eq "off")
+				{
+					record($conf, $log, "Node: [$node] failed to boot after a successful power off during a reboot action.\n");
+					record($conf, $log, "This is a non-critical error as the node was fenced successfully but may\n");
+					record($conf, $log, "indicate a hardware failure with the node or with Node Assassin itself.\n");
+				}
+			}
+			else
+			{
+				# The power off portion failed, exit with '1'.
+				$exit_code=1;
+			}
+			$exit_code=$node_state eq "off" ? 1 : 0;
+		}
+		next;
+	}
 	my @set_state=$conf->{node}{handle}->cmd("$order");
 	foreach my $line (@set_state)
@@ Line 291: / Line 330: @@
 }
 record($conf, $log, "All calls complete, exiting.\n");
+# Now confirm that the requested node is in it's requested state and exit with
+# the appropriate exit code. This function should not return.
+# show_state($conf, $log);
 # Cleanup and exit.
-do_exit($conf, $log, 0);
+do_exit($conf, $log, $exit_code);
 </source>
 {{na_footer}}

Fence na: Difference between revisions

Revision as of 06:24, 8 March 2010

Navigation menu