Node Assassin - Original: Difference between revisions

From Alteeve Wiki
Jump to navigation Jump to search
No edit summary
 
(91 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{na_header}}
{{na_header}}
'''Note''': This is an archival page. Please check the current [[Node Assassin]] page for relevant and current information.


-=] '''''Paradise by the node assassin light''''' [=-
-=] '''''Paradise by the node assassin light''''' [=-
Line 6: Line 8:


Of course, you must proceed at your own risk. :)
Of course, you must proceed at your own risk. :)
[[Image:current_na.jpg|thumb|400px|right|'''Apr. 03, 2010''': The current Node Assassin; v1.1.4 Prototype A.]]
[[Image:an_cluster_logo_01.jpg|thumb|300px|left|The AN!Cluster unnamed assassin.]]
= Current Status =
'''Apr. 07, 2010''': ''DONE''! The fence agent is fully up to date. At this point, v1.1.4 of the [[Node_Assassin#Hardware|hardware]] and [[Node_Assassin#Software|software]] is complete! All that is left is to clean up the documentation and the project will be completely complete. '''HAPPY DANCE!'''.
'''Apr. 06, 2010''': Released [[NAOS]] v1.1.4.1 to fix a node ID number bug I found while working on the fence agent.
'''Apr. 05, 2010''': Finished [[NAOS]] v1.1.4! Last step needed is to bring the fence agent up to date and the current set of revisions will be done. I am hoping to finish that in the next couple of days.
'''Apr. 03, 2010''': The hardware for [[NA_v1.1.4|v1.1.4]] (full size) is done. I will now be working on the updated [[NAOS]] version to support reading the node power feed inputs and to properly output the states of the nodes. This will result in the fence agent needing to be updated, as well.
'''Mar. 29, 2010''': The full-sized version of [[NA v1.1.4|v1.1.4]] is coming along nicely. I've added pull-down resistors after a fellow hacklab'er caught them missing from the input side. I'll need to update the block diagram too, as I also moved the feed resistors to not resist the line going to the inputs. I should get that done tomorrow. '''Note''': [[NA_v1.1.4#Circuit|updated now]].
'''Mar. 26, 2010''': Some work stress has been slowing down [[NA v1.1.4 Protoshield Variant|v1.1.4]], but it is progressing (see link for updates). I've come to one conclusion though; You must be quite patient and steady handed to make this circuit on the protoshield. As soon as it's done, and possibly sooner depending on how many bugs I find in this board, I will be making a full-size version. I suspect building it on a proper sized protoboard will be *much* faster.
'''Mar. 19, 2010''': Finished the circuit diagram for [[NA v1.1.4]] and the circuit diagram and board layout for the [[NA v1.1.4 Protoshield Variant|Arduino variant]]. Once built I will be able to finish the next iteration of [[NAOS]].
'''Mar. 14, 2010''': Added support for Node Assassin to the <span class="code">cluster.conf</span> XML validation file <span class="code">cluster.ng</span>.
'''Mar. 10, 2010''': Progress will be slow here while I divert back to working on the [[2-Node CentOS5 Cluster]] paper that this device was built for. Updates will resume here once the next version of the hardware is built. That is, the new version with independent power feed sensing to determine more reliably the state of a node.
'''Mar. 08, 2010''': Updated the [[#Fence Agent|fence agent]] to v0.1.005.
'''Mar. 07, 2010''': ''HUGE SUCCESS!'' For the first time ever, I was able to use the <span class="code">fence_tool</span> to successfully fence a node using Node Assassin! Granted, I had to lie about checking the node's state directly, but hey, the fence agent works!
'''Mar. 06, 2010''': Now working on adding a sense port for each node and to increase the number of I/O ports to a sufficiently high number to support 8 nodes (w/ 2 in + 1 out each).
'''Mar. 05, 2010''': The Fence Agent should now fully support calls from <span class="code">fenced</span>! It hasn't been tested yet, that is the next step. However, it does follow the [http://sources.redhat.com/cluster/wiki/FenceAgentAPI Fence Agent API].
'''Mar. 01, 2010''': The Fence Agent is now able to fully control Node Assassin. Next step is to set it up from control by [[CMAN]].
'''Feb. 28, 2010''': Release [[na_v1.1.3|NAOS ver. 1.0.3]]. Now working on the fence agent.
'''Feb. 27, 2010''': With the [[na_v1.1.3]] prototype complete, I will now be turning my attention to this fence agent. Progress should be noticeable from here on in.


= First Version =
= First Version =


The first version of Node Assassin is operational!
The first version of Node Assassin is operational! Also, the initial version of the '<span class="code">fence_na</span>' fencing agent and associated files are done.


It still needs to be tied into [[Red Hat]]'s 'ricci' and 'luci' programs to function as a fence device though. This fence device resides on the cluster's private network channel (or some other common intranet).
This fence device resides on the cluster's private network channel (or some other common intranet).


= Software =
= Software =
Line 22: Line 61:
== Version Control ==
== Version Control ==


All software related to this product is hosted on [http://github.com/digimer/Node-Assassin GitHub].
'''Note''': The code there is seriously out of date. I'll update it later. For now, grab code from this website directly.
 
All software related to this product is hosted on '''''OUT OF DATE'''''[http://github.com/digimer/Node-Assassin GitHub]'''''OUT OF DATE'''''.


== Naos v1.x Protocol ==
== NAOS Protocol v1.2 ==


It works by listening for a connection on TCP port 238 on IP 192.168.1.66 (both port and IP are configurable). Once connected, it uses a very simple protocol.
It works by listening for a connection on TCP port 238 on IP 192.168.1.66 (both port and IP are configurable). Once connected, it uses a very simple protocol.


Commands are:
=== Queries ===
 
Any message starting with <span class="code">00:x</span> are '''query commands'''. That is, the integer represented by '''x''' represents a type of query for the node assassin.


Get the state of all nodes:
Get the state of all nodes:
<source lang="text">
  00:0
  00:0
</source>


This will generate a message like:
This will generate a message like:
Node states:  
<source lang="text">
- Max Node: 05
Node states:
- Node 01: Running
- Node Count: 04
- Node 02: Running
- Node 01: P0 R0 F1
- Node 03: Running
- Node 02: P0 R0 F1
- Node 04: Running
- Node 03: P1 R1 F0
- Node 05: Running
- Node 04: P0 R0 F0
End Message.
End Message.
</source>


Any node that is currently fenced will read '<span class="code">Fenced!</span>'.
The node status is broken down into three values; Power: <span class="code">P[0|1]</span>, Reset: <span class="code">R[0|1]</span> and Feed: <span class="code">F[0|1]</span>. For the power and reset button, a value of <span class="code">0</span> indicates that the related switch switch is open. If it is set to <span class="code">1</span>, the switch is closed (the indicated button is "pressed"). For the Feed value, <span class="code">0</span> indicates that there is no power coming from the node's feed and thus, the node is "Off" (or disconnected). If the feed is <span class="code">1</span>, then power is detected from the node and the node is known to be "On".
 
Quick reference table:
{|style="width: 600px; border-top: 1px dotted #7f7f7f; border-left: 1px dotted #7f7f7f;"
|-
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|'''Type'''
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|'''Code'''
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|<span class="code">0</span>
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|<span class="code">1</span>
|-
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|Power Switch
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|<span class="code">P</span>
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|Open
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|Closed
|-
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|Reset Switch
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|<span class="code">R</span>
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|Open
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|Closed
|-
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|Power Feed
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|<span class="code">F</span>
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|Node is On
|style="border-bottom: 1px dotted #7f7f7f; border-right: 1px dotted #7f7f7f; text-align: center"|Node is Off/disconnected
|}
 
Get information on the node assassin device:
<source lang="text">
00:1
</source>


Fence node '02' (Supports nodes from 01 to 05):
02:0
This will generate a message like:
This will generate a message like:
Node 02:0: Now Fenced!
<source lang="text">
Node info:
- Node Name: ..... Motoko
- Port Count: .... 04
- NAOS Version: .. v1.1.4
- Serial Number: . PR0002
- Build Date: .... 2010-04-03
- MAC address: ... 02:00:00:FF:F0:AA
- IP address: .... 192.168.1.66
- Subnet Mask: ... 255.255.255.0
- Default Gateway: 192.168.1.1
EOM
</source>
 
=== Commands ===
 
Commands are always given in the form: <span class="code">XX:Y</span> where:
* <span class="code">XX</span> is the two-digit node number between 1 and max node.
* <span class="code">Y</span> is the single-digit command being issues.
 
'''Core commands are''':
 
Release:
* <span class="code">XX:0</span>: Release any existing fences on the node. Specifically, both power and reset switches are opened.
 
Fence:
* <span class="code">XX:1</span>: Specified node is fenced. Specifically;
** Reset switch is closed for one second to immediately disable the node.
** Reset switch is opened for one second.
** Power switch is closed. After five seconds the node's power feed is checked. If the feed is still high, NA waits another 25 seconds and checks again. If the feed is still high, an error is generated.
** Reset switch is closed. At this point, both power and reset are closed, disabling the node's front-panel switches preventing accidental booting of the node before the fence is released.
 
Boot or Initiate ACPI Power Off:
* <span class="code">XX:2</span>: This will close the power switch for one second and then re-open it. If the node was off, this should initiate boot. If the node was on, this should initiate shutdown via ACPI. The power feed is checked prior to the one second fence and the returned message reflects whether the node is being booted or shut down.
 
Force Power Off:
* <span class="code">XX:3</span>: This will close the power switch. After five seconds the node's power feed is checked. If the feed is still high, NA waits another 25 seconds and checks again. If the feed is still high, an error is generated. Regardless of success or failure, the fence is removed.
 
== Fence Agent ==
 
This is the [[fenced]] fence agent for Node Assassin.
 
The Node Assassin fence agent '''v1.1.4''' is split up into three files:
* Source: [[fence_na]] - [http://nodeassassin.org/files/sbin/fence_na Download]
** This is the core fence agent that exists in <span class="code">/sbin/</span>.
* Source: [[fence_na.lib]] - [http://nodeassassin.org/files/etc/na/fence_na.lib Download]
** This is the fence agent's function library that exists in <span class="code">/etc/na/</span>.
* Source: [[fence_na.conf]] - [http://nodeassassin.org/files/etc/na/fence_na.conf Download]
** This is the common Node Assassin configuration file that exists in <span class="code">/etc/na/</span>.
 
The reason for the three files is that, later, there will be a fourth executable that will program the Node Assassin devices. When this program is created, it will consult the common configuration file and will use some of the functions in the library.
 
To test the agent, copy the following into a file (ie: <span class="code">args.txt</span>):
<source lang="bash">
# Test file used as input for the NA fence agent.
ipaddr=motoko.alteeve.com
port=1
login=motoko
passwd=secret
action=on
</source>
 
And cat it into the fence agent via a pipe:
<source lang="bash">
clear; cat args.txt | ./fence_na
</source>


Release '02':
== XML Validation Support ==
02:1
 
This will generate a message like:
Until Node Assassin is natively supported, you will need to update the <span class="code">cluster.ng</span> validation file for <span class="code">cluster.conf</span> to successfully validate.
Node 02:1: Now running.


Fencing or releasing other nodes is a simple as replacing '02' above with the node number. The design allows for the expansion of ports up to 99 devices. It should be trivial to go beyond that, but it's a sufficiently high number for now.
To accomplish this, you will need to modify or replace the default <span class="code">/usr/share/system-config-cluster/misc/cluster.ng</span> to match this one:
* [[cluster.ng]]


Once a node is fenced, the caller can release and the state will be held until either another call releases the fence or the Node Assassin is reset.
The diff between the default version and the version above should be:
<source lang="bash">
diff /usr/share/system-config-cluster/misc/cluster.ng /root/backups/cluster.ng
</source>
<source lang="diff">
47,62d46
<        <!-- Node Assassin -->
<        <group>
<        <attribute name="ipaddr"/>
<        <optional>
<        <attribute name="login"/>
<        </optional>
<        <optional>
<        <attribute name="passwd"/>
<        </optional>
<        <optional>
<        <attribute name="passwd_script"/>
<        </optional>
<        <optional>
<          <attribute name="quiet"/>
<        </optional>
<        </group>
1062,1066d1045
<        <!-- Node Assassin -->
<        <group>
<          <attribute name="port"/>
<          <attribute name="action"/>
<        </group>
</source>


= Hardware =
= Hardware =


* [[NA Hardware v1.0]]
== v1.1.4 ==


This is <span class="code">v1.1</span> of the Node Assassin hardware
'''Apr. 03, 2010''': Current version


Mark Loit suggested improving the circuit design by swapping out the [[ULN2003A]] and the [[HLS-4078-DC5V]] with something like the [[TLP281-4]] or [[4N35]] opto-isolators. This will allow for a much reduced power consumption and size allowing for the circuit to potentially exist on an Arduino [[Protoshield]].
* [[NA v1.1.4]] - Hardware complete, [[NAOS]] and fence agent now being updated.


= Pictures =
This version will adds read pins connected to each node's power LED to check a node's power state independent of the node's port states. This is required to meet the [http://sources.redhat.com/cluster/wiki/FenceAgentAPI FenceAgentAPI] requirements.


This will be updated as the build comes along. Once I am done I will post a diagram. No sense sharing my bugs at this point.
== Cables ==


[[Image:na_build_v1.0.1_01.jpg|800px|thumb|left|The relay-version of the board.]]
The article linked below contains recommended cabling and pinouts for Node Assassin devices. By following these standards, you will ensure compatibility with other Node Assassin devices and cables.


[[Image:na_build_v1.0.1_02.jpg|800px|thumb|left|The relay-version of the board, with example output in the background.]]
* [[Node Assassin Cabling Standards]]


[[Image:na_design_v1.0.1_01.jpg|800px|thumb|left|A poorly hand drawn, photoed imaged of the relay-version of the board. Note that Node 5's flyback diode and the 220 ohm resistors feeding the status LED side of the relays are not there...]]
== Old Versions ==


* [[NA_v1.1.3|NA Hardware v1.1.3]]
* [[NA Hardware v1.1.2]] (never built)
* [[NA Hardware v1.0]]


{{na_footer}}
{{na_footer}}

Latest revision as of 16:15, 16 April 2010

 Node Assassin :: Node Assassin - Original

Note: This is an archival page. Please check the current Node Assassin page for relevant and current information.

-=] Paradise by the node assassin light [=-

Node Assassin is an open-source, open-hardware project to create a network-attached cluster fence device.

Of course, you must proceed at your own risk. :)

Apr. 03, 2010: The current Node Assassin; v1.1.4 Prototype A.
The AN!Cluster unnamed assassin.

Current Status

Apr. 07, 2010: DONE! The fence agent is fully up to date. At this point, v1.1.4 of the hardware and software is complete! All that is left is to clean up the documentation and the project will be completely complete. HAPPY DANCE!.

Apr. 06, 2010: Released NAOS v1.1.4.1 to fix a node ID number bug I found while working on the fence agent.

Apr. 05, 2010: Finished NAOS v1.1.4! Last step needed is to bring the fence agent up to date and the current set of revisions will be done. I am hoping to finish that in the next couple of days.

Apr. 03, 2010: The hardware for v1.1.4 (full size) is done. I will now be working on the updated NAOS version to support reading the node power feed inputs and to properly output the states of the nodes. This will result in the fence agent needing to be updated, as well.

Mar. 29, 2010: The full-sized version of v1.1.4 is coming along nicely. I've added pull-down resistors after a fellow hacklab'er caught them missing from the input side. I'll need to update the block diagram too, as I also moved the feed resistors to not resist the line going to the inputs. I should get that done tomorrow. Note: updated now.

Mar. 26, 2010: Some work stress has been slowing down v1.1.4, but it is progressing (see link for updates). I've come to one conclusion though; You must be quite patient and steady handed to make this circuit on the protoshield. As soon as it's done, and possibly sooner depending on how many bugs I find in this board, I will be making a full-size version. I suspect building it on a proper sized protoboard will be *much* faster.

Mar. 19, 2010: Finished the circuit diagram for NA v1.1.4 and the circuit diagram and board layout for the Arduino variant. Once built I will be able to finish the next iteration of NAOS.

Mar. 14, 2010: Added support for Node Assassin to the cluster.conf XML validation file cluster.ng.

Mar. 10, 2010: Progress will be slow here while I divert back to working on the 2-Node CentOS5 Cluster paper that this device was built for. Updates will resume here once the next version of the hardware is built. That is, the new version with independent power feed sensing to determine more reliably the state of a node.

Mar. 08, 2010: Updated the fence agent to v0.1.005.

Mar. 07, 2010: HUGE SUCCESS! For the first time ever, I was able to use the fence_tool to successfully fence a node using Node Assassin! Granted, I had to lie about checking the node's state directly, but hey, the fence agent works!

Mar. 06, 2010: Now working on adding a sense port for each node and to increase the number of I/O ports to a sufficiently high number to support 8 nodes (w/ 2 in + 1 out each).

Mar. 05, 2010: The Fence Agent should now fully support calls from fenced! It hasn't been tested yet, that is the next step. However, it does follow the Fence Agent API.

Mar. 01, 2010: The Fence Agent is now able to fully control Node Assassin. Next step is to set it up from control by CMAN.

Feb. 28, 2010: Release NAOS ver. 1.0.3. Now working on the fence agent.

Feb. 27, 2010: With the na_v1.1.3 prototype complete, I will now be turning my attention to this fence agent. Progress should be noticeable from here on in.

First Version

The first version of Node Assassin is operational! Also, the initial version of the 'fence_na' fencing agent and associated files are done.

This fence device resides on the cluster's private network channel (or some other common intranet).

Software

This is the initial release of the fence control software; Node Assassin Operating System.

Source Code and notes:

Version Control

Note: The code there is seriously out of date. I'll update it later. For now, grab code from this website directly.

All software related to this product is hosted on OUT OF DATEGitHubOUT OF DATE.

NAOS Protocol v1.2

It works by listening for a connection on TCP port 238 on IP 192.168.1.66 (both port and IP are configurable). Once connected, it uses a very simple protocol.

Queries

Any message starting with 00:x are query commands. That is, the integer represented by x represents a type of query for the node assassin.

Get the state of all nodes:

 00:0

This will generate a message like:

Node states:
- Node Count: 04
- Node 01: P0 R0 F1
- Node 02: P0 R0 F1
- Node 03: P1 R1 F0
- Node 04: P0 R0 F0
End Message.

The node status is broken down into three values; Power: P[0|1], Reset: R[0|1] and Feed: F[0|1]. For the power and reset button, a value of 0 indicates that the related switch switch is open. If it is set to 1, the switch is closed (the indicated button is "pressed"). For the Feed value, 0 indicates that there is no power coming from the node's feed and thus, the node is "Off" (or disconnected). If the feed is 1, then power is detected from the node and the node is known to be "On".

Quick reference table:

Type Code 0 1
Power Switch P Open Closed
Reset Switch R Open Closed
Power Feed F Node is On Node is Off/disconnected

Get information on the node assassin device:

 00:1

This will generate a message like:

Node info: 
- Node Name: ..... Motoko
- Port Count: .... 04
- NAOS Version: .. v1.1.4
- Serial Number: . PR0002
- Build Date: .... 2010-04-03
- MAC address: ... 02:00:00:FF:F0:AA
- IP address: .... 192.168.1.66
- Subnet Mask: ... 255.255.255.0
- Default Gateway: 192.168.1.1
EOM

Commands

Commands are always given in the form: XX:Y where:

  • XX is the two-digit node number between 1 and max node.
  • Y is the single-digit command being issues.

Core commands are:

Release:

  • XX:0: Release any existing fences on the node. Specifically, both power and reset switches are opened.

Fence:

  • XX:1: Specified node is fenced. Specifically;
    • Reset switch is closed for one second to immediately disable the node.
    • Reset switch is opened for one second.
    • Power switch is closed. After five seconds the node's power feed is checked. If the feed is still high, NA waits another 25 seconds and checks again. If the feed is still high, an error is generated.
    • Reset switch is closed. At this point, both power and reset are closed, disabling the node's front-panel switches preventing accidental booting of the node before the fence is released.

Boot or Initiate ACPI Power Off:

  • XX:2: This will close the power switch for one second and then re-open it. If the node was off, this should initiate boot. If the node was on, this should initiate shutdown via ACPI. The power feed is checked prior to the one second fence and the returned message reflects whether the node is being booted or shut down.

Force Power Off:

  • XX:3: This will close the power switch. After five seconds the node's power feed is checked. If the feed is still high, NA waits another 25 seconds and checks again. If the feed is still high, an error is generated. Regardless of success or failure, the fence is removed.

Fence Agent

This is the fenced fence agent for Node Assassin.

The Node Assassin fence agent v1.1.4 is split up into three files:

  • Source: fence_na - Download
    • This is the core fence agent that exists in /sbin/.
  • Source: fence_na.lib - Download
    • This is the fence agent's function library that exists in /etc/na/.
  • Source: fence_na.conf - Download
    • This is the common Node Assassin configuration file that exists in /etc/na/.

The reason for the three files is that, later, there will be a fourth executable that will program the Node Assassin devices. When this program is created, it will consult the common configuration file and will use some of the functions in the library.

To test the agent, copy the following into a file (ie: args.txt):

# Test file used as input for the NA fence agent.
ipaddr=motoko.alteeve.com
port=1
login=motoko
passwd=secret
action=on

And cat it into the fence agent via a pipe:

clear; cat args.txt | ./fence_na

XML Validation Support

Until Node Assassin is natively supported, you will need to update the cluster.ng validation file for cluster.conf to successfully validate.

To accomplish this, you will need to modify or replace the default /usr/share/system-config-cluster/misc/cluster.ng to match this one:

The diff between the default version and the version above should be:

diff /usr/share/system-config-cluster/misc/cluster.ng /root/backups/cluster.ng
47,62d46
<        <!-- Node Assassin -->
<        <group>
<         <attribute name="ipaddr"/>
<         <optional>
<         <attribute name="login"/>
<         </optional>
<         <optional>
<         <attribute name="passwd"/>
<         </optional>
<         <optional>
<         <attribute name="passwd_script"/>
<         </optional>
<         <optional>
<          <attribute name="quiet"/>
<         </optional>
<        </group>
1062,1066d1045
<         <!-- Node Assassin -->
<         <group>
<          <attribute name="port"/>
<          <attribute name="action"/>
<         </group>

Hardware

v1.1.4

Apr. 03, 2010: Current version

  • NA v1.1.4 - Hardware complete, NAOS and fence agent now being updated.

This version will adds read pins connected to each node's power LED to check a node's power state independent of the node's port states. This is required to meet the FenceAgentAPI requirements.

Cables

The article linked below contains recommended cabling and pinouts for Node Assassin devices. By following these standards, you will ensure compatibility with other Node Assassin devices and cables.

Old Versions

 

Input, advice, complaints and meanderings all welcome!
Digimer digimer@alteeve.ca https://alteeve.ca/w legal stuff:  
All info is provided "As-Is". Do not use anything here unless you are willing and able to take resposibility for your own actions. © 1997-2013
Naming credits go to Christopher Olah!
In memory of Kettle, Tonia, Josh, Leah and Harvey. In special memory of Hannah, Jack and Riley.