Hybrid simulation protocol for NEES: a proposal

Paul Hubbard
May 11, 2006

Introduction

This covers the details of the communications protocol. We are initially specifying it for use over TCP, but hope that it will be implemented on other networks as well. The main page covers the rest of the design and its architecture.

Revision history

  1. Initial posting May 25 2006
  2. Revised August 10 2006: Updates from Denver meeting, split into separate document
  3. August 21 2006, Change token description, added body type as uint16_t, flags as uint32_t, removed destination field

Communications protocol

In defining the protocol, researchers have requested that it not constrain the types, quantity or structure of data. For example, NTCP has a syntax based around verb/control point/parameter that limits itself to one float per. Based on this requirement, the communications protocol will look like this:
  1. Message header:
    1. Message length, in bytes, uint32_t
    2. Message type, from an enumerated list, uint32_t
    3. Message flags (error, critical, optional, etc), uint32_t
    4. Security token/capability, e.g. GridAuth session ID, 64 bytes of uint8_t
    5. Sequence id, uint64_t
    6. Message body type, uint16_t
    7. Message contents, non-padded network-order IEEE 754 data types.
For this to work, the setup phase of the communications needs to convey the contents of the messages to all listeners. For example, the coordinator will say that "My commands will contain parameters for Element X, which are the following C structure...". The elements (whether simulated or real) will define what they expect and return. All of this will be done via the setup phase, using an XML-formatted set of messages that define the message formats. That way, we can use friendly XML to define data structures, and fast binary transmissions during the experiment to convey data. The key to this working is the fact that message types do not change during an experiment. Therefore, we can use fixed-format binary, minimizing the surprisingly large overhead in ASCII/binary conversion.

The schema to be used is still under investigation; I am talking with MTS to see if we can leverage the effort they've done in this area. The requirements are pretty basic - we simply need to use XML describe the format of messages that can contain integers, doubles and strings.

Message types:
  1. MSG_DEFINE (param integer X), XML
    1. 'Define message X as the following contents'. This allows an experiment to predefine more than one message format, in case they need more flexible messaging. We could probably implement this in the second round, and flag it as unfinished in version 1.
  2. MSG_LOGIN (username, token)
    1. Send username and token to server so that it can verify permissions. All plugins receive this message as well and may optionally do their own verification. Any plugin can reject the login, which halts the experiment.
  3. MSG_OPEN (name, flags)
    1. Similar to the 'open a file' system call, attempt to connect to a back end. The flags define exclusive or shared access, to allow for shared hardware. If the hardware/backend/simulation is able to handle multiple controllers, it must handle multiple opens. If at-most-one controller, then it needs to return an error on the second open.
  4. MSG_CLOSE (name, flags)
    1. Release a back end, optionally waiting for any pending moves to complete.
  5. MSG_SET (sequence ID set)
    1. For a specific back end, send a message of type previously defined with MSG_DEFINE
    2. Used for initial stiffness matrices and other setup/shutdown needs.
    3. Can also be a broadcast if destination is empty.
  6. MSG_GET (sequence ID set)
    1. Corresponding to GET, query a destination for state/settings/readings
    2. Possibly broadcast, as with SET.
  7. MSG_OK (sequence ID set)
    1. For LOGIN and (other messages?), report success. Only required for synchonous messages such as LOGIN.
  8. MSG_ERROR (sequence ID set, message body is error code, description and source, similar to SOAP)
    1. If LOGIN or other synchronous command fails, this is used to report the error.
    2. May also be sent on other error conditions. If flagged CRITICAL, the experiment must halt. If flagged as PAUSE, experimented is paused while the error is cleared.
  9. MSG_PAUSE/MSG_RESUME (new sequence ID, body contains source of command and description)
    1. Provide the capability for any subcomponent to pause the experiment. Useful for fixing problems, placing the system into a safe mode for manual inspection, etc. If possible, hardware should be stopped; if this is impossible than an ERROR packet should be generated.
  10. MSG_INFO (new sequence ID, text in body)
    1. General otherwise unclassifiable information broadcast. Could also be used to emulate the ISEE/NSEP 'Discuss' or 'EXPINFO' messages.
  11. MSG_COMMAND (sequence ID, message body type set)
    1. Contains the computed next displacement/velocity/motion, with a message ID as defined by MSG_DEFINE
    2. No explicit acknowledgement is sent on success
  12. MSG_REACTION (sequence ID, message body type set)
    1. From the hardware/simulation, sent once the move is complete. Contains measured/computed reaction forces. This implies a successful completion of the MSG_COMMAND
    2. Sent aschronously, depending on how fast the components complete their work.
  13. MSG_RESEND (sequence ID)
    1. This is sent upon reconnect, where either end can request the re-sending of a message. This may be sent several times in a row to catch up on lost messages. Each sender has a sequence number counter for each connection, and is responsible for buffering messages for re-sending. Note that the October code will probably only have the stubs of this code in place.
  14. MSG_RECONNECT (username, token, last sequence number)
    1. Re-establish connection, responding with an OK message containing the last unsent sequence number

Error handling

Error handling is somewhat speculative and based on the design of NTCP. Every message has a serial/transaction number in it, which set by the sender of the message. In other words, the client maintains a counter for each connection and stamps each message with a serial number. The NEES communications library will track, increment and report this number to the calling code, which can then handle disconnect/reconnect.

Implications of sequence-based reconnection mechanism

Envision a scenario like the following: An experiment is under way when the connection between the coordinator and a server is lost. The coordinator catches the exception, and reinitates the connection. The coordinator then sends a MSG_RECONNECT message to the server, with the username, token and last sequence number that it got from the server. The server responds with a MSG_OK packet, which has in its payload the last sequence number K it has queued for the coordinator. The coordinator can then request all messages that were lost in the interim.

Hmm. We could also have a command that sent 'all messages since sequence number N'. Or 'all messages since time t'.

Time-sequence example of communications

  1. Server is started. It reads a configuration file that defines what elements are to be used. Based on this, it can use dlopen to load a plugin/module to communicate with that element. For example, it will know that an actuator named 'foo' will have a corresponding library named foo.dll, or libfoo.so. This reduces the configuration and  allows for runtime reconfiguration. Alternately, it may postpone the module load until requested by the coordinator. This follows the 'Convention over configuration' design pattern from Ruby.
  2. The hardware is started. Depending on the system, it may initiate communications with the server, or just open a listener.
  3. The coordinator is started. During the startup sequence, it has to define
    1. It has to get (optional) username and password from the user, or otherwise retrieve a token. (Kerberos, GridAuth, PAM, etc). (MSG_LOGIN)
    2. Server validates token, and confirms with backends that user has permissions.
    3. Message format for commands and reaction forces, for all real and simulated hardware. (MSG_DEFINE)
    4. Initial stiffness matrices have to be sent from the sim and hardware to the coordinator. (MSG_SET)
  4. The experiment enters the main loop. At each step
    1. The coordinator computes the next set of displacements/forces/motions
    2. The coordinator sends same to the server, which routes them to the correct plugins/backends. (MSG_COMMAND)
      1. If one or more report errors, this is reported to the coordinator via a high-priority message tagged with the corresponding sequence ID. (MSG_ERROR/Critical)
    3. The backends do the actual move or other action, real or simulated.
    4. Reaction forces from the backends are reported to the coordinator. Since each plugin runs in a separate thread, these arrive as unordered messages. (MSG_REACTION)
  5. Shutdown phase - the coordinator indicates completion, which is forwarded to each plugin. Other messages may be exchanged at this point (final stiffness matrices, for example) as well. (MSG_SET/GET)

Conclusions, feedback and path forward

This is an initial posting; I am quite certain that I've made errors and omissions. Please send feedback and corrections to the hybrid sim mailing list so that I can fix them.