Hybrid simulation protocol for NEES: a proposal
Paul Hubbard
May 11, 2006
Introduction
This covers the details of the communications protocol. We are
initially specifying it for use over TCP, but hope that it will be
implemented on other networks as well. The main
page covers the rest of the design and its architecture.
Revision history
- Initial posting May 25 2006
- Revised August 10 2006: Updates from Denver meeting, split into
separate document
- August 21 2006, Change token description, added body type as
uint16_t, flags as uint32_t, removed destination field
Communications protocol
In defining the protocol, researchers have requested that it not
constrain the types, quantity or structure of data. For example, NTCP
has a syntax based around verb/control point/parameter that limits
itself to one float per. Based on this requirement, the communications
protocol will look like this:
- Message header:
- Message length, in bytes, uint32_t
- Message type, from an enumerated list, uint32_t
- Message flags (error, critical, optional, etc), uint32_t
- Security token/capability, e.g. GridAuth session ID, 64 bytes
of uint8_t
- Sequence id, uint64_t
- Message body type, uint16_t
- Message contents, non-padded network-order IEEE 754 data
types.
For this to work, the setup phase of the communications needs to convey
the contents of the messages to all listeners. For example, the
coordinator will say that "My commands will contain parameters for
Element X, which are the following C structure...". The elements
(whether simulated or real) will define what they expect and return.
All of this will be done via the setup phase, using an XML-formatted
set of messages that define the message formats. That way, we can use
friendly XML to define data structures, and fast binary transmissions
during the experiment to convey data. The
key to this working is the fact that message types do not change during
an experiment. Therefore, we can use fixed-format binary,
minimizing the surprisingly
large overhead in ASCII/binary conversion.
The schema to be used is still under investigation; I am talking with
MTS to see if we can leverage the effort they've done in this area. The
requirements are pretty basic - we simply need to use XML describe the
format
of messages that can contain integers, doubles and strings.
Message types:
- MSG_DEFINE (param integer X), XML
- 'Define message X as the following contents'. This allows an
experiment to predefine more than one message format, in case they need
more flexible messaging. We could probably implement this in the second
round, and flag it as unfinished in version 1.
- MSG_LOGIN (username, token)
- Send username and token to server so that it can verify
permissions. All plugins receive this message as well and may
optionally do their own verification. Any plugin can reject the login,
which halts the experiment.
- MSG_OPEN (name, flags)
- Similar to the 'open a file' system call, attempt to connect to
a back end. The flags define exclusive or shared access, to allow
for shared hardware. If the hardware/backend/simulation is able to
handle multiple controllers, it must handle multiple opens. If
at-most-one controller, then it needs to return an error on the second
open.
- MSG_CLOSE (name, flags)
- Release a back end, optionally waiting for any pending moves to
complete.
- MSG_SET (sequence ID set)
- For a specific back end, send a message of type previously
defined with MSG_DEFINE
- Used for initial stiffness matrices and other setup/shutdown
needs.
- Can also be a broadcast if destination is empty.
- MSG_GET (sequence ID set)
- Corresponding to GET, query a destination for
state/settings/readings
- Possibly broadcast, as with SET.
- MSG_OK (sequence ID set)
- For LOGIN and (other messages?), report success. Only required
for synchonous messages such as LOGIN.
- MSG_ERROR (sequence ID set, message body is error code,
description and source, similar to SOAP)
- If LOGIN or other synchronous command fails, this is used to
report the error.
- May also be sent on other error conditions. If flagged
CRITICAL, the experiment must halt. If flagged as PAUSE, experimented
is paused while the error is cleared.
- MSG_PAUSE/MSG_RESUME (new sequence ID, body contains source of
command and description)
- Provide the capability for any subcomponent to pause the
experiment. Useful for fixing problems, placing the system into a safe
mode for manual inspection, etc. If possible, hardware should be
stopped; if this is impossible than an ERROR packet should be generated.
- MSG_INFO (new sequence ID, text in body)
- General otherwise unclassifiable information broadcast. Could
also be used to emulate the ISEE/NSEP 'Discuss' or 'EXPINFO' messages.
- MSG_COMMAND (sequence ID, message body type set)
- Contains the computed next displacement/velocity/motion, with a
message ID as defined by MSG_DEFINE
- No explicit acknowledgement is sent on success
- MSG_REACTION (sequence ID, message body type set)
- From the hardware/simulation, sent once the move is complete.
Contains measured/computed reaction forces. This implies a successful
completion of the MSG_COMMAND
- Sent aschronously, depending on how fast the components
complete their work.
- MSG_RESEND (sequence ID)
- This is sent upon reconnect, where either end can request the
re-sending of a message. This may be sent several times in a row to
catch up on lost messages. Each sender has a sequence number counter
for each connection, and is responsible for buffering messages for
re-sending. Note that the October code will probably only have the
stubs of this code in place.
- MSG_RECONNECT (username, token, last sequence number)
- Re-establish connection, responding with an OK message
containing the last unsent sequence number
Error handling
Error handling is somewhat speculative and based on the design of NTCP.
Every message has a serial/transaction number in it, which set by the
sender of the message. In other words, the client maintains a counter
for each connection and stamps each message with a serial number. The
NEES communications library will track, increment and report this
number to the calling code, which can then handle disconnect/reconnect.
Implications of sequence-based reconnection mechanism
Envision a scenario like the following: An experiment is under way when
the connection between the coordinator and a server is lost. The
coordinator catches the exception, and reinitates the connection. The
coordinator then sends a MSG_RECONNECT message to the server, with the
username, token and last sequence number that it got from the server.
The server responds with a MSG_OK packet, which has in its payload the
last sequence number K it has queued for the coordinator. The
coordinator can then request all messages that were lost in the interim.
Hmm. We could also have a command that sent 'all messages since
sequence number N'. Or 'all messages since time t'.
Time-sequence example of communications
- Server is started. It reads a configuration file that defines
what elements are to be used. Based on this, it can use dlopen
to load a plugin/module to communicate with that element. For example,
it will know that an actuator named 'foo' will have a corresponding
library named foo.dll, or libfoo.so. This reduces the configuration
and allows for runtime reconfiguration. Alternately, it may
postpone the module load until requested by the coordinator. This
follows the 'Convention over configuration' design pattern from Ruby.
- The hardware is started. Depending on the system, it may initiate
communications with the server, or just open a listener.
- The coordinator is started. During the startup sequence, it has
to define
- It has to get (optional) username and password from the user,
or otherwise retrieve a token. (Kerberos, GridAuth, PAM, etc).
(MSG_LOGIN)
- Server validates token, and confirms with backends that user
has permissions.
- Message format for commands and reaction forces, for all real
and simulated hardware. (MSG_DEFINE)
- Initial stiffness matrices have to be sent from the sim and
hardware to the coordinator. (MSG_SET)
- The experiment enters the main loop. At each step
- The coordinator computes the next set of
displacements/forces/motions
- The coordinator sends same to the server, which routes them to
the correct plugins/backends. (MSG_COMMAND)
- If one or more report errors, this is reported to the
coordinator via a high-priority message tagged with the corresponding
sequence ID. (MSG_ERROR/Critical)
- The backends do the actual move or other action, real or
simulated.
- Reaction forces from the backends are reported to the
coordinator. Since each plugin runs in a separate thread, these arrive
as unordered messages. (MSG_REACTION)
- Shutdown phase - the coordinator indicates completion, which is
forwarded to each plugin. Other messages may be exchanged at this point
(final stiffness matrices, for example) as well. (MSG_SET/GET)
Conclusions, feedback and path forward
This is an initial
posting; I am quite certain that I've made errors and omissions. Please send feedback and corrections to the
hybrid sim mailing list so that I can fix them.