Hybrid simulation architecture for NEES: an updated proposal

Paul Hubbard
May 11, 2006

Introduction

Since the first workshop and before, I've been working on hybrid simulation for NEES. While I am familiar with NTCP, I had a great deal to learn about the state of the art in other systems and research efforts. My time since the workshop has been mainly devoted to exploring middleware (databases, gt3/4) and learning about current research. We are now at a point where I can describe the initial architecture, some of the reasoning behind it and ask for a critique from the community.

Names

Based on the informal poll at the Denver meeting, this software is now dubbed NHCP, for NEES hybrid communications protocol/program.

Revision history

  1. Initial posting May 25 2006
  2. Revised August 10 2006: Updates from Denver meeting, split protocol into a separate page

Goals, limitations and technology choices

As we dive into the design, its worthwhile to talk about what we're trying to solve and what we can't address. There are several types of hybrid testing, so let's start with a definition:
A hybrid test is one where part of the structure under test is simulated numerically on a computer. This enables experiments of scale or type that were previously not possible.
One can also separate hybrid tests into two types based on speed: Fast hybrid simulation, used for rate-dependent effects and the like, attempts to keep the actuators in motion. (There are other cases where you need this sort of speed; I am simplifying a bit.) Slow tests do not attempt continuous motion and are, in general, slower than realtime. Fast hybrid tests are most often local within a laboratory, where one can guarantee transmission times and thus deterministic update rates.

Recent research (Mosqueda et al, others) have demonstrated that you can use a local DSP to compensate for packet delays, such that the DSP interpolates and extrapolates for up to 1.2 seconds. This raises the idea that fast distributed tests are possible. Indeed, work such as (Mosqueda et al) indicates the level of interest and effort.

NTCP is designed for slow tests, over unreliable networks, spanning one or more laboratories. Indeed, it has been used successfully in several of these, and researchers continue to use it. We would like to rework NTCP such that it can be used for fast distributed tests. In particular, one proposed experiment we're using as a goal is 75Hz over a 500km link.

We are not going to try and address fast hybrid tests that run in hard realtime. The canonical example of these is SCRAMnet with XPCtarget, which runs at 1024Hz. At this point in time, writing code that will run on here  and on a distributed system in not feasible.

We will focus on distributed tests, both slow and 'fast', attempting to run as fast as possible.

One area where NEESit can contribute is streaming data. We have an existing system based on the Creare Data Turbine that works well for data, video and audio. We plan on implementing an interface to the turbine for streaming commands, responses, events and other experimental data for observers and participants.

Security is another area where we can contribute. Our framework will have authentication, authorization and hooks or tools to administer permissions. GridAuth, Kerberos, and JAAS are all intriguing here. Based on initial results and feedback from the Denver meeting, we are using OpenSSL for TCP/IP, both plaintext and encrypted. OpenSSL allows to leverage the considerable investments in code, infrastructure and hardware acceleration that others have made, with confidence that the underlying code is reliable and well-tested.

Abstractions and paradigms

When trying to create a design applicable to multiple sites, the main method is to abstract the differences and create something implementing the commonalities. From the initial workshop and subsequent research, one of the key findings is that there are a limited number of things in common between different systems. For example, the NTCP concept of 'control point' doesn't exist elsewhere, the ISEE software has critical/open data dichotomy, and OpenSees has things like ExperimentalElement, Actor/Shadow. Given these differences, the common ground is that all systems to date have a structure something like this:

Mosqueda's flow chart

(The diagram is from Mosqueda et al)

The various systems differ in how they split up the work. Here's a pretty common setup from a physical viewpoint:
System diagram
The functional view looks like this:
Functional base diagram

Nomenclature

Each system has different names for the various bits. Here is the naming convention and definitions to be used here:
  1. The coordinator has the central concept of the experiment. It knows what's to be tested, units, geometry, username/password, and is generally where the experimenter is monitoring.
  2. The server runs a process that centralizes communications. It has plugins/code to handle different control systems, communications channels, logging, hooks into the authentication/authorization and streaming data. When the coordinator sends out a new set of commands, the server is responsible for directing messages to the correct destination, routing responses back to the coordinator, doing a best-effort stream of both, propagating errors.
  3. The simulation and hardware look alike to the coordinator; in other words the coordinator's interface to both is the same. This allows the simulation to replace hardware, both for experiments and algorithmic validation. The server will have dynamically loadable drivers/plugins to allow for different communications channels, protocols and interfaces. I label non-simulated systems as generic 'hardware', even though there is no requirement for physicality.
(Note that it's very common to have more than one item for simulation and hardware.)

Changes since the initial proposed design

Let's start with an updated functional diagram. Blocks in green represent TCP/IP server processes, where the program in question is responsible for accepting a TCP/IP connection:

Functional diagram with colors
(Light brown represents the corresponding TCP/IP client)

So how does this differ from the first proposed design? In this version, suggested by Greg Fenves, the server processes are simplified by removing the need to handle multiple back ends. In other words, they no longer need to
  1. Route messages to different plugins based on the contents
  2. Understand more than one set of messages
  3. Contain as much code.
Looking at a more detailed picture, let's add the data turbine and authorization/authentication bits:
Functional with more details

Data turbine interface

Now we can see how data is streamed to RDV. Note that each server will need to use the binary data interface for this, which is a separate bit of work to do. The turbine has interfaces for HTTP/DAV and Java, so will need to use the binary interface that Creare has developed. This section is light on detail because we have not yet investigated this yet.

Security

There are several things to consider here:
  1. Message integrity - was the data corrupted, either accidentally or delibarately?
  2. Authenticity - are you talking to the process you expected, or the man in the middle?
  3. Authorization - is user J allowed to control hardware X?
  4. Encryption - is you have proprietary (e.g. commercial) data, you might want to prevent against interception.
For our solution, we propose using a tiered solution:
  1. OpenSSL for communications
    1. Plain socket mode (default, fastest, simple)
    2. Message digests, plaintext - this addresses #1
    3. Encrypted with digests
  2. External username/password mechanism TBD. There are many possibilities here (kerberos, PAM, JAAS, Gridauth, Globus, etc, etc).
The use of OpenSSL does complicate the development and compilation a bit, but there are many offsetting factors. It's a well-established standard, cross-platform, stable and can be accelerated with crypto cards if desired. If we employ server and client certificates, we can ensure secure communications with minimal overhead, both runtime and administrative.

Communications protocol in detail

I've split the protocol into a separate page, as this was growing too large.

Compatability with existing systems

There are two primary issues in working with existing code: Whether or not the protocol is sufficient, and the detailed issues with invoking our communications libary from their code. One of the lessons learned from NTCP was that a too-rigid protocol invited peculiar workarounds, in particular the misuse of message types with undefined payloads. Overloading the SET/GET pair produced a more flexible protocol at the cost of compatibilty and opacity.

Any system that works with NHCP will need to
  1. Have OpenSSL available. Since this is nearly every OS with an IP stack, not a big deal.
  2. Be able to invoke C/C++ code.
We will commit to
  1. Writing a plugin for existing NTCP-controlled hardware that uses the ASCII protocol (PDF).
  2. Writing an OpenSEES object to invoke our code.
We are investigating the SIMCOR/MATLAB interface, which is complicated by MATLAB's single-threaded nature, which requires that the plugin be a bit smarter about buffering. With the addition of the RESEND and RECONNECT messages, it might be easier to implement; this is not yet certain.