Introduction
noun: resiliency
- the capacity to withstand or to recover quickly from difficulties; toughness.
- ability of an application to react to problems in one of its components and still provide the best possible service.
Applications - especially the ones handling real time feed and processing market data have to be resilient in some manner. The amount of resiliency in these applications depends on the the end purpose of data being used. For example, if the data feed is interrupted, an application which feeds front end displays will have to immediately switch over to a backup data feed or a slower conflated feed, whereas, a back-office number crunching application might be content with re-trying the connection after a while.
The net resiliency of an application depends on the software system best practice, compute hardware, horizontal scaling, geographic redundancy and the redundancy of the data source. In this article I will talk about the last part, i.e. what are the resiliency options available from the market data vendors like Refinitiv.
Redundant software on redundant hardware makes a redundant connection to a redundant servers.
Resiliency options
For the real-time market data application, following options are available to a user. Each one of these present a varying level of service degradation, and application can use the one which best fits its working needs.
- Using multiple discreet services with a Quality of Service (QoS) advertisement
- Refinitiv managed resiliency in the Refinitiv Real Time Optimized (RTO) Cloud
- Use multiple distribution endpoints using ChannelSet option in RTSDK
- Implement a hot standby resiliency
- Use warm standby feature of the real time SDK, RTDS
Using multiple discreet services with a Quality of Service (QoS) advertisement
Market data distribution endpoint - ADS, advertises multiple concrete services (sources of data) which have been configured by the administrator. When subscribing to a real time data for an item, the application specifies the service name and the item name in the request message. Typically, these services are configured to provide different sources of data; for e.g. there might be a service for delayed quotes, and a different one for the in-house published content, and yet another one for the FX rates etc. For the redundancy purposes, the infrastructure can be configured in a way that it has the ability to provides the same data from a high-throughput primary service, and a low throughput and conflated, backup service. Both of these services advertise this ability in the Quality of Service (QoS) message. An app can take advantage of this setup by requesting the data from a high QoS service, but upon interruption (i.e. service stale), switches to a low QoS service.
The QoS information is available in the Source Directory response message, and can be requested by subscribing to directory information. See this EMA example of directory message subscription.
Here is the Directory message from RTO cloud, showing the two services (ELEKTRON_DD and ERT_FD3_LF1) and their timeliness (realtime, but conflated):
<REFRESH domainType="SOURCE" streamId="2" containerType="MAP" flags="0x168 (HAS_MSG_KEY|SOLICITED|REFRESH_COMPLETE|CLEAR_CACHE)" groupId="0" State: Open Ok None - text:"" dataSize="878">
<key flags="0x08 (HAS_FILTER)" filter="63"/>
<dataBody>
<map flags="0x00" countHint="0" keyPrimitiveType="UINT" containerType="FILTER_LIST">
<mapEntry flags="0x00" action="ADD" key="257">
<filterList containerType="ELEMENT_LIST" countHint="0" flags="0x00">
<filterEntry id="1" action="SET" flags="0x00" containerType="ELEMENT_LIST">
<elementList flags="0x08 (HAS_STANDARD_DATA)">
<elementEntry name="Name" dataType="ASCII_STRING" data="ELEKTRON_DD"/>
<elementEntry name="SupportsQoSRange" dataType="UINT" data="0"/>
<elementEntry name="QoS" dataType="ARRAY">
<array itemLength="0" primitiveType="QOS">
<arrayEntry Qos: Realtime JustInTimeConflated Static - timeInfo: 0 - rateInfo: 0/>
</array>
</elementEntry>
<elementEntry name="DictionariesProvided" dataType="ARRAY">
<array itemLength="0" primitiveType="ASCII_STRING">
<arrayEntry data="RWFFld"/>
<arrayEntry data="RWFEnum"/>
</array>
...
</mapEntry>
<mapEntry flags="0x00" action="ADD" key="253">
<filterList containerType="ELEMENT_LIST" countHint="0" flags="0x00">
<filterEntry id="1" action="SET" flags="0x00" containerType="ELEMENT_LIST">
<elementList flags="0x08 (HAS_STANDARD_DATA)">
<elementEntry name="Name" dataType="ASCII_STRING" data="ERT_FD3_LF1"/>
<elementEntry name="SupportsQoSRange" dataType="UINT" data="0"/>
<elementEntry name="QoS" dataType="ARRAY">
<array itemLength="0" primitiveType="QOS">
<arrayEntry Qos: Realtime ConflatedByRateInfo Static - timeInfo: 0 - rateInfo: 3000/>
</array>
</elementEntry>
...
</mapEntry>
</map>
</dataBody>
</REFRESH>
Please refer to RDM usage guide and SDK samples to understand and use these Directory messages. The application logic will have to handle the failover from one service to another. This article on Service Resiliency describes the code in detail.
Refinitiv managed resiliency in the RTO cloud
For an application sourcing data from the RTO cloud, Refinitiv provides a Standard Resiliency as one of the options. There are two Availability Zones (AZ) in each geographic region, backed up three DNS addresses. The two of these DNS are directly connected to a cluster of load balanced ADS, while the third DNS will route the connection to one of two Availability Zones. In the event of an outage in one Availability Zone, any re-connection attempts will be routed to the unaffected AZ, thereby providing a level of redundancy. This setup is suitable for apps which can handle being disconnected and have adequate logic to reconnect and re-subscribe to the items of interest. The Realtime SDK (EMA) provides this functionality and automatically initiates the re-connection to the same channel upon interruption.
An application can use the Service Discovery to get a list of all the Regions and AZs. The Service Discovery request returns the DNS which are permissioned for the user's credentials and capacity levels. Here is an example of the Service Discovery response for US East region, showing that aws-3 is a front for both us-east-1a and us-east-1b
{
"services": [{
"dataFormat": [
"tr_json2"
],
"endpoint": "us-east-1-aws-1-lrg.optimized-pricing-api.refinitiv.net",
"location": [
"us-east-1a"
],
"port": 443,
"provider": "aws",
"transport": "websocket"
},
{
"dataFormat": [
"tr_json2"
],
"endpoint": "us-east-1-aws-3-lrg.optimized-pricing-api.refinitiv.net",
"location": [
"us-east-1a",
"us-east-1b"
],
"port": 443,
"provider": "aws",
"transport": "websocket"
},
{
"dataFormat": [
"tr_json2"
],
"endpoint": "us-east-1-aws-2-lrg.optimized-pricing-api.refinitiv.net",
"location": [
"us-east-1b"
],
"port": 443,
"provider": "aws",
"transport": "websocket"
},
...
]
}
Use multiple distribution endpoints using ChannelSet option in RTSDK
The Refinitiv Real-Time SDK 3.0.3 introduced the option to specify multiple distribution endpoints in the configuration. Typically, an application used to specify Channel in the configuration. A Channel represents a single network connection to a single ADS. If this ADS were to fail, the resulting connection will terminate and the SDK will keep retrying the same ADS, until successful. With the introduction of ChannelSet, an application specifies multiple Channels in the configuration, and if the connection to first one is lost, the SDK takes care of connecting to the next one. It keeps on trying all the channels in the list in a round-robin manner. At any time, when the connection is successful, the SDK also takes care of re-subscribing to all the items in the watch list.
Operationally, this option is quite similar to the one discussed previously, but instead the fail-over and re-subscribe logic is built into the SDK. It should be noted, that there is only one connection at any given time. The Warm Standby mechanism discussed further is quite similar to ChannelSet mechanism and offers an even better fail-over mitigation, at the expense of higher resource usage.
Here is an example showing how an EMA Consumers can specify ChannelSet in the EMAConfig.XML file:
<Consumer>
<Name value="Consumer_1"/>
<!-- Channel represents a single connection to single ADS. No failover available -->
<Channel value="Channel_1"/>
<Dictionary value="Dictionary_1"/>
<XmlTraceToStdout value="0"/>
</Consumer>
<Consumer>
<Name value="Consumer_2"/>
<!-- ChannelSet specifies an ordered list of Channels to which OmmConsumer will attempt to -->
<!-- connect, one at a time, if the previous one fails to connect -->
<ChannelSet value="Channel_1, Channel_2, Channel_3, Channel_4"/>
<Dictionary value="Dictionary_2"/>
<XmlTraceToStdout value="0"/>
</Consumer>
Implement a hot standby resiliency
Hot Standby is a redundancy pattern, where an application makes two concurrent connections to the market data system. Each of these connections is to a different ADS, or a different Availability Zone (AZ) if connecting to RTO Cloud. The application also duplicate-subscribes the watch list from the second connection. In the event of an outage in one AZ, the second connection would not be impacted and the application would keep on receiving data updates.
This redundancy mechanism offers least amount of disruption in the data flow, but comes at an expense of high network and resource usage. The application logic is also complicated, and it has to keep track of every update packet received on one channel, while discarding the duplicates received on the backup channel.
Use warm standby feature of the real time SDK, RTDS
Warm Standby is a redundancy mechanism within RTDS which allows a consumer to failover to a standby connection, when a primary connection fails. Because the standby connection has already been established and the server is already aware of items an application has subscribed to, during a failover application doesn't need to re-subscribe to the items. Therefore, Warm Standby reduces the overall recovery time and also reduces the network traffic by not inducing a burst of re-request messages. This process is transparent to the applications built using RTSDK. In terms of resiliency, Warm Standby is halfway between Hot Standby and the ChannelSet option provided earlier. It is less resource intensive and still provides a reasonable speed of recovery.
Two types of warm standby modes are supported: Login-based Warm Standby, and Service-based Warm Standby. These modes of operation are described in detail in this article on Warm Standby. The SDK also provides sample configuration settings for Warm Standby - the use of these configurations and how to run these samples is described in this article.
Warm Standby functionality is built into the RTSDK itself; hence an application built using WebSocket will have to implement the associated warm standby and failover logic itself. A sample Python app built using WebSockets, and connects to RTO Cloud for Warm Standby is provided as an example.
Resources
- Register or Log in to applaud this article
- Let the author know how much this article helped you