# Characterization and Mitigation of the MGT-Based Aurora Protocol in a Radiation Environment<sup>\*</sup>

Alex Harding, Kevin Ellsworth, Brent Nelson, and Michael Wirthlin

NSF Center for High-Performance Reconfigurable Computing (CHREC)

Department of Electrical and Computer Engineering

Brigham Young university, Provo, UT 84602

 $zaren 171 @gmail.com, \ k.ells worth.m @gmail.com, \ nelson @ee.byu.edu, \ wirthlin @ee.byu.edu \\$ 

August 6, 2013

## Abstract

The radiation test results of the Aurora protocol operating on an FPGA with Multi-Gigabit Transceivers are reported. An FPGA mitigation circuit was also developed and tested to repair SEU-induced faults seen in radiation testing.

### 1 Introduction

Field programmable gate arrays (FPGAs) provide a number of benefits for space-based electronic systems due to their flexibility, reprogrammability, and low development cost. In addition, modern FPGAs provide a large number of high-speed serial links to facilitate the high-bandwidth connectivity required by many space-based applications. The availability of a number of Multi-Gigabit Transceivers (MGTs) on the space-grade Xilinx V5QV FPGA provides both radiation tolerance and high speed communications links necessary to meet the demands of many current and future spacecraft applications.

The Xilinx Virtex-5QV FPGA is a reprogrammable FPGA based on the Virtex-5 FPGA family that incorporates radiation hardening by design (RHBD) [1]. The RHBD techinques provides hardness to singleevent updsets (SEU), immunity to single-event latchup (SEL), and a high total ionizing dose. This FPGA uses RHBD latches to protect the internal configuration memory, user flip-flops, and the configuration logic. This device provides a user with both radiation hardness and in-field reconfigurability.

Although this device provides SEU protection for the configuration latches and user flip-flops, some of the fixed-function blocks are not protected with radiation hardening by design. In particular, the MGTs are an important fixed function within the device but this function is susceptible to single event upsets (SEUs) in a space environment. Preliminary research on MGT failure modes indicates most failures are completely recoverable with proper stimulus, but less is known about the interaction of MGTs with a protocol layer in a space environment [2, 3]. The goal of this work is to investigate the failure modes and failure rates of an MGT when used with the Xilinx Aurora protocol layer. In addition, we seek to evaluate some low-cost recovery methods for failures that are not self recovering. The test results show that more than 98% failures either require no recovery or are recovered automatically by the Aurora protocol layer.

The Aurora protocol is an FPGA circuit core that implements a high speed serial protocol facilitating multi-lane binding, and clock synchronization [4]. The use of Aurora simplifies the development of complex multi-gigabit systems on FPGAs. Aurora uses a special bit encoding to provide special control characters as well as to ensure sufficient signal transitions for clock recovery. Aurora defines a simple and customizable frame structure to facilitate application-specific frame structure. The Aurora IP core implements all of this functionality to simplify the use of an MGT for point-to-point communication.

Previous work has been done to estimate MGT failure rates. Earlier tests focused on the testing of MGTs on Xilinx Virtex 2 Pro FPGAs [5]. More recent tests have focused on Virtex 5 FPGAs. Monreal et al. demonstrated that the failure rate for Virtex 5 MGTs was small and that nearly all failures were recoverable with proper stimulus [2]. Morgan et al. performed testing on a commercial (non-radiation harndened) Virtex 5 FPGA that incorporates an Aurora protocol circuit. That work provides insights into the

<sup>\*</sup>This work was supported by the I/UCRC Program of the National Science Foundation under Grant No. 0801876



Figure 1: Test Architecture

behavior of an MGT and protocol architecture [6].

Our work seeks to build upon that work by testing the combination of an MGT and Aurora protocol block on a radiation hardened Virtex 5 FPGA. Unlike previous tests which used a physical mask to limit the beam exposure to only the section of the FPGA containing the MGT circuits, in this work the entire radiation hardened FPGA running the Aurora protocol in conjunction with MGT's was tested. The goal of this was to provide a test environment closer to an end user environment in order to better evaluate overall system reliability.

# 2 Characterization Test Architecture and Results

A test architecture was created with the goal of identifying all faults and recovery mechanisms of the Aurora protocol and identifying which recovery mechanisms restored the system. The basic test architecture is shown in Figure 1, where an Aurora protocol IP block is attached to each lane in an MGT tile. The Aurora protocol simplifies high speed data IO by taking data from the user circuitry and transmitting it through the MGT blocks. There is also an external frame generator and check block associated with each lane which encapsulates data packets with a frame number and a CRC. In this test, error signals are monitored at three levels - the MGT tile level, the Aurora protocol level, and the data/frame level, and a cycle accurate time stamp recorded for each signal to capture details on error events and durations.

There were also a number of stimulus signals at the various levels used to recover the system from events when necessary. These were used primarily to reset the Aurora protocol, to reset the MGT tiles, and to reset the PLL contained in the Aurora block. These stimuli are provided though an external FPGA which monitored the functionality of the Aurora design. This functional monitoring (funcmon) FPGA logged errors signals, provided recovery stimulus and provided a communication link to a remote PC for user interaction and system control.

A configuration monitoring (cfgmon) FPGA monitored the configuration logic of the FPGA in the radiation beam (DUT). Communication available between the cfgmon and function allowed for selective scrubbing of the DUT configuration logic. The cfgmon was also user controllable through a parallel connection with a PC.

The test employed two Xilinx Virtex 5 FPGAs. One FPGA was in the radiation beam while the other was not (Service). The DUT FPGA was a XQR5VFX130 radiation hardened FPGA while the service was a commercial FX130T FPGA. The first half of our radiation testing took place in July 2011 at Texas A&M University's Cyclotron. Testing was done at four energy levels - 22.9, 46.1, 10.2 and 3.1 MeV-cm<sup>2</sup>/mg.

The vast majority of events recorded in the test either needed no recovery or were detected and recovered automatically by the Aurora protocol block. These are the events in Table 1 labelled *Self*, to denote that the system recovered from them on its own and was able to continue to operate. These events include data corruption (as evidenced by bit errors in the received data), hard errors (buffer over or underflow), and soft errors (received data upset in buffers). A small percentage of SEU-induced events, about 1.65%

| Type     | Recovery         | Count | % Total | % External |
|----------|------------------|-------|---------|------------|
| Self     | Data Corruption  | 26450 | 60.34%  | -          |
| Self     | Aurora Recovered | 16661 | 38.01%  | -          |
| External | DUT Aurora Reset | 533   | 1.22%   | 73.9%      |
| External | SRV Aurora Reset | 41    | 0.09%   | 5.7%       |
| External | DUT CDR Reset    | 17    | 0.04%   | 2.4%       |
| External | SRV CDR Reset    | 14    | 0.03%   | 1.9%       |
| External | DUT GTX Reset    | 44    | 0.10%   | 6.1%       |
| External | SRV GTX Reset    | 13    | 0.03%   | 1.8%       |
| External | DRP Scrub        | 14    | 0.03%   | 1.9%       |
| External | Scrub            | 43    | 0.10%   | 6.0%       |
| External | GLUT Scrub       | 2     | 0.00%   | 0.3%       |

Table 1: Events Categorized by Recovery Method.

of them, brought the communications link down indefinitely and thus required that additional recovery steps be initiated. These are the events denoted as External in Table 1 and are categorized by the manual recovery steps which were required to repair the link. For example, in 73.9% of the cases the link was recovered by simply resetting the DUT Aurora core. If that did not repair the link, the service Aurora core was reset, which recovered the system 5.7% of the time, and so on. The total distribution of the recovery steps which were successful as listed in Table 1 suggests that an external recovery circuit could be used to help mitigate the system against this 1.65% of events.

One of the metrics we measured in this test was the duration of faults. The duration of faults affects overall system availability, and the duration coupled with the frequency of errors impacts the bit error rate (BER) of the transmission channel. This rate is measured as the fraction of received bits that were in error (as detected by an error correction code circuit). The typical error duration was 3.34 *us* or less, which can be seen in Figure 2. The Bit Error Rate (BER) of the system due to radiation effects was calculated to be 1.31E-14.

## 3 System Mitigation Approach and Results Comparison

Based on our initial testing we found that the Aurora protocol provides good error detection and recovery and can be used in some space-based designs. The Aurora core, however, is not able to recover the system from all possible upsets and so an external detection and recovery circuit is needed to provide correction stimulus for the 1.65% of events which the system cannot self-recover from.

This detection and recovery circuitry is a verysimple hardware-based finite state machine which monitors the status signals from the MGT tile and the Aurora core. When an anomolous situation is detected by this circuit, it waits to see if the Aurora core will self-recover and if not, it resets the system to reestablish the serial link. From initial characterization testing most events recovered by the Aurora protocol were recovered within 5 us. The recovery circuit only need wait a very short time before attempting to correct the Aurora system when errors persists.

One approach to resetting the system would be to use the event frequencies of Table 1 and assert reset signals in succession while monitoring the system until it detected the system was fully recovered. The alternative (which was chosen) was the more "brute force" approach of sending a combination of resets at once (tile/GTX and Aurora resets). Analysis based on the recovery times of Table 2 shows that this allows the system to recover from the largest number of errors in the shortest time possible. Since higher level resets cause equivelant time penalties to the system the highest order resets can be used to fix the largest



Figure 2: Error Durations.

| Type     | Recovery         | Recovery Time  |
|----------|------------------|----------------|
| Self     | Data Corruption  | $5\mu s$       |
| Self     | Aurora Recovered | $150 \mu s$    |
| External | Aurora Reset     | $150\mu s$     |
| External | CDR Reset        | $150 \mu s$    |
| External | GTX Reset        | $150 \mu s$    |
| External | DRP Scrub        | $100 \mu s$    |
| External | Scrub            | $10^{6} \mu s$ |
| External | GLUT Scrub       | $10^{6} \mu s$ |

Table 2: Duration of Recovery Mechanisms.

number of potential system issues.

A second phase of testing, incorporating this detection and recovery circuitry was conducted in August 2012, also at Texas A&M University's Cyclotron. Testing was performed at two energies - 15 and 36.5 MeV-cm<sup>2</sup>/mg. The detection and recovery circuit performed well in the radiation tests, recovering from the vast majority of the *External* events from Table 1. These new recovery numbers are summarized in Table 3. There is a noticeable increase in errors corrected by external recovery in the mitigated system. This is due to the recovery circuit waiting a very short time before issuing high level resets, whereas in the unmitigated system testing we waited a fairly long time to ensure high level resets were needed before applying them. The table shows that unrecoverable errors are at 0.13%, an order of magnitude lower than for the unmitigated circuit.

The expected unrecoverable failure rate of this mitigated system is estimated at 5.9E-7 failures per lane per day in a Geosynchronous orbit. This is an improvement of two orders of magnitude over the unmitigated system's 1.15E-5 failures per lane per day. This low expected failure rate is slightly higher than the estimated single-event functional interrupt (SEFI) rate of the V5QV FPGA. The Virtex-5 radiation hardened FPGA has a documented single-event functional interrupt (SEFI) rate of 2.76E-7 events per device per day [1]. Since the expected failure rate of this mitigated Aurora link is similar to the FPGA SEFI rate, there are limited benefits of improving the reliability

Table 3: Mitigated Events by Recovery Method.

| Recovery               | Count | % Total |
|------------------------|-------|---------|
| Data Corruption        | 21126 | 49.77%  |
| Aurora Recovered       | 13992 | 32.96%  |
| Correction Circuit     | 8330  | 19.62%  |
| External/Unrecoverable | 56    | 0.13%   |

of this high-speed serial link.

#### 4 Conclusion

The results of testing MGTs with a protocol in a radiation environment suggest that existing protocols, such as the Xilinx Aurora protocol, provide a high level of reliability to MGTs in terms of allowing them to recover from Single Event Upsets. Only 1.65% of the test events required additional recovery stimulus that the protocol did not provide. However, the vast majority of these events were easily resolved with the addition of a very simple automated recovery mechanism to the existing protocol.

#### References

- Radiation-Hardened, Space-Grade Virtex-5QV Family Overview. Technical report, Xilinx, Inc., March 2012. DS192 (v1.3).
- [2] R. Monreal, G. Swift, C. Khuc, C. Carmichael, C. Tseng, S. A. Anderson, M. Coe, and J. Price. Investigation of the Single Event Effects and Subsequent Recovery Mechanism Induced By Multi Giga-Bit Transceivers (MGT). In *NSREC*, April 2010.
- [3] R. Monreal, C. Carmichael, and G. Swift. Single-Event Characterization of Multi-Gigabit Transceivers (MGT) in Space-Grade Virtex-5QV Field Programmable Gate Arrays (FPGA). In *Radiation Effects Data Workshop (REDW), 2011 IEEE*, pages 1–8, July 2011.
- [4] Aurora Protocol Specification. Technical report, Xilinx, Inc., 2007. SP002 (v2.0).
- [5] R. Monreal and G. Swift. Initial Heavy Ion Single Event Effect (SEE) Testing of the Xilinx Virtex-ii Pro Multi-Gigabit Transceivers (MGT). In *Proceedings of MAPLD*, 2006.
- [6] K. Morgan, M. Caffrey, M. Dunham, P. Graham, H. Quinn, C. Carmichael, T. Duong, A. Lesea, G. Miller, G. Swift, C. W. Tseng, Y. Wu, R. Monreal, and G. Allen. Upset-Induced Failure Signatures, Recovery Methods, and Mitigation Techniques in a High-Speed Serial Data Link For Space Applications. In NSREC, April 2008.
- [7] K. Ellsworth, T. Haroldsen, B. Nelson, and M. Wirthlin. Dual Channel Architecture for Reliable FPGA High Speed Serial Links. In *Aerospace Conference*, 2011 IEEE, pages 1–7, March 2011.