# Radiation Testing of FPGA-Based High-Speed Serial Communication

Kevin Ellsworth, Alex Harding, Colby Ballew, Travis Haroldsen, Michael Wirthlin, and Brent Nelson NSF Center for High-Performance Reconfigurable Computing (CHREC) Department of Electrical and Computer Engineering Brigham Young University, Provo, UT. 84602 Email: k.ellsworth.m@gmail.com, wirthlin@ee.byu.edu, nelson@ee.byu.edu

*Abstract*—FPGAs with high-speed serial transceivers provide an effective platform for space-based computing systems. This paper tests the reliability of the Aurora serial protocol operating on an FPGA using high-speed MGT links.

# I. INTRODUCTION

Many space-based applications require highbandwidth point-to-point Field connectivity. Programmable Gate Arrays (FPGAs) provide an platform for space-based effective applications due to their flexibility, reprogrammability, and low development cost. The increased availability of highspeed transceivers on FPGAs are providing serial communication links capable of meeting the demands of many of these high-bandwidth applications.

Concerns arise, however, over the susceptibility of FPGAs to single event upsets (SEUs) in a space environment. Xilinx Inc. (San Jose, CA) recently introduced a radiation hardened FPGA, the Xilinx V5QV, to help mitigate radiation effects in FPGAs [1]. Sometimes referred to as a *Single-Event Immune Reconfigurable FPGA (SIRF)*, this new FPGA design utilizes a special hardware layout to provide redundancy at the cell layer, resulting in configuration logic that is more resistant to SEUs than previous FPGAs. SIRF FPGAs have greatly reduced the susceptibility of the reconfigurable portion of the FPGA, however, some portions of the FPGA are still not radiation hardened [2].

One component which is not radiation hardened on SIRF chips is the high-speed transceiver. These hard cores (referred to most commonly as Multi-Gigabit Transceivers or MGTs for Xilinx devices) form the basis of high-speed serial communication on FPGAs. They are generally used in conjunction with a protocol layer which is implemented in the radiation hardened, reconfigurable portion of the FPGA. Preliminary research has been performed on radiation related effects for the MGTs themselves, but less is known about how radiation can affect MGTs when used with a protocol layer [3].

This work utilizes a radiation hardened FPGA and a simple protocol from Xilinx named Aurora to 1) identify radiation failure modes unique to a Xilinx V5QV transceiver and protocol system, 2) estimate the failure rates of an Aurora protocol system in a space environment, and 3) identify recovery techniques that can be added to the Aurora protocol logic to make a more reliable system.

During three radiation tests performed, more than 98% of events observed during testing either required no recovery or were recovered automatically by the Aurora protocol layer. However, those events which did require additional recovery could substantially affect the system and required some external action to be taken in order to resume normal operation. These additional actions are identified and our work shows they can be built on top of the Aurora protocol logic with minimal effort.

#### II. RELATED WORK

A great deal of research has been performed on the effects of radiation to FPGA logic generally [1], [2]. Less research has been done specifically on MGTs alone, but significant investigations have been performed. Earlier radiation testing focused on the characterization of MGTs on Xilinx Virtex 2 Pro FPGAs [4] while more recent testing has been performed by Monreal on Virtex 5 MGTs [3].

Monreal's testing used radiation hardened FPGAs and shielding to expose only the MGTs to radiation in order to better isolate upsets. The results from this investigation suggest the robustness of the Virtex 5 MGTs and their ability to recover from upsets. Specifically, Monreal showed that the MGTs provide sufficient control signals (resets) to repair all MGT-level faults. The isolation of the MGTs in this testing allowed for more accurate characterization of the MGT components alone and provided

This work was supported in part by the I/UCRC Program of the National Science Foundation within the NSF Center for High-Performance Reconfigurable Computing (CHREC), Grant No. 0801876.

a solid foundation for investigations into characterizing the MGTs as part of a larger system without shielding.

Morgan, et al. performed some initial testing of Virtex 5 MGTs with the Aurora protocol using commercial (not radiation hardened) Virtex 5 FPGAs [5]. Their research suggests that additional logic is needed for the Aurora protocol to be used in space environments, but also directs that more research is needed.

This work builds upon that work which has been done to provide greater insights into Virtex 5 MGTs as part of a larger system. We utilize a radiation hardened Virtex 5 FPGA with the Aurora protocol, exposing the entire system to radiation. This architecture allows for characterizing radiation effects to the system as a whole and evaluating what additions may be necessary to form a more robust space-based system.

# III. AURORA

The Aurora protocol is a lightweight, link-layer protocol used to connect two MGT end points [6]. It provides a mechanism for the streaming or framing of data across a serial link. 8B/10B encoding is used on the transmitted data for proper clock recovery as well as basic error checking. No additional data checking is provided beyond the 8B/10B encoding checks, but the protocol does check for properly framed data packets. The protocol implements no error correction.

The Aurora protocol is implemented in the reconfigurable portion of the FPGA or *soft logic*. This is the radiation hardened portion of the V5QV SIRF part thus greatly reducing the probability of upsets to this portion of the system. The protocol encapsulates one or more MGTs as shown in Figure 1. The MGTs are implemented as static blocks or *hard logic* which cannot be reconfigured. These static blocks are not radiation hardened.



Fig. 1: Aurora Protocol Implementation

# IV. TEST ARCHITECTURE

The basic test architecture used for radiation testing is shown in Figure 2. Two FPGAs are used and connected via RX/TX high-speed serial MGT links with a line rate of 3.125 Gb/s. One of the FPGAs (a Xilinx Virtex 5 XQR5VFX130 radiation hardened FPGA) is exposed to radiation during testing while the second (a non radiation hardened Xilinx Virtex 5 FX130T FPGA) acts as a service FPGA to provide the other end of RX and TX links. A separate Aurora protocol block is attached to each MGT in the architecture. There is also an external frame generation and check block associated with each lane which encapsulates data packets with a frame number and a CRC. Error and status signals are monitored at three levels - the MGT level, the Aurora protocol level, and the data/frame level. More than 20 error and status signals are monitored for each MGT link. A cycle accurate time stamp is used to record information on event durations and recovery times.

There are also a number of stimulus signals at the various levels which are employed to recover from events when necessary. These are comprised primarily of resets to the Aurora protocol or the MGTs. The test architecture systematically uses these stimulus signals as recovery steps in attempting to recover the system following an upset event. The level of recovery effort needed to return the system to normal operation is then used to classify the severity of upsets. The hierarchical manner in which the test architecture applies the recovery steps is also specifically designed to aid in understanding which specific recovery steps are most useful in repairing the system.

The three main categories of recovery stimulus are 1) Aurora logic reset, 2) MGT level resets, and 3) Reconfiguring the FPGA. The Aurora logic reset is used to reset the state of the Aurora protocol. Another level of reset, the MGT level resets, take more time to accomplish and are used to reset various transceiver components in the hard logic. Finally, the entire FPGA can be reconfigured if needed. This is rarely required but takes the entire design down for an extended period of time.

#### V. TESTING SUMMARY

Three radiation tests were performed with this test architecture. The first took place between March 22nd and March 26th, 2011 at Texas A&M University's Cyclotron. Testing was done at four energy levels -22.9, 46.1, 10.2 and 3.1 MeV-cm<sup>2</sup>/mg. The second test was conducted July 7th through the 13th, 2011 also at Texas A&M University's Cyclotron. This was the most extensive testing using 6 different heavy ions and 8 energy levels over 59 runs and resulted in over 43,000 events observed. Proton testing was also conducted on



Fig. 2: Test Architecture

January 25th and 26th of 2012 at the facilities at UC Davis with energies of 9.7, 18, and 64 MeV-cm<sup>2</sup>/mg. All testing was accomplished in conjunction with and with much help and support from the Xilinx Radiation Test Consortium (XRTC).

## VI. TEST RESULTS

The vast majority of events observed in testing either needed no recovery or were recovered automatically by the Aurora protocol. Figure 3 demonstrates the primary results from the July 2011 test showing all observed events classified by the recovery effort that was necessary to restore the system to normal operation. Data corruption events are those events which result in data corruption (such as bit errors) but have no other system impact. This type of event was by far the most common with 26,450 events observed and an expected Mean Time Between Failure (MTBF) of 2.2 years in an orbit common for satellites (geosynchronous orbit). Such errors are typically detected and corrected using appropriate error coding methods such as a Cyclic Redundancy Check (CRC).

The Aurora protocol detected and properly recovered from 38% of the events. A typical example of such an event would be a soft error in the MGT receive or transmit buffers. A part of the Aurora protocol monitors MGT-level signals which indicate such errors and resets the MGT tile as needed.

Less than 2% of observed events required additional recovery effort beyond the built-in Aurora error recovery logic. Without any additional logic, this 2% of the errors would cause the system to fail. Detecting and recovering from these errors was accomplished by external logic detecting that the system was in a fault condition and



Fig. 3: Classification of Observed Radiation Events

taking appropriate reset actions. In the most severe case the entire FPGA device had to be reconfigured.

The distribution of these events is given in Table I. The Aurora Logic Reset was the most common and accounted for 80% of the externally-recovered events. Similarly, asserting MGT Level Resets accounted for 14% of these event recoveries. Finally, 6% of the time the FPGA had to be reset to restore the system to a functioning state. This table also estimates the mean time between events for a geosynchronous orbit. Our work suggests that the logic to implement these external recovery mechanisms take very little additional logic on top of the Aurora protocol.

| Recovery Method    | Events | %    | MTBF (Years) |
|--------------------|--------|------|--------------|
| Aurora Logic Reset | 574    | 80%  | 240          |
| MGT Level Resets   | 102    | 14%  | 1864         |
| Reconfigure FPGA   | 45     | 6%   | 22831        |
| Total              | 721    | 100% | 211          |

TABLE I: Distribution of External Recovery Events. Years to Event is reported for a geosynchronous orbit.

## VII. CONCLUSION

Radiation testing with the described architecture demonstrates that FPGA transceiver systems are susceptible to a variety of radiation induced upset events. 60% of these events, though, result only in corruption of data and do not otherwise affect the system. The Xilinx Aurora protocol provides support for recovering the system from another 38% of events that do affect the system beyond data corruption. However, the remaining 2% of events can have a significant impact on the system and do require additional recovery effort. Thus, the Aurora protocol block provides a good foundation for a spaced-based FPGA transceiver system, but some minimal additional logic is needed in order to make a truly robust system.

#### REFERENCES

- G. R. Allen, G. Madias, E. Miller, and G. Swift, "Recent Single Event Effects Results in Advanced Reconfigurable Field Programmable Gate Arrays," in *Radiation Effects Data Workshop* (*REDW*), July 2011, pp. 1–6.
- [2] G. Swift, C. Carmichael, G. Allen, G. Madias, E. Miller, and R. Monreal, "Compendium of XRTC Radiation Results on All Single-Event Effects Observed in the Virtex-5QV," in *MAPLD Conference Proceedings*, 2011.
- [3] R. Monreal, G. Swift, C. Khuc, C. Carmichael, C. Tseng, S. A. Anderson, M. Coe, and J. Price, "Investigation of the Single Event Effects and Subsequent Recovery Mechanism Induced by Multi Giga-bit Transceivers (MGT)," in *NSREC*, Apr 2010.
- [4] R. Monreal and G. Swift, "Initial Heavey Ion Single Event Effect (SEE) Testing of the Xilinx Virtex-II Pro Multi-Gigabit Transceivers (MGT)," in MAPLD Conference Proceedings, 2006.
- [5] K. Morgan, M. Caffrey, M. Dunham, P. Graham, H. Quinn, C. Carmichael, T. Duong, A. Lesea, G. Miller, G. Swift, C. W. Tseng, Y. Wu, R. Monreal, and G. Allen, "Upset-Induced Failure Signatures, Recovery Methods, and Mitigation Techniques in a High-Speed Serial Data Link for Apace Applications," in *NSREC*, 2008.
- [6] Xilinx Corporation, "LogiCORE IP Aurora 8B/10B v5.2 (UG353 v5.2)," July 2010.