PCIe is one of the most latency-sensitive forms of serial communication because its address-based semantics mean that processor threads are often waiting for the results of a transaction. The advent of PCIe 4.0, and especially of PCI 5.0, have driven the need to use retimers in many longer-reach PCIe applications.
This article explores the general latency environment for PCIe at six different layers and discusses ideas for how to optimize each of those layers, including with retimers.
Application latency arises from many different sources. The following chart describes six different layers, the typical latency added at each layer, the source of the latency in each layer and methods that can be used to minimize the latency experienced in each layer. Latency should be optimized from the top of the table to the bottom.
Figure 1 Latency affects PCIe application performance at six different layers. Source: Kandou
At the top of the table, there are opportunities to save milliseconds to seconds. In the middle of the table, it is possible to save microseconds to milliseconds. At the bottom of the table, it is possible to save tens of nanoseconds. That said, every element of latency contributes to the overall system performance, so opportunities for improvement should be taken at each level where feasible and economic.
A significant opportunity for improvement in many systems is found at the data link layer. This is because the retransmit “retry” mechanism can be triggered relatively often, even in normal operation. With a bit error ratio (BER) on 1E-12 on a 16-lane x 32 Gbps/lane link (PCIe 5.0), a retransmit will occur about every two seconds.
The retransmitted traffic is treated as 5th priority per the PCI-SIG Implementation Guideline Section 3.6.2.1. A major latency hit is experienced whenever this happens. This latency source also degrades the latency consistency of the system and can cause an application to have a stall or stutter in its performance.
A more serious latency hit is taken when the link training and status state machine (LTSSM) is forced to enter the recovery state due to poor signal integrity, perhaps at a corner condition such as high or changing operating temperature. The operation of the recovery state machine can take hundreds of milliseconds or more to find a new setting for the transmit equalizers. In the worst case, PCIe links can be forced to run at a lower speed, downshifting from say 32 GT/s to 16 GT/s if the eye is not open enough, perhaps due to high or changing operating temperature.
Using retimers to improve BER
A method of reducing latency by avoiding retry and recovery events is to use retimer devices in the high BER paths, typically the longest paths in the system. Retimers improve both the eye height (EH) and the eye width (EW) seen by the next receiver in the path. A retimer recovers a clean digital copy of the signal, generates a clean transmit clock, and sends out a buffered copy of the data.
Figure 2 Retimers improve both the EH and EW in many situations. Source: Kandou
Retimers improve the EW by improving all sources of jitter. That includes data dependent jitter (DDJ) often due to inter-symbol interference (ISI), random jitter (RJ) often due to thermal noise and clock jitter, and bounded uncorrected jitter (BUJ) often due to crosstalk. Retimers “reset” all these jitter sources.
An important topic to introduce before discussing retimer latency: the three PCIe clock modes. These are separate reference clock with independent spread-spectrum (SRIS), separate reference clocks with no spread-spectrum clocking (SRNS), and common clock. Retimers must generate a new transmit clock, so the choice between these modes makes a large difference in the added latency. The following diagram illustrates the common clock mode at the top and the SRIS mode at the bottom.
Figure 3 The top shows common clock mode, the bottom shows SRIS mode. Source: Kandou
PCIe clock modes and retimers
In retimer applications, it’s best to use the common clock mode. In this mode, the CPU, retimer and end-point all share the same reference clock. All three points have the same parts per million (PPM) offset for their clock and have the same spread-spectrum profile. The retimer elastic buffer can be set dramatically smaller.
The SRIS and SRNS clock modes can be used with retimers. That said, with these modes, the CPU and retimer are in one clock domain and the end-point is in another. The retimer then has the job of accounting for this clocking difference for its ingress traffic from the end-point. It does this by being aware of the packet boundaries and adjusting the gap between the packets.
The retimer must account for +/- 300 PPM of clock rate offset plus -5000 PPM for the spread spectrum clock difference in SRIS mode. The retimer elastic buffer must be configured to account for this difference. The latency addition gets higher for larger packet sizes. One opportunity is that if the CPU supports it, its ingress can be set to common clock mode even though there are different clocks since the retimer has already done the buffering required by the SRIS or SRNS modes.
In the case where the system has not accounted for the additional lane-to-lane skew caused by the retimer in its budget, the retimer must add additional latency to reset that skew. The PCIe specification limits the Tx skew to be 1.5 ns and the Rx skew to be 5 ns.
The elastic buffer forms the central integrating store within a retimer. It allows the device to recreate the transmit clocks while not losing the user’s data packet information. The following diagram shows the architecture of a typical PCIe retimer.
Figure 4 A typical PCIe retimer architecture includes an elastic buffer as the central integrating store. Source: Kandou
The elastic buffer is the place where the latency from a retimer can add up and where the wildly different clocks found in SRIS clocking mode must be adapted in between. It is also here where sometimes the lane-to-lane skew must be cleaned up.
The PCIe specification sets limits for retimer-added latency. In non-SRIS clocking modes, this limit is 64 ns for the data rates from 5GT/s to 32 GT/s and 128 ns for 2.5 GT/s. In SRIS modes, the limit also depends on the maximum packet size and a large table of the limits is provided in the specification. An advanced retimer architecture allows a system to meet these specifications in all cases and to significantly beat them in certain cases.
It’s possible to employ a small, low-latency elastic buffer, often called a bypass buffer, if all four conditions are met. This bypass buffer can have a latency on the order of 10 ns and is typically implemented as a range-restricted region of the larger elastic buffer.
The following diagram illustrates a bypass buffer in operation.
Figure 5 It’s possible to use a low-latency bypass buffer if certain conditions are met. Source: Kandou
Retimer’s role in latency optimization
Application latency should be optimized from the top layer down to the bottom layer, and every element matters to performance. For workloads that require ultra-low latency, an alternative protocol such as CXL should be considered when feasible.
A good place to improve latency and improve latency consistency is to add retimers to the longest links with the worst BERs. Retimers can improve both the EH and the EW. Signal integrity margins should be validated at the operating corners, especially high and changing temperature. The use of retimers can improve latency by keeping systems out of retry events, which can be routine at high rates. The use of retimers can also help avoid costly recovery events.
Retimers can be used in a latency-optimized bypass buffer mode by the use of common clock mode, accounting for lane-to-lane skew at the system level, maintaining TS1/TS2 communication, and disabling rate adaption.
Editor’s Note: This article is based on a presentation by Jay Li of Kandou during the 2022 PCI-SIG Development Conference.
Jay Li is product marketing director at Kandou.
Brian Holden is responsible for the standards strategy at Kandou.
You must Sign in or Register to post a comment.