ChungGoohyung1
ChoKyoungub1
OhTaehyoun*
-
(Department of Electronic Engineering, Kwangwoon University, 615, Bima, 20, Gwangun-ro,
Nowon-gu, Seoul 139-701, Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
CMOS, IO transceiver, scalable, delay compensation, pre-emphasis, FIR driver
I. INTRODUCTION
Recently demands for high resolution display have increased per-pin data rate for
high data throughput between chip-to-chip. Data transmission speed for display has
variation (even during real-time operation) depending on image contents and various
standards need to be covered by the interface circuits with restricted power and noise
budget. In order for the interface scheme to operate with various speeds, high-speed
digital logics should be scalable. In [1], diverse clock phases with correlations are used to solve the hold-time violation
(HTV) problem and scalable operation could be achieved. However, multi-phase clock
should be available and maintaining the phase gap equally spaced at high speed is
an issue. A scheme that selects the clock polarity adaptively after detecting HTV,
has been suggested [2]. Since the cascade digital logics may require multiple selections along the path,
multiple adaptive loops need to be implemented to remove all HTV in the scheme.
In this paper, we theoretically analyze the mechanism of HTV event at multiple data
speeds and propose an efficient design methodology to avoid HTV for scalable data
speeds. The entire high-speed digital paths in our transceiver have been designed
scalably via delay matching technique. The transceiver covers the speed range of 2.65
Gb/s-6.4 Gb/s, which meets various standards such as DP1.4 (5.4 Gb/s), LPDDR5 (6.4
Gb/s), SATA3 (6 Gb/s) and XAUI (3.125 Gb/s). The measurement performances are compared
to the similar applications [3,4]. In addition, half-rate design of drivers and sampler in the front-end could reduce
the power significantly.
II. ARCHITECTURE
Fig. 1 presents our proposed 2 channel transceiver that operates in scalable data speed.
The pseudo random bit sequence (PRBS) generates 18 lanes 147-356 Mb/s parallel signals
with $2^{23}-1$, $2^{31}-1$ pattern lengths. The following 18:2 serializers assemble
them into 2 lanes 1.325-3.2 Gb/s EVEN/ODD data. As shown in Fig. 1(b), the tap signal generator delays the half-rate D$_{\mathrm{ODD}}$/D$_{\mathrm{EVEN}}$
signals to generate the PRE/MAIN/POST signals. The tap signal generator consists of
consecutive latches, and the data are delayed by 0.5 UI of data speed for each stage
of latches. As shown in the timing diagram of Fig. 1(b), the PRE/MAIN/POST signals from the appropriate nodes where the three tap signals
are aligned are provided to the drivers. As shown in Fig. 1(c), compared to the current mode driver our voltage mode driver consumes 1/4 current
power. The driver consists of three taps (PRE/MAIN/POST) and each taps have 3, 15,
7 segments, respectively. The number of ON segments of each tap is adjusted through
the PU SEG, and the amplitude of pre-emphasis is adjusted for various channel losses.
Fig. 1. (a) Block diagram of the 2channel TRx; (b) tap signal generator; (c) FIR driver; (d) sampler.
The equalizing drivers generate 2.65-6.4 Gb/s differential non-return-to-zero (NRZ)
signals. Along the path, all high-speed logics are scalable because the HTV could
be avoided by matching delays for various speed. In our receivers, the 1-stage continuous-time
linear equalizers (CTLE) mitigate the channel inter-symbol interference and improve
bit-error rate (BER) performance. As shown in Fig. 1(c), each sampler is designed with a strong-arm latch topology and SR latch by transforming
the output of strong-arm latch return-to-zero (RZ) signal to NRZ signals. The strong
arm latch compares the voltage-level of the differential inputs (V$_{\mathrm{in,p}}$,
V$_{\mathrm{in,n}}$) of 2.65-6.4 Gb/s data speed at rising edge of the recovered clock
from the CDR. If V$_{\mathrm{in,p}}$ is larger than V$_{\mathrm{in,n}}$, OUTP is determined
as 1 and if V$_{\mathrm{in,p}}$ is smaller than V$_{\mathrm{in,n}}$, OUTP is determined
as 0.
The following 2:18 deserializers parallelize the EVEN/ODD data into 18 lanes $\times
$ 147-356 Mb/s signals and the PRBS checkers detect errors in the received signals.
The BER counters can count the number of errors up to $2^{40}$ and monitor the error
count in real-time via serial-to-parallel interface (SPI). Scalable logics make possible
data operation at various speeds under maximum limitation comes from clock speed constraint.
Fig. 2 illustrates the delay matching technique for scalable speed operation of high-speed
logics in our architecture. Fig. 2(a) shows a typical case of consecutive positive edge-triggered flipflops (FF) that share
a single clock source. Then Input clock is inverted for FF2 because PVT variation
and line delay mismatch can cause timing mismatch between data and clock on FF2 and
may cause setup and hold time violations. The data delay, $t_{d}$ and clock delay,
$t_{c}$ occur from combinational logic propagation delay required for making logical
functions (i.e. muxing/demuxing/clock dividing) or clock-to-Q delay. In all cases,
$t_{d}$ and $t_{c}$ do not depend on the data speed and clock speed but on the propagation
delay of logic circuits. Fig. 2(b) is a simple implementation of logic blocks made up of inverter chain to find out
the changes in $t_{d}$ and $t_{c}$ by PVT variation. The value of $t_{d}$} and $t_{c}$
are 468.9 ps and 157.8 ps at 3.2 Gb/s, typical corner, 27℃ and Table 1 summarizes the value of data and clock delay with corners and temperature. The $t_{d,corner}$,
$t_{c,corner}$ in Table 1 are the values of data delay and clock delay at each corner and temperature and $t_{d,var}$
and $t_{c,var}$ are defined as $t_{d,var}= t_{d,corner}- t_{d}$, $t_{c,var}= t_{c,corner}-
t_{c}$. When Clock B locates at the optimal point of Data B at typical corner, 27℃,
the deviation of Clock B from the optimal point of Data B is defined as $\left| t_{d,var}-
t_{c,var}\right| $. As stated in Table 1, the maximum value of $\left| t_{d,var}- t_{c,var}\right| $at the maximum data rate(3.2
Gb/s) of our circuit is 95 ps, 0.3UI at ss corner, 120℃. Usually, eye opening is secured
over 0.8UI in a digital circuit. Since the difference in delay due to PVT variation
does not vary with data speed, the lower the data speed, the narrower the portion
of $\left| t_{d,var}- t_{c,var}\right| $within 1UI of the data speed. As a result,
this circuit reduces the HTV due to PVT variation.
Table 1. Data and Clock Delay with PVT variations
Corner
|
Temperature [℃]
|
$t_{d,corner}$
[ps]
|
$t_{c,corner}$
[ps]
|
$t_{d,var}$
[ps]
|
$t_{c,var}$
[ps]
|
$\left| t_{d,var}- t_{c,var}\right| $
|
[ps]
|
[UI @ 3.2Gbps]
|
tt
|
-40
|
460.6
|
156.9
|
-8.3
|
-0.9
|
7.4
|
0.02
|
120
|
485.3
|
161.7
|
16.4
|
3.9
|
12.5
|
0.04
|
ss
|
-40
|
590.5
|
200
|
121.6
|
42.2
|
79.4
|
0.25
|
120
|
607.5
|
201.4
|
138.6
|
43.6
|
95
|
0.3
|
ff
|
-40
|
368.4
|
124.7
|
-100.5
|
-33.1
|
67.4
|
0.22
|
120
|
398.2
|
132.9
|
-70.7
|
-24.9
|
45.8
|
0.15
|
Fig. 2. Simulation testbench for delay matching technique: (a) Typical flip-flop logic where timing issue occurs for various speed; (b) logic blocks made up of inverter chain; (c) timing illustration of Data B and Clock B.
Fig. 3(a) shows a timing diagram in the case that $t_{d}$ is $3\alpha $, where $\alpha $ is
assumed to be 0.5UI for the illustration purpose. In method1 and method2 we can delay
$t_{c}$ by $\alpha $ and $3\alpha $, each respectively, to avoid HTV. If both the
data and clock speed become half, as shown in Fig. 3(c), method1 results in HTV. Fig. 3(b) shows that it operates without HTV in both method1 and method2 at 2/3 data speed.
Method 2 can enable avoiding HTV at the continuous wide-range data rate between max
data speed and 0.5 ${\times}$ max data speed.
Fig. 3. Illustration of delay matching techniques for scalable speed: (a) Timing diagram for maximum speed ( $t_{d}$ > $t_{c}$ case); (b) Timing diagram for 2/3 speed of the maximum ( $t_{d}$ > $t_{c}$ case); (c) Timing diagram for 0.5 speed of the maximum ( $t_{d}$ > $t_{c}$ case); (d) Timing diagram for maximum speed ( $t_{d}$ < $t_{c}$ case); (e) Timing diagram for 2/3 speed of the maximum ( $t_{d}$ < $t_{c}$ case); (f) Timing diagram for 0.5 speed of the maximum ( $t_{d}$ < $t_{c}$ case).
Whereas the clock trigger timing still remains at optimal data BER for method2, this
consecutive FF scheme can operate without HTV regardless of various data speed. Fig. 3(d)-(f) show the case that $t_{c}$ is $3\alpha $ and the delay compensation is made on
$t_{d}$ by $\alpha $ and $3\alpha $. Similarly, method2 ($t_{d}$=$3\alpha $) can enable
avoiding HTV for various data speed.
In Fig. 2(a), the speed of input data is same as the speed of the input clock, and the value of
$\alpha $ at the maximum data rate (3.2 Gb/s) is set to 156.25 ps. In the case of
Method 1, $t_{d}=3\alpha $, $t_{c} = \alpha $, and in the case of Method 2, $t_{d}=3\alpha
$ and $t_{d}=3\alpha $. It is a simulation in which the pattern checker determines
an error and calculates the BER for each frequency when the speed of the input clock
changed to 0.1-3.2 GHz. Fig. 4 shows the BERs of method1 and method2 in wide data rate through this simulation.
In Method 2, the BER is close to 0 across 0.1-3.2 GHz while in Method 1, the BER increase
near 0.5 ${\times}$ maximum data rate.
Fig. 4. Simulation results of BER – Frequency by method1 and method 2.
Fig. 5(a) and (b) shows the circuits of 3:1 serializer in the transmitter and 1:3 deserializer
in the receiver, where the timing issues occur on 2nd FF in the consecutive FFs with
a single clock source. In the serializer, as shown in Fig. 5(a) the mux has to use a divided-by-3 clock and $t_{d}$ is larger than $t_{c}$. For scalable
operation the delay compensation should be made on $t_{c}$ by adding a chain of buffers.
In the deserializer, on the other hand, $t_{d}$ is smaller than $t_{c}$. In the same
manner, the compensation is made on $t_{d}$, as shown in Fig. 5(b). We have options to place the delay compensation buffers on A or B for the deserializer.
Choosing A will affect the timing issue in FF1, so B is a better choice.
Fig. 5. Delay matching techniques used in our transceiver IP: (a) 3:1 Serializer; (b) 1:3 Deserializer.
III. MEASUREMENT
Fig. 6 presents the measurement results of our transceiver for 3.2 Gb/s and 6.4 Gb/s. Tektronix
TDS6154C has been used to measure the Tx eye performances and the built-in BER counter
in Rx measures the BER by sweeping the sampler clock phase horizontally. The estimated
parasitic loading of Tx output, PAD and channel is 7.5 pF, which results in 17.8 dB
channel loss at Nyquist rate. The measured vertical eye openings for channel1 and
channel2 are 94.8 mV/993 mV and 59.5 mV/997 mV each respectively at 6.4 Gb/s without
pre-emphasis, as shown in Fig. 6(a) and (b). With the pre-emphasis on, the vertical eye-openings are improved to 221.6
mV/577.8 mV and 185.1 mV/534.3 mV. Fig. 6(c) shows the Rx horizontal bathtub curve measured from the built-in BER counter in our
IP at 3.2 Gb/s and 6.4 Gb/s with and without pre-emphasis. The horizontal eye-opening
is improved by 0.23 UI and 0.25 UI at $10^{-9}$ BER. Our transceiver has been fabricated
in
Fig. 6. Measurement results of our transceiver at 3.2 Gb/s and 6.4 Gb/s: (a) Tx output eye opening w/ and w/o FIR at channel1 (6.4 Gb/s); (b) Tx output eye opening w/ and w/o FIR at channel2 (6.4 Gb/s); (c) Rx bathtub curves w/ and w/o FIR for 3.2 and 6.4 Gb/s (channel1).
65 nm CMOS process and occupies 1.02 $\mathrm{mm}^{2}$ die area. Fig. 7 shows layout of our IP and the measurement setup. Table 2 summarizes the measured performances of our transceiver, and they are compared to
the prior arts. The proposed transceiver shows successful data transmission in measurement
within all speed range of 2.65 Gb/s - 6.4 Gb/s by scalable design technique. Our transceiver
consumes 72 mW/ch from 1.2 V power supply.
Fig. 7. Layout for 2-ch transceivers (1.02 mm2) and measurement setup.
Table 2. Comparison Table
|
[3]
|
[4]
|
This work
|
Technology
|
28 nm CMOS
|
90 nm CMOS
|
65 nm CMOS
|
Data rate (bit/s)
|
0.5 - 6.6 G
|
4 G
|
2.65 - 6.4 G
|
Supply (V)
|
1
|
-
|
1.2
|
Power (mW/ch)
|
129
|
56
|
72
|
Channel Loss (dB)
|
22
|
18.2
|
17.8
|
Tx Vertical eye opening (mV)
|
180
|
-
|
221.6
(FR4)
|
Rx Horizontal eye opening (UI)
|
0.25 (@10-9)
|
0.2 (@10-9)
|
0.25 (@10-9)
|
Swing (mV)
|
-
|
250 - 1000
|
577.8
|
Single Tx/Rx Area (mm2/ch)
|
0.64
|
1.11
|
0.51
|
IV. CONCLUSIONS
A design methodology of high-speed clock-triggered logics for scalable speed operation,
has been proposed and used to implement the whole 2-channel IO transceivers. The HTV
timing issue for various data speed has been dealt with theoretical backgrounds. The
IP shows successful data transmission over the speed range of 2.65 Gb/s-6.4 Gb/s with
error-free.
ACKNOWLEDGMENTS
This work was supported in part by the ATC+ (Advanced Technology Center plus)
Program through the Korea Evaluation Institute of Industrial Technology under Grant
20017980 and was supported by the Research Grant of Kwangwoon University in 2022.
The EDA tool was supported by the IC Design Education Center (IDEC), Korea.
References
Frans, Y., Carey, D., Erett, M. et al: ‘A 0.5-16.3 Gb/s Fully Adaptive Flexible-Reach
Transceiver for FPGA in 20 nm CMOS’ , IEEE Jornal of Solid-State Circuits, 2015, 50,
8, pp. 1932-1944, doi:10.1109/JSSC.2015.2413849.
Abdollahi, R., Hadidi, K. and Khoei, A.: ‘A Simple and Reliable System to Detect and
Correct Setup/Hold Time Violations in Digital Circuits’,IEEE Transactions on Circuits
and Systems I: Regular Paper 2016, 63, 10, pp. 1682-1689, doi:10.1109/TCSI.2016.2582239.
Savoj, J., Hsieh, K.C.H., An, F.T. et al: ‘A Low-Power 0.5-6.6Gb/s Wireline Transceiver
Embedded in Low-Cost 28nm FPGAs’, IEEE Journal of Solid-State Circuits, 2013, 48,
11, pp. 2582-2594, doi:10.1109/JSSC.2013.2274824.
Faust, A.C., Narasimha, R.L., Bhatia, K. et al: ‘FEC-based 4 Gb/s backplane transceiver
in 90nm CMOS’, Proceedings of the IEEE 2012 Custom Integrated Circuits Conference,
San Jose, CA, USA, 9-12 Sept. 2012, doi:10.1109/CICC.2012.6330665.
Goohyung Chung received the Bachelor of Science (B.S.) degree in the department
of electronic engi-neering from Kwangwoon university, Korea, in 2022. His Master of
Science (M.S.) degree is in progress in Kwangwoon university, Korea. His current research
field is designing of clock and data recovery (CDR) circuits including high-speed
IO circuits.
Kyoungub Cho received the Bachelor of Science (B.S.) degree in the department of
electronic engi-neering from Kwangwoon university, Korea, in 2022. His Master of Science
(M.S.) degree is in progress in Kwangwoon university, Korea. His current research
field is designing of clock generation circuits which are including phase-locked loop
(PLL) and high-speed IO circuits.
Taehyoun Oh (S’05) received the Bachelor of Science (B.S.) and Master of Science
(M.S.) degrees in Electrical Engineering from Seoul National University in 2005 and
2007, respectively. He received his Ph.D. degree in Electrical Engi-neering from the
University of Minnesota, Minneapolis under the supervision of Dr. Ramesh Harjani.
His doctoral research is focused on high-speed I/O circuits and architectures. During
the summer of 2010, he worked on I/O channel modeling at AMD Boston Design Center,
MA. In the fall semester of 2011, he researched on I/O architecture and jitter budgeting
of the link at Intel Corp., CA. From fall of 2012, he joined the IBM system technology
group, NY. and worked on performance verification of high-speed decision feedback
equalizer for server processors. Since spring of 2013, he joined at the department
of electronic engineering in Kwangwoon university in Seoul, Korea as an assistant
professor. His current research interest is focused on clock generation IC design.