# High-Speed NRZ/PAM4 Wireline Receiver System-on-a-Chip Design

by

Guang Zhu

A Thesis Submitted to

The Hong Kong University of Science and Technology
in Partial Fulfillment of the Requirements for
the Degree of Doctor of Philosophy
in the Department of Electronic and Computer Engineering

August 2018, Hong Kong

#### **Authorization**

I hereby declare that I am the sole author of the thesis.

I authorize the Hong Kong University of Science and Technology to lend this thesis to other institutions or individuals for the purpose of scholarly research.

I further authorize the Hong Kong University of Science and Technology to reproduce the thesis by photocopying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research.

Guang ZHU

August 2018

## High-Speed NRZ/PAM4 Wireline Receiver System-on-a-Chip Design

by

#### **Guang ZHU**

This is to certify that I have examined the above PhD thesis and have found that it is complete and satisfactory in all respects, and that any and all revisions required by

the thesis examination committee have been made.

Prof. C. Patrick YUE, ECE Department (Thesis Supervisor)

Prof. Bertram SHI (Head of ECE Department)

#### Thesis Examination Committee:

- 1. Prof. C. Patrick YUE (Supervisor), Department of Electronic and Computer Engineering
- 2. Prof. Ross D. MURCH, Department of Electronic and Computer Engineering
- 3. Prof. Andrew Wing-On POON, Department of Electronic and Computer Engineering
- 4. Prof. Jinglei YANG, Department of Mechanical and Aerospace Engineering
- Prof. Hao YU (External Examiner), Department of Electrical and Electronic Engineering,
   Southern University of Science and Technology

Department of Electronic and Computer Engineering,
The Hong Kong University of Science and Technology
August 2018

To my parents

#### Acknowledgement

Firstly, I would like to thank my supervisor Prof. Patrick Yue. With his instruction and guidance, I change my mindset and learn how to do a PhD. His encourage is very helpful to relieve pressure. He is always full of energy and enthusiasm, which encourage the whole group to do physical exercise hard and then do research more efficiently. His expertise from semiconductor to integrated circuits and from academia to industry not only significantly plays a key role during my PhD study, but also will exert far-reaching influence on my career and life.

I also would like to thank all thesis examination committee members: Prof. Richard SO, Prof. Ross Murch, Prof. Andrew W. Poon, and Prof. Jinglei Yang. They are so dedicative and like to spare time to review my thesis and attend my defense. I also like to thank the external committee member Prof. Hao YU from Southern University of Science and Technology for his willingness to offer suggestions and guidance. I also like to thank Prof. Philip K. Mok for severing my thesis proposal committee and giving very insightful comments.

I would also like to take the opportunity to thank my Master degree supervisors: Prof. Fujiang LIN and Prof. Shengxi DIAO from University of Science and Technology of China, who guided me into the area of integrated circuit design and taught me much basic knowledge and techniques which make my PhD research more efficient.

Besides these professors, I would also like to thank all YUE Group members. Without these nice guys, my PhD research and life would not be so meaningful. I would like to thank Dr. Yipeng Wang, Dr. Duona Luo, and Dr. Zhao Zhang, Dr. Qasim Maqbool and Zhixin Li for the help on tapeouts. I would also like to thank Dr. Quan Pan, Dr. Liang Wu, Zhengxiong Hou, Dr. Haikun Jia and Dr. Xiangyu Meng for the meaningful discussion. I would also like to thank Fengyu Che, Dr. Xianbo Li, Babar Hussain, Liusheng Sun, Can Wang, Li Wang, Milad Kalantari Mahmoudabadi, Jian Kang, Xuan Wu, Kanghui Zhao, Liwen Jing, Dr.

Salahuddin Raju and Clarissa Prawoto for the enjoyable group activities. I would also like to thank Prof. Andrew Poon for generously lending equipment to us. I would also like to thank some very helpful technicians. I would like to thank Mr. S.F. Luk for his help on multiple times of chip bonding, chip dicing, digital circuit synthesis and PR, and PDK installation. I would also like to thank Mr. K.W. Chan, Mr. W. T. Cheng and Mr. Fred Kwok for their accordance on equipment. I would also like to thank Miss Tian Lu who is very helpful about group meeting organization, paper grammar correction and reimbursement. I would also like to thank my roommate Yasu Lu who came from the same undergraduate university with me, and his modesty and magnanimous makes our living very harmonious.

Last but certainly not least, I would like to thank my families: my parents and my wife Jin Zhu. Without their understanding and unrequited support, my PhD life would be much tougher. My wife works alone in Shanghai, and I cannot imagine how much she needs my companionship especially when she got sick. I feel grateful to her for her dedication. I feel grateful to life for getting through the difficult time.

## **Table of Contents**

| Authorization Page                      | ii  |
|-----------------------------------------|-----|
| Acknowledgement                         | v   |
| Table of Contents                       | vii |
| List of Figures                         | ix  |
| List of Tables                          | xiv |
| Abstract                                | XV  |
| Chapter I Introduction                  | 1   |
| 1.1 Research Background                 | 1   |
| 1.2 Thesis Organization                 | 10  |
| Chapter II 24-Gb/s PAM4 Receiver Design | 11  |
| 2.1 Introduction                        | 11  |
| 2.2 Receiver Decoder Architecture       | 14  |
| 2.3 Building Blocks                     | 15  |
| 2.4 Measurement Results                 | 27  |
| 2.5 Conclusion                          | 32  |
| Chapter III 26-Gb/s NRZ Receive Design  | 33  |
| 3.1 Introduction                        | 33  |
| 3.2 System Design                       | 34  |
| 3.3 Building Blocks                     | 39  |
| 3.4 Experimental Results                | 44  |
| 3.5 Clock Recovery Loop Analysis        | 48  |
| 3.6 Conclusion                          | 50  |
| Chapter IV 56-Gb/s PAM4 Receiver Design | 52  |
| 4.1 Introduction                        | 52  |

| 4.2 System Design                       | 53 |
|-----------------------------------------|----|
| 4.3 Building Blocks                     | 55 |
| 4.4 System Simulation Results           | 72 |
| 4.5 Conclusion                          | 77 |
| Chapter V Equalizer Adaptation Modeling | 78 |
| 5.1 CTLE Adaptation                     | 78 |
| 5.2 DFE Adaptation                      | 85 |
| 5.3 Conclusion                          | 91 |
| Chapter VI Summary and Future Work      | 92 |
| 6.1 Summary                             | 92 |
| 6.2 Future work                         | 94 |
| Bibliography                            | 96 |

## **List of Figures**

| Fig. 1.1 Global IP traffic forecast of Cisco [1]                                                       |
|--------------------------------------------------------------------------------------------------------|
| Fig. 1.2 Diagram of a typical (a) optical link and (b) electrical link                                 |
| Fig. 1.3 Electrical link standards.                                                                    |
| Fig. 1.4 Ethernet development history                                                                  |
| Fig. 1.5 Power efficiency of wireline links in ISSCC [4]                                               |
| Fig. 1.6 Cross section of twinaxial cables in full duplex 100 GbE [6]                                  |
| Fig. 1.7 Eye diagram with ISI.                                                                         |
| Fig. 2.1 (a) Coding of PAM4 levels, (b) 1/4-rate PAM4 receiver with a VGA and 3-                       |
| comparator based decoder, and (c) proposed receiver utilizing an adaptive variable-gain                |
| rectifier for the decoder                                                                              |
| Fig. 2.2 Limited bandwidth caused eye opening degradation to (a) NRZ and PAM4, (b) top                 |
| eye, (c) middle eye and (d) bottom eye                                                                 |
| Fig. 2.3 (a) CTLE schematic diagram, and (b) NRZ and (c) PAM4 eye diagrams when CTLE                   |
| has over peaked by 2 dB                                                                                |
| Fig. 2.4 Schematic diagram of the VGR: (a) variable-gain SA and (b) rectifier Figure 19                |
| Fig. 2.5 Timing diagram of VGR.                                                                        |
| Fig. 2.6 Simulated transfer curves of a VGR. (a) SA gain set for a maximum input of 80 mV $_{pp}$      |
| and (b) SA gain set for a maximum input of 260 mV <sub>pp</sub>                                        |
| Fig. 2.7 Block diagram of gain adaptive control of the SA                                              |
| Fig. 2.8 The adaptation process when a 10-fF loading mismatch at the SA output exists 22               |
| Fig. 2.9 SA and rectifier outputs under (a) 80 mV $_{pp}$ and (b) 260 mV $_{pp}$ PAM4 input after gain |
| adaptation                                                                                             |
| Fig. 2.10. Binary controlled (a) PMOS capacitor and (b) NMOS capacitor, and (c) simulated              |
| equivalent capacitance at different bias voltages                                                      |

| Fig. 2.11 (a) Schematic of the ring oscillator with four stages of delays cells, and (b) delays   | ay              |
|---------------------------------------------------------------------------------------------------|-----------------|
| cell stage with clock injection part.                                                             | 26              |
| Fig. 2.12 Receiver chip photograph.                                                               | 28              |
| Fig. 2.13 Photograph and diagram of the measurement setup.                                        | 28              |
| Fig. 2.14 24-Gb/s PAM4 signal input for measurement.                                              | 29              |
| Fig. 2.15 Power consumption breakdown.                                                            | 29              |
| Fig. 2.16 Received 3-Gb/s (a) MSB and (b) LSB eye diagrams.                                       | 30              |
| Fig. 2.17 Measured BER bathtub curve under a 24-Gb/s 190-mV <sub>pp</sub> PAM4 input              | 31              |
| Fig. 2.18 Measured BER under different input amplitudes.                                          | 31              |
| Fig. 3.1 Proposed source-synchronous 1/4-rate receiver topology.                                  | 35              |
| Fig. 3.2 Data and edge sampling with clock-data phase error.                                      | 36              |
| Fig. 3.3 Timing diagram of the LSPD.                                                              | 36              |
| Fig. 3.4 (a) Pulse response of a channel with limited bandwidth, and (b) data path $DP_{0,1}$ and | nd              |
| edge path EP <sub>0</sub> with FFE and DFE.                                                       | 37              |
| Fig. 3.5 (a) Inter symbol interference induced jitter, and (b) valid edge samples $S_1$ w/ and w  | <sub>′</sub> /o |
| edge equalization of a behavior model.                                                            | 39              |
| Fig. 3.6 Sampling and holding stage                                                               | 40              |
| Fig. 3.7 (a) Charge-steering EQS and (b) EQS outputs wi/wo equalization                           | 42              |
| Fig. 3.8 (a) Voltage-current converter and (b) simulated transfer curve of edge path and VI       | C.              |
|                                                                                                   | 43              |
| Fig. 3.9 (a) Chip photograph, (b) Used channels with pulse responses, and (c) measureme           | nt              |
| setup.                                                                                            | 45              |
| Fig. 3.10 Recovered eye diagrams for the input of (a) 25 Gb/s and (b) 30 Gb/s, and (              | (c)             |
| measured BER bathtub curves of receiver outputs with channel #1                                   | 46              |
| Fig. 3.11 (a) 26-Gb/s input eye diagram, (b) recovered eye diagram, (c) measured bathtu           | ub              |
| curves and (d) achieved BER VS data rate with channel #2                                          | 47              |

| Fig. 3.12 Linear model of the 1 <sup>st</sup> order clock recovery loop                        | 18  |
|------------------------------------------------------------------------------------------------|-----|
| Fig. 3.13 (a) jitter transfer curve and (b) jitter tolerance curve                             | 50  |
| Fig. 4.1 OIF CEI-56G-VSR-PAM4 application [45].                                                | 53  |
| Fig. 4.2 PAM4 receiver topology.                                                               | 53  |
| Fig. 4.3 (a) Schematic of two CTLEs and (b) the simulated frequency response5                  | 55  |
| Fig. 4.4 CTLE DC output control, (a) feedback OPA, (b) digitally controlled resistor ladde     | er  |
| for reference generation, (c) test bench of the open loop and (d) the simulation results5      | 57  |
| Fig. 4.5 Simplified diagram of front end with DC offset cancellation                           | 58  |
| Fig. 4.6 DC offset cancellation network (a) OPA with miller capacitor and (b) schematic        | ic  |
| diagram of OPA.                                                                                | 59  |
| Fig. 4.8 Summer including FFE tap and DFE tap.                                                 | 51  |
| Fig. 4.9 Timing diagram of the equalizers.                                                     | 51  |
| Fig. 4.10 FFE effect                                                                           | 52  |
| Fig. 4.11 Schematic diagram of PAM4 decoder.                                                   | 53  |
| Fig. 4.12 (a) Strong-arm latch with offset calibration pair, and (b) offset calibration loop 6 | 54  |
| Fig. 4.13 (a) Transient simulation of calibration process and (b) calibration results for      | or  |
| different offset inputs.                                                                       | 55  |
| Fig. 4.14 Timing diagram of demux by 2.                                                        | 56  |
| Fig. 4.15 Demux and synchronization clock generation.                                          | 57  |
| Fig. 4.16 NRZ and PAM4 eye diagrams.                                                           | 57  |
| Fig. 4.17 Bang-bang PD with transition selection.                                              | 58  |
| Fig. 4.18 Charge pump.                                                                         | 59  |
| Fig. 4.19 Simulation of PC and CP for the cases of Early and Late                              | 59  |
| Fig. 4.20 VCDL and its delay cell                                                              | 70  |
| Fig. 4.21 Simulated VCDL delay under corners                                                   | 71  |
| Fig. 4.22 Layout of 56-Gb/s PAM4 receiver                                                      | 72. |

| Fig. 4.23 Frequency response of the used channel                                             | 2   |
|----------------------------------------------------------------------------------------------|-----|
| Fig. 4.24 Timing for key blocks in the receiver.                                             | '3  |
| Fig. 4.25 Loop filter output of the CDR                                                      | '3  |
| Fig. 4.26 Equalizer summer output (a) before and (b) after CTLE adaptation                   | 4   |
| Fig. 4.27 Simulation of adaptation process.                                                  | 5   |
| Fig. 4.28 Three input signals of adaptation.                                                 | 6   |
| Fig. 5.1 PAM4 eye diagram with noise                                                         | '9  |
| Fig. 5.2 Considered VD[n] in (a) peak detection based pattern selection, and (b) initial pea | ık  |
| search                                                                                       | 0   |
| Fig. 5.3 CTLE adaptation model in PAM4 receiver.                                             | 1   |
| Fig. 5.4 CTLE adaptation model in NRZ receiver.                                              | 32  |
| Fig. 5.5 CTLE adaptation process of DEQ and <i>vref</i> in PAM4 receivers.                   | 3   |
| Fig. 5.6 CTLE output eye diagrams (a) before and (b) after CTLE adaptation in PAM            | [4  |
| receivers.                                                                                   | 3   |
| Fig. 5.7 The calculated BER bathtub curves under different CTLE setting in PAM4 receiver     | s.  |
|                                                                                              | 4   |
| Fig. 5.8 CTLE adaptation process of DEQ and <i>vref</i> in NRZ receivers                     | 4   |
| Fig. 5.9 CTLE output eye diagrams (a) before and (b) after CTLE adaptation in NR             | Z   |
| receivers                                                                                    | 55  |
| Fig. 5.10 1st-order RC channel simulation (a) pulse response and (b) NRZ eye diagram         | ıs  |
| before and after the ISI is cancelled by a 3-tap DFE.                                        | 6   |
| Fig. 11 NRZ DFE model and adaptation algorithm.                                              | 6   |
| Fig. 5.12 PAM4 DFE model and adaptation algorithm.                                           | 8   |
| Fig. 5.13 Pulse response of channel #1 (11-cm Rogers PCB trace)                              | 0   |
| Fig. 5.14 Simulated eye diagrams before (left) & after (right) adaptation: (a) NRZ input and | (b) |
| PAM4 input                                                                                   | 00  |

| Fig. | 5.15 | Adaptation | process | of vref | and   | DFE | tap | coefficients | (a) | NRZ | input | and | (b) | PAM4 |
|------|------|------------|---------|---------|-------|-----|-----|--------------|-----|-----|-------|-----|-----|------|
| inpı | ıt   |            |         |         | ••••• |     |     |              |     |     |       |     |     | 91   |

## **List of Tables**

| Table I. Performance summary of 24-Gb/s PAM4 receiver and comparison with similar works  |
|------------------------------------------------------------------------------------------|
|                                                                                          |
| Table II: Performance summary of 26-Gb/s NRZ receiver and comparison with similar works. |
| 47                                                                                       |
| Table III. Performance summary of 56-Gb/s PAM4 receiver and comparison with similar      |
| works                                                                                    |
| Table IV: Adaptation results of 3-tap NRZ/PAM4 DFEs with channels                        |

#### **Abstract**

Wireline communication featuring wide bandwidth and good channel isolation has been extensively employed in applications such as massive data centers, cloud computing, etc. In wireline links, I/O transceiver works at the highest data rate and determines the communication quality. Benefitting from the advancement of process technology, the highly demanded I/O bandwidth and power efficiency have been improved dramatically over the past decades, and the trend will continue to meet the future larger data traffic boom. While Moore's Law is coming to an end, the mainstream non-return-to-zero (NRZ) transceivers meet more stringent challenges, and four-level pulse amplitude modulation (PAM4) transceivers become popular for doubled bandwidth efficiency but with new design challenges. In this thesis, three receivers working at ~25 Gb/s or 56 Gb/s are presented to address these issues.

The first work is a source-synchronous low-power 1/4-rate PAM4 receiver with an adaptive variable-gain rectifier (AVGR) based decoder in 28-nm CMOS technology. The proposed AVGR based PAM4-to-NRZ decoder performs gain adaptation and amplitude rectification simultaneously for decoding the least significant bit (LSB) of PAM4 input. The linear sense amplifier (SA) in the AVGR is modified from a latch to achieve both high gain and low power. Experimental results demonstrate that the receiver chip can achieve a BER of  $10^{-11}$  and a bit efficiency of 1.38 pJ/bit while receiving and decoding a 24-Gb/s 190-mV<sub>pp</sub> PAM4 signal.

The second work is a power-efficient source-synchronous NRZ receiver employing a 1/4-rate linear sampling phase detector (LSPD) with embedded feed forward equalizer (FFE) and decision feedback equalizer (DFE). The 1/4-rate LSPD is proposed to save power and avoid dithering jitter in a nonlinear bang-bang PD. To relax the timing constraint and improve the jitter performance of the recovered clock, a 1-tap FFE and a 1-tap DFE are applied to both

the data path and the edge path to cancel the first and second post-cursors by reusing the linear samples. The receiver IC is fabricated in a 28-nm CMOS process and achieves error-free operation up to 26 Gb/s with a superior bit efficiency of 0.31 pJ/bit while tolerating a 14-dB channel loss at 13 GHz.

The third work is a source-synchronous 56-Gb/s 1/4-rate PAM4 receiver based on two previous works. Besides 1-tap FFE and 1-tap DFE, continuous time linear equalizers (CTLE) are also included to improve the equalization ability. As PAM4 signal is more bandwidth sensitive, the highly demanded adaptation algorithm for the CTLE is proposed based on the data pattern selection scheme. Considering the simplicity and jitter performance, a bang-bang PD with data transition selection is proposed. To alleviate the free running frequency shift of the injection locked ring oscillator (ILRO) used in the first two works and not degrade noise performance, a wide bandwidth phase lock loop (PLL) is employed. The simulation results demonstrate that the receiver achieves a bit efficiency of 0.65 pJ/bit while compensating a 9.5-dB channel loss at 14 GHz.

Besides the CTLE adaptation used in the third work, an LMS based adaptation method for DFE is also introduced with design details, which are barely reported before. Behavior-level simulation results reveal the accuracy of the proposed equalization adaptation algorithms.

### **Chapter I Introduction**

#### 1.1 Research Background



Fig. 1.1 Global IP traffic forecast of Cisco [1].

The widespread data centers, cloud computing and other data service have become the main impetus of the data traffic boom. According to the forecast of Cisco as shown in Fig. 1.1, the global IP traffic is about 120 EB/month in 2017 and the compound year growth rate is about 24%, meaning that the data traffic will be threefold every 5 years [1]. To support the tremendous data traffic, both wireless links and wireline links play critical roles.

#### 1.1.1 Data Communication Links

For wireless links, the signal modulated on carriers transmits in the air, so cross coupling, path attenuation and signal-to-noise ratio (SNR) must be carefully considered. Due to the limited spectrum resource, standards of wireless links have been allocated corresponding frequency bands, and the transmitted signal spectrum has to be stringently limited within the bands avoiding the interference to adjacent bands. During the transmission, air attenuates the signal and then degrades the SNR. How long wireless links can support is largely determined by the path attenuation effect and the sensitivity of the wireless receivers.

The congested wireless standards and limited spectrum source mean that it is hard to allocate enough bandwidth to a single standard, and high-order modulations are usually employed to achieve higher data rate [2]. For instance, the coming 5G is very likely to adopt 512-QAM or even 1024-QAM to achieve a data rate of over Gb/s [3].



Fig. 1.2 Diagrams of a typical (a) optical link and (b) electrical link.

Wireline links can be categorized into two groups: optical links and electrical links. Fig. 1.2(a) shows a simplified diagram of an optical link where electrical-to-optical converters (EO) like vertical-cavity surface emitting laser (VCSEL) convert electrical signal to optical signal then couple the signal into fibers. The received optical signal is converted back to electrical signal by optical-to-electrical converters (OE) like photo detector. The EO and OE are usually implemented with III-V materials and packaged with CMOS or BiCMOS transceiver ICs discretely. Optical links are widely deployed in the backbone of the Ethernet since fibers have very wide bandwidth and the loss can be as low as 0.2 dB/km. Similar with wireless links, sensitivity is also one of the most significant factors of optical links. The fiber can be tens of kilometers long and the optical signal needs to be repeated for the transmission

over a longer distance. For (very) short-reach links (e.g. < 20 m), the signal attenuation is not that critical, and electrical links become more popular and cost-efficient since the expensive optical components are got rid of. Fig. 1.2(b) shows a simplified diagram of an electrical link. Electrical links are more straightforward, and the signal is always electrical in physical channels like cables and PCBs. Since the electrical signal is confined within the channel, the cross coupling among them is significantly reduced. The signal can occupy as much bandwidth of the channel as possible to achieve high-speed (e.g. over 20 Gb/s) data communication without the complicated modulation, so electrical links are more power-efficient than wireless links in terms power consumed by per bit.

#### 1.1.2 Electrical Links



Fig. 1.3 Electrical link standards.

Electrical links are very popular and have been adopted in lots of applications. Fig. 1.3 summaries some electrical-link standards. As we can see, the data rate has been keeping increasing during the past years. For common electrical interface (CEI) standards, the data rate has increased from 11 Gb/s in 2005 to 56 Gb/s in 2017 to meet the demanded bandwidth requirement [4]. Besides the standards above, electrical links are also used in the Ethernet

with the link range of less than 20m. Fig. 1.4 shows the over 30-year history of the Ethernet development, and the data rate has increased from 10 Mb/s to 100 Gb/s and will reach 400 Gb/s soon. Fig. 1.5 shows the reported power efficiency of wireline links in international solid-state circuit conference (ISSCC) [5]. Even though the data rate keeps increasing, the power efficiency improves by 30% every two years benefiting from the advancement of process technologies and the advent of new circuit techniques.



Fig. 1.4 Ethernet development history.



Fig. 1.5 Power efficiency of wireline links in ISSCC [4].

Data centers are one of the most important applications of the 40 GbE with 4 lanes of 10 Gb/s paths. Fig. 1.1 shows the data center with dense cables for interconnects. To improve the I/O density and power efficiency, the 100 GbE is gradually superseding the 40 GbE. Fig. 1.6 shows the cross section of the twinaxial cables in full duplex 100 GbE [6]. To be compatible with 10/40 GbE, the cables for the first generation 100 GbE is shown in the middle of Fig. 1.6. It consists of 20 pairs of 10-Gb/s differential cables, and 10 pairs of them are for transmission and the other 10 pairs are for receiving. Considering I/O density, power efficiency, and CMOS process advancement, the second generation 100 GbE consists of 4 lanes of 25 Gb/s paths, and the left-hand cable for second generation 100 GbE with 8 differential pairs in Fig. 1.6 have a smaller form factor. For the next-generation 200 GbE and 400 GbE, the data rate per lane should be doubled or fourfold to maintain the I/O density. The situation is also applicable to other 200G/400G systems. The thesis topic is high-speed wireline non-return-to-zero (NRZ) and four-level pulse amplitude modulation (PAM4) receiver SoC design. In this thesis, power-efficient NRZ and PAM4 electrical receivers with data rates of ~25 Gb/s and 56 Gb/s will be reported.



Fig. 1.6 Cross section of twinaxial cables in full duplex 100 GbE [6].

#### A. Equalization

Different from wireless and optical links which care more about transmission attenuation, one of the most critical functions in electrical links is to compensate the inter symbol interference (ISI) caused by the channel with limited bandwidth. ISI consists of precursors and post-cursors, and both degrade the signal integrity. Fig. 1.7 shows the NRZ eye diagram with ISI generated by a RC channel, and the eye opening deteriorates in both horizontal and vertical directions. If the ISI is severer, the eye will totally close and bit errors will occur during logic level decision. Equalization is the technique to reopen the closed eye diagram by eliminating ISI. The continuous-time linear equalizer (CTLE), feed forward equalizer (FFE) and decision feedback equalizer (DFE) are the most popular equalizers which have been adopted by almost all electrical transceivers. The receiver with CTLEs shows an analog high-pass characteristic, which compensates the low-pass channel to generate a flat response. CTLE can compensate not only main post cursors but also pre-cursors and long-tail post-cursors. During the equalization, CTLE can also have positive DC gain, and clocks or delay cells are not required. Therefore, CTLE is a very efficient equalizer. However, CTLE amplifies the high-frequency noise and degrades SNR. In addition, it's not easy to tune the frequency response of CTLEs to accommodate different channels because usually the design only has one degree of freedom and a channel usually has a very complicated frequency response. In the receiver side, CTLE usually provides medium equalization ability to open the eye diagram and the discrete-time equalizer FFE and DFE do further equalization. FFE can cancel both pre-cursors and post-cursors with 1-UI delay cells, and different channels can be compensated by using multi-taps with independent tap coefficients. Due to the nature of the analog-domain cancellation, FFE also degrades SNR, and the degradation is severer as the number of the taps increases. Therefore, FFE is often used in the transmitter side and only cancels pre-cursors in receiver side. Similar with FFE, DFE design also has enough degrees of design freedom to compensate different channels by employing multiple taps and setting different tap coefficients. DFE does not deteriorate SNR because the cancellation is implemented with noiseless feedback digital taps. However, DFE only cancels post-cursors and its first tap must meet a very stringent timing constraint, especially in high-speed receivers.



Fig. 1.7 Eye diagram with ISI.

#### B. Clock data recovery

Besides the equalization, clock and data recovery is another critical function of electrical links. In the clock-data recovery loop, a phase detector (PD) detects the phase error between clock and data, and then adjusts the clock phase. Both the bang-bang PD and the linear PD are very popular. The simple bang-bang PD only tells the polarity of the clock-data phase error. Thus, an ideal bang-bang PD has infinite gain. Just for this reason, bang-bang PD usually causes large jitter to the recovered clock because the PD always adjusts the clock phase with a constant step no matter how small the clock-data phase error is. Linear PD detects the clock-data phase error quantitatively and its output is proportional to the phase error. Once the clock and data are aligned, the output of the linear PD has very small interruption to the recovered clock and good jitter performance is realized.

#### C. Design challenges

As the data rate of electrical links increases, the transceivers design is increasingly challenging. In many applications, the quality improvement of the physical channels and the increase of the required data rate do not match well. Therefore, the channels have more loss at the Nyquist frequency (half of the data rate), and equalizers should work at higher frequency and have more equalization ability. For CTLE, to extend the working frequency, its DC gain will be further suppressed, and multi-stages are required to achieve enough boosting ability. More power will be consumed, and SNR will be degraded. FFE also suffers from the penalties of power and SNR. The 1-UI delay cell is more power consuming and more sensitive to the process, voltage, and temperature (PVT) for a small delay. As aforementioned, the DFE tap1 loop should meet the stringent timing constrain of 1 UI. When the data rate goes higher, the delay of all blocks in the loop should be minimized at the cost of large power consumption. The slicer contributes most of the loop delay especially for small inputs. Variable gain amplifiers (VGA) are often employed to amplify the signal before the slicing [7]. However, the tradeoff among bandwidth, gain and power of the VGA requires lots of attentions. Clock and data recovery also needs to deal with design challenges. The linear PD is preferred to achieve good jitter performance. Different from the bang-bang PD that just detects the polarity of the phase error, the linear PD also needs to detect the phase error quantitatively. Usually the linear PD is more complicated and more power consuming, and its linear range decreases as the increase of data rate [8]. The tradeoff between the power consumption and linear range should be considered carefully in a linear PD.

According to the above discussion, a lot of effort should be put on the circuit optimization to achieve higher data rate and better power efficiency simultaneously [9]. Novel system topologies or circuit techniques can be proposed to break the above tradeoff by taking advantage of the new phenomenon occurring in high-speed data links [10].

To further break the tradeoff, signal modulation can be employed. Compared with the 2-level NRZ signaling, the 4-level PAM4 signaling carries doubled bits. In other words, the working frequency of the PAM4 transceivers is halved compared with NRZ transceivers at the same data rate. PAM4 signaling was proposed around 20 years ago [11, 12] and has attracted huge attentions during the recent 5 years [13-15] because the cost and performance of the advanced CMOS processes make the NRZ signaling unsustainable over a long time. PAM4 transceivers have become the R&D focus in both industry and academia. In addition, due to the compatibility with NRZ transceivers, PAM4 transceivers are more attractive. So far, lots of reported 56-Gb/s transceivers have adopted PAM4 signaling [15-17]. PAM4 signaling has become the most promising solution for the next-generation 200 GbE and even 400 GbE [18].

Even though the PAM4 signaling doubles the bandwidth efficiency, new challenges are still met and should be dealt with. Due to the amplitude modulation, the front-end should be linear to avoid the compression to the top/bottom eyes. DFE should also be modified to have different feedback coefficients for different input levels [17]. PAM4 signal should be decoded back to NRZs so that it can be processed further by subsequent digital processing blocks. How to do PAM4 decoding efficiently is worth studying. Since the PAM4 signal has three eyes, it is more sensitive to the bandwidth effect and the adaptive equalizers are demanded. In addition, the PAM4 signal has much more data transitions than the NRZ signal, and how to implement an efficient PD is also very important to the design of a PAM4 clock data recovery.

PAM4 signaling introduces higher-order modulation to electrical links for the first time. The design challenges of PAM4 electrical links have been introduced. How about migrating PAM4 to PAM8? PAM8 can further lower the required bandwidth of channels and circuits, but will meet more stringent design challenges than PAM4 signaling. Because of the limited available voltage headroom, PAM8 signal must have much smaller eye opening and signal to noise ratio (SNR) leading to the low-noise requirement to circuits design; PAM8

signal will meet linearity issue and the nonlinearity may cause significant suppression to top/bottom eyes which further degrades SNR; due to the small eye opening, decoding PAM8 to NRZ requires accurate reference voltage; more accurate equalization should be designed to avoid eye-opening degradation caused by effects of over peaking and under peaking. In general, PAM8 design has more strict requirements.

#### **1.2 Thesis Organization**

This thesis will introduce the receiver design for high-speed electrical links. Three receivers are reported from chapter II to chapter IV. Chapter II introduces a 24-Gb/s PAM4 receiver with a decoder based on an adaptive variable gain rectifier. The proposed rectifier-based decoder performs PAM4-to-NRZ decoding adaptively and efficiently. Chapter III introduces a 26-Gb/s NRZ receiver with embedded equalization to achieve medium equalization ability efficiently. The embedded equalizers including FFE and DFE are for both the data equalization and the edge equalization. The proposed linear PD is also presented. Chapter IV introduces a 56-Gb/s PAM4 receiver. The equalization techniques in chapter II-III are modified and then used in this receiver. CTLEs with adaptation are employed to increase the equalization ability. A bang-bang PD with the data-transition selection is implemented to recover the clock and data. Chapter V introduces the proposed adaptation algorithms for both CTLE and DFE. Behavior-level model simulation demonstrates the effectiveness of the algorithms to different channels. In chapter VI, the introduced works in chapter II-IV are summarized and the future work is also discussed.

#### Chapter II 24-Gb/s PAM4 Receiver Design

#### 2.1 Introduction

As mentioned in chapter1, the global IP traffic triples every five years and is projected to be over 200 Exabytes per month by 2020 [1]. In the past decades, to support the data boom, electrical links have adopted non-return-to-zero (NRZ) signaling due to its simplicity and the advancement of CMOS technology. But now, CMOS technology scaling is meeting challenges like unsustainable cost and the ending of Moore's Law. NRZ signaling is losing its attraction due to future electrical links requiring higher data rates. Four level pulse amplitude (PAM4) signaling with doubled bandwidth (BW) efficiency has become the most likely solution for the next generation Ethernet. Therefore, power-efficient PAM4 transceivers are highly desired to save the cost of the hyper scale data centers. Besides equalization and clock data recovery, PAM4 receivers also require a PAM4-to-NRZ decoder for further digital processing. Both analog-to-digital (ADC) based receivers [16, 19] and mixed-signal receivers [15, 20] have been reported. In ADC based receivers, most of the PAM4 signals are processed in the digital domain, facilitating the decoder design and the implementation of the advanced equalization. In addition, ADC based receivers have good design flexibility and process portability, but an inferior bit efficiency (power consumed by 1-bit data transmission and receiving) of ~10 pJ/bit [19]. In contrast, mixed-signal receivers can achieve a better bit efficiency of <4 pJ/bit by employing power-efficient analog circuit techniques [15], therefore they are more attractive to low-power designs (with a bit efficiency lower than 5 pJ/bit). Decoders in mixed-signal receivers are usually performed by three comparators and the subsequent thermometer-to-binary (T2B) logic. One comparator without extra references decodes the maximum significant bit (MSB) from the PAM4 signal, while the other two comparators with amplitude-proportional references  $\pm V_{ref}$  are for LSB decoding. To accommodate different input amplitudes, V<sub>ref</sub> should be generated adaptively or can be a constant value with the help of a variable gain amplifier (VGA). In [17], the adaptive generated V<sub>ref</sub> is equal to 2/3 of the detected peak-to-peaking amplitude. In [21], the adaptive generated V<sub>ref</sub> is equal to the vertical opening of the PAM4 middle eye, but the introduced analog adaptation path consumes extra power. For a decoder with a constant V<sub>ref</sub>, the VGA will amplify the input signal amplitude to 3/2V<sub>ref</sub>. In all the methods above, a nonlinearity issue will degrade the performance of LSB decoding due to the small PAM4 eye opening. In a full-rate PAM4 receiver, a current-mode logic (CML) circuit is usually adopted to achieve high-speed operation; therefore, the VGA and the decoder have to sacrifice more power in the tradeoff between power and speed. Sub-rate topologies are preferred, especially when the data rate is close to the process extreme. For a 1/4-rate topology, although the number of 1/4-rate blocks is fourfold, the total power consumption does not have to increase since the 1/4-rate blocks will adopt more power-efficient voltage-mode logic (VML) instead of CML. For example, the power consumption of a strong-arm latch based comparator is speed-proportional and far less than that of a CML configuration [22]. To further save power, merging functions into one block is effective, like merging a VGA into a decoder.

In this chapter, a power-efficient PAM4 receiver employing 1/4-rate topology, and an adaptive variable-gain rectifier (AVGR) based decoder is presented [23]. VML circuits are used in 1/4-rate blocks for power saving. The VGA function and decoder comparators are merged into an AVGR which performs PAM4-to-NRZ decoding by taking advantage of the PAM4 amplitude information. With amplification and rectification, the AVGR converts two kinds of amplitudes of a PAM4 signal to two voltage levels with amplified level spacing, and the Vref which differentiates the voltage levels above for decoding also has a larger margin alleviating the performance degeneration of the LSB decoding in a 3-comparator decoder resulting from small eyes and nonlinearity issue of a PAM4 signal. Therefore, the AVGR can be viewed as a special amplifier which only amplifies the opening of PAM4 top/bottom eyes.

For PAM4 inputs with different amplitudes, the AVGR adaptively adjusts its gain to produce outputs with the same swing, thereby leading to a constant  $V_{ref}$  for LSB decoding.



Fig. 2.1 (a) Coding of PAM4 levels, (b) 1/4-rate PAM4 receiver with a VGA and 3-comparator based decoder, and (c) proposed receiver utilizing an adaptive variable-gain rectifier for the decoder.

#### 2.2 Receiver Decoder Architecture

PAM4 levels are usually coded in binary, as shown in the left-hand codes in Fig. 2.1(a). The binary MSB (M<sub>B</sub>) is determined by the level polarity and can be decoded through a comparator without extra reference. The binary LSB (L<sub>B</sub>) decoding should be performed by two comparators with opposite references  $\pm V_{ref}$ . The outputs of the three comparators are thermometer codes and should be further converted to MSB and LSB by T2B logic. Fig. 2.1(b) shows a 1/4-rate receiver topology with a VGA and a 3-comparator based decoder. A continuous-time linear equalizer (CTLE) pre-conditions the PAM4 signal. As per the discussion about 1/4-rate receivers in the previous section, voltage-mode comparators consisting of a strong-arm latch (SAL) and an SR latch are preferred. The middle comparator for the MSB decoding in Fig. 2.1(b) has a large gain and does not require a VGA function if its offset is calibrated. Therefore, the VGA can merge with the top and bottom comparators to achieve better power efficiency.

Besides binary codes, PAM4 levels can also be treated as Gray codes, as shown in the right-hand side of Fig. 2.1(a). The Gray code MSB ( $M_G$ ) is the same as  $M_B$ , while the Gray code LSB ( $L_G$ ) is determined by the signal amplitude. In addition, the amplitude of the PAM4 levels of  $L_G = 1$  is three times that of  $L_G = 0$ . Based on the observation above, a rectifier is feasible for LSB decoding. As shown in Fig. 2.1(a), level 11 and 01 will be rectified to a level corresponding to  $L_G = 1$ ; while level 10 and 00 will be rectified to a level corresponding to  $L_G = 0$ . By differentiating the two rectified levels through a reference voltage  $V_{ref}$ , LSB can be decoded. Fig. 2.1(c) shows the proposed 1/4-rate receiver where the VGA and two comparators for the LSB decoding in Fig. 2.1(b) are replaced with an AVGR and a comparator CMP2. The AVGR consists of three parts: a variable-gain sense amplifier (SA), rectifier, and gain adaptation block. The SA will amplify the sampled PAM4 signal adaptively, and its output will be rectified to two voltage levels corresponding to LSB = 1 and 0 by the following rectifier. The subsequent CMP2 with  $V_{ref}$  will differentiate the two voltage levels

above then produce the LSB. The SA is a linear SAL, therefore the power consumption of the SA and the rectifier is similar to that of a comparator. For this reason, the power consumption of the AVGR based decoder is similar to that of a 3-comparator based decoder; and the power consumption of the receiver in Fig. 2.1(c) is better than that of the receiver in Fig. 2.1(b) since the VGA is eliminated. In addition, no extra logic is required after CMP2, while T2B conversion is necessary in Fig. 2.1(b). For the binary coded PAM4 levels, the proposed receiver requires an XNOR gate according to the binary and Gray code conversion equations in Fig. 2.1(a). Furthermore, the AVGA based decoder poses smaller loading to its former stage than its 3-comparator based counterpart. In Fig. 2.1(c), 1/4-rate clocks are generated from an on-chip four-stage injection locked ring oscillator (ILRO) with a voltage control delay line (VCDL) and a pulse generator (PG) in the injection path. The injected source-synchronous clock improves the jitter performance of the ILRO output clocks.

#### 2.3 Building Blocks

In this section, the design details of several key building blocks are introduced.

#### 2.3.1 CTLE



Fig. 2.2 Limited bandwidth caused eye opening degradation to (a) NRZ and PAM4, (b) top eye, (c) middle eye and (d) bottom eye.

Considering the tradeoff between BW-induced inter symbol interference (ISI) and noise BW, NRZ analog front ends (AFEs) usually choose a BW of 0.7Rb where Rb is the baud rate. A 1st order RC low-pass filter (LPF) with a BW of 0.7Rb is used to emulate the NRZ AFE. Fig. 2.2(a) shows the generation of the minimum eye opening from two kinds of data sequences: one '1' among several '0', and one '0' among several '1'. Here, the limited BW caused vertical eye opening degradation. VEOD<sub>BW</sub> is defined as the ratio between the error amplitude and the ideal eye opening, and VEOD<sub>BW</sub> =  $2\varepsilon$  in Fig. 2.2(a). Let us consider the case of using the above RC LPF to process a PAM4 signal at the same baud rate. Fig. 2.2(b)-(d) show how the minimum eye opening is generated for three PAM4 eyes. In Fig. 2.2(b), the top VEOD<sub>BW</sub> =  $(\varepsilon + \varepsilon/3)/(1/3) = 4\varepsilon$ . Using the same calculation method, VEOD<sub>BW</sub> =  $4\varepsilon$  for both the middle and bottom eyes. Therefore, the limited BW effect to PAM4 VEOD is doubled. Besides the vertical degradation, the limited BW also causes horizontal eye opening degradation HEOD<sub>BW</sub>. Compared with the NRZ eye in Fig. 2.2(a), the width of the middle PAM4 eye in Fig. 2.2(c) is smaller. The dashed line in Fig. 2.2(c) is the optimal reference for both vertical and horizontal openings, while the dashed lines in Fig. 2.2(b) and (d) are optimal for only the vertical opening of the top and bottom eyes. The middle eye determines the MSB of the PAM4 signal, while the LSB is decided by the top and bottom eyes. The vertical asymmetry of the top and bottom eyes means that the BER performance of a PAM4 receiver will be highly related to the LSB part.

For an analog equalizer, frequency peaking comes from either zero or a low damping factor. Over peaking in the frequency domain may result in over shoot in time domain. The over shoot caused by zero will attenuate exponentially after reaching its peak. However, the over shoot resulting from a low damping factor will generate ringing after reaching its peak, and the ringing envelope decays exponentially. Since the attenuation of both cases is very quick, only the main lobe will deteriorate the eye diagram. For the NRZ eye diagram, the over shoot lies outside of the eye, and the deterioration of the eye opening is negligible. However,

the PAM4 eye diagram has three eyes and the eye opening deterioration will be very obvious due to the over shoot to the two middle levels.



Fig. 2.3 (a) CTLE schematic diagram, and (b) NRZ and (c) PAM4 eye diagrams when CTLE has over peaked by  $2\ dB$ .

According to the analysis above, the PAM4 signal is much more bandwidth-sensitive (limited BW and over peaking) than the NRZ signal, and the well-known 9.5-dB degradation from NRZ to PAM4 is underestimated when taking the BW effect into consideration. The CTLE plays a key role in adjusting the BW of the AFE. In this work, the AFE is a one-stage CTLE, and Fig. 2.3(a) shows the schematic. The source degeneration RC generates a pair of zero-poles leading to frequency peaking by suppressing the DC gain. A customized 1.5-nH inductor with a compact area of  $40 \times 40 \text{ }\mu\text{m}^2$  is adopted for shunt peaking to further enhance the BW. The simulated frequency response in Fig. 2.3(b) shows that a peaking ability of 5.2 dB at 10 GHz is achieved. As discussed in previous paragraph, over peaking has different effects to PAM4 and NRZ signals, and Fig. 2.3(c) and (d) show the NRZ and PAM4 eye diagrams responding to a 2-dB over peaking from the CTLE. The NRZ eye opening does not deteriorate since the over shoot only appears at the outside of the eye. Four PAM4 levels show different responses to over peaking. The top and bottom levels only show one-side over shoot, while the middle levels have unequal over shoot at both sides. The over shoot squeezes the eye opening and the time instance corresponding to the optimal BER performance will shift away from the 0.5 UI. Therefore, slicing at the 0.5 UI instance is still optimal for NRZ data but is no longer optimal for the PAM4 data when over peaking occurs. Equalization to the PAM4 signal requires additional care.

#### **2.3.2 AVGR**

Fig. 2.4(a) and 4(b) are the schematic diagrams of the VGR including a variable-gain SA and a rectifier. The SA is modified from an SAL. If M3-M6 are not included in Fig. 2.4(a), the SA is a dynamic amplifier with small gain and large output common-mode (CM) drop causing a large undesired DC component to the transfer curve of the VGR. M3-M4 introduce the isolation between OUT<sub>p/n</sub> and node X/Y to solve the output CM drop issue [24]. M5-M6 alleviate the CM drop by restoring one output to VDD when the output CM drop is significant.

M3-M6 form a latch, which increases the gain of the SA. Compared to a differential amplifier, using a linear SAL based SA has the following characteristics: 1) the gain is higher and the output swing is larger owing to the inherent latch M3-M6; and 2) there is no static current in the SA, so the power consumption is smaller. To amplify the PAM4 signal, the SA should be linear. The SAL based SA will work in the linear range by adjusting the gain to limit the output swing, thereby preventing the inherent latch from entering into a deep nonlinear state. As Fig. 2.4(a) shows, the variable gain of the SA is implemented by an adjustable input transistor size and output loading capacitors for coarse and fine adjustment, respectively [25]. The rectifier operation is based on charging and discharging process. The SA drives the ML/MR to generate current which charges C<sub>rec</sub>. C<sub>rec</sub> will be discharged to ground before processing next bit. The rectifier can work quickly because there is no RC limitation. Good power efficiency is also achieved by adopting a small C<sub>rec</sub> (< 20 fF).



Fig. 2.4 Schematic diagram of the VGR: (a) variable-gain SA and (b) rectifier Figure

Fig. 2.5 shows the timing diagram of one branch VGR for four Gray coded PAM4 input levels. After sampling, the SA outputs  $OUT_{p/n}$  make transistors ML/MR in the rectifier conduct for part of the holding time  $T_{HLD}$ , and the current is integrated on the rectifier capacitor  $C_{rec}$  [26]. Once  $T_{HLD}$  is over, the SA outputs  $OUT_{p/n}$  are reset to VDD and  $C_{rec}$  holds the voltage  $V_{rec}$  on it for 1 UI for the following comparison with  $V_{ref}$  in the comparator CMP2. Before the next  $T_{HLD}$  starts, the rectifier  $C_{rec}$  discharges to ground preparing itself for the next bit. As Fig. 2.5 shows, for the PAM4 level x1, the SA output  $OUT_p$  or  $OUT_n$  has a large swing and drops to  $V_{min1}$ , and the rectifier produces a high pulse with a swing of  $V_{max1}$ ; and for the PAM4 level x0, the SA output has a smaller swing  $(V_{min0})$ , and the swing of the output pulse of the rectifier decreases to  $V_{max0}$ . Therefore,  $V_{rec}$  is a series of pulses with two voltage levels  $V_{max0}$  and  $V_{max1}$ , which correspond to LSB = 0 and LSB = 1, respectively. A reference  $V_{ref}$  between  $V_{max1}$  and  $V_{max0}$  will decode the LSB of the PAM4 input.



Fig. 2.5 Timing diagram of VGR.



Fig. 2.6 Simulated transfer curves of a VGR. (a) SA gain set for a maximum input of 80 mV $_{pp}$  and (b) SA gain set for a maximum input of 260 mV $_{pp}$ .

In a comparator, the SAL is nonlinear, and both  $V_{min0}$  and  $V_{min1}$  are zero. For the SAL based SA in the AVGR,  $V_{min1}$  should be over 150 mV for good linearity according to the simulation, and it can be set by adjusting its gain. The swing of  $V_{rec}$  can be further adjusted by controlling the size of ML/MR of the rectifier when the SA output has already been determined. To characterize the VGR, the following simulation is performed. For a 80-m $V_{pp}$  periodical square waveform input, the gain of the SA is adjusted so that  $V_{min}$  is about 260 mV, and the gain of the rectifier is also adjusted so that  $V_{max}$  is around 850 mV. While maintaining the gain of the SA and the setting of the rectifier, the amplitude of the input square waveform is swept from 80 m $V_{pp}$  to 20 m $V_{pp}$ , and the  $V_{max}$  of the rectifier is monitored. As shown in Fig. 2.6(a), the simulated transfer curve is quite linear, and the VGR gain is up to 19 dB. For an 80-m $V_{pp}$  PAM4 input, the voltage difference of the two levels of  $V_{rec}$  is more than 400 mV meaning a big voltage margin of  $V_{ref}$ . To accommodate larger inputs, the gain of the SA should be set smaller. Fig. 2.6(b) shows the VGR transfer curve for a maximum input amplitude of 260 m $V_{pp}$ , and it is also quite linear with a gain of 12 dB.



Fig. 2.7 Block diagram of gain adaptive control of the SA.



Fig. 2.8 The adaptation process when a 10-fF loading mismatch at the SA output exists.



Fig. 2.9 SA and rectifier outputs under (a)  $80~\text{mV}_{pp}$  and (b)  $260~\text{mV}_{pp}$  PAM4 input after gain adaptation.

The gain adaptation for PAM4 receivers is highly demanded. For different input amplitudes, the gain adaptation will adjust the gain of the SA adaptively to guarantee that the SA works linearly and the swing of its output ( $V_{min1}$ , refer to Fig. 2.5) is almost the same. The adjustable size of ML/MR of the rectifier is to overcome the process variation, and the size setting will be fixed after an initial calibration. Therefore, the same  $V_{min1}$  of the SA outputs OUT<sub>p/n</sub> leads to the same  $V_{max1}$  of the rectifier output  $V_{rec}$ , and a constant  $V_{ref}$  for LSB decoding. As shown in Fig. 2.7, the gain adaptation block includes two parts: a self-resetting SR latch, and control logics. The SR latch, working as an interface from analog to digital, monitors the SA outputs OUT<sub>p</sub> and OUT<sub>n</sub> alternately, and its output drives the logics to digitally control the gain of the SA by controlling the input transistor size and the output loading capacitance. The SR latch will be triggered if  $V_{min1}$  of the SA outputs is lower than its trigger voltage  $V_{tg}$ . At first,  $V_{min1}$  is small, and the triggered SR latch will produce a pulse to drive the counter (CNT1,2) in the digital logics and then the reset. The increase of CNT1,2 means a larger capacitive loading at the SA output, thereby a smaller gain of the SA. Once

CNT1 or CNT2 reaches its maximum, both will be reset, and CNT3 will decrease by 1 to reduce the input transistor size and then the gain of the SA.  $V_{min1}$  increases as the gain of the SA decreases. When  $V_{min1}$  reaches  $V_{tg}$ , CNT1,2 will keep for a certain time and MNT will produce a signal to lock CNT1,2,3, then the gain adaptation process ends.  $V_{tg}$  of the SR latch determines  $V_{min1}$  (swing of the SA output). To cover the process variation,  $V_{tg}$  has a design margin of about 200 mV.

As discussed above, monitoring the SA output OUT<sub>p</sub> and OUT<sub>n</sub> alternately means that some mismatch of the SA can also be calibrated during the gain adaptation. Fig. 2.8 shows the adaptation process when there is a 10-fF loading mismatch at the SA output. Initially,  $V_{min1}$  of OUT<sub>p</sub> and OUT<sub>n</sub> are different due to the loading mismatch, and still trigger the SR latch leading to the decrease of the gain of the SA. Finally, both the  $V_{min1}$  of OUT<sub>p</sub> and OUT<sub>n</sub> converge to the  $V_{tg}$  of the self-resetting SR latch in Fig. 2.7. The 10-fF mismatch is calibrated by the difference of the final converged control words of the capacitor array at the SA output. Fig. 2.9 shows the outputs of the SA and the rectifier after gain adaptation with 80-mV<sub>pp</sub> and 260-mV<sub>pp</sub> PAM4 inputs. The top two figures show the adaptation block makes the SA output the same  $V_{min1}$  for different input amplitudes. The bottom two figures show the same  $V_{min1}$  means the same voltage level of LSB = 1 ( $V_{max1}$ ) because of the fixed setting of the rectifier. A  $V_{ref}$  between the two voltage levels differentiates them to decode LSB. For the input amplitude range from 80 mV<sub>pp</sub> to 260 mV<sub>pp</sub>,  $V_{ref}$  can be a constant value and the margin is always larger than 400 mV.

To achieve fine tuning of the SA gain, the unit capacitance of the adjustable LCA in Fig. 2.4(a) is required to be less than 1 fF, and MOS capacitors (MOSCAP) are more preferred. Binary controlled PMOSCAP shown in Fig. 2.10(a) are utilized for the offset calibration of the SA. An ideal binary controlled MOSCAP should have a high ratio between the capacitance when the switch (SW) is on and off. Different from the decoupling MOSCAP of power, the voltage swing over the MOSCAP in this design is several hundreds of mV, and

the MOSCAP cannot always stay in the strong inversion region leading to the decrease of the effective capacitance. MOSCAP is process-dependent, and NMOSCAP in Fig. 10(b) is also considered. The simulated small-signal capacitance of two kinds of MOSCAP at different bias voltages is shown in Fig. 2.10(c). The capacitance difference at  $VA_DC = 1V$  with the SW on and off is about 0.3 fF for both NMOSCAP and PMOSCAP, but PMOSCAP introduces more capacitance when the SW is off. On the other hand, PMOS has a higher threshold voltage than NMOS in this technology, and it leaves the strong inversion region earlier than NMOSCAP as the decrease of  $V_{A_DC}$ . Therefore, it is easier to achieve a bigger tuning range and a smaller fixed capacitance loading with NMOSCAP in this design.



Fig. 2.10. Binary controlled (a) PMOS capacitor and (b) NMOS capacitor, and (c) simulated equivalent capacitance at different bias voltages.

## 2.3.3 ILRO



Fig. 2.11 (a) Schematic of the ring oscillator with four stages of delays cells, and (b) delay cell stage with clock injection part.

Using an ILRO to generate multi clock phases is a very effective method and has been adopted extensively [27]. In this design, a four-stage RO is implemented to generate eight clock phases; four of them are used for data sampling and the remaining four are loaded with a dummy for matching. Fig. 2.11 shows the schematic of the RO and the delay cell. The

oscillation frequency of the RO is controlled by adjusting the biasing current, and an external variable resistor R<sub>EX</sub> controls the biasing current after mirroring. The RO is biased at both the top and the bottom sides for better symmetry of the output rising and falling edges. The delay cell in Fig. 2.11(b) includes main delay inverters INV<sub>M</sub>, feedback inverters INV<sub>F</sub> and an injection stage. The delay of the INV<sub>M</sub> inverters decides the RO oscillation frequency, and the INV<sub>F</sub> inverters are to speed up the output transition and make the output more differential. The single-ended 1/4-rate input clock performs fundamental injection after passing through the VCDL and the PG, which generates a pair of differential pulses with a narrow width. In Fig. 2.11(b), M<sub>N3</sub> and M<sub>P4</sub> are for the injection to two output nodes P<sub>1</sub> and Q<sub>1</sub> [28]. To achieve a better loading matching when there is no injection, M<sub>N4</sub> and M<sub>P3</sub> are added. M<sub>N1,2</sub> and M<sub>P1,2</sub> are introduced to construct the configuration with four stacked transistors like the RO. During the injection, Q<sub>1</sub> and P<sub>1</sub> are forced to low and high, respectively, cleaning the output clock noise. For an ILRO, the free running frequency of the RO and the injected frequency should be as close as possible to achieve ideal output clock phases. However, the free running frequency will drift during a long-time measurement due to the change of the working temperature, and the output clock phases will also deviate from the ideal values even though the fundamental injection guarantees the frequency locking. Therefore, in a practical design, frequency calibration to an RO is required [29]. In this work, no calibration is included due to the limited design source and tight schedule.

### **2.4 Measurement Results**

The proposed receiver is fabricated in 28-nm high-k CMOS process. Fig. 2.12 shows the chip photograph, and the core area is  $0.15 \times 0.16$  mm<sup>2</sup>. Fig. 2.13 shows the photograph and diagram of the measurement setup. The PAM4 input signal is generated by using a power combiner to combine the two-channel outputs of a programmable pattern generator (PPG). The PAM4 signal is attenuated by a 10-dB attenuator and given a new DC biasing through a

bias-T, and then it is fed into the receiver chip through an SGS probe. With the exception of the input pads, pads are wire bonded, as shown in Fig. 2.14. The output NRZ eye diagrams are captured by a sampling scope, and the BERs are measured through an error detector (ED). The whole setup is synchronized through the 1/4-rate clock from the PPG.



Fig. 2.12 Receiver chip photograph.



Fig. 2.13 Photograph and diagram of the measurement setup.



Fig. 2.14 24-Gb/s PAM4 signal input for measurement.



Fig. 2.15 Power consumption breakdown.

Before measuring the receiver performance, the free running frequency of the RO is manually tuned to 3 GHz. The 24-Gb/s 190-mV<sub>pp</sub> input PAM4 eye diagram is shown in Fig. 2.14, and the ripple on it may result from the imperfect frequency response of the transmitter in the PPG. The chip is powered under a 1-Vpower supply, and Fig. 2.15 shows the power consumption breakdown. The total consumed power is 33 mW, and more than half is consumed by clock-related parts because of the more complex clock distribution and larger loading. Fig. 2.16 shows the received 3-Gb/s eye diagrams of the MSB and LSB. Since they have been retimed by clocks, the eye diagrams show good opening in both the vertical and horizontal directions. The measured rms jitters for the MSB and the LSB are 1.5 ps and 2.0 ps,

respectively. The BER of the NRZ outputs are measured for the 24-Gb/s PAM4 input, as shown in Fig. 2.14, and the BER bathtub curve shown in Fig. 2.17 is obtained by manually tuning the input clock delay. The optimal BER is better than 10<sup>-11</sup>, and the timing margin at a BER of 10<sup>-9</sup> is 0.17 UI. A spike on the BER bathtub curve is observed at the 0.5 UI. As per the discussion about the bandwidth effect in section III.A, the spike is owing to the over peaking effect, which causes time-domain over shoot, thereby degrading the opening of three PAM4 eyes and the BER. In Fig. 2.17, the optimal BER happens at the right half only due to the left-side ripple on the input signal.





Fig. 2.16 Received 3-Gb/s (a) MSB and (b) LSB eye diagrams.

The BER performance of the receiver chip under different input amplitudes is also measured. After the determination of the setting of the rectifier in Fig. 2.4(b), the SA of the AVGR will adaptively amplify the input. The curve in Fig. 2.18 shows the measured optimal BER versus input amplitude. With the increase of the input amplitude, the BER also improves and reaches  $10^{-11}$  at the input amplitude of 190 mV<sub>pp</sub>. During sweeping of the input amplitude, the reference voltage for LSB decoding is fixed because of the gain adaptation in the AVGR.



Fig. 2.17 Measured BER bathtub curve under a 24-Gb/s  $190\text{-mV}_{pp}$  PAM4 input.



Fig. 2.18 Measured BER under different input amplitudes.

The receiver performance is summarized and compared with similar works in Table I.

To evaluate the performance of receivers, bit efficiency is a good reference since it reflects the cost of transmitting one bit when the condition is the same. A bit efficiency of 1.38 pJ/bit is

achieved in this work. Table I shows a trend that 1/4-rate topology is more power-efficient than its full-rate and 1/2-rate counterparts. The bit efficiency of Ref [30] is better than this design because it has no on-chip clock generator and uses low-gain but power-efficient latches.

Table I. Performance summary of 24-Gb/s PAM4 receiver and comparison with similar works.

|                                       | [20]      | [15]         | [30]         | [21]         | This work |
|---------------------------------------|-----------|--------------|--------------|--------------|-----------|
| Function                              | CTLE +    | CTLE + DFE + | DFE +        | CTLE + DFE + | CTLE +    |
|                                       | Decoder + | Decoder +    | Decoder +    | Decoder +    | Decoder + |
|                                       | CDR       | CDR          | Clock buffer | CDR          | ILRO      |
| Clocking                              | Full rate | 1/2-rate     | 1/4-rate     | 1/4-rate     | 1/4-rate  |
| Decoding Adaptation                   | No        | Yes          | No           | Yes          | Yes       |
| Data Rate (Gb/s)                      | 56        | 40-56        | 32           | 32           | 24        |
| Bit Efficiency (pJ/bit)               | 7.5       | 4.11         | 0.55         | 2.5          | 1.38      |
| Eye Width of BER 10 <sup>-9</sup>     | NA        | 0.13 UI      | 0.17 UI      | 0.18 UI      | 0.17 UI   |
| @ Input Amplitude (mV <sub>pp</sub> ) | NA        | @ 300        | @ 300        | @ 350        | @ 190     |
| CMOS Process                          | 40nm      | 16nm         | 65nm         | 65nm         | 28nm      |
| Chip Area (mm²)                       | 1.6*      | 0.364        | 0.014        | 0.16         | 0.024     |
|                                       |           |              |              |              |           |

<sup>\*</sup> This is the whole chip area and others are core area.

### 2.5 Conclusion

This chapter presents a 24-Gb/s 1.38-pJ/bit PAM4 receiver in 28-nm CMOS process. The analysis shows that PAM4 signaling is more BW-sensitive than its NRZ counterpart. The good bit efficiency of the proposed receiver is achieved by adopting the 1/4-rate topology and the AVGR based decoder which adaptively and power-efficiently decodes PAM4 signals with different amplitudes. The circuit implementation of the AVGR and other key building blocks are introduced with simulation results. Experimental results show that the receiver work achieves comparable power efficiency with the similar state-of-the-art works.

# Chapter III 26-Gb/s NRZ Receive Design

### 3.1 Introduction

Power-efficient electrical links have very massive applications. Even though PAM4 signaling has become the research focus, NRZ links are still the mainstream in real applications like Ethernet, and research on power-efficient NRZ links is still meaningful. For instance of super-computing, though its speed has accelerated 200 times over the last decade, the consumed power has only increased 7 times, benefiting from the advanced process technology and computing architectures [31]. To maintain this pace, the high-speed and power-efficient I/Os for a substantial number of processors and memories must be developed. For a compact assembly, the electrical channels for chip-to-chip and module-to-module communication usually have a medium loss at the Nyquist frequency [32]. To meet the requirements above, source-synchronous I/Os with forwarded clocking are widely employed for their low complexity [32-34]. Among [32-34], the highest data rate and best power efficiency is 16 Gb/s and 0.56 pJ/bit, respectively. To further increase the bandwidth and improve the power efficiency, huge design challenges have to be dealt with. A PD is necessary for clock-data phase recovery. Bang-bang PDs (BBPD) are popular for their high gain, while the nonlinearity will cause dithering jitter and loop latency. On the contrary, a linear PD can eliminate these issues at the cost of high-bandwidth analog circuits, which are more power-hungry and susceptible to process voltage temperature (PVT) variations. Comparing with CTLE, DFE does not amplify high-frequency noise and has more design freedom by adopting many taps. However, the DFE tap1 faces a very stringent timing constraint and has to consume more power to decrease the feedback loop delay. Unrolled tap1 alleviates the timing constraint but doubles the number of power-hungry slicers [35]. [36] reports a record bit efficiency of 0.35 pJ/bit at 40 Gb/s by sharing building blocks among different functions and extensively utilizing charge-steering techniques.

In this chapter, a power-efficient source-synchronous receiver based on the linear sampling technique is presented [37]. Both the data and the edge of the differential input signal are sampled by 1/4-rate clocks. The differential edge samples are proportional to the clock-data phase error when data transition occurs and the error is small. A linear sampling PD (LSPD) is proposed where data samples tell the data transition direction and edge samples tell how much the phase error is. The bandwidth requirement of the sub-rate LSPD is alleviated so that its power consumption is lower. Equalizers are embedded into the LSPD by reusing the linear samples. The feed forward equalizer (FFE) in the receiver side is usually for the cancellation of pre-cursors [38]. In this receiver, the FFE instead of the DFE is utilized to cancel the first post-cursor in order to eliminate the power-consuming DFE tap1 loop. The second post-cursor is cancelled by the DFE which can be easily implemented in the 1/4-rate topology. The FFE and the DFE are often used to compensate the voltage margin, but they can also perform edge equalization [39]. In this chapter, the FFE and the DFE are also applied to the edge path to cancel the post-cursors at 0.5 UI and 1.5 UI so as to suppress the inter symbol interference (ISI) induced jitter of the recovered clock. The 8-phase 1/4-rate clocks are generated and de-skewed by a ring oscillator with an injected forwarded 1/4-rate clock. The equalizer summer (EQS) adopts the charge-steering technique to further save power.

## 3.2 System Design

Fig. 3.1 shows the proposed receiver topology including three parts: 4 data paths (DPs), 4 edge paths (EPs) and an 8-phase injection locked ring oscillator (ILRO) with a voltage controlled delay line (VCDL) in its injection path. In sample/hold (S/H) stages, the input signal  $D_{in}$  is sampled to data sample  $S_{2n}$  by  $CLK_{2n}$  in  $DP_n$  and to edge sample  $S_{2n+1}$  by  $CLK_{2n+1}$  in  $EP_n$ . The data sample  $S_{2n}$  becomes  $D_n$  after the EQS and regenerative slicer. After the EQS and the subsequent sampling, the edge sample  $S_{2n+1}$  transfers to  $E_n$  which is proportional to the clock-data phase error and is applied to a voltage-current converter (VIC)

to generate the corresponding error current  $I_{VI}$  under the logic control. The subsequent loop filter converts  $I_{VI}$  to voltage then tunes the VCDL to adjust the output clock phases of the ILRO. The clock phase recovery loop is a 1<sup>st</sup> order low-pass system without stability issue. The edge path is sensitive to noise and offset. The main noise comes from the EQS and the VIC, and will be filtered by the low-pass characteristic of the loop. The EQS may introduce offset which can be calibrated manually or automatically by an extra calibration branch.



Fig. 3.1 Proposed source-synchronous 1/4-rate receiver topology.

Fig. 3.2 shows the sampling examples in  $DP_0$ ,  $EP_0$  and  $DP_1$  when there is a clock-data phase error. For clock phase recovery, only transition T2,3 are considered. The voltage  $V_E$  of the edge sample  $S_1$  is proportional to the clock-data phase error  $P_E$  when it is small. For T2, the clock is later when  $V_E < 0$  and earlier when  $V_E > 0$ . The phenomenon is opposite for T3.

Based on the observation above, an LSPD with a finite linear range can be easily implemented by utilizing  $D_0$  and  $D_1$  to detect data transitions and determine the input polarities of the edge samples to the VIC.



Fig. 3.2 Data and edge sampling with clock-data phase error.



Fig. 3.3 Timing diagram of the LSPD

Fig. 3.3 shows the timing diagram of one branch of the LSPD. In the data paths, the delay from  $S_{0,2}$  to  $D_{0,1}$  is 1 UI including the settling time of the EQS and the delay of the slicer. For the edge path, 1.5 UI is allocated for the settling time of the EQS, and then the equalized edge sample will be sampled and held for another 2 UI. The gray area in Fig. 3.3 shows a 2-UI window during which the VIC will be enabled to produce the corresponding current into the loop filter for phase tuning when data transition happens.





Fig. 3.4 (a) Pulse response of a channel with limited bandwidth, and (b) data path  $DP_{0,1}$  and edge path  $EP_0$  with FFE and DFE.

 $DP_0$ ,  $EP_0$  and  $DP_1$  in Fig. 3.1 are redrawn in Fig. 3.4 to explicitly introduce the operation of the FFE and the DFE in the receiver. A pulse response with significant ISI is shown in Fig. 3.4(a). Post-cursors  $a_{1,2}$  degrade the voltage margin of the subsequent bits

severely and should be removed. In Fig. 3.4(b), the post-cursor a<sub>1</sub> in data sample S<sub>2</sub> can be removed by subtracting its weighted former data sample  $S_0$ . If  $D_0$  is used for  $a_1$  cancellation, the EQS and the slicer must have a very small delay (< 1 UI) at the cost of much larger power. The weighted D<sub>3</sub> cancels the post cursor a<sub>2</sub> in S<sub>2</sub> since the timing constraint has been relaxed to 2 UI. In order to achieve a low-power equalizer, a 1-tap FFE and a 1-tap DFE are employed for the first and second post-cursors cancellation, respectively. The ISI not only degrades the voltage margin, but also causes significant jitter, as shown in Fig. 3.5(a). For the edge samples at CLK<sub>1</sub>, there are 4 groups of non-zero crossing points C<sub>0</sub>-C<sub>3</sub> which will lead to ISI-induced jitter in the recovered clock; therefore, edge equalization is also very meaningful in highspeed receivers. The pulse response in Fig. 3.4(a) reveals that  $a_{0.5} = a_0/2$  and  $a_{1.5} = 0$  for ideal edge equalization, and  $a_{1.5}$  may contribute more ISI than  $a_{0.5}$ . In Fig. 3.5(a), the edge sample  $S_1$  is positive when the data sample  $S_6$  is positive (red solid and green dashed lines) and vice versa. Furthermore,  $S_1$  is directly related to  $S_6$ . To perform edge equalization, the EQS output  $E_0$  in  $EP_0$  should be  $S_1 - b_1 * S_0 - b_2 * S_6$  where  $b_{1,2}$  are corresponding tap weights. In this design, S<sub>6</sub> is replaced by DP<sub>3</sub> output D<sub>3</sub> considering that edge sampling is more sensitive to noise than data sampling is. Fig. 3.5(b) shows edge equalization effect on edge samples S<sub>1</sub> based on a behavior model at the sampling phase in Fig. 3.5(a). The distribution of the valid edge samples in the black circle are counted. When there is no edge equalization, the edge samples distribute among -240 mV ~ 240 mV and the standard deviation is 140 mV. The distribution range shrinks to -60 mV ~ 60 mV and the standard deviation decreases to 50 mV when edge equalization is applied. The clock recovery loop has improved input jitter performance, so the tradeoff between jitter transfer bandwidth and jitter tolerance bandwidth is relaxed providing another degree of design freedom for the loop. A BBPD does not have this advantage even though it is in linear range.



Fig. 3.5 (a) Inter symbol interference induced jitter, and (b) valid edge samples  $S_1$  w/ and w/o edge equalization of a behavior model.

# 3.3 Building Blocks

In this section, key building blocks of the proposed receiver including S/H stage, charge-steering EQS, and VIC stage are introduced.

#### 3.3.1 S/H



Fig. 3.6 Sampling and holding stage

Fig. 3.6 shows the differential S/H stage where  $M_{1,2}$  are the clock controlled switches.  $M_{3,4}$  with the half size of  $M_{1,2}$  are controlled by the anti-clock to cancel the charge injection and clock feedthrough, while  $M_{5,6}$  are always off to cancel the data feedthrough [40].

### 3.3.2 Equalizer summer

The concept of charge steering was proposed several years ago for its superior power efficiency and speed performance, and the technique has been used in a wide range of circuits [36, 41]. Fig. 3.7(a) shows the charge-steering EQS, which consists of the main amplifier, FFE and DFE branches. The charge-steering amplifier is dynamic with return-to-zero (RZ) output, and its approximate gain at a small input is the ratio of the output loading capacitance over the tail capacitance. The main amplifier has tunable gain implemented by 6-bit tail capacitor units. The FFE and the DFE are also based on the charge-steering technique, and the difference is that the FFE input is the linear data sample while the DFE input is the output of the slicers. The EQS is designed with a negative FFE coefficient, while the DFE coefficient can be positive and negative by switching the input polarity of the DFE branch. For the data

path, the small eye opening of the input signal leads to a large variation in the response time of the EQS and big jitter in the output data eye diagram. For the edge path, the variation of the transition crossing points also leads to level spread in the output edge eye diagram. The equalization effect is shown in Fig. 3.7(b). Without the equalization, the left-hand figure shows that the output data eye diagram has a large amount of jitter and the output edge eye diagram shows a 210-mV<sub>ppd</sub> level variation of the valid edge samples. With the equalization enabled, the jitter in the data eye diagram has been improved and the level variation in edge eye diagram has been suppressed to 130 mV, as shown in the right-hand figure. The reason why the edge equalization effect in Fig. 3.5(b) is better than that of Fig. 3.7(b) is that a channel with heavier loss is used here and two taps are no longer sufficient for the cancellation of post-cursors.





Fig. 3.7 (a) Charge-steering EQS and (b) EQS outputs wi/wo equalization.

## 3.3.3 VIC



(a)



Fig. 3.8 (a) Voltage-current converter and (b) simulated transfer curve of edge path and VIC.

As previously mentioned, the recovered clock will achieve a better jitter performance by employing the LSPD. To implement the LSPD, not only should the clock-data phase error be linearly sampled, but also the linear edge samples should be converted to the control voltage of the VCDL linearly. The linear sampling is guaranteed by the S/H states, as in Fig. 3.1, and the VIC is the other key linear converter. Fig. 3.8(a) shows the schematic diagram of the proposed VIC with four input slices. In each slice, M<sub>N3,4</sub>/M<sub>N5,6</sub> working in the linear range convert the EP<sub>n</sub> output  $E_n$ +/- to proportional current with a gain of  $g_{m3,4}/g_{m5,6}$ . The converted currents of the four slices are summed together in M<sub>P1,4</sub> and then transferred to the loop filter after mirroring. The loop filter, a 10-pF capacitor, is first-order and converts the VIC output current to the control voltage of the VCDL very smoothly. Therefore, the VIC gain is the product of g<sub>m3,4</sub> and the current mirror ratio. M<sub>N1,2</sub>, controlled by VIEN<sub>n</sub>, detects data transitions and determines the conversion polarity of the VIC. When no data transition occurs, both VIENn and VIENnB are low, so this slice does not generate current. The output current is negative and positive corresponding to the transition T2 and T3, respectively, of Fig. 3.2. Fig. 3.8(b) shows the simulated transfer curve of the LSPD and the VIC, and at least a 10-ps linear range is realized. When the phase error is far from the linear range, the VIC tail current may be steered to one side by the large edge samples and the produced constant current will cause slewing of the locking process. Although the system is 1<sup>st</sup> order, no static phase error will occur since any phase error will cause the VIC to continuously charge/discharge the loop filter and then adjust the VCDL. After clock and data are aligned, the generated current from the VIC is almost zero meaning a small voltage ripple of the VCDL control voltage and a good jitter performance of the recovered clock.

## 3.4 Experimental Results

The 1/4-rate receiver chip is implemented in a 28-nm CMOS process occupying an core area of 0.01 mm<sup>2</sup>, as shown in Fig. 3.9. The measurement setup is also shown in Fig. 3.9. PRBS-15 generated from the pattern generator is attenuated by the channel and then inputs to the receiver chip. The chip outputs are measured by an error detector and a sampling scope. The whole setup is synchronized by a 1/4-rate clock from the pattern generator. The measured tuning range of the VCDL is 80 ps (2 UI) which is the guarantee of not locking to power or ground. The RO frequency calibration [42] and equalizer adaptation are not included due to the limited design resource and tight schedule. Two channels are used to characterize the receiver chip and their measured pulse responses in Fig. 3.9 are the guide to set FFE and DFE coefficients.

For channel #1 with a 6-dB loss at 15 GHz, the chip is measured at 25 Gb/s and 30 Gb/s. The captured 6.25-Gb/s and 7.5-Gb/s output eye diagrams shown in Fig. 3.10 are with an rms jitter of 2.1 ps and 3.0 ps, respectively. The measured BER bathtub curves in Fig. 3.10(c) show that an over 0.6-UI timing margin is achieved at a BER of 10<sup>-12</sup>. Here we use the BER bathtub curve to evaluate the output eye opening. In real applications, eye opening is represented by an eye mask whose edges have the same SNR or BER. To get the red BER bathtub curves in Fig. 3.10(c), the BER of the output signal in Fig. 3.10(b) should be measured at different phases by comparing the signal with the DC/middle level. So from the red BER bathtub curve, we know that the SNR is larger than 7 within a 0.6-UI width and

degrades as the measurement phase is closer to the two-side crossing points. So BER bathtub curve can represent SNR and then partial eye opening.



Fig. 3.9 (a) Chip photograph, (b) Used channels with pulse responses, and (c) measurement setup.



Fig. 3.10 Recovered eye diagrams for the input of (a) 25 Gb/s and (b) 30 Gb/s, and (c) measured BER bathtub curves of receiver outputs with channel #1.

Channel #2 with a 14-dB loss at 13 GHz is utilized to further test the equalization ability of the receiver. The 26-Gb/s input signal in Fig. 3.11(a) is almost closed after passing through channel #2. When the equalization ability is at its maximum, the 6.5-Gb/s output eye diagram, shown in Fig. 3.11(b), has an rms jitter of 2.8 ps and a peak-peak jitter of 14.8 ps. Fig. 3.11(c) shows the measured bathtub curve. With the equalization enabled, the measured output BER has been improved from  $4 \times 10^{-4}$  to  $10^{-12}$  with an over 0.46-UI timing margin, and a bit efficiency of 0.31 pJ/bit is achieved. Fig. 3.11(d) shows the BER performance of the data rate sweeping from 25 Gb/s to 30 Gb/s. Above 26 Gb/s, the BER degrades quickly due to the limited sensitivity of the EQS. For an input with small eye opening, the EQS may have such a long response time that the correct logic cannot be restored within the limited time and error bits happen. The CTLE and variable-gain amplifier can be introduced to extend the data rate

limitation. Table II summaries the receiver performance and compares it with other state-of-art designs. The superior bit efficiency demonstrates the effectiveness of the proposed 1/4-rate LSPD with an embedded FFE and DFE.



Fig. 3.11 (a) 26-Gb/s input eye diagram, (b) recovered eye diagram, (c) measured bathtub curves and (d) achieved BER VS data rate with channel #2.

Table II: Performance summary of 26-Gb/s NRZ receiver and comparison with similar works.

|                                   | [33]<br>VLSI14                      | [32]<br>JSSC15 | [36]<br>ISSCC16     | This Work       |
|-----------------------------------|-------------------------------------|----------------|---------------------|-----------------|
| PD Arch.                          | <sup>1</sup> / <sub>4</sub> Rate BB | N/A            | ½ Rate Linear       | 1/4 Rate Linear |
| Clock Arch.                       | Forwarded                           | Forwarded      | Embedded            | Forwarded       |
| Equalization                      | CTLE                                | CTLE +<br>DFE  | CTLE +<br>DTLE+ DFE | FFE + DFE       |
| Ch. Loss (dB)                     | N/A                                 | 14 @ 6GHz      | 18.6 @ 20GHz        | 14 @ 13GHz      |
| DR (Gb/s)                         | 14                                  | 12             | 40                  | 26              |
| Power (mW)                        | 7.8                                 | 14             | 14                  | 8.1             |
| Eff. (pJ/bit)                     | 0.56                                | 1.04           | 0.35                | 0.31            |
| Eye Width of BER 10 <sup>-9</sup> | 0.37 UI                             | 0.16 UI        | 0.3 UI              | 0.62 UI         |
| @ Input Amplitude (mV)            | 200                                 | 100            | NA                  | 75              |
| Technology                        | 65nm                                | 32nm SOI       | 45nm                | 28nm            |
| Area (mm²)                        | 0.36                                | 0.01           | 0.019               | 0.01            |

## 3.5 Clock Recovery Loop Analysis

Since the equipment for measuring CDR transfer and tolerance characteristics are unavailable, a theoretical analysis on the clock recovery loop will be given in this section. For simplicity, a full-rate linear model is built to shows the clock recovery characteristics. The linear model shown in Fig. 3.12 consists of PD, VIC, VCDL, and ILRO.



Fig. 3.12 Linear model of the 1st order clock recovery loop.

When the ILRO output clock samples the input data in the linear range as shown at the bottom of Fig. 3.12, the PD output is

$$V_{PD} = \frac{A_i}{T_{LM}} * \frac{\phi_e}{2\pi} * T * A_{SUM}$$
(3.1)

where  $A_i$  is the input data amplitude,  $T_{LM}$  is the maximum linear range,  $\emptyset_e$  is the clock-data phase error, T is the unit interval, and  $A_{SUM}$  is the gain of the EQ summer. The EQ summer is modeled into the linear PD. Therefore, the PD gain is

$$k_{PD} = \frac{A_i}{T_{LM}} * \frac{1}{2\pi} * T * A_{SUM}$$
(3.2)

In the same way, the output voltage of VIC, the output phase of VCDL and IRO can be obtained as:

$$V_{LP}(s) = \frac{1}{2} \frac{g_m V_{PD}}{sC}$$
(3.3)

$$\emptyset_{DL} = k_{DL} * V_{LP}$$

(3.4)

$$\phi_O = \phi_{DL} \frac{1}{1 + s/\omega_L}$$

(3.5)

where  $g_m$  is the gain of the VIC, C is the capacitance of 1<sup>st</sup> order loop filter,  $k_{DL}$  is the gain of the VCDL from voltage to phase dealy, and  $\omega_L$  is the locking range of the ILRO. Here the ILRO is modeled as a low pass filter with a 3-dB bandwidth of  $\omega_L$  because the loop bandwidth is much smaller than the locking range of a fundamental ILRO [43]. The open loop transfer function is

$$H_{open}(s) = \frac{A_i}{T_{LM}} * \frac{1}{2\pi} * T * \frac{1}{2} \frac{g_m}{sC} * k_{DL} * \frac{1}{1 + s/\omega_L} = \frac{1}{s/\omega_B} * \frac{1}{1 + s/\omega_L}$$
(3.6)

where

$$\omega_B = \frac{A_i}{T_{LM}} * \frac{1}{2\pi} * T * A_{SUM} * \frac{1}{2} * \frac{g_m}{C} * k_{DL}$$
(3.7)

Since  $\omega_B \ll \omega_L$ , the  $H_{open}(s)$  can be simplified to

$$H_{open}(s) \approx \frac{1}{s/\omega_B}$$
 (3.8)

Since  $H_{open}(s)$  is 1<sup>st</sup> order, the loop is absolutely stable. The close loop transfer function is

$$H_{close}(s) = \frac{1}{1 + s/\omega_B}$$
(3.9)

So, the  $\omega_B$  is the loop bandwidth.

The above analysis shows the loop BW is related to the input signal (data transition). For an input with 0.5V differential amplitude and 50% transition time, the PD gain in the linear range is

$$k_{PD} = \frac{1}{2\pi}$$
 when  $A_{SUM} = 1$ 

When  $g_m = 2\text{mS}$ , C = 20pF, and linearized delay line factor  $k_{DL} \approx 80\text{p} * 26\text{G} * 2\pi/0.8\text{V} = 2.6*2 \pi$ , the loop BW is

$$\omega_B = \frac{1}{2\pi} * \frac{1}{2} * \frac{g_m}{C} * k_{DL} = 130 M \ rad/s = 20 MHz$$

Because the jitter transfer characteristic is represented by the close loop transfer function (3.9), so the input jitter will be filtered by a low pass filter with a BW of  $\omega_B$  before it is transferred to the recovered clock. For the jitter tolerance, it can be calculated as [44]

$$\phi_{in} = \frac{0.5 * UI}{1 - H_{close}(s)} = \frac{1}{2} \frac{1 + s/\omega_B}{s/\omega_B} UI$$
(3.10)

Fig. 3.13 shows the jitter transfer and tolerance curves of the clock recovery loop.



Fig. 3.13 (a) jitter transfer curve and (b) jitter tolerance curve

## 3.6 Conclusion

In this chapter, a 26-Gb/s 0.31-pJ/bit NRZ receiver is presented with the proposed LSPD and embedded 1-tap FFE and 1-tap DFE for equalization. The LSPD uses the data

transition to represent clock-data phase errors. Data and edge equalization is proposed by reusing the corresponding samples, and the corresponding analysis is also given. The loop of the clock and data recovery with the LSPD is analyzed, and the derivation shows the loop is a 1<sup>st</sup> order. The experimental results demonstrate that the proposed receiver is superior in terms of bit efficiency.

# Chapter IV 56-Gb/s PAM4 Receiver Design

### 4.1 Introduction

Data boom has been emphasized many times in this thesis. We also know that 25~28-Gb/s NRZ links are still the mainstream in industrial applications, and that's why a powerefficient 26-Gb/s NRZ receiver is designed in chapter III. To meet the data boom, the data rate of the next-generation I/O will exceed 50 Gb/s. However, the design of power-efficient 50-Gb/s NRZ I/Os is quite challenging even with the advanced 16 nm CMOS FinFET technology [15, 16]. The advent of PAM4 signaling relaxes the design challenges by halving the working frequency of the NRZ transceivers. In chapter II, a low-power 24-Gb/s PAM4 receiver has been introduced. 56-Gb/s PAM4 I/O is becoming the research focus and will be the mainstream in near future. Recently, Optical Internetworking Forum (OIF) has released the 56-Gb/s PMA4 I/O standards: OIF CEI-56G-PAM4, among which OIF-CEI-56G-VSR-PAM4 is for chip-to-module communication with a channel loss of 10 dB at Nyquist frequency (14 GHz) as shown in Fig. 4.1 [45]. It is well known that ADC-based PAM4 receivers and mixed-signal PAM4 receivers are two popular topologies, which have been implemented extensively [15-17, 19-22]. For middle-reach and long-reach applications, the channel usually has a loss of > 30 dB at Nyquist frequency. To compensate these kinds of channels, mixed-signal receivers have to employ a DFE with a number of taps leading to the increase of the circuit complexity and power consumption, and ADC-based topology is more suitable since it is convenient to implement more advanced equalization and PAM4-to-NRZ decoding in digital domain. For very-short-reach and short-reach applications with medium lossy channels (< 20 dB), mixed-signal topology is more preferred since analog circuit techniques can be fully taken advantage of to achieve good power efficiency. In this chapter, a 56-Gb/s mixed-signal PAM4 receiver targeting VSR applications will be presented. As mentioned in chapter II, PAM4 receiver meets new design challenges over NRZ counterpart:

1) PAM4 signal has four voltage levels and is very sensitive to bandwidth effect including limited bandwidth and over peaking, so adaptation is highly demanded; 2) the small eye opening of PAM4 signal causes a larger delay of slicing, so the timing constraint of the DFE first tap is more stringent; 3) PAM4 has 16 kinds of transitions between neighbor bits, and the PD design in CDR should be considered carefully. The receiver in this chapter will address the issues.



Fig. 4.1 OIF CEI-56G-VSR-PAM4 application [45].

## 4.2 System Design



Fig. 4.2 PAM4 receiver topology.

Fig. 4.2 shows the proposed 56-Gb/s 1/4-rate receiver which is evolved from that in chapter III by introducing more functions required by PAM4 signaling. At first, the input PAM4 data is equalized firstly by two stages of the CTLEs and then sampled by the subsequent sample and hold stages (S/H). Data samples SDn and edge samples SEn are used to do equalization through 1-tap FFE and 1-tap DFE in the following summers. The equalized data samples are decoded back to NRZ MSB and LSB (RDn) in the decoder with an adaptive reference. The equalized edge samples are sliced by the following slicers to REn. The recovered data and edge (RDn and REn) are demux and synchronized to Mn/Ln and En which are the inputs to the bang-bang PDs in the clock and data recovery loop. Mn/Ln detect data transitions, and En together with Mn/Ln tells the polarity of the clock-data phase error. The PD output Early/Late drives CPs to control the delay of the VCDL in the reference clock path of the PLL to recover the clock phase. Different from chapter II and III where ILROs are used to produce 1/4-rate clocks, a wide bandwidth PLL (WBW-PLL) is designed to avoid the interference from the injected clock to the output phases of the recovered clock, and the free running frequency shift of the RO mentioned in chapter II-III is also solved. Besides decoding, the decoder also performs amplitude detection and provides the DFE feedback signals.

To make CTLE accurately equalize the input PAM4 signal, an adaptation algorithm is proposed, and the details will be introduced in chapter V [46]. To avoid the stringent timing constraint of the DFE first tap, an FFE is implemented by taking advantage of the 1/4-rate topology which is same to that of chapter III. In the Bang-Bang PD, only symmetrical data transitions are effective in order to get rid of the crossing points deviating from the ideal position. For PAM4-to-NRZ decoding, the required reference voltage which is 2/3 of the amplitude is adaptively generated since CTLE adaptation also performs amplitude detection. In the following parts, the details of circuit implementation will be given. In addition, to reduce the loading to CTLEs, edges are only sampled in path 0 and 2 (P0 and P2), so there is only a pair of PDCPs for clock and data recovery.

# 4.3 Building Blocks

## 4.3.1 CTLE



Fig. 4.3 (a) Schematic of two CTLEs, and (b) the simulated frequency response.

Fig. 4.3(a) shows the schematic diagram of two stages of the CTLEs. The source degeneration  $R_S$  and  $C_S$  generate a pair of zero-pole and the zero is smaller than the pole. So,

the zero-pole boosts the high-frequency gain. The shunt peaking by serializing an inductor L and a loading resistor R<sub>D</sub> is also employed to further extend the bandwidth. To cover the PVT variation, the loading resistors R<sub>D</sub> is controlled by an operational amplifier (OPA) whose positive input is the virtual ground node of the CTLE output and negative input is a reference voltage which is generated from a digitally controlled resistor ladder. Therefore, with the negative feedback loop above, the CTLE output DC voltage is set by the reference V<sub>REF</sub>. Physical channels usually start attenuating from the frequencies lower than 1 GHz due to the skin effect, while the frequency of the zero of the CTLE zero-pole pair is several times higher leading to nonnegligible skin effect. To generate the corresponding low-frequency peaking, the output of the second CTLE is filtered by an LPF and then subtracted from the output of the first CTLE through a transconductor g<sub>m</sub>. The DC offset which may be generated from the signal input and the mismatch of CTLEs. To implement the DC offset cancellation, part (1/6) of the bias current of the first CTLE is controlled by the output of the second CTLE after the low-pass filtering and the amplification by an OPA with miller capacitors for a small bandwidth. Fig. 4.3(b) shows the simulated AC response. An 8-dB peaking at 19 GHz is achieved. The -2-dB DC gain results from the gain-bandwidth tradeoff. The frequency components of less than 30 kHz are suppressed, and the suppression is 29 dB. The lowfrequency peaking starts from several hundreds of MHz determined by the corner frequeny of the LPF in Fig. 4.3(a).

The popular miller OPA shown in Fig. 4.4(a) is proper to be used for single-ended control.  $V_{REF1,2}$  in Fig. 4.3(a) are generated by 3-bit resistor ladders in Fig. 4.4(b). The stability is simulated with the test bench in Fig. 4.4(c), where the role of CTLE in the feedback loop is accurately emulated. The RC LPF with a very low corner frequency only lets DC component pass. The loading resistors of CTLEs are composed of two serial resistors and a transistor which is in parallel with one resistor for tuning the effective loading resistance. The open loop simulation in Fig. 4.4(d) shows a DC gain of 38 dB and a phase margin of 90°.



Fig. 4.4 CTLE DC output control, (a) feedback OPA, (b) digitally controlled resistor ladder for reference generation, (c) test bench of the open loop and (d) the simulation results.



Fig. 4.5 Simplified diagram of front end with DC offset cancellation.

$$H(s) = \frac{g_{m1}R_{D1}A_2}{1 + b(s)g_{mTC}R_{D1}A_2}$$
(4.1)

And the numerator is the DC gain of two stages of CTLEs. Fig. 4.6(b) shows the schematic diagram of the differential OPA. The cascode topology makes the OPA achieve a high gain, and the complementary architecture of the common-mode feedback network (CMFB) can support a large differential output swing. The output DC is set by VCM which is the gate bias of the tail current of the CTLEs. The bias voltages of Vb1 to Vb4 are generated through current mirror techniques. The GBW of the configuration in Fig. 4.6(b) is  $1/R_MC_M$  because the effective  $C_M$  is amplified by the OPA. Therefore, b(s) can be written as:

$$b(s) = \frac{b(0)}{1 + s/(GBW/b(0))}$$
(4.2)

Substitute (4.2) into (4.1) and H(s) can be rewritten as

$$H(s) = \frac{g_{m1}R_{D1}A_2}{b(0)g_{m\_TC}R_{D1}A_2} \frac{1 + s/(GBW/b(0))}{1 + s/GBWg_{m\_TC}R_{D1}A_2}$$
(4.3)

So, the gain of CTLEs is suppressed by  $b(0)g_{m\_TC}R_{D1}A_2$  from DC to the frequency of GBW/b(0).



Fig. 4.6 DC offset cancellation network (a) OPA with miller capacitor and (b) schematic diagram of OPA.

Fig. 4.7(a) shows the simulated frequency response of the DC offset cancellation network. The DC gain b(0) is 50 dB with a phase margin of 88°. The GBW is 230 kHz which is approximately equal to  $1/R_{\rm M}C_{\rm M}$ . With b(0) = 50 dB, GBW = 230 kHz,  $g_{m1}R_{D1}A_2 = -2$  dB and  $g_{m\_TC}R_{D1}A_2 = -20$  dB, (4.3) matches very well with the frequency response in Fig. 4.3(b). To guarantee the stability, CMFB also needs to be verified. Fig. 4.7(b) shows the simulated CM frequency response, and it has a DC gain of 56 dB and a phase margin of 81°.

The 61-MHz GBW is larger than that of differential-mode (DM) frequency response. When simulate the DC offset cancellation network together with the CTLEs in Fig. 4.3, the phase margin will be better since the feedback gain is smaller by a factor of  $g_{m\_TC}R_{D1}$  (~-20 dB) for both CM and DM.



Fig. 4.7 Simulation of DC offset cancellation network (a) AC and (b) CMFB.

## **4.3.2** Equalizer summer

Fig. 4.8 shows the equalizer summer including 1-tap FFE and 1-tap DFE of path2 in Fig. 4.2. The source degeneration resistor of the main amplifier is to increase the linearity. The FFE input is the data sample of path0. The DFE tap consists of three slices with the same weight because the input is the thermometer code outputs of the decoder in path0. The tap coefficients of the FFE and the DFE are controlled by adjustable tail currents.



Fig. 4.8 Summer including FFE tap and DFE tap.



Fig. 4.9 Timing diagram of the equalizers.

Fig. 4.9 gives the timing diagram of the equalizers. 1/4-rate CKDn samples Dn to SDn. With holding process, the duration of Dn increases from 1 UI in Din to 2.5 UI in SDn. TDn[2:0] will holds for 4 UI since they changes with CKDn. The ISI of SD2 is from the 1<sup>st</sup> post cursor of D1 and 2<sup>nd</sup> post cursor of D0. To avoid the stringent timing of the DFE 1<sup>st</sup> tap, 1-tap FFE is employed. The FFE uses the sampled SD1 to cancel its 1<sup>st</sup> post cursor on SD2, and the sliced outputs TD0[2:0] are used to cancel the 2<sup>nd</sup> post cursor of D0 on SD2. The equalized SD2 is in the dashed gray window with a width of 1.5 UI, and the solid gray line is the slicing timing instance to SD2.

Fig. 4.10 shows the simulation results of the 1-tap FFE. Due to the limited bandwidth, the eye opening of the sampled SD2 is degraded. By enabling the FFE, the eye opening of the summer output SSD2 has been improved significantly. The width of the improved eye is 1.5 UI which matches well with the timing in Fig. 4.9. In the following 1 UI, SD1 is over, so the FFE cannot work correctly and the eye opening degrades. The ripple on SSD2 is because of the clock sampling.



Fig. 4.10 FFE effect.

### 4.3.3 Decoder



Fig. 4.11 Schematic diagram of PAM4 decoder.

Fig. 4.11 shows the decoder architecture. There are four comparators, and three of them are for decoding and the one in the dashed rectangle is for peak detection. Two DACs converts the DPK and VPK from the adaptation block to the analog peak voltage VPK and reference voltage VREF. VREF is 2/3 of VPK for the PAM4-to-NRZ decoding. The comparator outputs TD[2:0] are thermometer codes and can be used for DFE feedback with the same coefficient  $\alpha$ . For instance, if SSDn is level 3, TD[2:0] are all '1' and the DFE feedback is  $3\alpha$ ; if SSDn is level 1, TD[2] are '-1' and TD[1:0] are '1', so the feedback is  $1\alpha$ . TD[2:0] are converted to MSB and LSB by the thermometer-to-binary logic. To perform the peak detection, the signal value determined by low-frequency components should be detected. In the proposed adaptation algorithm, the detected signal value and its two preceding bits are

all level 3. So, SSDn compares with VPK when three consecutive level 3 happens. VPK starts from an initial value and will be equal to the signal amplitude when the peak detection enters steady state.

## 4.3.4 Comparator with offset calibration



Fig. 4.12 (a) Strong-arm latch with offset calibration pair, and (b) offset calibration loop.

One of the most popular comparator architectures is the combination of a strong-arm latch and an SR latch. Fig. 4.12(a) shows the strong-arm stage with an offset calibration pair. The comparator offset should be calibrated since errors may occur during slicing the PAM4 signal with small eye opening. Fig. 12(b) shows the calibration loop including a calibration logic and a 6-bit DAC using the MSB for the sign bit. During the calibration, the input is set to zero. At first, the DAC input is 0 and its analog output is minimum (negative value), so the comparator output must be '1' if the offset is within the coverage of the calibration logic. The calibration logic monitors the comparator output OUT and counts. If the comparator does not

produce falling edge within 4 clock cycles, the calibration digital output increases by one. The process keeps until the falling edge happens and then the calibration ends. Fig. 4.13(a) shows the transient output the calibration logic when the input is a constant value of -20 mV. When the calibration completes, the DAC output is just higher than the inherent offset, but the excess is less than 1 LSB. The output goes up from 0 and stops at 25 when the falling edge occurs. Fig. 4.13(b) shows the calibration results for the input offsets from -60 mV to 60 mV. As we can see, the calibration logic output increases linearly as the linear increase of the introduced offset.



Fig. 4.13 (a) Transient simulation of calibration process and (b) calibration results for different offset inputs.

#### **4.3.5 Demux**



Fig. 4.14 Timing diagram of demux by 2.

To generate the inputs for both PDs and adaptation block, the decoder outputs should be demuxed to a lower speed by the Demux & Synchronization Block in Fig. 4.2. Since the bang-bang PD logic is simple, the decoder outputs are only demuxed by 2 and then synchronized before inputting to the PDs, while the inputs to the digital adaptation algorithm are further slowed down by 8. The adaptation details are introduced in section 5.1 of chapter V. Fig. 4.14 shows the timing diagram of the demux by 2, and there are four data outputs M<sub>0-3</sub>/L<sub>0-3</sub> and two edge outputs E<sub>0,2</sub>. The outputs of all four paths P0-P4 have delay t<sub>ck2q</sub> from clock to Q. P0 and P1 outputs are demuxed and synchronized to M0/L0, E0 and M1/L1 by a synchronization clock CKSyn01, and the corresponding synchronization clock is CKSyn23

for P2 and P3. Because the adaptation block requires an input which shows the occurrence of three consecutive top levels, D0 - D2 should remain sequential. CKSyn01 and CKSyn23 should have determined phase relation. Fig. 4.15 shows the diagram of the clock divider. CKSyn01 is triggered by CKE2 and will be sampled to CKSyn23 by CKE0 after a half clock period. Therefore, CKSyn01 and CKSyn23 are quadrature, and CKSyn01 is always leading. As shown in Fig. 4.14, CKSyn01 lags CKE2 by tck2q, and it is suitable to synchronize RD0, RE0, and RD1. The case is the same to CKSyn23. The negative output of DFF1 is further divided to Sub\_CKSyn which is used to demux and synchronize D0, D1 and D2. In the same way, further demux can be implemented by the clocks divided from Sub\_CKSyn.



Fig. 4.15 Demux and synchronization clock generation.

## **4.3.6 PD and CP**



Fig. 4.16 NRZ and PAM4 eye diagrams.

In chapter III, a linear PD with a limited range is proposed by taking advantage of the observation that the NRZ edge samples is proportional to the clock-data phase error, as shown in Fig. 4.16(a). However, the linear PD cannot be used directly in the PAM4 systems because the one-to-one correspondence between the edge samples and the clock-data phase error does not exist. For the solid transition in Fig. 4.16, the edge samples have different values for a certain phase error. Additional logic should be introduced to reestablish the one-to-one correspondence. For simplicity, a bang-bang PD is adopted in this design.



Fig. 4.17 Bang-bang PD with transition selection.

Fig. 4.17 shows the proposed bang-bang PD with transition selection. In XOR1-XOR3, only MSBs (M) do logic operations with edges (E). For the PAM4 eye diagram in Fig. 4.16, if there was no XOR4, all solid transitions are translated to the early/late information by the PD. However, the crossing points are not unique meaning that the clock phase will wander within a range instead of a point leading to bad jitter performance. This phenomenon does not exist in the NRZ signaling. In the proposed PD, the control word LSB\_EN will enable the transition selection. When LSB\_EN = 0, XOR4 detects the LSB difference between  $L_0$  and  $L_1$ . Only when both  $M_0$  XOR  $M_1$  and  $M_2$  XOR  $M_3$  are true, the PD output is valid. Therefore, only the red thin solid transitions are selected for phase detection, and the clock phase converges at the middle crossing point in Fig. 4.16.



Fig. 4.18 Charge pump.



Fig. 4.19 Simulation of PC and CP for the cases of Early and Late.

Fig. 4.18 shows the simplified circuit diagram of the charge pump. When Early = 1 and Late = 0, the bias current  $I_B$  goes to the left-half circuit and then will be mirrored to the CP output causing the decrease of the CP output  $V_{LP}$ . If clock is later than data,  $I_B$  goes to the right-half circuit and  $V_{LP}$  increases. When there are no transitions (both Early and Late are 0),  $I_B$  goes to the middle current branch and  $V_{LP}$  holds.  $I_B$  is always on for all phase relations. If

there was no the middle branch,  $I_B$  will be switched between on and off, so the switching time will affect the effective output current to the loop capacitor  $C_{LP}$ .

The PD and the CP are simulated together with a PRBS input when there is some phase error. Fig. 4.19 shows the simulation results. For the case of Early,  $V_{LP}$  keeps decreasing.  $V_{LP}$  keeps increasing when clock is Late.

### 4.3.7 VCDL

In Fig. 4.2, the frequency-synchronous external clock  $CK_{REF}$  works as the reference clock of the ring oscillator based PLL after passing a VCDL. Since one of the output clocks of the PLL aligns with the delayed  $CK_{REF}$ , the delay adjustment of the VCDL means the phase adjustment of the output clocks of the PLL.

Fig. 4.20 shows the schematic diagram of the VCDL. The delay cell adopts CMOS logic. The delay is controlled by the loading resistance, and the tuning control voltage VC can change effective loading resistance which is the parallel result of a positive resistance and a negative resistance. The tunable delay range of the VCDL should be at least 0.5 UI (1 UI = 37 ps) to guarantee correct clock phase recovery, and the designed tunable delay range should have enough margin to cover the PVT variation and the accumulated phase shift caused by input data. A delay cell chain is employed, and Fig. 4.21 shows the simulated delay under different VC and processes. The minimum delay range happens at the corner of ff&-40° and is 58 ps which is more than 1.5 UI.



Fig. 4.20 VCDL and its delay cell.



Fig. 4.21 Simulated VCDL delay under corners.

## 4.3.8 PLL

In Chapter II and III, the multiple clock phases are generated by the ILRO in which the drift of the free running frequency of the RO leads to the inaccuracy of the output phases. In addition, the output phases are not equally spaced when there is a frequency error between the injected frequency and the free running frequency of the ring oscillator. To solve these issues, a PLL is adopted in this design. As shown in Fig. 4.2, the PLL consists of PFD, CP, LPF and RO. No frequency divider is used because the reference frequency is the same to outputs of the PLL. The RO design has been introduced in Chapter II and is not repeat here. ILRO achieves good phase noise performance by utilizing a clean injection clock to correct the output phase periodically. In this PLL, a wide loop BW suppresses the inferior phase noise of the RO over a wide range to improve the jitter performance of the output clocks. The thesis omits the circuit implementation of the building block in the PLL.

## **4.4 System Simulation Results**



Fig. 4.22 Layout of 56-Gb/s PAM4 receiver.



Fig. 4.23 Frequency response of the used channel.

The 56-Gb/s PAM4 receiver is designed in 40 nm CMOS process. Fig. 4.22 shows the full layout occupying an area of 0.72 (1.1 x 0.65) mm<sup>2</sup>. In the simulation, the PAM4 signal is generated by the combination of two PRBS7s with a weight ratio of 2:1. The whole receiver system is simulated under 1-V supply voltage and 36 mW is consumed by core parts excluding output buffers. Fig. 23 shows the measured frequency response of the channel for characterizing the receiver, and the loss at Nyquist frequency 14 GHz is 9.5 dB. Fig. 24 shows the control timing of key blocks in this receiver. Firstly, PLL starts and other blocks are reset to clean circuit states of all logics. Secondly, all comparators are calibrated with the clocks in

the first step when the input is zero. Thirdly, the PAM4 input is enabled and the CDR starts to adjust the clock phase so as to align the edge-sampling clock with the input data transition and sample data at the middle (optimal) time instance. Even though the input data is only equalized by equalizers with a pre-setting, the CDR still can lock at the correct phase as long as the data transitions can be detected. Without accurate equalization, the inferior jitter of the input data transfers to the recovered clocks. Finally, the CTLE adaptation will start to work. During the CTLE adaptation, the clock phase is also adjusted as the equalization changes the speed or data delay of data transitions.



Fig. 4.24 Timing for key blocks in the receiver.



Fig. 4.25 Loop filter output of the CDR.

The calibration process of comparators has been shown in Fig. 4.13(a). Fig. 25 shows the transient waveform of the PLL loop filter output  $V_{LP}$  which controls the VCDL in the reference clock path of the WBW-PLL. Before the activation of the CDR,  $V_{LP}$  is set to the half of the supply voltage. In this source-synchronous CDR, clock frequency is synchronous to the data rate and only clock phase is to be recovered. As introduced in chapter III, the ILRO based CDR loop is first order, so the CDR with a WBW-PLL is still first order. In Fig. 4.25,  $V_{LP}$  converges monotonously. During the adaptation, the CTLEs equalize both the data and edge, so  $V_{LP}$  also changes accordingly.



Fig. 4.26 Equalizer summer output (a) before and (b) after CTLE adaptation.

Fig. 4.26 shows the CTLE adaptation effect to the output eye diagrams of the equalizer summers. The timing of the eye diagrams is 8 UI, and the details are explained in Fig. 4.9. Before the adaptation, the equalization ability of the CTLEs is small, and the eye diagram is almost closed, as shown in Fig. 4.26(a). After the adaptation, the CTLE peaking ability is adjusted by the adaptation block, and the eye diagram in Fig. 4.26(b) is reopened. Fig. 4.27 shows the adaptation process and the details of the adaptation algorithm is introduced in chapter V and [46]. In the step1, the adaptation algorithm performs the initial search of VPK.

In the step2, the CTLE control word starts to increase from an initial value. 8 is chosen as the initial value considering the adaptation time and easier CDR phase locking. At the same time, VPK is adjusted according to the pattern-selection based peak detection. The glitches on VPK and VREF are because there are timing errors among the outputs of counters. When the CTLE control word wanders between N and N+1 where N is the optimal value of the CTLE control word and is 13 in this simulation setup, the adaptation process ends. And then the bit number is counted and the bit error rate is calculated by a Verilog-A BERT module. Since the simulation time cannot be so long to count to gigabits, the counted bit number reaches 16.5k and no error bit is detected in this simulation.



Fig. 4.27 Simulation of adaptation process.

Three input signals required by the adaptation algorithm are given in Fig. 4.28 to see the details of the adaptation process more clearly. The three input signals are CK, CID3 and ERR. CID3 = 1 means that three consecutive top levels happen and ERR = 1 means that the voltage value of current bit is higher than VPK (refer to section 5.1). The rising edges of CK always happen at the middle position of the pulses of CID3 and ERR. In the step2 of the algorithm, CID3 = 1 should happen periodically because the PAM4 input is periodic and the obtained VPK is stable. As the simulation results shown in Fig. 4.27 and Fig. 4.28, VPK and

CID3 match well with the analysis. For the enlarged part in Fig. 4.28, ERR = 1 only happens when CID3 = 1 because the signal is still under equalized and VPK is determined by low-frequency components. As the increases of the peaking ability, more ERR = 1 happens since data transitions are becoming increasingly faster and can reach its target level within 1 UI. Finally, the adaptation completes when P(ERR = 1) = 1/8.



Fig. 4.28 Three input signals of adaptation.

The performance of this 56-Gb/s PAM4 receiver is tabulated in Table III which also includes the comparison with other similar works. The proposed receiver achieves the best bit efficiency due to the following reasons: 1) a 1/4-rate topology is employed to save power by utilizing power-efficient circuit blocks which has been mentioned in chapter II; 2) [15, 17] have FLLs, more adaptation functions and other blocks, like the eye diagram monitor in [17];

3) In [15, 17], a multi-tap DFE also consumes significant power to meet the stringent timing constraint.

Table III. Performance summary of 56-Gb/s PAM4 receiver and comparison with similar works.

| Reference                      | [15] J. Im et al.  ISSCC'17                         | [17] P. Peng et al.  ISSCC'17                            | This work (Simulation)                           |  |
|--------------------------------|-----------------------------------------------------|----------------------------------------------------------|--------------------------------------------------|--|
| Data Rate (Gb/s)               | 40-56                                               | 56                                                       | 56                                               |  |
| Clocking                       | 1/2-rate                                            | 1/2-rate                                                 | 1/4-rate                                         |  |
| Modulation                     | PAM4                                                | PAM4                                                     | PAM4                                             |  |
| Clocking                       | 1/2-rate                                            | 1/2-rate                                                 | 1/4-rate                                         |  |
| Function                       | CTLE<br>10-tap DFE<br>All adaptation<br>CDR FLL+PLL | CTLE 3-tap DFE DFE adaptation CDR FLL+PLL Eye monitoring | CTLE 1-tap FFE 1-tap DFE CTLE adaptation CDR PLL |  |
| Area (mm²)                     | 0.364                                               | 1.26                                                     | 0.72                                             |  |
| Power (mW) Efficiency (pJ/bit) | 230<br>4.1                                          | 382<br>6.8                                               | 36<br>0.65                                       |  |
| Channel Loss (dB)              | 10 @ 14GHz                                          | 24 @ 14 GHz                                              | 9.5 @ 14 GHz                                     |  |
| Process                        | 16 nm FinFET                                        | 40 nm                                                    | 40 nm                                            |  |
| Chip Area (mm²)                | 0.364*                                              | 1.26                                                     | 0.72                                             |  |

<sup>\*</sup> This is the chip core area and others are total chip area.

## 4.5 Conclusion

In this chapter, the proposed 56-Gb/s PAM4 receiver with the CTLE adaptation is introduced. The design details of several key building block are reported. The pattern-selection based bang-bang PD is also proposed to eliminate the issue of non-unique crossing points of PAM4 level transitions. The simulation results show that the adaptive PAM4 receiver achieves error-free operation while compensating a 9.5-dB channel loss at 14 GHz with a bit efficiency of 0.65 pJ/bit.

## **Chapter V Equalizer Adaptation Modeling**

As the rapid increase of the wireline communications, wireline receivers must tolerate more and more channel loss since the channel quality improvement is far behind. To provide more equalization ability, wireline receivers usually employ both linear equalizers like CTLE and DFE to cancel all ISI including pre- and post- cursors. In practical applications, these equalizers have to be adaptive to accommodate different channels and variable working environment. For instance, the channel length can be different leading to the variable channel loss. Even for a fixed channel, I/Os may experience the change of the working environment. So, adaptation to equalizers is highly demanded.

To achieve adaptation, usually two jobs are required. The first job is to find a reference which indicates the desired data format when optimal equalization reaches. The second job is to use the obtained reference to guide the equalizer to adjust equalization ability. For linear equalizers, there is often one adjustable degree of freedom, so the adaptation algorithm monotonically adjusts the equalizer and is relatively simple. To achieve a better equalization effect, DFE has to employ multiple taps, and the corresponding algorithm to generate all tap coefficients must be more complicated. In addition, linear equalizers usually amplify high-frequency components while maintain the low-frequency counterparts, while DFE has a negative gain in dB. Therefore, adaptation for two kinds of equalizer is quite different. In this chapter, two adaptation algorithms for CTLE and DFE in NRZ/PAM4 systems are presented.

## **5.1 CTLE Adaptation**

Several adaptation schemes have been reported for NRZ signaling [47–51]. The scheme based on the spectrum balancing in [47] has limitations such as robustness, speed scalability, and data pattern requirement. Another adaptation scheme is based on the theory that the signal peak value is not attenuated by lossy channels when under-equalization since it is determined by the longest consecutive identical data (CID), while the peak value increases

when over-equalization occurs [48–50]. For instance of the PAM4 signal generated by combining two PRBS7 streams, the probability that the longest CID length is shorter or equal to 4 is 91.4%, while the longest CID length of PRBS7 is 7. Therefore, the peak value of PAM4 signal, which is under equalized, may be inaccurate, particularly with heavily lossy channels. Moreover, since the PAM4 signal contains more signal levels and transitions, the optimal eye diagram should be redefined. In this section, a two-step adaptation scheme is proposed for the CTLE in the PAM4 receiver in chapter IV. The peak detection based on probability theory in the step1 and the pattern selection in the step2 continuously monitors and adjusts the peak value, while the PAM4 top level distribution around the peak value is monitored to achieve the best vertical eye opening.

### **5.1.1** Adaptation algorithm



Fig. 5.1 PAM4 eye diagram with noise.

The optimal PAM4 eye diagram is defined before the introduction of the adaptation algorithm. Fig. 5.1 shows the PAM4 eye diagram with the noise in the gray area. vref is the peak amplitude determined by low-frequency components. The optimal eye diagram occurs when the top level L3 distributes at the two side of vref equally: P(L3 > vref) = 1/2 or P(L > vref) = 1/8. What should be noted is that the peak value of PAM4 signal is noted as VPK in Fig. 4.11 and vref in Fig. 5.1, respectively. Hence, the voltage for distinguishing between L2 and L3 is noted as VREF in Fig. 4.11 and 2vref/3, respectively.



Fig. 5.2 Considered VD[n] in (a) peak detection based pattern selection, and (b) initial peak search.

To obtain vref, 3 consecutive bits are evaluated as shown in Fig. 5.2(a). If the values of all three bits are bigger than 2vref/3 like the drawn transitions, the D[n] value VD[n] is not determined by equalization strength and can be used to adjust vref. If VD[n] > vref, vref is low and will increase; if VD[n] < vref, vref is high and will decrease. At first, vref is 0, and top levels can't be recognized. Therefore, an initial search of vref should be conducted before the above pattern-selection based vref search, and a criterion should be set to end this initial search. In the pattern-selection based vref search, VD[n] is considered when at least three consecutive top levels happen. The probability of the occurrence of at least three consecutive top levels is 1/64 or  $(1/4)^3$ . In the initial search, all VD[n] circled in red in Fig. 5.2(b) is considered, and the probability P(L > vref) = 1/128 is the aforementioned ending criteria of the initial search considering the tradeoff between the accuracy and adaptation time. The

initial vref search is in the step1. The pattern-selection based vref search and peaking determination of the CTLE run concurrently in the step2. For the peaking determination of the CTLE, all VD[n] will be considered and the probability P(L > vref) will increases from 1/128 to 1/8 when the optimal peaking is achieved.



Fig. 5.3 CTLE adaptation model in PAM4 receiver.

As shown in Fig. 5.3, a behavior-level model is built to verify the function of the proposed adaptation algorithm. The CTLE peaking ability increases as the increases of its control word DEQ[n]. CTLE output y[n] is sliced into thermometer codes by three slicers with an offset of 2*vref*/3, 0, and -2*vref*/3, respectively. T2D[n] = 1 means a top PAM4 level, and two delay units are employed to save two preceding bits T2D[n-1] and T2D[n-2]. A three-input AND gate with output CID tells if there are three consecutive top levels. The algorithm mainly includes two parts: the left part is an accumulator which uses the ERR[n] under CID[n] = 1 to generate DREF and then *vref* by a DAC; the right part is to adjust the equalization ability by calculating the P(ERR[n] = 1). The first step for initial *vref* search is not shown in Fig. 5.3. The generated *vref* not only can be used for CTLE adaptation, but also for PAM-to-NRZ decoding by multiplying a factor of 2/3. Therefore, to generate an accurate *vref* is critical.

The above algorithm can be also used in NRZ receivers. Fig. 5.4 shows a CTLE adaptation model in NRZ receivers. As shown in Fig. 5.4, the vref generation is the same to Fig. 5.3, and the only difference is that the P(ERR[n] = 1) is changed from 1/8 to 1/4 since there is only two levels for NRZ signal.



Fig. 5.4 CTLE adaptation model in NRZ receiver.

### **5.1.2 Simulation Results**

### A. PAM4 Adaptation

Fig. 5.5 shows the CTLE adaptation process of DEQ and *vref*. The CTLE equalization ability is set to its minimum initially. The initial search of *vref* quickly reaches an approximate *vref* value, then the pattern-selection based search of *vref* starts together with the adjustment of the equalization ability. As the increase of DEQ, the PAM4 eye opening is improved, and the optimal equalization is achieved when DEQ wanders between 10 and 11. Fig. 5.6 shows the CTLE outputs before and after the CTLE adaptation. For minimum equalization ability, the eye diagram in Fig. 5.6(a) is almost closed. The eye diagram in Fig. 5.6(b) is optimal when the adaptation completes. To further prove that the eye diagram of DEQ = 10 is optimal, the CTLE setting is swept from 0 to 14 at a step of 2, and the corresponding BER bathtub curves are estimated using Gaussian model. The estimated bathtub curves are shown in Fig.

5.7. It's clear that the optimal equalization happens at DEQ = 10. When DEQ > 10, the BER degradation also reveals that PAM4 signal is very sensitive to over equalization.



Fig. 5.5 CTLE adaptation process of DEQ and *vref* in PAM4 receivers.



Fig. 5.6 CTLE output eye diagrams (a) before and (b) after CTLE adaptation in PAM4 receivers.



Fig. 5.7 The calculated BER bathtub curves under different CTLE setting in PAM4 receivers.

## **B. NRZ Adaptation**



Fig. 5.8 CTLE adaptation process of DEQ and vref in NRZ receivers.

Fig. 5.8 shows the corresponding CTLE adaptation simulation in a NRZ receiver which experiences the same channel loss and works at the same baud rate to the PAM4 receiver simulation above. The optimal equalization converges at DEQ = 10. The same

converged optimal DEQ means a good accuracy of the adaptation algorithm. Fig. 5.9 shows the NRZ eye diagrams before and after the adaptation. With adaptation, the NRZ eye diagram has been improve dramatically in both vertical and horizontal directions.



Fig. 5.9 CTLE output eye diagrams (a) before and (b) after CTLE adaptation in NRZ receivers.

## 5.2 DFE Adaptation

Many adaptation algorithms for DFE adaptation have been proposed. Among all proposed algorithms, the LMS-based algorithm is most popular because of its stability, simplicity and efficiency. LMS-based adaptive DFEs have been extensively reported, but adaptation design details are barely disclosed [51]. In addition, PAM4 DFE is quite different from its NRZ counterpart, and the adaptation algorithm is worthy of study [17]. In this section, an LMS-based algorithm compatible for both NRZ and PAM4 DFEs, is proposed to implicitly introduce the design details.

## **5.2.1 DFE adaptation algorithm**

A first-order RC low-pass channel is employed for DFE adaptation algorithm analysis. Fig. 5.10(a) shows the channel pulse response (-1 to 1). The gray eye diagram in Fig. 5.10(b)

shows the degenerative effect of all post-cursors. With all pre-cursors being zero and post-cursors cancelled by a DFE, the eye opening will be improved to  $2a_0$ , and the black eye diagram in Fig. 5.10(b) shows the effect of a 3-tap DFE with  $a_1 = 0.24$ ,  $a_2 = 0.09$  and  $a_3 = 0.03$ . It is easy to know the level difference between -1 and the open eye ceiling of the gray eye diagram is also  $2a_0$ , so the ordinate of the open eye ceiling (the minimum level of logic one) is  $2a_0 - 1$ . In Fig. 5.10(b), with all ISI cancelled by a multi-tap DFE, the spread levels of logic one between L1 and L3 converge at their middle L2 ( $a_0$ ).



Fig. 5.10 1<sup>st</sup>-order RC channel simulation (a) pulse response and (b) NRZ eye diagrams before and after the ISI is cancelled by a 3-tap DFE.



Fig. 11 NRZ DFE model and adaptation algorithm.

Fig. 5.11 shows a NRZ DFE model. The input x[n] is equalized to y[n] to get rid of the input ISI, and y[n] is further digitized to  $\hat{y}[n]$  by a clocked slicer. To adaptively generate DFE tap coefficient  $a_k[n]$ , LMS algorithm gives the following equations [52].

$$\varepsilon[n] = y[n] - \hat{y}[n]$$
 (5.1) 
$$a_k[n+1] = a_k[n] + \mu \varepsilon[n] y[n-k]$$
 (5.2)

where  $\mu$  is the update step of  $a_k[n]$ . In (5.1), the error  $\varepsilon[n]$  between the equalized signal y[n] and desired signal format  $\hat{y}[n]$  guides (5.2) to update  $a_k[n]$ . As mentioned above,  $\hat{y}[n]$  has been sliced into logic level and cannot be used directly to generate the  $\varepsilon[n]$ . In (5.1), the required information from  $\hat{y}[n]$  is vref which is the amplitude of the optimally equalized y[n]. vref should be equal to  $a_0$  as DC of y[n] is set to zero. It is well known that DFE increases the eye opening at the cost of reducing the amplitude, so vref should be among the y[n] levels of logic one. As for the discussed RC channel, vref is the middle level of logic one. Therefore, vref can be chosen as the average of positive y[n]:

$$\sum_{n=0}^{n \to \infty} (y[n] - vref) = 0 \text{ for } y[n] > 0$$
(5.3)

At the beginning, error may exist between  $a_0$  and the *vref* obtained from (5.3), but this error will become smaller because the y[n] levels of logic one will converge with the DFE adaptation process. As shown in Fig. 5.11, the *vref* generator consists of a slicer, an accumulator and a DAC. Only positive y[n] (D[n] > 0) is compared with the generated *vref*, and the slicer output ERR[n] is accumulated to DREF[n] which is then converted to *vref* by the DAC. Usually DREF[n] is part of the accumulator for stability issue. By digitizing (5.2), the following equations are obtained.

$$A_k[n+1] = A_k[n] + U * sign(\varepsilon[n]) * sign(y[n-k])$$

$$= A_k[n] + U * sign(ERR[n]) * sign(D[n-k])$$
(5.5)

where U is the digital update step. (5.5) can be intuitively understood: if y[n] > vref[n], the  $k^{th}$  post-cursor of the preceding positive bit D[n-k] still contributes to the positive ERR[n], so  $A_k[n]$  should increase, and vice versa. As shown in Fig. 5.11, it is easy to implement (5.5), and  $A_k[n]$  updates only when D[n] > 0 because vref is positive.



Fig. 5.12 PAM4 DFE model and adaptation algorithm.

The adaptation algorithm can be further used in PAM4 DFE. Fig. 5.12 shows a typical PAM DFE topology [54]. y[n] is sliced to thermometer codes T2D[n], T1D[n] and T0D[n] by three slicers with offsets of 2vref/3, 0, and -2vref/3 where vref is average of PAM4 top levels (T2D[n] > 0). In PAM4 DFE,  $\hat{y}[n]$  is the output of the slicer with the offset of 2vref/3, and the D[n] input of the adaptation block is T2D[n]. In (5.5), the polarity of D[n-k] for updating  $A_k[n]$ 

is replaced by that of T1D[n-k]. So far, the adaptation algorithm is applicable to both NRZ and PAM4 DFEs without extra modification.

#### **5.2.2 Simulation Results**

The adaptive NRZ and PAM4 DFE models in Fig. 5.11 and Fig. 5.12 are built in Cadence using Verilog and Verilog-A languages. Because the purpose is to verify the algorithm, parasitics of all blocks are not considered. Two measured PCB traces are employed as the channels to test the adaptive DFE models, and the pulse response of the channel #1 is shown in Fig. 5. 13. In reality, a practical channel usually has non-zero pre-cursors and a number of post-cursors. Therefore, the 3-tap DFE models cannot cancel all ISI. Fig. 5.14 shows the same baud-rate NRZ and PAM4 eye diagrams before and after the DFE adaptation. The dramatic improvement from the left eye diagrams to the right eye diagrams reveals that the adaptation is effective for both NRZ and PAM DFEs. Fig. 5.15 shows the adaptation process of vref and DFE tap coefficients for NRZ and PAM4 inputs. At first, vref with a big initial value decreases until y[n] > vref happens, and then vref and DFE tap coefficients adapt concurrently at a step of 0.016 and 0.005, respectively. What should be noted is that the PAM4 tap coefficients should be 1/3 of the post-cursors in Fig. 5.13. The channel #1 pulse response and the NRZ/PAM4 adaptation results are summarized in Table IV. The small error between the converged vref NRZ/vref PAM4 (a<sub>0</sub> NRZ/ a<sub>0</sub> PAM4) and a<sub>0</sub> means a good accuracy of (5.3).  $a_{1,2}$  which contribute most ISI are also cancelled precisely. The converged  $a_{3,NRZ}/a_{3,PAM4}$ is not very accurate because  $a_3$  is as small as the update step and the DFE has limited taps. DFE adaptation results with the channel #2 which has more loss than the channel 1 are also listed in Table 1, and the adaptation accuracy is further demonstrated.



Fig. 5.13 Pulse response of channel #1 (11-cm Rogers PCB trace).



Fig. 5.14 Simulated eye diagrams before (left) & after (right) adaptation: (a) NRZ input and (b) PAM4 input.



Fig. 5.15 Adaptation process of vref and DFE tap coefficients (a) NRZ input and (b) PAM4 input.

Table IV: Adaptation results of 3-tap NRZ/PAM4 DFEs with channels

|            | k                              | 0     | 1     | 2     | 3     |
|------------|--------------------------------|-------|-------|-------|-------|
| Channel #1 | Pulse response: $a_k$          | 0.710 | 0.143 | 0.043 | 0.008 |
|            | NRZ adaptation: $a_{k\_NRZ}$   | 0.709 | 0.139 | 0.045 | 0.013 |
|            | PAM4 adaptation: $a_{k\_PAM4}$ | 0.703 | 0.045 | 0.014 | 0.001 |
| Channel #2 | Pulse response: $a_k$          | 0.591 | 0.169 | 0.066 | 0.038 |
|            | NRZ adaptation: $a_{k\_NRZ}$   | 0.603 | 0.178 | 0.066 | 0.044 |
|            | PAM4 adaptation: $a_{k\_PAM4}$ | 0.589 | 0.060 | 0.020 | 0.012 |

For optimal 3-tap DFE,  $a_0 \approx a_{0\_NRZ} \approx a_{0\_PAM4}$ ,  $a_k \approx a_{k\_NRZ} \approx 3a_{k\_PAM4}$  (k = 1, 2, 3).

## 5.3 Conclusion

This chapter has introduced two proposed adaptation algorithms for CTLE and DFE, respectively. For each algorithm, behavior models for both NRZ and PAM4 receivers have been built. The simulation results demonstrate the good adaptation accuracy of the proposed two algorithms corresponding to CTLE and DFE, respectively.

# **Chapter VI Summary and Future Work**

## **6.1 Summary**

This thesis focuses on high-speed NRZ/PAM4 electrical receiver system-on-a-chip design, and three receivers have been reported in chapter II–IV. The 24-Gb/s PAM4 receiver in chapter II and the 26-Gb/s NRZ receiver in chapter III have been introduced quite completely with measurement results. The 56-Gb/s PAM4 receiver in chapter IV gives simulation results to show the function and performance. In addition, two proposed adaptation algorithms for CTLE and DFE are introduced in chapter V.

For the 24-Gb/s PAM4 receiver, the design focus is the PAM4-to-NRZ decoder. Compared with the decoder utilizing three comparators, the proposed AVGR based decoder achieves a better power efficiency by taking advantage of the Gray-coded PAM4 levels and function merging technique to save power. Adaptive decoding is also realized with the proposed decoder. By adopting the 1/4-rate topology, the voltage-mode comparators make the receiver more power-efficient comparing the current-mode comparators in the full-rate topology. The receiver is fabricated in 28-nm CMOS process and a good bit efficiency of 1.38 pJ/bit is achieved.

For the 26-Gb/s NRZ receiver, the design focus is to achieve a superior power efficiency. Like the 24-Gb/s PAM4 receiver, the 1/4-rate topology is also adopted to slow down the working frequency to 1/4. By taking advantage of the observation that the edge sample value of a NRZ signal is proportional to the clock-data phase error in a finite range, a LSPD is proposed. Both data and edge equalization are embedded in the LSPD by reusing the linear data and edge samples with good power efficiency. To cancel the first post cursor, the employed FFE has advantages of timing and power over a DFE in the 1/4 topology. The receiver is fabricated in 28-nm CMOS process and a superior bit efficiency of 0.31 pJ/bit is achieved while compensating a 14-dB channel loss at 13 GHz.

For the 56-Gb/s PAM4 receiver, the design focus is to explore the PAM4 circuits techniques to lay foundation for the next-generation electrical links. Some design experience and circuit techniques in chapter II-III are reused in this receiver including the 1/4-rate topology and the equalizers like CTLE, FFE and DFE. To deal with the bandwidth-sensitive PAM4 signaling, the proposed algorithm for CTLEs helps the receiver to choose the optimal peaking, adaptively. To compensate the offset of the comparators, a self-calibration technique is proposed. Using the WBW-PLL in this receiver instead of the ILROs in chapter II-III has the following advantages: 1) achieve the non-degraded noise performance; 2) get rid of the interruption issue of the injected clock to output phases; and 3) achieve accurate phases of the multiple output clocks. The bang-bang PD with the transition selection helps to improve the output jitter performance. The receiver is designed in 40-nm CMOS process and the simulation results show a bit efficiency of 0.65 pJ/bit is achieved while compensating 9.5-dB channel loss at 14 GHz.

Chapter V introduces two proposed adaptation algorithms. The adaptation algorithm for CTLEs employs pattern selection and statistics to realize the adaptive adjustment of equalization ability. The algorithm is also verified by the 56-Gb/s PAM4 receiver. An LMS-based DFE adaptation algorithm is introduced and the design details are clearly given. Good adaptation accuracy of two algorithms is verified by behavior models.

Proposed high-speed, power-efficient, and source-synchronous receivers are desired for low-power applications or the applications with huge data traffic. For instance of supercomputing system with huge number of cores and memories, the cores and memories which are on the same board and close to each other, can share a global clock generator and use high-speed, power efficient, and source-synchronous interfaces to lower the total electricity cost.

#### **Summary of main contributions:**

### 1. 24-Gb/s PAM4 receiver design:

 An AVGR based PAM4-to-NRZ decoder is proposed to achieve powerefficient receiver design

### 2. 26-Gb/s NRZ receive design:

- Mixed equalizer is proposed for both data and edge equalization, and edge equalization is analyzed in details
- Source-synchronous CDR system loop is analyzed

### 3. 56-Gb/s PAM4 receiver design:

- PAM4 bang-bang PD with data transition selection is proposed
- Self-calibrated comparators are proposed and implemented
- Mixed equalizers including adaptive CTLE, FFE and DFE are employed
- A wide-bandwidth PLL is employed to replace the ILRO for multi-phase clock generation to solve the free-running frequency drift of the ring oscillator

## 4. Equalizer adaptation modeling:

- Adaptive algorithm for CTLEs in NRZ/PAM4 receivers is proposed and verified by behavior-level models
- Adaptive algorithm for DFE in NRZ/PAM4 receivers is proposed and verified by behavior-level models

### **6.2 Future work**

In this thesis, three receivers are reported, and they are a series of works. Many techniques from system level to block level are reused, for instance of the 1/4-rate topology, and equalizers. Just for this reason, much supplementary work should be done.

As we know, all three receivers use source-synchronous topology. Although the ILROs in the first and second receivers have been replaced by a WBW-PLL in the third receiver to be more automatic and overcome existing issues of the ILROs, a forwarded synchronous clock is still required. Source-synchronous receivers are power-efficiency since

no frequency lock loop is needed and only clock phase should be recovered, but data and clock have to be transmitted simultaneously requiring one more channel making this topology less attractive in many applications. Therefore, a frequency lock loop (FLL) is very demanded. In general, there are two kinds of CDRs categorized by including or excluding reference clocks. Besides providing a source-synchronous clock, the reference clock also can be generated by a crystal whose frequency should be so close to the integer frequency dividend of the data rate that the difference between the FLL output frequency and the data can be tracked by the CDR loop. For the most cost-efficient reference-less CDR, the clock is extracted from the data stream, and [53-55] gives several these kinds of FDs. More literature review of FLLs should be done so that the receiver will not significantly degrade the power efficiency after the FLL function is introduced. In addition, the PLL of the CDR should be modified accordingly to make it work with the FLL more efficiently.

In all three receivers, even though multiple equalizers have been adopted for channel compensation, the maximum equalization ability is still less than 15 dB at Nyquist frequency. For more stringent channel, 1-tap FFE or 1-tap DFE is hard to achieve good equalization effect leading to eye-opening degradation. Therefore, equalizers like FFE and DFE with multiple taps should be studied to achieve more accurate equalization especially in PAM4 systems. But multi-tap equalizers must increase power consumption. To maintain power-efficient design, low-power equalizer techniques should also be studied.

Chapter V introduces two proposed equalization adaptation algorithms. The algorithm for CTLE is verified by the behavior model and the 56-Gb/s PAM4 receiver. But no experimental results are obtained so far. For the algorithm for DFE, it is also only simulated through a behavior model. During the measurement of the 56-Gb/s PAM4 receiver, the algorithm will be test with different channels. However, the verification of the algorithm for DFE is far from being enough. A DFE circuit should be designed and fabricated, and the algorithm can be implemented either on-chip or off-chip via FPGA.

## **Bibliography**

- [1] Cisco Visual Networking Index: Forecast and Methodology, 2016-2021
- [2] https://www.ntia.doc.gov/page/2011/united-states-frequency-allocation-chart
- [3] http://www.qualcomm.cn/invention/5g
- [4] http://www.fiber-optic-tutorial.com/40g-vs-100g.html
- [5] H.Tamura, IEEE SSCS Distinguished Lecture in HKUST, Aug. 2017.
- [6] http://literature.cdn.keysight.com/litweb/pdf/5992-0657EN.pdf
- [7] B. Zhang *et al.*, "A 28 Gb/s multi-standard serial link transceiver for backplane applications in 28 nm CMOS," *IEEE J. Solid-Sate Circuits*, vol. 50, no. 12, pp. 3089-3100, Dec. 2015.
- [8] J. Lee *et al.*, "A 20-Gb/s full-rate linear clock and data recovery circuit with automatic frequency acquisition," *IEEE J. Solid-Sate Circuits*, vol. 44, no. 12, pp. 3590-3602, Dec. 2009.
- [9] Y. Lu and E. Alon, "Design techniques for a 66 Gb/s 46 mW 3-tap decision feedback equalizer in 65 nm CMOS," *IEEE J. Solid-Sate Circuits*, vol. 48, no. 12, pp. 3243-3257, Dec. 2013.
- [10] J. W. Jung and B. Razavi, "A 25-Gb/s 5-mW CMOS CDR/Deserializer," *IEEE J. Solid-Sate Circuits*, vol. 48, no. 3, pp. 684-697, Dec. 2013.
- [11] R. Farjad-rad *et al.*, "An equalization for 4-PAM signaling over long cables," in *Proc. IEEE CAS Mixed Signal Conf.* July 1997, pp. 19-22.
- [12] R. Farjad-rad *et al.*, "A 0.3-um CMOS 8-Gb/s 4-PAM serial link transceiver," *IEEE J. Solid-Sate Circuits*, vol. 35, no. 5, pp. 757-764, May 2000.
- [13] M. Kossel *et al.*, "A 10 Gb/s 8-tap 6b 2-PAM/4-PAM Tomlison-Harashima precoding transmitter for future memory-link applications in 22-nm SOI CMOS," *IEEE J. Solid-Sate Circuits*, vol. 48, no. 12, pp. 3268-3284, Dec. 2013.
- [14] P. Chiang et al., "60Gb/s NRZ and PAM4 transmitter for 400GbE in 65nm CMOS," IEEE Int. Solid-State Circuit Conf. Dig. Tech. Papers, Feb. 2014, pp. 42-43.

- [15] J. Im *et al.*, "A 40-to-56 Gb/s PAM-4 receiver with ten-tap direct decision-feedback equalization in 16-nm FinFET," *IEEE J. Solid-Sate Circuits*, vol. 52, no. 12, pp. 3486-3502, Dec. 2017.
- [16] Y. Frans *et al.*, "A 56-Gb/s PAM4 wireline transceiver using a 32-way time-interleaved SAR ADC in 16-nm FinFET," *IEEE J. Solid-Sate Circuits*, vol. 52, no. 4, pp. 1101-1110, Apr. 2017.
- [17] P. Peng et al., "A 56Gb/s PAM-4/NRZ transceiver 40nm CMOS," IEEE Int. Solid-State Circuit Conf. Dig. Tech. Papers, Feb. 2017, pp. 110-111.
- [18] http://www.oiforum.com/wp-content/uploads/50317-FOE-Architecture-Presentation.pdf
- [19] D Cui *et al.*, "A 320mW 32Gb/s 8b ADC-based PAM-4 analog front-end with programmable gain control and analog peaking in 28nm CMOS," in *IEEE Int. Solid-State Circuit Conf. Dig. Tech. Papers*, Feb. 2016, pp. 58–59.
- [20] J. Lee *et al.*, "56Gb/s PAM4 and NRZ SerDes transceiver in 40nm CMOS," in *IEEE Symp. VLSI Circuits Dig. Tech. Papers*, Jun. 2015, pp. 118–119.
- [21] L. Tang *et al.*, "A 32Gb/s 133mW PAM-4 transceiver with DFE based on adaptive clock phase and threshold voltage in 65nm CMOS," in *IEEE Int. Solid-State Circuit Conf. Dig. Tech. Papers*, Feb. 2018, pp. 114–115.
- [22] T. Toifl et al., "A 22-Gb/s PAM4 receiver in 90-nm CMOS SOI technology," IEEE J. Solid-State Circuits, vol. 41, no. 4, pp. 954 965, Apr. 2006.
- [23] G. Zhu *et al.*, "A low-power PAM4 receiver using 1/4-rate sampling decoder with adaptive variable-gain rectification," in *Proc. IEEE Asian Solid-State Circuits Conf.* Nov. 2017, pp. 81–84.
- [24] A. Manian and B. Razavi, "A 40-Gb/s 9.2-mW CMOS equalizer," in *IEEE Symp. VLSI Circuits Dig. Tech.*Papers, Jun. 2016, pp. 226–227.
- [25] P. Nuzzo et al., "Noise analysis of regenerative comparators for reconfigurable ADC architectures," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 55, no. 6, pp. 1441–1454, Jul. 2008.
- [26] S. Sidiropoulos and M. Horowitz, "A 700-Mb/s/pin CMOS signaling interface using current integrating receivers," *IEEE J. Solid-Sate Circuits.*, vol. 32, no. 5, pp. 681–690, May 1997.

- [27] M. Raj, S. Saeedi, and A. Emami, "A 4-to-11 GHz injection-locked 1/4-rate clocking for an adaptive 153fJ/b optical receiver in 28nm FDSOI CMOS," in *IEEE Int. Solid-State Circuit Conf. Dig. Tech. Papers*, Feb. 2015, pp. 404–405.
- [28] M. M. Izad and C. H. Heng, "A 17pj/bit 915MHz 8PSK/O-QPSK transmitter for high data rate biomedical applications," in *Proc. IEEE Custom Integrated Circuits Conf.*, Sep. 2012, pp. 1–4.
- [29] E. Monaco *et al.*, "A 2-11 GHz 7-bit high-linearity phase rotator based on wideband injection-locking multiphase generation for high-speed serial links in 28-nm CMOS FDSOI," *IEEE J. Solid-Sate Circuits.*, vol. 52, no. 7, pp. 1739–1752, Jul. 2017.
- [30] O. Elhadidy et al., "A 32 Gb/s 0.55 mW/Gbps PAM4 1-FIR 2-IIR tap DFE receiver in 65-nm CMOS," in *IEEE Symp. VLSI Circuits Dig. Tech. Papers*, Jun. 2015, pp. 224–224.
- [31] Top500 Lists Release, 2017, Available: http://www.top500.org/lists
- [32] T. O. Dickson *et al.*, "A 1.4 pJ/bit, power-scalable 16 × 12 Gb/s source-synchronous I/O with DFE receiver in 32 nm SOI CMOS technology," *IEEE J. Solid-Sate Circuits*, vol. 50, no. 8, pp. 1917-1931, Aug. 2015.
- [33] H. Li *et al.*, "A 0.8V, 560fJ/bit, 14Gb/s injection-locked receiver with input duty-cycle distortion tolerable edge-rotating 5/4X sub-rate CDR in 65nm CMOS," in *IEEE Symp. VLSI Circuits Dig.*, Jun. 2014, pp. 1-2.
- [34] M. Mansuri et al., "A scalable 0.128-to-1Tb/s 0.8-to-2.6pJ/b 64-lane parallel I/O in 32nm CMOS," in *IEEE Int. Solid-State Circuits Conf. Dig.*, Feb. 2013, pp. 402–403.
- [35] S. Kasturia and J. Winters, "Techniques for high-speed implementation of nonlinear cancellation," *IEEE J. Sel. Areas Commun.*, vol. 9, no. 5, pp. 711-717, Jun. 1991.
- [36] A. Manian and B. Razavi, "A 40Gb/s 14 mW wireline receiver," in *IEEE Int. Solid-State Circuits Conf. Dig.*, Feb. 2016, pp. 412–413.
- [37] G. Zhu, Y. Wang, and C. P. Yue, "A 26-Gb/s 0.31-pJ/bit receiver with linear sampling phase detector for data and edge equalization," *IEEE Solid-State Circuits Letter*, vol. 1, pp. 46–49, Feb. 2018.
- [38] A. Agrawal *et al.*, "A 19-Gb/s serial link receiver with both 4-tap FFE and 5-tap DFE functions in 45-nm SOI CMOS," *IEEE J. Solid-Sate Circuits.*, vol. 47, no. 12, pp. 3220-3231, Dec. 2012.
- [39] K.-L. J. Wong, E.-H. Chen, and C.-K. K. Yang., "Edge and data adaptive equalization of serial-link

- transceivers," IEEE J. Solid-Sate Circuits, vol. 43, no. 9, pp. 2157-2169, Aug. 2008.
- [40] M. Sanduleanu, S. Reynolds, and J.-O. Plouchart, "A 4 GS/s, 8.45 ENOB and 5.7 fJ/conversion, digital assisted, sampling system in 45nm CMOS SOI," in *Proc. IEEE Custom Integr. Circuits Conf.*, Sep. 2011, pp. 1–4.
- [41] J. W. Jung and B. Razavi, "A 25 Gb/s 5.8 mW CMOS equalizer," *IEEE J. Solid-Sate Circuits*, vol. 50, no. 2, pp. 515-526, Feb. 2015.
- [42] E. Monaco *et al.*, "A 2-11 GHz 7-bit high-linearity phase rotator based on wideband injection-locking multiphase generation for high-speed serial links in 28-nm CMOS FDSOI," *IEEE J. Solid-Sate Circuits.*, vol. 52, no. 7, pp. 1739-1752, Jul. 2017.
- [43] http://ieeexplore.ieee.org/ielx5/4497158/4523032/4523252/V25 01nn.pdf?arnumber=4523252
- [44] B. Razavi, Design of Integrated Circuits for Optical Communications. New York: McGraw-Hill, 2002.
- [45] http://www.oiforum.com/wp-content/uploads/OIF-CEI-04.0.pdf
- [46] G. Zhu, D. Luo, J. Zhuang, C. Zhi, and P. Yue, "A fully adaptive continuous-time linear equalizer for PAM4 signaling based on a statistical algorithm," *IEEE Init. Conf. on Electron Devices and Solid-State Circuits* (EDSSC), Oct. 2017, pp. 1-2.
- [47] Jri Lee, "A 20-Gb/s adaptive equalizer in 0.13-μm CMOS technology," *IEEE J. Solid-State Circuits*, vol. 41, pp. 2058–2066, Sep. 2006.
- [48] H. Uchiki, et al., "A 6Gb/s Rx equalizer adapted using direct measurement of the equalizer output amplitude," *IEEE Int. Solid-State Circuit Conf. Dig. Tech. Papers*, Feb. 2008, pp. 104–105.
- [49] Y. M. Ying, et al., "A 20Gb/s digitally adaptive equalizer/DFE with blindly sampling," *IEEE Int. Solid-State Circuit Conf. Dig. Tech. Papers*, Feb. 2011, pp. 444–445.
- [50] K. Yu *et al.*, "A 25 Gb/s hybrid-integrated silicon photonic source-synchronous receiver with microring wavelength stabilization," *IEEE J. Solid-State Circuits*, vol. 51, pp. 2129–2140, Sep. 2016.
- [51] Kimura, H., Aziz, P.M., Jing, T., et al., "A 28 Gb/s 560 mW multi-standard SerDes with single-stage analog front-end and 14-tap decision feedback equalizer in 28 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 49, pp. 3091–3103, Dec. 2014.

- [52] Widrow, B., Mccool, J.M., Larimore, M.G., *et al.*, "Stationary and nonstationary learning characteristics of the LMS adaptive filter," *proc. IEEE*, vol. 64, no. 8, pp. 1151—1162, 1976.
- [53] A. Pottbacker, U. Langmann, and H.-U. Schreiber "A Si bipolar phase and frequency detector IC for clock extraction up to 8 Gb/s." *IEEE J. Solid-State Circuits*, vol. 27, pp. 1747—1751, Dec. 1992.
- [54] R. Inti *et al.*, "A 0.5-to-2.5 Gb/s reference-less half-rate digital CDR with unlimited frequency acquisition range and improved input duty-cycle error tolerance," *IEEE J. Solid-State Circuits*, vol. 46, pp. 3150—3162, Dec. 2014.
- [55] J. Lee and K. Wu, "A 20-Gb/s full-rate linear clock and data recovery circuit with automatic frequency acquisition," *IEEE J. Solid-State Circuits*, vol. 44, pp. 3590–3602, Dec. 2009.