REAL-TIME IMPLEMENTATION OF HIGH-SPEED DIGITAL COHERENT TRANSCEIVERS

This is a continuation from the previous tutorial - Carrier recovery in coherent optical communication systems

1. ALGORITHM CONSTRAINTS

Coherent optical transmission systems at 100 Gb/s use baud rates of 28 Gbaud or beyond, with the receiver \(\text{ADCs}\) usually using two times oversampling, that is 56 Gs/s and beyond, to convert the received signals from the analog to the digital domain. Standard cell logic in modern complementary metal–oxide–semiconductor \(\text{(CMOS)}\) technology cannot operate at this high sampling rate of the \(\text{ADCs}\).

The preferred clock speed for standard logic is usually in the range of a few hundred megahertz, which means that the digital samples from the \(\text{ADCs}\) cannot be processed one by one in a serial implementation, but need to be demultiplexed and processed in parallel.

But even then, in order to achieve the clocking speed of several hundred megahertz, the logic is only capable of processing a small fraction of the overall receiver algorithm until intermediate results need to be pipelined or buffered in flip flops or memory.

This processing architecture, which is depicted in Figure 1., needs to be considered already during the design phase of the processing algorithms, as there might otherwise be no possibility to bring a certain algorithm into a real-time processing engine. Amongst others the algorithms need to be implementable within the power constraints of the \(\text{DSP}\) engine, they must be parallelizable, and significant feedback latencies need to be assumed.

The following sections analyze these constraints in more detail.

Power Constraint and Hardware Optimization

The way algorithms are mathematically described in scientific publications or implemented in software simulation tools can differ significantly from the way these algorithms are most efficiently implemented in hardware.

In this section, the Viterbi & Viterbi \(\text{(V&V)}\) carrier recovery algorithm serves as an example. Figure 2. shows a block diagram of a textbook description of the algorithm. The incoming signal is raised to the fourth power and filtered in a sliding window averaging filter.

The phase angle from the filter output is negated and divided by 4, which provides the actual carrier phase estimate. This estimate is converted to a complex phasor and used to derotate the received symbol.

This description makes it easy to understand the functionality of the algorithm, but a straightforward hardware implementation would be highly inefficient.

The fourth power calculation requires six multiplications, and the derotation of the received symbol requires three multiplications.

As multiplications are the most costly functions in hardware, avoiding them generally reduces the hardware effort. Figure 3. shows a block diagram of the same algorithm that potentially reduces the hardware effort significantly.

At first glance, the processing in Figure 3. appears more complicated than the one in Figure 2, as it contains several conversion blocks from Cartesian to polar coordinates and vice versa. However, these conversions can be implemented very efficiently using the \(\text{CORDIC}\) (coordinate rotation digital computer) algorithm, which only uses simple shift and add operations.

In addition, the first conversion most likely already happens before the carrier recovery, as the constant modulus algorithm \(\text{(CMA)}\) update of the channel equalizer requires knowledge of the signal amplitude, and the intermediate frequency compensation is more efficiently implemented in polar coordinates as well.

Also note that the calculation of \(|X_k|\) is plotted with dotted lines. This calculation can be omitted, which has been shown to actually improve the performance of the V&V algorithm. Hence, the number of required multipliers can be reduced to zero.

Optimizing all parts of the \(\text{DSP}\) chain is critical, because coherent receivers have stringent constraints on the power they can burn.

This is most apparent in the development of coherent pluggable modules, which have maximum power consumptions specified as part of the form factor (e.g., 24W for compact form-factor pluggable \(\text{(CFP)}\). But even coherent receiver line cards have this power restriction, which

**FIGURE 1.** Hardware structure of a real-time coherent receiver \(\text{DSP}\).

**FIGURE 2.** Viterbi & Viterbi \(\text{(V&V)}\) carrier recovery algorithm.

**FIGURE 3.** Hardware optimized V&V carrier recovery algorithm.

is a direct consequence of the cooling abilities on the line card. So the more the algorithms are optimized in terms of power consumption, the more features can be packed into the \(\text{DSP}\).

Parallel Processing Constraint

Algorithms that support multi-Gb/s operations must allow parallel processing as was shown in Figure 1.. As explained in the introduction, the \(\text{DSP}\) logic cannot operate directly at the sampling clock frequency of the \(\text{ADC}\), but requires demultiplexing to process the data in parallel modules.

This approach is even already used inside the \(\text{ADCs}\), which use an architecture of a high-speed sample and hold buffer fanning out into several (e.g., 64) interleaved successive approximation \(\text{(SAR)}\) \(\text{ADCs}\). Hence, each \(\text{SAR}\)-\(\text{ADC}\) only needs a conversion rate of \(<1\text{GHz}\).

For the \(\text{DSP}\) processing, this low clock frequency allows automated generation of the layout, which is indispensable due to the complexity of the system.

Algorithms for real-time applications must, therefore, allow parallel processing with a large number of demultiplexed channels.

This translates into the requirement that (intermediate) results within one module cannot depend on results calculated at the same time in other parallel modules.

A good example to explain the feasibility of parallel processing is the comparison of two filter structures: Finite \(\text{(FIR)}\) and infinite \(\text{(IIR)}\) impulse response filters. Figure 12.4 depicts their structures in both serial and parallel systems.

It can be seen that it is easily possible to parallelize an \(\text{FIR}\) filter. Although the output signal depends on information provided by several parallel modules, it does not depend on the results of the same calculations performed in these modules. In contrast, it is a big challenge

**FIGURE 4.** Finite impulse response \(\text{(FIR)}\) and infinite impulse response \(\text{(IIR)}\) filters in serial and parallel implementation.

to realize the parallel structure shown for the \(\text{IIR}\) filter, because the result depends on results calculated at the same time in other parallel modules. Similar to a carry bit in a digital adder, information has to traverse the entire parallel processing bus until the output becomes stable.

A very low clock frequency or a low number of parallel channels would be needed to allow all calculations to be executed within one clock cycle. Neither of these requirements is fulfilled in coherent digital receivers for optical transmission system.

The only possible way to use algorithms that do not lend themselves for parallelization is to use parallel-to-serial conversion as shown in Figure 5. In this technique, the incoming data are written into a large memory column by column, and then read out for processing row by row.

This allows frames of data to be processed in a truly serial fashion. However, this approach comes with large drawbacks, because a large and complicated memory structure is required, and discontinuities have to

**FIGURE 5.** Concept of parallel to serial conversion to enable serial processing algorithms in a parallel processing structure.

be resolved at the start of each serial-processing block. Solving this problem requires either to have an overlap between the processing blocks, which further increases complexity, or to use a frame-based data structure with training symbols at the beginning of each processing block.

Due to these drawbacks, this approach is generally avoided, and algorithms are chosen that map efficiently into a parallel processing structure.

Feedback Latency Constraint

In simulations or offline data processing, often feedback loops with one symbol loop latency are used. While this implementation achieves the highest performance, it does not represent a system that can be realized in an \(\text{ASIC}\) or \(\text{FPGA}\).

Hence, these results can only be considered as an upper bound for the performance that can be achieved.

Due to the massive parallelization and pipelining required in coherent real-time \(\text{DSP}\) processing, loop latencies can easily reach hundreds or thousands of symbols.

Unfortunately, very few publications that analyze \(\text{DSP}\) algorithms actually study the effects of these large loop latencies. One of the reasons is that it is very difficult to estimate these latencies without detailed knowledge of the actual technology used for implementing the real-time processing.

Table 1. provides a rough overview for the required clock cycles to implement the most common digital building blocks. The assumed clock frequencies are \(\text{500MHz}\) for \(\text{ASIC}\) and \(\text{200MHz}\) for \(\text{FPGA}\) implementations.

Of course in reality, these numbers depend on a wide range of parameters (e.g., \(\text{CMOS}\) process node, resolution, digital implementation). But this table can be used by researchers and engineers who are not directly involved in hardware development to get a feel for the expected latency of their algorithms.

It can be seen that the processing latency inside an \(\text{FPGAs}\) is roughly three times larger than inside an \(\text{ASIC}\) – even despite the fact that \(\text{FPGAs}\) are also clocked at lower frequencies.

This is due to the overhead in \(\text{FPGAs}\) caused by mapping the algorithms into a general purpose processing structure. This difference has to be kept in mind when \(\text{FPGAs}\) are used for the prototyping of coherent receivers.

Let us apply the latency numbers to an actual problem system designers face when selecting algorithms for a coherent receiver development. In several publications, it has been shown that the least mean square \(\text{(LMS)}\) equalizer update algorithm has faster convergence and better performance than the \(\text{CMA}\) algorithm.

However, these studies in general do not consider the different loop latencies of these

TABLE 1. Real-time signal processing latencies for basic DSP functions

**FIGURE 6.** DSP blocks for CMA and LMS equalizer updates.

two algorithms. So how large is actually the difference in the loop latency between a \(\text{CMA}\)-based equalizer coefficient update and an \(\text{LMS}\)-based update. Therefore, we require the latencies of the following blocks: the butterfly filter, the frequency offset compensation, the carrier recovery, the decision circuit or decoder, and the equalizer coefficient update block (Figure 6.). For simplicity 8-bit resolutions are assumed in all processing steps.

The butterfly filter consists of complex multipliers followed by a summation of the multiplier outputs.

Let us assume a 16-tap filter. The latency through the equalizer, therefore, is

\[L_{EQ}=2+log_2\{16\}=6\]

For the frequency offset compensation, the signal is converted from Cartesian to polar coordinates.

This simplifies the subsequent processing stages. If we assume a feedback loop, then compensating the \(\text{IF}\) is a simple subtraction, that is, the latency becomes

\[L_\text{IF}=8/3+1=4\]

The carrier recovery uses the feedforward \(\text{V&V}\) algorithm depicted in Figure 3. Calculating the fourth power of the input symbols in polar coordinates is a simple left shift operation, which comes for free. The next part is the filter, which consists of two \(\text{CORDIC}\) blocks and one summation block.

Finally, the estimated phase is subtracted from the received symbol, which is then converted back into Cartesian coordinates before it is fed into the decoder. If we assume a 50-tap filter length in the carrier recovery, the latency becomes

\[L_\text{CR}=8/3+log_2\{50\}/4+8/3+1+8/3=12\]

The decoder basically consists of comparators, so its latency can be assumed to be \(L_D=1\).

The two equalizer updates have almost identical processing. The only difference is the calculation of the error signal.

While the \(\text{LMS}\) algorithm requires the distance vector between the decoded and the received sample (subtraction), the \(\text{CMA}\) algorithm weighs the input vector with the amplitude difference of the equalizer output from unity \(\text{(CORDIC}+\text{subtraction}+\text{multiplication})\).

Then, both algorithms multiply the error signal with the equalizer output, apply a control gain (can be implemented as right shift, which comes for free), and add it to the previous channel estimate.

Hence, their latencies are

\[L_\text{CMA}=(8/3+1+2)+2+1=9\]

\[L_\text{LMS}=1+2+1=4\]

So in total, the latencies for the two different equalizer update implementations become

\[L_\text{CAM,total}=L_{EQ}+L_{CMA}=15\]

\[L_\text{LMS,total}=L_{EQ}+L_\text{IF}+L_\text{CR}+L_D+L_\text{LMS}=27\]

The \(\text{LMS}\) equalizer update has a \(>60%\) higher loop latency than the \(\text{CMA}\) update. If we assume a demultiplexing factor of 64, the loop latency for the \(\text{CMA}\) update is 960 symbols, for the \(\text{LMS}\) update it is 1728 symbols.

And replacing some of the processing with higher performance algorithms (e.g., feedforward frequency estimation, or using \(\text{BPS}\) instead of \(\text{V&V})\) can easily push the latency for the \(\text{LMS}\) update to \(>2000\) samples.

This latency difference has a large impact on the convergence and tracking capabilities of the algorithms, and the choice which equalizer to use has to be reviewed under this constraint.

This is a crucial example of why feedback latency needs to be considered as early in the development of a next generation coherent receiver as possible.

2. HARDWARE IMPLEMENTATION OF DIGITAL COHERENT RECEIVERS

The different requirements for research or product development also cause different strategies in the real-time architectures for the digital transmitter and receiver. In research, the main objective is to have a highly flexible setup, which allows implementing a variety of different processing algorithms for different modulation formats.

In addition, the design cycle needs to be fast, as the research has to be conducted before the start of the product development.

Therefore, research prototypes are implemented using field-programmable gate arrays \(\text{(FPGA)}\). These devices are made up of up to a million so-called slices (Figure 7.), which contain look-up tables, multiplexers, dedicated adder logic, and registers.

The slices are connected through wide busses and switching nodes.

By changing the content of the \(\text{LUTs}\) and the wiring of the interconnects between the slices, virtually any logic function can be implemented inside the \(\text{FPGA}\) fabric.

Modern \(\text{FPGAs}\) offer a huge amount of processing capacity with hundreds of thousands of slices, thousands of dedicated multiplier blocks, clock managers, and a large number of gigabit transceivers, supporting aggregated \(\text{IO}\) bandwidths of \(>1\text{Tb/s}\).

**FIGURE 7.** Example for an \(\text{FPGA}\) slice containing look-up tables, multiplexers, dedicated adder logic, and registers.

But as impressive as these numbers may be, comparing them with the requirements of a \(\text{100G}\) digital coherent receiver reveals that these devices still imply large restrictions on possible prototype implementations.

One major limitation is that no high-speed \(\text{ADCs}\) are available integrated into \(\text{FPGAs}\). Hence, external high-speed \(\text{ADCs}\) need to be used, which require enormous interface bandwidths.

A coherent receiver requires four \(\text{ADCs}\) running at \(\thicksim 56\; \text{Gs/s}\).

If the resolution of the \(\text{ADCs}\) is 8 bit, then the required interface bandwidth is \(\thicksim1.8\;\text{Tb/s}\).

Hence, \(\text{FPGA}\) prototypes often use \(\text{ADCs}\) with a lower sampling rate and/or lower resolution compared with a product receiver.

Due to the internal architecture of \(\text{FPGAs}\) with slices and large reconfigurable interconnect networks, often more than 50% of the processing delay is caused by routing the signals inside the device.

This increases the number of pipeline stages required in the design and reduces the optimal clock speed for the \(\text{FPGA}\) (see Table 1.).

The latter causes the design to grow in size, as a higher degree of parallelism is required to achieve the desired throughput.

In addition, due to the massive parallel processing required inside the \(\text{FPGA}\) (bus widths of 256 or larger are possible), only a few thousand slices and a few multipliers are available per parallel lane.

Hence, only a fraction of the overall processing can be implemented in a single \(\text{FPGA}\), and the algorithms need to be partitioned into multiple \(\text{FPGAs}\).

This further increases the interconnect restrictions mentioned earlier.

Figure 8. shows an example of an \(\text{FPGA}\) board used for real-time prototyping of a coherent receiver.

It hosts four 6-bit \(32\)-\(\text{Gs/s}\) \(\text{ADCs}\), an \(\text{FPGA}\) with an \(\text{IO}\) bandwidth of \(96\times13.1\)-\(\text{Gb/s}\), and \(\text{VCSEL}\)-based parallel optical interconnects for daisy-chaining of multiple boards.

Four of these boards were needed to build a real-time \(\text{100G}\) coherent \(\text{OFDM}\) receiver, which was successfully employed in a first field trial for real-time \(\text{CO}\)-\(\text{OFDM}\).

Inside commercial transceiver modules with much tighter space, power, and cost restrictions than research prototypes, \(\text{ASIC}\) implementations are the only possible solution.

In an \(\text{ASIC}\), all functions are implemented exactly as they are needed, which makes the design significantly faster and more power efficient than an \(\text{FPGA}\) implementation.

In addition, it is possible to monolithically integrate high-speed \(\text{ADCs}\) and

**FIGURE 8.** Real-time FPGA based coherent receiver prototype board.

\(\text{DACs}\) with huge amounts of \(\text{DSP}\) logic into a single chip. Figure 9. shows the layout of the very first commercial coherent \(\text{ASIC}\) taking advantage of this integration.

The two parts of the chip are very distinct: the \(\text{ADCs}\) on the left, and the very dense \(\text{DSP}\) section on the right.

However, this tight integration leads to new design challenges.

One major challenge is the proper isolation between the \(\text{ADC}\) cores and the standard \(\text{DSP}\) logic.

In order to achieve a high effective number of bits for the \(\text{ADCs}\), the added noise and distortions that the \(\text{ADCs}\) inject into the analog input signal have to be minimal.

This is not an easy task, because a massive \(\text{DSP}\) engine sits right next to it, which produces a lot of switching noise from toggling \(\text{CMOS}\) logic and can cause current spikes in the order of \(\text{100}\;\text{A}\).

This isolation requires very careful design in order to ensure that the \(\text{ADCs}\) do not degrade in performance when integrated with the \(\text{DSP}\).

Figure 10. shows an example for a commercial coherent transceiver linecard. In future, different ASIC designs will offer interesting differentiation options for various systems.

While one chip can only include the bare essentials of algorithms required for a coherent receiver to achieve low power consumption and enable a small form-factor-pluggable module, a different chip can be designed with the most sophisticated algorithms such as digital backpropagation or advanced soft-decision \(\text{FEC}\) for high-performance transoceanic submarine systems.

With the industry moving toward higher-order modulation formats, super-channel transmission systems, and maybe even spatial division multiplexing in multimode fibers, and at the same time coherent transmission also becoming attractive for metro and shorter reach links, the diversification of coherent \(\text{DSP}\) chips optimized for different transmission scenarios has only just begun.

1. ALGORITHM CONSTRAINTS

Power Constraint and Hardware Optimization

Parallel Processing Constraint

Feedback Latency Constraint

2. HARDWARE IMPLEMENTATION OF DIGITAL COHERENT RECEIVERS

Share this post