# Carrier Recovery in Coherent Optical Communication Systems

This is a continuation from the previous tutorial - Timing synchronization in coherent optical transmission systems.

## 1. INTRODUCTION

Carrier recovery is another important building block for modern coherent optical communication systems. Since both amplitude and phase of an optical carrier are used for carrying the data, phase and frequency synchronization between the signal source and the local oscillator (LO) is required in order to correctly demodulate a coherently modulated optical signal. Early coherent optical communication experiments (1980s to early 1990s) employed optical phase-locked loops (PLLs) for phase synchronization; however, this type of optical method was too complex for practical implementation.

With the recent advancement of high-speed electrical processing techniques, digital signal processing (DSP)-based digital carrier recovery methods have recently been introduced, from lower-speed wireless systems to high-speed coherent optical systems, where phase and frequency deviations are estimated and removed in the digital domain.

Digital carrier recovery in a high-speed coherent optical system is more challenging than a lower-speed wireless system, especially when higher-order modulation formats are employed.

Unlike the wireless system in which the frequency and phase offset changes are relatively similar and slow, the characteristics of frequency and phase offsets in a high-speed optical system are very different: the frequency change is relatively slow (in the milliseconds for a high-quality laser) but the range can be

large (up to 5GHz for a free-running signal source and LO), while the phase noise varies rapidly in comparison to the wireless system (in the nanosecond range).

Furthermore, due to limited complementary metal–oxide–semiconductor (CMOS) speed, a high-speed optical system usually demands high-degree of parallelism which makes feedback-based digital carrier phase recovery algorithms (widely used in wireless communication systems) much less effective for high-speed optical systems. Also due to the CMOS capability constraint, more hardware-efficient algorithms are required for high-speed optical systems.

The basic concept of digital carrier recovery for a high-speed coherent optical transmission system is illustrated in Figure 1, in which square 64QAM is used as an example.

Due to the differing characteristics of carrier frequency and phase offset, carrier frequency recovery should be performed before phase recovery. Furthermore, since the frequency of a free running laser can drift up to \(\pm2.5\) GHz end-of-the-life, without controlling the LO frequency up to 5 GHz extra electrical bandwidth is needed in order to preserve all the needed signal components for the following processing.

To avoid such an increased receiver bandwidth requirement, a feedback-based coarse auto-frequency control (AFC) circuit is usually introduced before the nominal frequency recovery. Using a coarse AFC to control the LO could reduce the frequency offset from several gigahertz to the tens of megahertz range. Coarse AFC should be performed before timing recovery because a large frequency offset may cause a problem for timing synchronization.

The nominal (fine) frequency and phase recovery are realized after AFC, timing recovery, and signal equalization. The fine frequency recovery unit estimates the residual frequency offset \((\Delta\omega)\) down to megahertz range (the required frequency-estimate accuracy depends on the used modulation format), and then removes this offset from the signal.

The phase recovery unit estimates the combined phase noise \((\Delta\theta)\) from the LO and the signal source and then removes it from the signal. The phase-recovered signal is then sent to the decision and decoding unit.

## 2. OPTIMAL CARRIER RECOVERY

### MAP-Based Frequency and Phase Estimator

The optical signal electric field envelope arriving at the receiver is the sum of the modulated transmitter laser and additive noise. Assuming that the transmit pulse shape and receiver impulse response are chosen so there is no intersymbol interference (ISI) and the signal polarization is aligned with the LO, then the received electrical signal after coherent detection can be arrived by

\[\tag{1}\text y_k=\text x_ke^{-j\Delta\omega_kt_k+j\theta_k}+n_k\]

where \(\text y_k\) and \(x_k\) denote the received and transmitted signal at the \(kth\) time instant, while \(\Delta\omega_k\) and \(\theta_k\) represent the frequency and phase offset, respectively.

\(n_k\) is the additive white Gaussian noise (AWGN), a circular Gaussian noise with zero mean and variance of \(N_0/2\) per dimension. The conditional probability density function (pdf) of the received signal \(\text y_k\) is given by

\[\tag{2}P(\text y_k|x_k)=\frac{1}{\pi\sqrt{N_0}}\text{exp}\left(-\frac{|\text y_k-x_ke^{-j\Delta\omega _kt_k-j\theta_k}|^2}{N_0}\right)\]

The best possible estimate of the carrier frequency and phase that can be made, given the observed received values \(\text y_k\), is the maximum a posteriori (MAP) estimate.

Since \(x_k\), \(\Delta\hat{\omega}_k\), and \(\theta_k\) are statistically independent, the MAP estimate is the sequence of values\(\hat{x}_k\), \(\Delta\hat{\omega}_k\), and\(\hat\theta_k\) that maximize the probability function

\[\tag{3}(\hat x_k,\Delta\hat\omega_k,\hat\theta_k)=\begin{array}\text{max}\\\small{x_k\Delta\omega_k,\theta_k}\end{array}\left\{\prod\frac{1}{\pi N_0}\text{exp}\left(-\frac{|\text y_k-x_ke^{-j\Delta\omega_kt_k-j\theta_k}|^2}{N_0}\right)\right\}P(x_k)P(\Delta\omega_k)P(\theta_k)\]

Since the pdf \(P(x_k)\) is known a priori and \(\theta_k\) a Wiener process, taking the logarithm of Equation 3 yields

\[\tag{4}(\hat x_k,\Delta\hat\omega_k,\hat\theta_k)=\begin{array}\text{max}\\\small{x_k\Delta\omega_k,\theta_k}\end{array}\left\{\sum\text{exp}\left(-\frac{|\text y_k-x_ke^{-j\Delta\omega_kt_k-j\theta_k}|^2}{N_0}\right)-\frac{(\theta_k-\theta_{k-1})^2}{2\pi\tau\Delta\nu}+c\;1\text n(P(\Delta\omega_k))\right\}\]

where \(c\) is a constant \(\tau\) denotes the time interval between two consecutive samples, and \(\Delta\nu\) denotes the laser linewidth.

Equation 4 is a joint estimation of the carrier frequency, phase, and data. In general, this joint maximization problem does not yield a closed-form solution, and computationally intense numerical methods have to be used to solve for \(\hat{x}_k\), \(\Delta\hat{\omega}\) and \(\hat{\theta}_k\), which are nearly impossible to be implemented in any practical high-speed communication system.

However, the maximization in Equation 4 can be performed offline, and the results can be used as a baseline to evaluate the effectiveness of different carrier-recovery algorithms.

For actual real-time implementation, it is necessary to estimate the carrier frequency and phase independently from data recovery, in order to reduce the computational load.

Much progress has been made toward developing hardware-efficient suboptimal carrier recovery algorithms, which is described in detail in Sections 3 and 4.

### Cramér–Rao Lower Bound

Before diving into the detail of hardware-efficient carrier recovery algorithms, here we give a brief introduction to the Cramér–Rao lower bound \(\text{(CRLB)}\). \(\text{CRLB}\) provides us with a more fundamental way to evaluate the effectiveness of various carrier recovery algorithms.

In estimation theory and statistics, the \(\text{CRLB}\) expresses a lower bound on the variance of estimators of a deterministic parameter. The bound is also known as the Cramér–Rao inequality or the information inequality.

In its simplest form, the bound states that the variance of any unbiased estimator is at least as high as the inverse of the Fisher information. The Fisher information is a way of measuring the amount of information that an observable random variable \(X\) carries about an unknown parameter \(\theta\), upon which the probability of \(X\) depends.

The probability function for \(X\), which is also the likelihood function for \(\theta\), is a function \(f\;(X|\theta)\); it is the probability density of the random variable \(X\) conditional on the value of \(\theta\).

\[\tag{5}I(\theta)=E\left[\left(\frac{\partial}{\partial\theta}\log f(X|\theta)\right)^2\right]\]

where \(E\) denotes expected value (over \(X)\) and log denotes natural logarithm.

An unbiased estimator that achieves this lower bound is said to be (fully) efficient. Such a solution achieves the lowest possible mean-squared error among all unbiased methods, and is therefore the minimum variance unbiased \(\text{(MVU)}\) estimator.

Suppose \(\theta\) is an unknown deterministic parameter that is to be estimated from measurements \(X\), and is distributed according to some pdf \(f(X|\theta)\).

The variance of any unbiased estimator \(\hat{\theta}\) of \(\theta\) is then bounded by the reciprocal of the Fisher information \(I(\theta)\), that is the \(\text{CRLB}\),

\[\tag{6}\text{Var}\hat{(\theta)}\geq\frac{1}{I(\theta)}=\text{CRLB}\hat{(\theta)}\]

The efficiency of an unbiased estimator \(\hat{\theta}\) measures how close this estimator’s variance comes to this lower bound, and estimator efficiency is defined as

\[\tag{7}e\hat{(\theta)}=\frac{\text{CRLB}\hat{(\theta)}}{\text{Var}\hat{(\theta)}}\]

For a general \(\text{QAM}\)-modulated carrier, there is no closed-form expression for \(\text{CRLB}\). But closed-form expressions do exist for an unmodulated carrier .

The frequency \(\text{CRLB}\) for an unmodulated subcarrier (assuming known carrier phase or joint frequency and phase estimation with unknown carrier phase) is given by Rife and Boorstyn

\[\tag{8}\text{CRLB}\hat{(\varpi)}=\frac{6}{N(N^2-1)\frac{E_s}{N_0}}\]

The phase \(\text{CRLB}\) for an unmodulated subcarrier (assuming known frequency) is given by Rife and Boorstyn

\[\tag{9}\text{CRLB}\hat{(\theta)}=\frac{1}{2N\frac{E_s}{N_0}}\]

In Equations 8 and 9, \(N\) denotes the number of samples used for carrier estimation, while \(E_s/N_0\) denotes the classic signal-to-noise ratio \((SNR,\;E_s\) representing the average signal energy over a sample period).

As compared with the \(\text{MAP}\)-based method shown in Equation 4, Equations 8 and 9 allow a simpler benchmark calculation method to evaluate the effectiveness of various carrier recover algorithms.

## 3. HARDWARE-EFFICIENT PHASE RECOVERY ALGORITHMS

As mentioned above to reduce the computational load, it is necessary to estimate the carrier frequency and phase independently from the data recovery.

Before proceeding to discuss various frequency offset estimation schemes, this section focuses on several hardware-efficient phase estimation algorithms by assuming that the frequency offset has been largely compensated-for before phase recovery.

### Decision-Directed Phase-Locked Loop \(\textbf{(PLL)}\)

When very narrow linewidth lasers are used for a coherent optical communication system, a decision-directed \((DD)\;\text{PLL}\) may be used to track the carrier phase in a feedback manner without employing any training sequence (i.e., fully blind recovery).

Especially, a second-order \(DD\)-\(\text{PLL}\) allows us to track not only the phase change, but also to some degree the frequency change.

Figure 2 shows a typical decision-directed second-order \(\text{PLL}\) for single-carrier-modulated systems. Here, the feedback phase error \(\phi_{error}\) is calculated as follows:

\[\tag{10}\phi_{error}(k)=\frac{\text{Im}\{\hat a^*_k\cdot\text y_ke^{-j\Delta\hat{\phi(k)}}\}}{|\hat{a^*_k\cdot\text y_k}|}\]

where k is the time index, \(\text y_{k}\) is the kth received sample (assuming one sample per symbol, after equalization is performed), \(\hat{a}^*_k\) is the conjugate of the \(kth\) decided symbol, and \(\Delta\hat{\phi}(k)\) is the \(kth\) estimated phase offset.

Here, \(\text y_ke^{-j\Delta\hat{\phi}(k)}\) is the received sample with phase correction based on an estimated phase offset.

By multiplying it with \(\hat{a}^*_k\) , any phase information encoded in the sample is removed, and the symbol rotated to the \(x\)-axis (real-axis). Any deviation of this result from the \(x\)-axis is, therefore, representative of the error in phase estimation.

For small angles, \(\phi\approx\sin(\phi)\), so to simplify calculations, the angle is estimated by measuring the imaginary component, and normalized for magnitude scaling.

This normalization is not present in some literature; however, it has recently been found that the use of amplitude normalization could improve \(\text{PLL}\) performance in low signal-to-noise ratio \(\text{(SNR)}\) region. The remaining \(\text{PLL}\) equations are as follows:

\[\tag{11}\phi_i(k)=\phi_i(k-1)+\text g_i\phi_{error}(k)\]

\[\tag{12}\phi_\delta(k+1)=\text g_p\phi_\text{error}(k)+\phi_i(k)\]

\[\tag{13}\Delta\hat{\phi}(k+1)=\Delta\hat{\phi}(k)+\phi_\delta(k+1)\]

The performance of a second-order \(\text{DD}\)-\(\text{PLL}\) in terms of both phase noise and frequency offset tolerance under ideal conditions (the feedback delay is assumed to be a single-symbol clock cycle) has been numerically studied in.

It is shown that a second-order \(\text{DD}\)-\(\text{PLL}\) could recover both carrier frequency and phase for a 14 Gbaud 16QAM system when the lasers-\(\text{LO}\) beat linewidth is smaller than 1\(\text{MHz}\) and the signal-\(\text{LO}\) frequency offset is smaller than 140 \(\text{MHz}\).

For a realistic high-speed optical system, however, a high-degree of parallelism has to be utilized, since the optimal \(\text{CMOS}\) clock rate is only about 500 \(\text{MHz}\).

Furthermore, there are operations within the feedback loop that cannot be completed in a single clock cycle, so pipeline architectures usually have to be employed. The use of parallel and pipeline architectures (assume with the conventional time-interleaving architecture) greatly increases the feedback delay, which will result in much reduced frequency and phase noise tolerance (as compared with the ideal case with feedback delay=1).

Even with the use of external cavity-based narrow linewidth lasers (linewidth \(\sim\)100 \(\text{kHz}\)), the classic \(\text{DD}\)-\(\text{PLL}\) may still be useful only for certain lower-order modulation formats such as \(\text{BPSK}\) or \(\text{QPSK}\), which have a relatively large phase error tolerance as compared with higher-order \(\text{QAMs}\).

Recently, several variants of \(\text{DD}\)-\(\text{PLL}\)-based phase recovery schemes have been proposed by employing different types of loop filters and/or phase detection schemes.

Some of these modified \(\text{DD}\)-\(\text{PLLs}\) may have the potential to achieve slightly better phase estimation performance at the cost of increased implementation complexity.

But all of these feedback-based algorithms suffer from the same problem of poor linewidth tolerance due to the extended feedback delay. Although the extended feedback delay inherent in the use of symbol-by-symbol time-interleaving could be avoided by using a block-by-block time-interleaved architecture (i.e., the so-called superscalar parallelization).

For this architecture, however, not only does extra overhead have to be introduced within each data block to start the \(\text{PLL}\) operation on a block-by-block basis, substantial memory/buffer units are also needed to realize block-by-block parallelization. This not only complicates the circuit design, it also introduces considerable latencies.

### Mth-Power-Based Feedforward Algorithms

**Principle** One of the key problems for carrier phase recovery is how to remove the effect of data modulation. For \(\text{DD}\)-\(\text{PLL}\), the effect of data modulation is removed by making a decision-directed (decision feedback) phase estimate.

For equal-phase encoded communication systems, the effect of data modulation can also be removed by raising the \(M\)-ary \(\text{PSK}\) (phase-shift keying) signal to the \(Mth\) power. To illustrate this principle, a quadrature phase-shift keying (QPSK) signal is used, in which the signal can be represented as

\[\tag{14}\text y_k=A\;\text{exp}\{j[\theta_{d,k}+\theta_{c,k}]\}\]

in which the optical carrier phase \(\theta_{c,k}\) is the phase of the transmitter laser referenced to the \(\text{LO}\), and the data phase takes on four values, \(\theta_{d,k}=0,\pm\pi/2,\pi.\)

In Equation 14, \(k\) denotes the \(kth\) symbol. When the received signal is raised to the fourth power as is shown in Figure 3, one obtains

\[\tag{15}A^4\;\text{exp}\{j[4(\theta_{d,k}+\theta_{c,k})]\}=A^4\;\text{exp}\{j[4\theta_{c,k}]\}\]

because \(\text{exp}\{j[4\theta_{d,k}]\}=1\), that is, the fourth-power operation strips off the data phase. The carrier phase can then be computed and subtracted from the phase of the received signal to recover the data phase as shown in Figure 3. Such a feedforward phase estimation scheme lends itself well to real-time digital implementation.

Figure 3, however, is an idealization in which no additive noise is present in the received signal.

In realistic systems, the received signal will contain noise dominated by either \(\text{ASE}\)–\(\text{LO}\) beat noise (where \(\text{ASE}\) is amplified spontaneous emission) or shot noise of the \(\text{LO}\). The impact of additive noise on \(M\)th power algorithm can be analyzed using a small-signal approximation-based method as briefly described later.

With additive noise included, the received \(\text{QPSK}\) signal can be written as

\[\tag{16}\text y_k=A\text{exp}\{j[\theta_{d,k}+\theta_{c,k}]\}+n_k\]

where the amplitude of the received signal has been normalized, and \(n_k\) is the additive noise, which is assumed to be a complex zero-mean Gaussian distribution random variable characterized with a variance \(\sigma^2_n\). The impact of raising the received \(\text{QPSK}\) signal to the fourth power yields

\[\tag{17}\text y_k^4=\text{exp}\{j4\theta_{c,k}\}+4\;\text{exp}\{j[3\theta_{d,k}+3\theta_{c,k}]\}n_k+O(n_k^2)\]

and, in the small-angle approximation,

\[\tag{18}\text{arg}(y^4_k)=4\theta_{c,k}+\delta(\theta_{c,k})n_k+O(n_k^2)\]

where \(\delta(\theta_{c,k})\) is a small quantity. It is apparent that the phase estimate is no longer accurate in the presence of additive noise.

The phase estimation error will be of the order of \(\delta(\theta_{c,k})n_k/4\), which is inversely proportional to the \(\text{SNR}\).

A typical method used to reduce the effect of additive noise on phase estimation error is to average the estimated phase over a sequence of symbols by filtering the per-symbol phase estimate through an equal-tap-weight transversal filter

\[\tag{19}\theta_{c,\text{est}}=\frac{1}{4}\text{arg}\left(\sum^{N_b}_{k=1}\text y^4_k\right)\]

This implementation is referred to as block-window filtering as sequences of data are processed in blocks. Assuming the carrier phase is constant over the sequence of symbols, the variance of the phase estimation error due to additive noise is then reduced by a factor equal to the symbol sequence length \(N_b\).

The filtering process itself introduces an error in phase estimation, as the carrier phase is actually not constant owing to the finite beat linewidth of the transmitter and \(\text{LO}\) lasers.

This error thus increases with \(N_b\). A trade-off between these two effects dictates that the tap number for the additive noise filter must be optimized. In the case of block-window filtering, the carrier-phase estimate is the same for the entire sequence of symbols.

It was shown through a series of approximations that the phase estimation error for \(\text{QPSK}\) using a block window can be modeled as a zero-mean Gaussian random variable with a variance depending on the beat linewidth, electrical \(\text{SNR}\), and block size. The equal-tap-weight filter can also be implemented by using gliding window filtering.

Since the phase estimate is forced in value to the range \(-\pi/4\leq\theta_{c,\text{est}}(k)\leq\pi/4\), there is a fourfold phase ambiguity.

Fundamentally, the observed fourfold phase ambiguity originates from the rotational symmetry of the \(\text{QPSK}\) constellation as is discussed below.

Thus, the phase estimator cannot discern if the recovered phase is accidently rotated by \(\pi/2\). This problem is widely referred to as the “cyclic slip problem” in the literature.

If left without addressing, catastrophic error propagation may occur because a single cyclic slip may cause errors for all following symbols, until another reverse-direction cyclic slip occurs as reversion.

### Differential Bit Coding

Quadrant-based differential coding/decoding can resolve the error propagation problem caused by cyclic phase slips. Using \(\text{QPSK}\) as an example, this technique can be described by the following formulae:

\[\tag{20}\begin{array}&q_{o,k}=(q_{r,k}-q_{r,{k-1}}+q_{j,k})\;\text{mod}\;4\\q_{o,k},q_{r,k},q_{j,k}\in\{0,1,2,3\}\end{array}\]

where \(q_{o,k}\) denotes the differentially decoded quadrant number (see Figure 4), \(q_{r,k}\) is the received quadrant number and \(q_{j,k}\) is the quadrant jump number, which can be

detected by using the following criteria:

\[\tag{21}\left|\theta_{c,\text{est}}(k)-\theta_{c,\text{est}}(k-1)-\frac{\pi}{2}q_{j,k}\right|<\frac{\pi}{4}\]

In Equation 21, \(k\) denotes the kth blocks for the block-by-block phase estimation. Quadrant-based differential encoding at the transmitter side is given by

\[\tag{22}q_{c,k}=(q_{d,k}+q_{c,k-1})\;\text{mod}\;4\]

where \(q_{c,k}\) denotes the differentially encoded quadrant number while \(q_{d,k}\) denotes the original quadrant number (before differential encoding).

For a more generalized \(\text{M}\)-\(\text{PSK}\), the \(\pi/4\) in Equation 21 should be replaced by \(\pi/M\), and mod 4 operation should be replaced by mod \(M\).

### Wiener Filtering

Because the laser phase noise is a Wiener process and the additive noise is Gaussian, theWiener filter that is applied to the unwrapped phase estimate provides the best phase estimation according to estimation theory.

The Wiener filter can make an estimate \(\theta_{c,\text{est}}(k)\) based on all samples up to and including \(y_k\), in which case the filter is referred to as the zero-lag Wiener filter.

Alternatively, the Wiener filter can make an estimate \(\theta_{c,\text{est}}(k)\) based on all samples up to and including \(y_{k+D}\), where \(D\) is a positive integer, in which case the filter is referred to as the finite-lag Wiener filter.

The finite-lag Wiener filter has been shown to perform better because it estimates the phase both forward and backward in time. The \(z\) transfer function of the zero-lag and finite-lag Wiener filters are given by Taylor

\[\tag{23}H_{ZL}(z)=\frac{1-a}{1-az^{-1}}\]

\[\tag{24}H_{FL}(z)=\frac{(1-a)a^D+(1-a)^2\sum^D_{k=1 }a^{D-k}z^{-k}}{1-az^{-1}}\]

from which the coefficients of the finite impulse response \(\text{(FIR)}\) filters can be obtained accordingly. The parameter \(𝛼\) depends on the variance of phase noise and additive noise and is given by

\[\tag{25}a=\frac{M^2\sigma^2_w+2\sigma^2_q-M\sigma_w\sqrt{M^2\sigma^2_w+4\sigma^2_q}}{2\sigma^2_q}\]

for an \(M\)-ary \(\text{PSK}\) signal. In Equation 24, \(\sigma^2_q\) is the variance of the noise term \(\delta(\theta_{c,k})n_k\) in Equation 18 and \(\sigma^2_w=2\pi\Delta\nu T\), where \(\Delta\nu\) is the beat linewidth of the transmitter and \(\text{LO}\) lasers, and \(T\) is the symbol period.

While the power law combined with finite-lag Wiener filters produces an optimal phase estimate, its real-time implementation in a parallel architecture digital processor may require the use of complicated look ahead-based computation algorithms since there are places where feedback of the immediately preceding result is needed for utilizing finite-lag Wiener filter.

As an alternative suboptimal solution, a gliding window filter might be more compatible with real-time implementation.

### Discussion

To reduce the implementation complexity, the \(M\)th-power operation used for an \(M\)-ary \(\text{PSK}\) signal, which requires \(M\) complex multipliers, can be replaced by a much simpler angle-based Mod operation as

\[\tag{26}\theta_{c,\text{est}}(k)=\text{arg}\{r_k\}\;\text{mod}\frac{\pi}{\textbf M}\]

because Equation 26 shares the same property as

\[\tag{27}\theta_{c,\text{est}}(k)=\frac{1}{\text M}[\text{arg}\{r^M_k\}\;\text{mod}\;2\pi\]

The \(M\)th-power-based algorithm allows the use of a feedforward configuration (see Figure 3), so its phase noise tolerance performance will not be limited by the use of parallel and pipeline processing.

Although this algorithm is originally proposed for \(M\)-\(\text{PSK}\) systems, it can be extended to \(\text{QAM}\)-modulated systems by using constellation partition-based methods.

The basic idea is to classify \(\text{QAM}\) symbols and select only those symbols having equal phase spacing for phase estimation.

For a high-order \(\text{QAM}\), however, only a small portion of \(\text{QAM}\) symbols can be used for phase estimation. This inevitably results in reduced phase noise tolerance.

## 4. Blind Phase Search (BPS) Feedforward Algorithms

### Principle

For arbitrary \(\text{QAM}\) modulation formats, robust phase recovery can be achieved by using a minimum distance-based blind phase search \(\text{(BPS)}\) algorithm.

The \(\text{BPS}\) algorithm was first introduced for coherent optical transmission systems.

For this algorithm, the carrier phase is scanned over a limited phase range \(([0,\;\pi∕2]\) for a square \(\text{QAM})\) at fixed or variable phase increments, and the decisions made following each trial phase are approximated as the correct/reference signal for mean square distance error \(\text{(MSDE)}\) calculation. The optimal phase is the one that gives the minimum \(\text{MSDE}\).

The principle of this algorithm is illustrated in Figure 5. For convenience, we denote the digitized signal (one sample per symbol) entering into the carrier phase recovery module as \(\text y_k\).

To recover the carrier phase in a pure feedforward approach, \(\text{BPS}\) requires \(\text y_k\) to be rotated by multiple test carrier phase angles \(\phi_m\). If the constellation is rotationally symmetric by \(\gamma\), the trial phase angle can be selected by

\[\tag{28}\bf{\phi}_m=\frac{m-1}{B}\cdot\gamma,\quad m\in\{1,2,3\cdots,B\}\]

where \(M\) denotes the total number of selected trial phase angles. For square \(\text{QAM}\) constellations, \(\gamma=\pi/2\) holds. Without rotational symmetry \(\gamma=2\pi\) must be used.

Then, all rotated symbols are fed into a decision circuit and the squared distance \(|d_{k,m}|^2\) to the closest constellation point is calculated.

In order to remove distortions from additive noise, the distances of \(2N\) consecutive test symbols rotated by the same carrier phase angle \(\phi_m\) are summed

\[\tag{29}e_{k,m}=\sum^N_{n=-N+1}|d_{k-n,m}|^2\]

and the “optimum” phase angle is determined by searching for the minimum sum of the distance values.

As the decoding was already executed in each phase test unit, the decoded output symbol \(\hat Y_k\) can be selected from the \(\hat Y_{k,m}\) by a switch controlled by the index \(m_{k,min}\) of the minimum distance sum.

### Generalized Differential Bit Coding

Due to the fourfold ambiguity of the recovered phase in the square \(\text{QAM}\) constellation, the receiver cannot uniquely assign the \(\text{log}_2(M)\) bits to the recovered symbol.

This problem can be resolved by applying a generalized differential coding and decoding technique. For square \(M\)-\(\text{QAM}\) constellations, the differential encoding and decoding process is the same as for \(\text{QPSK}\) since \(M\)-\(\text{QAM}\) also exhibits fourfold phase ambiguity.

Thus, it is sufficient to differentially Gray-encode the two bits that determine the quadrant of the complex plane. The only required modification of the decoding process compared with Equations 20 and 21 is that quadrant jumps are detected according to the following formula:

\[\tag{30}q_{j,k}=\left\{\begin{array}&1,\text{if}\;m_{k,\text{min}}-m_{k-1,\text{min}}>B/2\\3,\text{if}\;m_{k,\text{min}}-m_{k-1,\text{min}}<-B/2\\0,\text{otherwise}\end{array}\right.\]

All other bits that determine the symbol within the quadrant of the complex plane are Gray-encoded without any differential encoding or decoding. Figure 6. exemplifies the bit-to-symbol assignment including differential encoding/decoding for square \(\text{16QAM}\).

For arbitrary \(\text{QAM}\) constellations with \(k\)-fold phase ambiguity \((k=2\pi/\gamma)\), \([\text{log}^k_2]\) bits should be differentially encoded/decoded. If \([u]\) is the smallest integer larger than or equal to \(u\), then the differential decoding formulae given in Equations 20 and 21 should be modified as

\[\tag{31}\begin{array}&q_{o,k}=(q_{r,k-1}+q_{j,k})\;\text{mod}\;k\\\qquad\qquad\ q_{o,k},q_{r,k},q_{j,k}\in\{0,\cdots,k-1\}\end{array}\]

Although the differential-bit-encoded \(\text{QAM}\) resolves the phase ambiguity and the associated error propagation problem, this capability comes with an intrinsic bit error rate \(\text{(BER)}\) penalty.

For example, for \(\text{QPSK}\), an isolated single symbol error will result in two continuous symbol errors after differential decoding. But recent progress in coded modulation techniques reveal that this \(\text{BER}\) penalty can be minimized by using an iterative differential decoding technique as described in another tutorials.

### Performance Discussions

As a feedforward phase estimation algorithm, the \(\text{BPS}\) allows all constellation points to be used for phase estimation with arbitrary \(\text{QAMs}\).

As a result, for higher-order \(\text{QAMs}\), this method can achieve much better linewidth tolerance than constellation-partition-based \(M\)th-power algorithms.

With this algorithm, the carrier phase estimator efficiency, which is defined as the ratio between the \(\text{CRLB}\) and the mean squared error of the phase estimator output, can reach 80% for square \(\text{16QAM}\) by using 6-bit phase quantization \((B=64)\), which is close to an optimal phase estimator. The linewidth tolerance achieved

**Table 1.** Achievable linewidth tolerance using the BPS algorithm with differing square QAM constellations

by using the \(\text{BPS}\) algorithm with different square \(\text{QAM}\) constellations is summarized in Table 1, where \(\Delta f\) denotes the signal-\(\text{LO}\) beat linewidth.

### Hardware Efficiency Discussions

The \(\text{BPS}\) algorithm requires many vector rotation operations. The rotation of a symbol in the complex plane normally requires a complex multiplication, consisting of four real-valued multiplications with subsequent summation.

This would lead to a large number of multiplications having to be executed, in order to achieve a sufficient resolution for the carrier phase values \(\varphi_m\). The hardware effort would, therefore, become prohibitive.

Applying the \(\text{CORDIC}\) (coordinate rotation digital computer) algorithm greatly reduces the hardware effort needed for calculating the \(B\)-rotated test symbols since this algorithm computes vector rotations by simply using summation and shift operations.

The hardware efficiency can be further improved by using a lookup table-based mean square distance calculation method. By using the \(\text{CORDIC}\) and the lookup table-based distance calculation method, the required overall hardware effort for the \(\text{BPS}\) is roughly \(B\) times higher than \(\text{QPSK}\) phase recovery using the \(M\)th-power-based algorithm– however, the required \(B\) increases with the modulation order.

For example, in order to achieve a close-to-optimal performance, the required number of trial phases needs to be greater than 16 for \(\text{16QAM}\), and greater than 64 for \(\text{64QAM}\). So the implementation complexity can still be high for high-order \(\text{QAMs}\).

## 5. Multistage Carrier Phase Recovery Algorithms

To address the performance and implementation complexity challenges facing higher-order coherent \(\text{QAM}\) systems, several multistage phase recovery algorithms have recently been proposed.

The core idea of these algorithms is to use a hardware-efficient but less-accurate phase estimator at the first stage to perform coarse-phase estimation, and then refine the estimated phase with a more accurate fine-phase estimator.

To further improve this performance, more than one stage of fine-phase estimator may be applied. The coarse-phase estimation stage could be a \(\text{BPS}\) estimator with coarse trial-phase resolution, it could also be a decision-directed \(\text{PLL}\) or a constellation partition-based \(M\text{th}\)-power algorithm.

If training symbols are allowed, coarse phase can also be estimated from sparsely and periodically inserted training symbols. Fine-phase estimation can be realized by using the \(\text{BPS}\) estimator with finer-phase resolution over a narrower phase-varying range, the constellation-assisted maximum likelihood \((ML)\) phase estimator, or some constellation transformation-based algorithms, in which the regular \(\text{QAM}\) constellation after coarse-phase recovery is first transformed into an \(M\)-ary PSK-like constellation and then the \(M\)th power algorithm is applied to the transformed constellations for a more accurate phase estimation.

In the following, we describe in more detail three hardware-efficient multistage phase recovery algorithms that have been demonstrated for high-\(\text{SE}\) \(100\;\text{Gb/s}\) and beyond transmission experiments.

These are \(\text{(i)}\) the multistage hybrid \(\text{BPS/ML}\) phase recovery algorithm, \(\text{(ii)}\) the hybrid \(\text{PLL/ML}\) algorithm, and \(\text{(iii)}\) the training-assisted two-stage \(\text{BPS/ML}\) algorithm.

### Multistage Hybrid BPS and ML Algorithm

The multistage hybrid \(\text{BPS/ML}\) algorithm was first proposed. The principle of this algorithm is illustrated in Figure 7.

For this method, the \(\text{BPS}\) algorithm with coarse trial-phase resolution is used in the first stage to find a rough location of the optimal phase angle.

The decoded/decided signal \(\hat Y^{(1)}_n\) based on this rough phase estimation (along with the original signal \(\text y_k)\) are then fed into the second stage where an \(\text{ML}\) phase estimate is employed to find a more accurate phase estimate \(\varphi^{ML}_k\) by Proakis

\[\tag{32}H_k=\sum^{k+n}_{n=k-N+1}\text y_n[\text Y^{(1)}_n]^*\]

\[\tag{33}\phi_k^{ML}=\tan^{-1}\{\text{Im}[H_k]/\text{Re}[H_k]\}\]

where \(\hat Y^{(1)}_n\) serves as the (approximate) reference signal and \(N\) denotes the block length used for phase averaging. The decoded signal \(\hat Y^{(2)}_n\) based on this \(\text{ML}\) phase estimate

along with the original signal \(\text y_n\) may be passed into another \(\text{ML}\) phase estimation stage to further refine the phase estimation.

The effectiveness of this method has been demonstrated through both simulation and experiments. Figure 8 shows the simulated results for a \(38\;\text{Gbaud}\) square \(\text{PDM}\)-\(\text{64QAM}\) system, where the required equivalent number of test phase angles \(\text{(ENTPA)}\) is used to measure the relative implementation complexity for three different phase recovery algorithms – the single-stage \(\text{BPS}\), and the two- and three-stage hybrid \(\text{BPS/ML}\).

For this study, the laser phase noise for both the signal source and the \(\text{LO}\) is assumed to be \(100\;\text{kHz}\). The received \(\text{OSNR}\) in 0.1 nm noise bandwidth and for a single polarization is \(28\;\text{dB}\).

Thus, it can be seen that, in order to achieve a performance that is close to optimum, the single-stage BPS method needs to test approximately 64 different phase angles, while the three-stage hybrid \(\text{BPS/ML}\) algorithm only needs to equivalently test 18 different phase angles (14 test phase angles used in the first coarse \(\text{BPS}\) stage plus two cascaded \(\text{ML}\) phase estimation stages).

This results in a reduction of computational effort by more than a factor of 3.

The simulated \(\text{BER}\) performance versus \(\text{OSNR}\) is given in Figure 9, in which we compare two different phase recovery methods: the single-stage \(\text{BPS}\) using \(64\) test phase angles and the three-stage hybrid \(\text{BPS/ML}\) using an equivalent \(18\) test phase angles.

Two laser linewidths, \(100\;\text{kHz}\) and \(1\text{MHz}\), are investigated here, with corresponding phase block lengths of \(28\) and \(16\), respectively.

The results with \(0\;\text{kHz}\) laser linewidth using ideal phase recovery are also displayed as a reference. For the \(100\;\text{kHz}\) laser linewidth, the multistage method (using equivalent \(18\) test phase angles) can achieve almost identical performance to the single-stage \(\text{BPS}\) method using three times more test phase angles for a wide range of \(\text{OSNR}\) values.

The impact of \(\text{ASE}\) noise on the performance of the \(\text{ML}\) phase estimate introduced in the new multistage method is quite small even for \(\text{OSNR}\) down to \(23\;\text{dB}\) (corresponding

to a \(\text{BER}\) \(7.6\times10^{-2})\). For a \(1\;\text{MHz}\) laser linewidth, however, the multistage method exhibits a slightly worse performance compared with the single-stage \(\text{BPS}\).

This may indicate that the \(\text{ML}\) phase estimate is more sensitive to the residual phase error than the \(\text{BPS}\) method because a larger linewidth implies a faster-changing symbol phase, resulting in larger residual phase errors after block-by-block-based phase recovery.

### Multistage Hybrid PLL/ML Algorithm

To further reduce the implementation complexity, presents a detailed study of a different multistage strategy, in whicha first-order decision-directed \(\text{PLL}\) is used as a coarse phase estimator and the \(\text{ML}\) phase estimator is used for fine phase recovery.

This design is illustrated in Figure 10. It is shown that such a multistage configuration can reduce the implementation complexity by more than one order of magnitude as compared with

the single-stage \(\text{BPS}\) for a \(\text{64QAM}\) system, but at the cost of reduced laser phase noise tolerance when the degree of parallelism is high. But as compared with the traditional single-stage \(\text{DD}\)-\(\text{PLL}\), such a multistage algorithm can improve the linewidth tolerance by more than two orders of magnitude.

The effectiveness of this algorithm has been tested in a \(9.4\;\text{Gbaud}\) \(64\text{QAM}\) (single polarization) back-to-back experiment as is shown in Figure 11(a) and (b). Figure 11(a) shows the impact of parallel processing on the proposed algorithms for a constant \(23\;\text{dB}\) \(\text{OSNR}\).

One can observe that one \(\text{PLL}\) followed by two stages of \(\text{MLs}\) can achieve the same \(\text{BER}\) performance as the single-stage \(\text{BPS}\) for the symbol-by-symbol interleaved parallel path \(P\) up to \(20\).

The \(\text{BER}\) performance versus \(\text{OSNR}\) level for \(P=16\) is given in Figure 11(b). This multistage algorithm can achieve a performance similar to the BPS for a wide range of \(\text{OSNR}\) levels with \(\text{BER}\) ranging from \(2\times10^{-2}\) to close to \(10^{-4}\).

Figure 12 shows the simulated linewidth tolerance performance for using different phase recovery algorithms for a 38 Gbaud square \(\text{64QAM}\) system (laser \(\text{linewidth}=100\;\text{kHz}\), \(\text{OSNR}=25\;\text{dB}\) and \(\text{PLL}\) pipe line delay \(D=5)\).

One can see that the linewidth tolerance of the multistage algorithm is more than two orders better than \(\text{PLL}\)-only method.

### Training-Assisted Two-Stage Phase Recovery Algorithm

As mentioned earlier, if training symbols are allowed, the coarse phase can be estimated from the sparsely and periodically inserted training symbols.

Figure 13 shows the functional block illustration for a training-assisted two-stage phase recovery algorithm proposed and demonstrated. For this method, training symbols (known at the receiver) are periodically inserted into the data stream to assist in the phase recovery.

These training symbols may also be used for other purposes, such as frame synchronization. To reduce overhead, training symbols are only sparsely inserted at the transmitter.

At the receiver, the received data are processed block by block, where each block consists of at least two training symbols. For each block of data,

• First, the average phase over this block is estimated by using the inserted training symbols through an \(\text{ML}\) phase estimator and

• second, each block is divided into multiple groups, and then the phase of each group is refined by using a \(\text{BPS}\)-based phase estimator over a small phase-varying range that is centered at the average phase estimated through the training symbols.

One significant advantage for this training-assisted algorithm is its robustness against cyclic phase slips, and thus may remove the need for differential encoding/ decoding.

Since the baseline phase is recovered from the training symbols, there is inherently no phase ambiguity problem for this algorithm.

There still exists chances that the recovered phase may deviate from the true phase by as much as \(\text{pi}/2\;[\pi/2]\) due to the impact of large linear or nonlinear noise, but this large phase error or phase jump will only impact a single data block, because the baseline phase of different blocks

are estimated independently. This effectively prevents errors from propagating from one block to another.

The validity of this new method has been verified by a \(400\;\text{Gb/s}\) experiment using a time-domain hybrid \(32\)–\(\text{64QAM}\). Its robustness against cyclic phase slips is shown in Figure 14, with the recovered carrier phases using two different algorithms: the training–assisted two-stage algorithm and the conventional \(\text{BPS}\) are displayed for a back-to-back measurement with \(\text{OSNR}=24.2\;\text{dB}\) (corresponding to a bit error ratio \(2\times10^{-2}\) for using this training-assisted algorithm).

One can see that there was no phase jump with the training-assisted two-stage algorithm, whereas the phase-jump problem was severe (due to low \(\text{OSNR}\), nonideal equalization, and signal constellation in this experiment) when using the conventional single-stage \(\text{BPS}\) algorithm, which mandates use of differential coding.

This new two-stage algorithm can achieve comparable or even better (in the low \(\text{OSNR}\) region) phase noise tolerance than the single-stage \(\text{BPS}\) method with much

**TABLE 2. A comparison for several recently demonstrated carrier phase recovery algorithms**

lower implementation complexity. The required (approximate) 2% training overhead can be further reduced by exploring joint phase recovery over two orthogonal polarization states for current polarization-multiplexed transmission systems or joint phase recovery over multiple spatial channels for future space division multiplexed systems.

Joint phase recovery over multiple spatial channels can also be explored to improve the linewidth tolerance.

As a summary, Table 2 gives a brief comparison of several phase recovery algorithms discussed in this section in terms of the achievable hardware efficiency, linewidth tolerance, and several other metrics.

## 6. HARDWARE-EFFICIENT FREQUENCY RECOVERY ALGORITHMS

### Coarse Auto-Frequency Control (ACF)

As described in Section 1, before the nominal (fine) frequency and phase recovery, a feedback-based coarse \(\text{AFC}\) is usually required in order to lock the \(\text{LO}\) frequency close to the incoming signal source frequency (typically within tens or a few hundreds of megahertz range).

The key component for the coarse \(\text{AFC}\) is the frequency error detector\(\text{(FED)}\). Since nonzero frequency offset between the \(\text{LO}\) and the signal source will cause signal spectrum asymmetry, two types of \(\text{FEDs}\) based on this spectrum asymmetry have been developed. The first one is a balanced quadricorrelator \(\text{(BQ)}\), and the other one is based on differential power measurement \(\text{(DPM)}\).

### BQ-Based FDE

Figure 15 shows a typical balanced quadricorrelator \(\text{(BQ)}\). Assume that the signal entering into the BQ can be expressed by

\[\tag{34}\text y_k=(a_k+jb_k)e^{j2\pi\Delta ft_k}\]

where \(\Delta f\) denotes the frequency offset, \(a_k\) and \(b_k\) denote the real and imaginary components of the transmitted (complex) signal. For conceptual simplicity, here we ignore the additive noise as well as carrier phase noise.

A straightforward analysis of the balanced quadricorrelator of Figure 15 yields the following for its output signal:

\[\tag{35}u_k=[s_ks_{k-1}+b_kb_{k-1}]\sin(2\pi\Delta fT_s)+[b_ka_{k-1}-a_kb_{k-1}]\cos(2\pi\Delta fT_2)\]

where \(T_s\) is the sampling period. The first term of Equation 35 is the desired frequency error signal as long as the sample period is shorter than signal correlation period such that \(E\{a_ka_{k−1}+b_kb_{k−1}\}\) is nonzero.

The second term is not desired because it only produces pattern jitter if random data are transmitted in an \(\text{MQAM}\)

or \(\text{MPSK}\) transmission system in which \(M>2\). But for real-value modulated signals, such as \(\text{PAM}\) or \(\text{BPSK}\) where \(b_k=0\), the second term will be zero.

Even with \(\text{MQAM}\) or \(\text{MPSK}\) where the second term is nonzero, the produced pattern jitter can still be suppressed by using a low-bandwidth loop filter as was demonstrated, where \(\pm10\;\text{GHz}\) frequency offset error detection was demonstrated by using \(\text{BQ}\)-based \(\text{FED}\) for a \(43\;\text{Gb/s}\) \(\text{QPSK}\) coherent receiver, with a loop filter bandwidth of \(625\;\text{kHz}\).

### DPM-Based FED

Figure 6 shows the principle of \(\text{DPM}\)-based \(\text{FED}\). The incoming signal \(\text y(t)\) is fed into two bandpass filters \(H_p(f)\) and \(H_n(f)\).

The output signal \(u(t)\) of the \(\text{DPM}\)-\(\text{FED}\) is then the difference between the instantaneous powers coming out of the two bandpass filters.

In order to provide a useful error signal that can be fed into a feedback network, the signal \(u(t)\) must have the following properties:

1. If the incoming signal \(\text y(t)\) is centered at zero frequency, it is necessary that the mean value \(E\{u(t)\}\) be zero.

2. If there is a frequency offset \(\Delta f\) then \(E\{u(t)\}\) must be a measure of this frequency offset.

For example, the first condition is fulfilled if the power spectrum of the incoming signal is an even function of the frequency and if the bandpass filters observe the relationship \(|H_p(f)|=|H_n(−f)|\), as shown in Figure 16(b).

In order to observe the second condition, a reasonable possibility would be to have the passbands of the bandpass filters in the range around the slopes of the incoming signal (if centered at zero frequency).

In the case of a frequency offset of \(\Delta f\), the differential power is then a measure of this frequency offset (Figure 16(c)).

However, a pure \(\text{DPM}\)-\(\text{FED}\) only yields on average an output signal \(u(t)\), which is related to the frequency offset.

The desired error signal is distributed by a pattern-dependent term. This produces a pattern-dependent frequency jitter in the \(\text{AFC}\) loop. A low-bandwidth loop filter can suppress such a pattern-dependent jitter; it can also be suppressed by properly designing the transfer functions of \(H_p(f)\) and \(H_n(f)\).

When \(\Delta f=0\), the pattern-dependent jitter can be completely suppressed by using optimal design of \(H_p(f)\) and \(H_n(f)\), realizing jitter-free stable operation.

For real-value modulated signal, in order to realize jitter-free operation, \(H_p(f)\) and \(H_n(f)\) should satisfy the following conditions:

\[\tag{36}H_p(f)=H^*_n(-f)\]

For a complex-value modulated signal such as \(\text{QAM}\), however, the optimal \(H_p(f)\) and \(H_n(f)\) depends on channel transfer function (including both transmitter and receiver filters).

Let \(G(f)\) denote the channel transfer function, then \(H_p(f)\) and \(H_n(f)\) should satisfy the following criteria to ensure jitter-free operation:

\[\tag{37}G(f-f_s)H_p(f-f_s)=G(f-f_s)H_n(f-f_s)\]

where \(f_s=1∕2T\) and \(T\) denotes the symbol period.

For high-speed long-haul coherent optical systems that employ frequency domain-based equalization for chromatic dispersion \(\text{(CD)}\) compensation, \(\text{DPM}\)-based \(\text{FDE}\) can be easily implemented in the frequency domain by employing hardware-efficient fast Fourier transform \(\text{(FFT)}\) as was demonstrated.

Since only coarse \(\text{AFC}\) control is required, there is no need for very large \(\text{FFT}\) size, implying the \(\text{FFT}\) used for \(\text{CD}\) compensation can be reused for \(\text{FED}\).

### Mth-Power-Based Fine FO Estimation Algorithms

As described in our tutorials, the \(M\)th-power algorithm can be used to erase the data modulation for \(M\)-ary \(\text{PSK}\) signals.

After erasing the data modulation, the frequency offset \(\text{(FO)}\) can be estimated using either a time-domain-based differential phase method or some \(\text{FFT}\)-based algorithms.

### Time-Domain Differential Phase-Based Algorithm

The principle of the time-domain differential phase-based \(\text{FO}\) estimation method is illustrated in Figure 17, in which \(\text{QPSK}\) with the fourth-power algorithm is used as an example.

For this method, the \(\text{FO}\) is extracted from the average phase increment between two consecutive data-erased symbols. In order to get a reliable \(\text{FO}\) estimate at the low \(\text{OSNR}\) region, a long averaging window of thousands of symbols is typically required even for \(\text{QPSK}\).

Many more symbols are required for higher-order modulation formats.

### \(\text{FFT}\)-\(\text{Based}\) \(\text{FO}\) Estimation Algorithm

The \(\text{FFT}\)-based method can also be used to extract the \(\text{FO}\) from the data-erased signals, because the phase angle of a data-erased signal will exhibit an \(\text{FFT}\) peak at \(M\) times the \(\text{FO}\).

\(\text{FFT}\)-based methods can achieve better \(\text{FO}\) estimation accuracy than the time-domain-based method (by using the same number of symbols) but the implementation complexity is much higher, especially for higher-order \(\text{QAMs}\), where tens of thousands of symbols may have to be used for a reliable and accurate \(\text{FO}\) estimate.

Furthermore, a single \(\text{FFT}\) operation can only determine the magnitude of the \(\text{FO}\). In order to get the sign of the \(\text{FO}\), an additional \(\text{FFT}\) may have to be used.

Some efforts have been made to simplify the \(\text{FFT}\)-based \(\text{FO}\) estimator for high-order \(\text{QAMs}\).

For a square \(\text{QAM}\), it is shown that, by using only the outmost four constellation points combined with the use of linear interpolation and down sampling-based methods, the implementation complexity can be greatly reduced. Figure 18 illustrates the schematics of this method.

First, the received one sample per symbol signal is preprocessed with constellation classification, previous neighbor interpolation, and down-sampling. The Mth-power algorithm is then used to erase the data modulation.

Finally, the frequency magnitude and sign are detected by using two concurrent \(\text{FFTs}\) with a modified \(\text{FFT}\).

As an example, Figure 19 shows the classification applied to a square \(\text{64QAM}\), in which a ring-based classification method is employed.

Symbols are classified as Class \(\text I\) if their magnitude is closest to a Class \(\text I\) ring, and Class \(\text{II}\) if otherwise.

The Class \(\text I\) points are identified by four rings, which exactly intersect the transmitted symbols that lie on a perfect diagonal; these symbols can be derotated using the \(M\)th power algorithm.

Figure 20 shows the \(\text{FFT}\) of rotated symbol angles for each viable ring. Although all show a peak at four times the frequency offset, the outermost ring shows the best peak-to-noise ratio, and is most robust to high noise.

## 7. Blind Frequency Search (BFS)-Based Fine FO Estimation Algorithm

\(\text{BFS}\) algorithm was first proposed as a universal carrier \(\text{FO}\) estimation method, where minimum mean square distance error \(\text{(MSDE}\), in terms of phase or Euclidean distance) is used as the frequency-selection criteria.

For this method, the frequency offset is first scanned at a coarse step size \(\sim10\;\text{MHz})\) and then at a fine step size of \(\sim1\;\text{MHz})\), and the optimal frequency offset is the one that gives the minimum \(\text{MSDE}\) (see Figure 21).

For each trial frequency, the carrier phase is first recovered (with best efforts) by using the \(\text{BPS}\) algorithm, and decisions made following this phase estimation are then approximated as the reference/correct signals for \(\text{MSDE}\) calculation.

Figure 22 shows the simulated results for a \(\text{38Gbaud}\) \(\text{64QAM}\) system with laser linewidth\(=100\) \(\text{kHz}\) and \(\text{OSNR}=25\;\text{dB}\).

Figure 22(a) shows how the normalized \(\text{MSDE}\) varies with the frequency deviation (the difference between the trial \(\text{FO}\) and the actual \(\text{FO})\) and the number of symbols used for \(\text{FO}\) estimation, while Figure 22(b) shows the frequency error distributions by using several different data block lengths.

Here, the frequency error is defined as the difference between the estimated \(\text{FO}\) and the actual \(\text{FO}\). \(\text{BFS}\) can reliably estimate \(\text{FO}\) to be within \(\text{20MHz}\) accuracy by using only \(32\) symbols.

Increasing the number of symbols to \(128\) can improve the \(\text{FO}\) estimation accuracy to within \(3\;\text{MHz}\), which is good enough even for \(\text{64QAM}\).

As compared with the previous \(M\)th-power-based algorithms, \(\text{BFS}\) requires much less number of symbols for a reliable \(\text{FO}\) estimate.

Thus, this algorithm can achieve very fast carrier frequency recovery if it is implemented with a parallel processing architecture as is shown in Figure 21.

For a typical coherent receiver where the carrier frequency varies much more slowly than the symbol rate, \(\text{BFS}\) can also be

implemented with a sequence or partial sequence processing architecture (such as the \(\text{CPU}\) in a computer) to reduce the implementation complexity.

The robustness of this algorithm has been verified by multiple \(100\) and \(400\;\text{Gb/s}\) transmission experiments employing high-order \(\text{QAMs}\).

Just recently it has been shown that, in addition to \(\text{MSDE}\), phase entropy can also be used as a reliable frequency-selection criterion for the \(\text{BFS}\) algorithm.

But using the phase entropy as the frequency selection criterion requires more samples for each \(\text{FO}\) estimation than the \(\text{MSDE}\)-based solution.

## 8. Training-Initiated Fine FO Estimation Algorithm

The training-initiated fine \(\text{FO}\) estimation algorithm was first proposed. This algorithm also works for arbitrary \(\text{QAMs}\).

For this method, the initial \(\text{FO}\) is estimated by using a starting training sequence in a feedforward manner and then the \(\text{FO}\) variation is tracked through a feedback configuration using the recovered carrier phases from the following phase recovery stage as is shown in Figure 23.

Note that, unlike the fast-changing phase noise that cannot tolerate extended feedback delay in high-speed optical systems, carrier frequency typically varies much more slowly and thus it could be tracked by using a feedback-based architecture. The advantages of this method are as follows: \(\text{(i)}\) it is applicable to arbitrary \(\text{QAM}\), \(\text{(ii}\)) its implementation complexity is very low because it requires significantly fewer complex multiplications than algorithms described previously, and \(\text{(iii)}\) the tolerable frequency offset can be very large (up to the symbol rate).

As compared with the frequency recovery method based on the second-order \(\text{PLL}\), where both the frequency and phase recovery rely on a feedback mechanism, here only the \(\text{FO}\) tracking uses a feedback configuration while the phase recovery is achieved in a feedforward manner.

In summary, a coarse \(\text{AFC}\) is typically required before the nominal fine frequency recovery. For coarse \(\text{AFC}\), the key component is the \(\text{FDE}\), which can be either a balanced quadricorrelator or an \(\text{FFT}\)-based differential power monitor \(\text{(DPM)}\).

For fine frequency recovery, the \(M\)th-power-based time-domain differential phase algorithm

is a hardware-efficient blind-\(\text{FO}\) estimation method for \(M\)-ary \(\text{PSK}\) signals operating at high \(\text{OSNR}\) with relatively small phase noise.

For more general \(\text{QAM}\)-modulated signals, however, the training-initiated feedback-based method presented in Section above tutorials can achieve much reliable performance with even lower implementation complexity.

The \(\text{BFS}\) method presented in this tutorial has the potential to achieve much faster \(\text{FO}\) estimation at the expense of higher implementation complexity.

## 9. EQUALIZER-PHASE NOISE INTERACTION AND ITS MITIGATION

The carrier recovery algorithms described so far are mostly optimized based on the additive Gaussian noise assumption.

However, in a realistic long-haul coherent transmission system without using inline optical dispersion compensation, a long-memory equalizer usually has to be employed before the carrier recovery to compensate for the accumulated fiber chromatic dispersion \(\text{(CD)}\).

For future few-mode-fiber-based space-division multiplexing \(\text{(SDM)}\) systems, long-memory multi-input multi-output \(\text{(MIMO)}\) equalizers are needed for intermodal dispersion compensation.

As described later, the interaction between the long-memory equalizers and the laser phase noise will not only enhance the phase noise, but will also cause additional amplitude distortions.

Especially, it is found that impairments due to equalizer and phase noise interaction \(\text{(EPNI)}\) increase with the signal baud rate and can be a significant problem for future \(400\;\text{Gb/s}\) and beyond systems operating at very high symbol rate. So designing a carrier recovery scheme capable of mitigating impairments due to \(\text{EPNI}\) is becoming more and more important.

For the case with additive Gaussian noise, phase noises from the signal source and the \(\text{LO}\) have essentially similar system impact, which can be minimized by using a fast single-tap phase rotation equalizer (i.e., the nominal phase recovery circuit), as is shown in Figure 24(a).

For a typical coherent system using a long-memory receiver-side equalizer for \(\text{CD}\) compensation, however, the impacts of signal and \(\text{LO}\) phase noises become quite different, because the signal source experiences both positive \(\text{CD}\) from the fiber and negative \(\text{CD}\) from the digital equalizer, but the \(\text{LO}\) only sees negative \(\text{CD}\) from the digital equalizer.

Due to the fact that \(\text{CD}\) can convert phase noise into amplitude noise (and may also enhance phase noise), the impact of \(\text{LO}\) phase noise becomes more severe than the signal source phase noise, and the additional amplitude distortion caused by a long-memory equalizer cannot be mitigated by using conventional phase recovery algorithms.

To address this problem, a hardware-based laser phase noise compensation method has recently been proposed. However, this method is very complex and costly, because it requires an additional coherent receiver to measure the laser phase noise.

A \(\text{DSP}\)-based solution is proposed to address this challenge. The basic idea is based on the following observations: if the laser linewidth is small and signal symbol rate is high, the amplitude and phase distortions caused by \(\text{EPNI}\) will be highly correlated over quite a few symbols (tens to hundreds of symbols depending on the laser linewidth and the symbol rate), and moreover, such distortions can be modeled

as the result of a time-varying multitap linear filtering effect in which – although the filter coefficients vary over time – over every limited time block (consisting of tens to hundreds of symbols), the filter coefficients can be well approximated as unvarying constants.

Thus, by replacing the commonly used fast-tracking single-tap phase rotation equalizer with a fast-tracking multitap linear equalizer, as is shown in Figure 24(b), both the amplitude and phase distortion caused by \(\text{EPNI}\) can be mitigated.

Because laser phase noise typically varies 2–4 orders of magnitude faster than the state of polarization change, the adaptation rate for the proposed EPNI mitigation equalizer should be much faster than the regular \(2\times2\) polarization-compensating equalizer.

Very fast adaptation rate can be realized by using a block-by-block feedforward adaptation algorithm as is illustrated in Figure 25. For this algorithm, the frequency-recovered signal is first divided into blocks (with a few overlap symbols included between blocks), and the regular phase recovery method is applied to each block. After that, one can make an initial decision from the phase-recovered signal.

The decided-upon signal is then approximated as the correct data for \(\text{EPNI}\) filter coefficients estimation through the classic least square \(\text{(LS)}\) algorithm. To reduce the impact of imperfect decision accuracy, multiple iterations may be applied to each data block for filter coefficients update. Note that the initial decision can be made based on performing pure phase recovery over the current data block.

But the initial decision may also be made by applying the recovered phase of the prior data block to the current data block or by directly applying the \(\text{EPNI}\) filter coefficients acquired

from the prior data block to the current data block (the starting phase or \(\text{EPNI}\) filter coefficients can be obtained using a starting training sequence). Because the block length cannot be too large due to the need for rapid adaptation and the time-varying nature of \(\text{EPNI}\), the accumulated amplifier noise may degrade the performance of the proposed \(\text{EPNI}\) equalizer.

This problem may be alleviated by joint optimization of the proposed \(\text{EPNI}\) filter over both polarizations, because the phase noises in \(X\)-and \(\text Y\)-polarizations are usually correlated (if they are from the same source).

This new impairment mitigation method has been numerically verified in a 7-channel \(50\;\text{GHz}\)-spaced \(49\;\text{Gbaud}\) \(\text{PDM}\)-\(\text{16QAM}\) system.

The transmission link consists of 20 total spans with each span composed of \(100\;\text{km}\) of large area fiber (dispersion coefficient and fiber loss are assumed to be \(21\;\text{ps/nm/km}\) and \(0.18\;\text{dB/km}\), respectively) and \(\text{EDFA}\)-only amplification (noise figure is assumed to be \(5\;\text{dB)}\).

No inline optical dispersion compensation is used for this simulation. For simplicity, polarization-mode dispersion \(\text{(PMD)}\) and polarization-dependent loss are not considered. For the laser sources, we assume that the signal source and the \(\text{LO}\) have identical linewidth.

Figure 26 shows the \(\text{BER}\) performance of the middle channel (channel 4) versus the laser linewidth at the optimal signal launch power of \(3\;\text{dBm/channel}\).

The diamond-shaped symbols give the results using a conventional coherent receiver, where the phase is tracked by using the previously described training-assisted two-stage phase recovery algorithm. (The phase is first estimated by using three training symbols that are uniformly distributed over every 128 symbols, and then the 128 symbols are divided into four groups and the phase over each group is refined by using the \(\text{BPS}\) algorithm over a small phase-varying region.)

The square-shaped symbols illustrate the result using the proposed \(\text{EPNI}\) mitigation method, in which the \(\text{NPNI}\) filter length is chosen to be five \(T\)-spaced taps.

For this study, the block length is chosen to be 95 symbols (including 5 overlap symbols), and two iterations are applied for each data block, in which the initial decision for each data block is made based on the same phase recovery algorithm used for the conventional coherent receiver (i.e., for the diamond-shaped symbols).

**TABLE 3. Simulated \(\textbf{BER}\) versus number of iterations (laser linewidth=0.8 \(\textbf{MHz)}\) when using the proposed \(\textbf{EPNI}\) mitigation algorithm **

The effectiveness of this new algorithm is evident from Figure 26. For a laser linewidth 0.8 \(\text{MHz}\) (a typical linewidth for widely used \(\text{DFB}\) lasers), the new method improves the \(Q\) performance by 1.35 \(\text{dB}\) while using only two iterations.

As we further increase the number of iterations, the performance gain is small, as can be seen from Table 3. It is interesting to note that, even without laser phase noise, the introduction of a five-tap \(\text{EPNI}\) equalizer improves the performance by 0.22 \(\text{dB}\), indicating that the proposed method also helps in mitigating fiber nonlinear effects.

To confirm this, we also simulated the results by switching off the fiber nonlinearity and found that use of the \(\text{EPNI}\) mitigation equalizer does not improve the performance. In fact, when the fiber nonlinearity is switched off, the use of \(\text{EPNI}\) mitigation equalizer slightly degrades the performance \(\text{(BER}=2e−5\) versus \(\text{BER}=1.8e−5\) when using the normal phase recovery algorithm).

A similar result is observed in a back-to-back simulation as is shown in Figure 27. One can see that \(\text{(i)}\) when using the conventional one-tap filter-based phase recovery, increasing the laser linewidth from 0 to 0.8 \(\text{MHz}\) only results in a small performance degradation while there is no \(\text{EPNI}\) effect; and \(\text{(ii)}\) the use of the \(\text{EPNI}\) mitigation filter does not improve the performance when there are no \(\text{EPNI}\) effects.

The small performance degradation caused by the \(\text{EPNI}\) mitigation filter when there are no \(\text{EPNI}\) effects probably occurs because the block length of 95 symbols is not long enough for optimal estimates of the five complex \(\text{EPNI}\) filter coefficients (Note that the normal phase recovery only needs to estimate one real filter coefficient, the phase angle.)

## 10. CARRIER RECOVERY IN COHERENT OFDM SYSTEMS

Unlike single-carrier modulated systems in which the carrier phase noise only causes the constellation rotation of \(\text{QAM}\) symbols (assuming linear systems with negligible channel memory), carrier phase noise not only changes the common phase of \(\text{OFDM}\) subcarriers but also causes interference between subcarriers.

Phase noise-induced common phase change of \(\text{OFDM}\) subcarriers can be estimated by using similar phase estimation methods described which were originally developed for single carrier modulated systems.

For example, in single-carrier modulated systems, the \(M\)th-power-based algorithms, the \(\text{BPS}\) algorithm as well as the \(\text{ML}\)-based algorithm are applied to the symbols in the time domain.

In coherent \(\text{OFDM}\) systems, these algorithms can be applied to subcarriers in the frequency domain within each \(\text{OFDM}\) frame as is illustrated in Figure 28, in which the \(\text{ML}\) algorithm is used as an example.

\(\text{OFDM}\) systems are very sensitive to carrier frequency offset, and the required frequency offset estimation accuracy must be much greater than the subcarrier spacing – so training or pilot tone-based frequency estimation methods are typically used for fine frequency offset estimation.

The blind \(\text{AFC}\) algorithms described are also applicable for \(\text{OFDM}\) systems.

Phase-noise-induced intercarrier interference \(\text{(ICI)}\) is a significant problem for coherent \(\text{OFDM}\) systems.

Effective mitigation of such a problem essentially requires us to estimate carrier phase in the time domain on a sample-by-sample basis, which is a more difficult problem than a single-carrier modulated system since, in the time domain, an \(\text{OFDM}\) signal behaves like a “Gaussian” noise.

A common method used to address this problem is to employ ultra-low phase noise lasers such that the carrier phase remains constant (approximately) over the whole \(\text{OFDM}\) frame, requiring only the subcarrier common phase to be estimated. Alternatively, several recent studies have shown that an \(\text{RF}\)-pilot tone-based method is effective in mitigating phase noise-induced \(\text{ICI}\)

## 11. CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS

In this tutorial, we first introduce the concept of carrier recovery and the challenges faced. We show that, due to differing carrier frequency and phase noise characteristics, an independent and slower carrier frequency recovery unit is typically required before the much faster carrier phase recovery unit.

Two important theoretical concepts regarding optimal frequency and phase estimation are introduced: \(\text{MAP}\) estimator and the Cramér–Rao lower bound.

Although the \(\text{MAP}\) estimator is too complex to be implemented in an actual system, it can be implemented with offline process to establish a baseline to estimate the efficiency of some suboptimal estimators.

The Cramer–Rao lower bound provides us with another and more fundamental way to evaluate the efficiency of any estimator.

HARDWARE-EFFICIENT PHASE RECOVERY ALGORITHMS is devoted to several hardware-efficient phase estimation algorithms recently demonstrated for high-speed coherent optical systems.

These include the decision-directed \(\text{PLL}\), the Mth-power-based algorithms, the BPS algorithm as well as several multistage hybrid phase estimation algorithms. Among these algorithms, the \(\text{DD}\)-\(\text{PLL}\) is the most hardware-efficient, but its linewidth tolerance is fundamentally limited by the inherent feedback delay.

The \(M\)th-power algorithm works well for \(M\)-ary PSKs but is much less efficient for higher-order \(\text{QAMs}\). The \(\text{BPS}\) algorithm can achieve close-to-optimal phase noise tolerance for arbitrary \(\text{QAM}\), but implementation complexity increases with modulation order.

The recently proposed multistage algorithms such as the hybrid \(\text{BPS/ML}\) and the training-assisted two-stage \(\text{ML/BPS}\) can achieve a linewidth tolerance performance similar to the single-stage BPS but with significantly reduced implementation complexity.

Furthermore, the training-assisted two-stage \(\text{ML/BPS}\) algorithm is very robust against the detrimental cyclic slips, and may remove the need for differential encoding/decoding.

HARDWARE-EFFICIENT FREQUENCY RECOVERY ALGORITHMS is devoted to frequency recovery algorithms. We first introduce two auto-\(\text{AFC}\) techniques developed for coarse-locking of the \(\text{LO}\) frequency to the vicinity of the transmitter source laser.

Then, we describe three types of fine \(\text{FO}\) estimation algorithms.

• The first employs the \(\text M\)th-power algorithm to remove data modulation, and the frequency offset is estimated by using either a time-domain-based differential phase method or \(\text{FFT}\)-based methods. The time-domain method only works for \(\text{M}\)-\(\text{PSK}\) while \(\text{FFT}\)-based methods can be effective for higher-order \(\text{QAMs}\), but the implementation complexity is very high.

• The second is a constellation-assisted blind frequency search method that works for arbitrary \(\text{QAM}\). The significant advantage of this method lies in the fact that it can achieve very fast frequency recovery, but the down side is its high implementation complexity.

• The third is a training-initiated feedback method, in which the acquisition training sequence is utilized to do an initial \(\text{FO}\) estimation, and then \(\text{FO}\) variations are tracked by using the recovered phases from the following phase recovery stage in a feedback configuration.