# Pipelined FFT Architectures for Wireless Communication Applications

C. Padma<sup>1</sup>, Dr. P. Jagadamba<sup>2</sup>, Dr. P. Ramana Reddy<sup>3</sup>

Research Scholar at JNTUA, Ananthapuramu, Andhra Pradesh, India<sup>1</sup> Assistant Professor (Sr), Department of ECE, SKIT, Srikalahasti, Andhra Pradesh, India<sup>2</sup> Professor, Department of ECE, JNTUA, Ananthapuramu, Andhra Pradesh, India<sup>3</sup>

# ABSTRACT

Today's emerging technology requires fast processing and efficient use of resources. The resources include power, area and speed. This research paper aims at a study of various FFT algorithms and architectures for design of efficient Low Power FFT processor for Wireless Communication systems. Orthogonal frequency division multiplexing (OFDM) is one of the efficient system for transmission of data at high rate. The main functional block in OFDM system is the FFT processor. An efficient implementation of FFT processor is possible through a selection of suitable pipeline architecture and radix-2<sup>n</sup> algorithms. This paper describes the recent progress on the implementation of low power FFT processor.

KEYWORDS: Fast Fourier Transform (FFT), Single Path Delay Feedback Architecture (SDF), Multi Path Delay Commutator (MDC).

## **1. INTRODUCTION**

Orthogonal frequency division multiplexing (OFDM) plays a vital role in communication systems for transmission of audio and video signals in a high speed, high performance and efficient manner [9]. Fast Fourier transform is a main important block in OFDM System, The Fast Fourier Transform is an efficient algorithm for computing the Discrete Fourier Transform (DFT) and requires less number of computations than that of direct evaluation DFT. FFT has various applications like communications, signal processing, bio-medical engineering, instrumentation and applied mechanics.

Pipeline architectures are most widely used due to its high throughput and lesser clock rates. The memory based architecture provides lesser area but it provides lesser throughput. But compared to memory based architecture the pipelined architectures suffering with its hardware complexity. It consists of memory components called buffers between the intermediate stages during the calculations [3]. The function of these buffers is managing the flow of data and scheduling. The most widely used pipelined architectures are feedback and feed forward architectures.

Now a day's there is increasing demand on semiconductor technology in terms of performance, area and power. But increasing of power is a continuous problem in the processing communication technology [12]. The requirement of low power FFT processor in a wireless communication system is to maximize the life system and fulfill the need of consumer demand by giving extend battery life at lower cost. This paper explained as follows. Section II provides literature survey on FFT processor. In section III explains the various FFT algorithms. In section IV explains the various FFT architectures in detail and section V summarized the total work.

## 2. REVIEW ON FFT PROCESSORS

The author Yu-Wei Lin et al. [2] proposed new dynamic scaling approach for 8K point FFT processor used in digital video broad casting (DVB-T) applications. In this method proposed scaling approach and pre-fetched buffering saved the 64K bit memory in 8K point FFT processor. Bock floating point (BFP) is one of the dynamic scaling approaches is useful for reducing the quantization error and increasing the SONR ratio. The proposed design provides a core area of 4.84mm<sup>2</sup>, power dissipation 2.5mw at 20MHz for 8192-point FFT processor. The author Shen-Jui Huang et al. [3] proposed a high throughput 512 point FFT processor for IEEE 802.15.3c (WPAN) by using radix-16 FFT algorithm and memory based pipelined architecture for reducing number of butterfly units and producing high speed of 2.59GS/s. He proposed conflict free multibank memory addressing scheme support 16 data path type for parallel and normal order input/output data. In this implementation is done through EDA synthesis with area 0.93mm<sup>2</sup> and power consumption 42mW with 90nm technology with operating frequency of 324MHz and SQNR is 57dB with 12 bit word length. The author T. Cho et al. [4] has proposed high speed low complexity modified radix- $2^5$  for high rate WPAN application is used for producing high speed and lesser area by using pipeline multipath delay commutator. In this for producing high throughput greater than 2 GS/s use large data paths that is 8 or 16 used. The author M. Garrido et al. [5] proposed radix- $2^{k}$  feed forward architecture provides high throughput and it can be used for any number of parallel samples which is power of two and it is used for both DIT and DIF decomposition. The author Jian Wang et al. [7] presented fixed point analysis using radix- $2^{k}$ algorithm for optimizing the parameters. For hardware evaluation pipeline single path delay feedback (SDF) and multipath delay commutator (MDC) is used and also they preferred complex multiplier and one memory less CORDIC unit for reducing computational complexity and hardware. The proposed system is mainly preferred to improving signal to quantization ratio (SQNR) to meeting the signal processing applications. But concentrate on through put the system is poor. The author Chao Wang et al. [8] presented a normal input/output order 512 point FFT/IFFT processor for high throughput and low complexity for WPAN application. He proposed mixed radix  $-2^4 - 2^2 - 2^3$  and multipath delay feedback (MDF) architecture for reducing the twiddle factor multiplications. By using single RAM group it support the parallel normal-order output data flow continuously. The proposed system reaches the throughput of 1.76GS/s with area of 1.6mm<sup>2</sup> with a clock rate of 220MHz. The author Eun Ji Kim et al. [9] proposed a novel shared multiplier scheduling scheme using mixed radix algorithm for area efficient FFT/IFFT processor. The proposed mixed radix multipath delay commutator support 128,256 and 512 points and the design is implemented in 90nm CMOS technology. The proposed method is implemented by using 8 parallel data paths with a high throughput of 27.5GS/s at 430MHz operating frequency. The design is not only applicable to WPAN applications and also applicable to optical OFDM applications due to large throughput. This design is also extending to any FFT size by using additional stages but it leads to circuit complexity. The author Zhuo Qian et al. [10] proposed radix-2 low power shared memory based split radix SDF FFT processor for reducing dynamic power consumption. The proposed modified butterfly unit is used the multiplier gating technique to save the dynamic power. And also proposed two algorithms for both trivial and non trivial twiddle factor developed. This technique is more optimal in the sense of floating point operations. The author Kai-Feng Xia et al. [11] proposed design and implementation of Memory-Based FFT with Generalized Efficient Conflict-Free Address Schemes. He proposed address schemes for different FFT lengths are integrated to support FFT processing for various systems. High radix algorithm and parallel processing technique can be used to increase the throughput. The author Ngoc le ba et al. [12] proposed radix-2<sup>2</sup> MDC 1024 point feed forward FFT processor. This method proposed new input scheduling algorithm based upon memory to eliminate energy required to shift the data along the delay lines. The proposed 2 parallel path design is well suited for low power and high throughput applications. The proposed design occupies area 3.6mm<sup>2</sup>, power 60.3mW at operating frequency 600MHz.

### **3. FFT ALGORITHMS**

#### 3.1. Radix-2 Cooley Tookey Algorithms

The discrete Fourier transform can be written as

$$X(k) = \sum_{n=0}^{N-1} x(n) \cdot W_N^{nk}, \quad k = 0, 1, \dots, N-1 \quad (1)$$
$$W_N^{nk} = e^{-j(2\pi nk/N)} = \cos(2\pi nk/N) - j\sin(2\pi nk/N). \quad (2)$$

Where X(n) is the input sequence, X(k) is the output sequence and N is the transform length. Where  $W_N = e^{-j2\pi/N}$  which is called as the twiddle or phase factor. Fast Fourier transform algorithms are mathematical simplifications of the discrete Fourier transform (DFT). They exploit symmetries and periodicity in the transform in order to reduce the number of mathematical computations [9]. This is done by a divide and conquers method first invented by Carl fried rich gauss around 1805. It was later rediscovered and published by j.w. Cooley and john Tukey in 1965 and has become known as the cooley-tukey algorithm. This method effectively reduces the computational complexity of the DFT from order  $O(N^2)$  to  $O(Nlog_r^N)$ , where 'r' denotes the radix-r FFT. These families of fast algorithms for computing the DFT are commonly known as FFT algorithms. FFT computations can also be classified as decimation in time (DIT) or decimation in frequency (DIF) where the required computations are identical, but the order is essentially reversed.

A radix-2 simplified butterfly flow diagram is shown Figure 1.



Figure 1. Butterfly diagram of radix-2

# 3.2 Radix-2<sup>2</sup>, Radix-2<sup>3</sup>, Radix-2<sup>4</sup> and Radix-2<sup>5</sup> Algorithms

After improvement of radix-2 and radix-4 the proposed algorithm is  $radix2^{2}$  [8]. Further reduction of complexity in complex multipliers introduce algorithms are  $radix-2^{3}$ ,  $radix-2^{4}$  and  $radix-2^{5}$  [4]. All algorithms are similar to radix-2 DIF or radix-2 DIT only, but the generation of twiddle factors at each stage is different and this approach reduces complexity of butterfly structure. Present research these are the most widely algorithms for reduction of complexity. For example N=512 points how the twiddle factors at each stage is different is shown in below Table 1

Table 1: Sequence of the 512-Point FFT Twiddle factor computations for radix  $2^k$  algorithms

|                      |    | 80               |                  |                  |       |                 |    |     |
|----------------------|----|------------------|------------------|------------------|-------|-----------------|----|-----|
| Algorithm            | 1  | 2                | 3                | 4                | 5     | 6               | 7  | 8   |
| Radix-2 <sup>2</sup> | -j | W <sub>512</sub> | -j               | W <sub>128</sub> | -j    | W <sub>32</sub> | -j | W8  |
| Radix-2 <sup>3</sup> | -j | $W_8$            | W <sub>512</sub> | -j               | $W_8$ | W <sub>64</sub> | -j | W8  |
| Radix-2 <sup>4</sup> | -j | W <sub>16</sub>  | -j               | W <sub>512</sub> | -j    | W16             | -j | W32 |

#### **4. FFT ARCHITECTURES**

#### 4.1. Radix-2 Multipath Delay Commutator (R2MDC)

R2MDC is a one of the feed forward pipeline architecture. The following Figure 2. shows the example of 8-point radix-2 DIF FFT architecture [5]. It consists of two parallel paths with delay element one switch (commutator). In this input data is divided into two four input data streams, the first four input data stream is multiplexed to the top delay elements as

shown in figure and another four input data stream is directly given to the butterfly. This MDC architecture suffers with data dependency problem and utilization butterflies and multipliers also 50% only. The total delay of 8 point DIF R2MDC is 4+2+2+1+1=10. For an N-point FFT the total delay elements is equal to (3N/2)-2.



Figure 2. A 8 Point DIF R2MDC

#### 4.2. Radix-2 Single Path Delay Feedback (R2SDF)

This pipeline Single Path Delay Feedback architecture was introduced by Herbert L and George with minimum number of delay elements [6]. In this one half of the outputs from each stage are fed back to the input data buffer when the input data are directly given to the butterfly. Due to feed back mechanism this architecture reduces the total number of delay elements to N-1 (4+2+1=7) as shown in Figure 3. The utilization of multipliers and butterflies are same as to the R2MDC architecture.



Figure 3. A 8 Point DIF R2SDF FFT

# 4.3. Radix-4 Multipath Delay Commutator (R4MDC)

This architecture is similar to R2MDC. But it consists of 4 path delay commutator between the stages and the input data are separated by a 4-1 multiplexer and 3N/2 delay elements at first stage [11]. The following Figure 4. Shows the 64 point DIF R4MDC it consists of 3 multipliers at each stage except last stage. For an N point FFT the required number of multipliers is  $3(\log_4N-1)$  which is greater than R2MDC and R2SDF architectures. In this architecture the total delay elements for N point FFT are (5N/2)-4 and utilization of butterflies and multipliers in this architecture is 25% only. This architecture is suffered in terms of hardware and utilization.



Figure 4. A 64-Point DIF R4MDC FFT

#### 4.4. Radix-4 Single Path Delay Commutator (R4SDC)

This simplified radix-4 butterfly architecture is proposed by G. Bi and V. Jones. In this architecture only one output is proposed in comparison with conventional R4MDC and to

produce same four outputs the butterfly works four times instead of just one time. Due to this the utilization of butterflies and the number of delay elements will increases in this architecture [12]. For N point FFT the total number of multipliers is  $log_4N-1$  which is less than R2MDC FFT architecture. This architecture is benefit with the utilization of butterflies is 100% and utilization of multipliers is 75% but the cost of this architecture is increases due large number of delay elements. 16 point R4SDC FFT architecture as shown in Figure 5.



Figure 5. A 16-Point DIF R4SDC FFT

#### 4.5. Radix-4 Single Path Delay Feedback (R4SDF)

This radix-4 SDF architecture is a radix-4 version of R2SDF. Proposed radix-4 algorithm reduces the number of multiplier to  $\log_4^{N}$ -1 compared R2SDF architecture [1]. In this architecture the utilization of multipliers is 75% and utilization of butterflies is only 25%. The 64 point R4SDF FFT as shown in Figure 6. This R4SDF architecture butterflies is more complicated as compared R2SDF butterflies.



Figure 6. A 64-Point DIF R4SDF FFT

## 4.6. Radix-2<sup>2</sup> Single Path Delay Commutator (R2<sup>2</sup>SDF)

This radix-2<sup>2</sup>SDF is well suited pipeline FFT architecture for VLSI implementation due to less number of multipliers, memory requirement and better adder usage [7]. The radix-2<sup>2</sup>SDF uses a modified radix-4 DIF algorithm but it leads hardware complexity. By using two kinds radix-2 SDF butterflies to achieve the same output as compared to radix-4 butterfly as shown in Figure 7. This architecture uses registers efficiently to implement FFT function and reduces the circuit complexity.



Figure 7. A 256-Point DIF R2<sup>2</sup>SDF FFT

The computational complexity and memory requirement for above discussed pipeline architectures as shown in Table 2.

| Pipeline<br>Architectur<br>es | Complex multipliers     | Complex adders       | Memory<br>Size | Control |
|-------------------------------|-------------------------|----------------------|----------------|---------|
| R2MDC                         | 2(log <sub>4</sub> N-1) | $4 \log_4 N$         | 3N/2-2         | Simple  |
| R2SDF                         | 2(log <sub>4</sub> N-1) | $4 \log_4 N$         | N-1            | Simple  |
| R4MDC                         | 3(log <sub>4</sub> N-1) | 8 log <sub>4</sub> N | 5N/2-4         | Medium  |
| R4SDC                         | log <sub>4</sub> N-1    | 3 log <sub>4</sub> N | 2N-2           | Simple  |
| R4SDF                         | log <sub>4</sub> N-1    | 8 log <sub>4</sub> N | N-1            | Complex |
| R2 <sup>2</sup> SDF           | log <sub>4</sub> N-1    | 4 log <sub>4</sub> N | N-1            | Simple  |
| R2 <sup>2</sup> MDC           | log <sub>4</sub> N-1    | 4 log <sub>4</sub> N | 3N/2-2         | Simple  |

Table 2: Comparison of Pipelined Architectures

### **5. CONCLUSION:**

In this paper we focused on various radix- $2^n$  algorithms and pipeline architectures design and importance of low power reconfigurable FFT processor for wireless communication system. The major computations in FFT processor is complex multipliers, adders and memory. The radix- $2^2$ single path delay feedback architecture is most efficient for low power due to less number of multipliers, adders for design of FFT processor. The radix- $2^2$ Multipath Delay feedback commutator is well suited for large throughput and low power applications. The aim of research is to reduce the number of multiplication operation, increase the efficiency and reduce the power and area by proper design of twiddle factor multiplication and Memory.

#### REFERENCES

- Wein-Chang Yeh and Chein-Wei Jen"High Speed and Low Power Split-Radix FFT ", IEEE Transactions on Signal Processing, vol.51, issue: 3, March 2003.
- [2] Yu-Wei, Hsuan-Yu Liu and Chen-Yi Lee " A Dynamic Scaling FFT Processor for DVB-T Applications", IEEE Journal of Solid State Circuits, vol.39 no.11, November 2004.
- [3] Shen-Jui Huang, Student Member, IEEE, and Sau-Gee Chen, Member, IEEE "A High-Throughput Radix-16 FFT Processor with Parallel and Normal Input / Output Ordering for IEEE 802.15.3c Systems", IEEE Transaction On Circuits and Systems-I, vol. 59, no. 8, August 2012.
- [4] T. Cho and H. Lee "A High-Speed Low-Complexity Modified Radix-2<sup>5</sup> FFT processor for High Rate WPAN Applications", IEEE Transactions On Very Large Scale Integration (VLSI) Systems, vol.21, no.1, January 2013.
- [5] M. Garrido, J. Grajal, M. A. Sánchez, and O. Gustafsson, "Pipelined radix-2<sup>k</sup> feedforward FFT architectures," IEEE Transactions On Very Large Scale Integration (VLSI) Systems, vol. 21, no. 1, pp. 23– 32, Jan. 2013.
- [6] Jienan Chen, Jianhao Hu, Shuyang Lee and Gerald E. Sobelman "Hardware Efficient Mixed Radix-25/16/9 FFT for LTE Systems" IEEE Transactions On Very Large Scale Integration (VLSI) Systems, vol. 23, no. 2, November 2015.
- [7] Jian Wang, Chunlin Xiong, Kangli Zhang, and Jibo Wei, Member, IEEE "Fixed point analysis and parameter optimization of the radix-2<sup>k</sup> pipelined FFT processor" IEEE Transaction on signal processing, Vol.63, No.18, Sept. 15, 2015.
- [8] Chao Wang, Yuwei Yan, and Xiaoyu Fu "A High-Throughput Low-Complexity Radix-2<sup>4</sup>-2<sup>2</sup>-2<sup>3</sup> FFT/IFFT Processor With Parallel and Normal Input/Output Order for IEEE 802.11ad Systems" IEEE Transactions On Very Large Scale Integration (VLSI) Systems" in process, 2015.

- [9] Eun Ji Kim, Jee Hack Lee and Myung Hoon Sunwoo "Novel Shared Multiplier Scheduling Scheme for Area-Efficient FFT-IFFT Processors", IEEE Transaction on very large scale integration (VLSI) Systems, vol.23, no.9, September 2015.
- [10] Zhuo Qian and Martin Margala "Low-Power Split-Radix FFT Processors Using Radix-2 Butterfly Units" IEEE Transactions On Very Large Scale Integration (VLSI) Systems, vol.24, Issue: 9, September 2016.
- [11] Kai-Feng Xia, Bin Wu, Tao Xiong, and Tian-Chun "A Memory-Based FFT Processor with Generalized Efficient Conflict-Free Address Schemes", IEEE Transactions On Very Large Scale Integration (VLSI) Systems, 2017.
- [12] Ngoc Le Ba and Tony Tae-Hyoung Kim "An Area Efficient 1024-Point Low Power Radix-2<sup>2</sup> FFT Processor with Feed-Forward Multiple Delay Commutators" IEEE Transactions On Circuits and Systems, April, 2018.