I. INTRODUCTION
Sorting is a widely used prime function in most of the data-centric applications such
as big data [1], image processing [2], database query operations [3], wireless local area network [4], ATM switching [5], etc. Due to the great impact of sorting operations, a lot of software algorithms
and hardware implementations have been developed. In many cases, especially for real-time
sorting operations [6,7], software-oriented designs are difficult to meet timing constraints. Since hardware
implementation has many advantages over software approaches, developing an efficient
hardware sorting unit has been a significant challenge to overcome the computational
bottleneck of the sorting problems.
Most prior works on hardware sorting designs are implemented based on some modification
of traditional sort algorithms or some altered structure of switching network with
parallel processing and pipelining stages. Most of these approaches are based on compare-and-swap
(CAS) operation which compares the magnitudes of two inputs and switches their positions
according to the result. To improve the performance of sorting architectures, parallel
data-independent CAS blocks are devised to form sorting network [8-10]. The CAS based sorting architectures recursively move inputs from memory to memory
so that they require large memory space and processing time for moving inputs.
Alternative sorting engines, called comparison-free sorting engines, use some special
sorting methods instead of the CAS operation [11-14]. These architectures use a non-comparative sorting algorithm which does not compare
between inputs. Due to the absence of recursive moving and swapping of input elements,
comparison-free sorting engines can save swapping memories and data load/store time.
The number of comparison-free sorting engines, however, is quite limited compared
to CAS based sorting engines, because implementation of existing non-comparative sorting
algorithms is much harder than CAS based design. Comparison-free sorting engines use
a non-comparative algorithm either by modifying existing algorithm or by developing
some special architecture and operations.
Despite many potential advantages of comparison-free sorting, most of implementations
proposed so far are not adequate as a sorting engine for a general use. The main reason
is that their architectures use a special operation which requires unusual hardware
so that they are not easy to synthesize with usual ASIC design tools.
In this paper, a comparison-free sorting engine called Sorting Grid is presented.
The Sorting Grid (SG) is intended to be used as a general hardware module for ASIC
design. The SG is composed of three major blocks, and each block is built by copying
a basic unit so that it can be easily synthesized with ASIC tool. It can support stable
sorting for inputs with the same key, and even multi-key sorting. Any ASIC chip requiring
hardware sorting IP can use this module.
The relevant hardware sorting implementations are reviewed in Section II. The architecture
and operation of the proposed sorting engine are described in Section III. Performance
of the new architecture is evaluated by simulations and the results are presented
in Section IV and Section V concludes this paper.
II. RELATED WORK
Most of hardware sorting architectures are based on comparative sorting algorithms
while only few employ a non-comparative algorithm.
All hardware sorting engines adopting a comparative sorting algorithm use CAS elements
[1-10], and their difference lies in the way of their arrangement in the architectures.
Efficient parallel processing and pipelining is the key for high-speed operation.
Although some CAS based engines are implemented as an ASIC chip, most of them are
implemented with FPGA. Since comparators and local memories are included in most of
FPGAs, it is relatively easy to implement CAS based engine with FPGA. The CAS-based
architecture needs large processing time and memories for comparing and swapping operations,
especially for sorting large number of inputs.
Some sorting engines employ a non-comparative sorting algorithm to avoid the disadvantage
of CAS operation. However, only few comparison-free sorting engines have been developed
so far. A binary number sorting engine is presented in [11] which is based on rank-order filter and an efficiently implemented multi-input majority
function. Alternatively, the design in [12] uses a rank matrix that assigns relative ranks to the inputs. The rank matrix is
updated in accordance with the value of a particular bit in each of the inputs. However,
this design could not be used when the number of elements was less than the bit-size
of key. One-hot weight matrix is employed for sorting operation in [13]. Each binary input is assigned a unique one-hot-weight and the one-hot-weight matrix
is made by the weights of the inputs. Sorting is performed by using the matrix. Due
to the inefficiency of the one-hot-weight matrix, however, this engine is competitive
only when the key size ($\textit{k}$) is small, and the number of inputs is about
2$^{k}$. Largest element detection scheme is used for sorting in [14]. By employing bit-serial scanning of inputs, it finds the largest element at the
first scan and returns the address of the element. The second largest element is sorted
in the next san, and so on. One element is sorted by one scan so that sorting is performed
by repeating the scans $\textit{N}$ times.
Most of comparison-free sorting engines are implemented as an ASIC chip. Generally,
comparison-free sorting engines use a special sorting scheme which is not employed
in any other ASIC chips. Therefore, most of their components are not supported in
commercial FPGA chips so that implementation with FPGA is either impossible or very
inefficient. The uniqueness and unfamiliarity of the architecture are the disadvantage
of comparison-free sorting engines as a generally usable sorting engine.
III. THE SORTING GRID
A sorting-engine module called Sorting Grid (SG) is developed to be used as a hardware
IP for ASIC design. It can be used as an easily synthesizable hardware module for
any ASIC chip which requires high-speed sorting operation. The structure and operation
of SG are described in this section.
1. Block Diagram of The Sorting Grid
Operation of the SG is based on radix sort algorithm. Radix sort is a well-known sorting
algorithm. It is a non-comparative sorting algorithm which avoids comparison by creating
and distributing data into buckets according to their radix [15]. From the most significant digit (MSD), this bucketing process is repeated for each
digit while preserving the ordering of the prior step, until all digits have been
considered.
The SG uses a novel architecture to implement radix-2 sort algorithm. Fig. 1 shows the structure of SG. It consists of four main blocks: I/O register block, grid
block, bin management block, and control block. The I/O register block is composed
of conventional shift registers. To sort $\textit{N}$ of $\textit{m}$-bit binary inputs,
$\textit{N}$ of ($\textit{m}$+1)-bit shift registers are used to store both of the
inputs and sorted data. One additional bit is automatically inserted when each input
element is loaded to the registers. Before sorting, input elements are loaded to the
registers, and those are sorted in ($\textit{m}$+1) cycles. The most significant bits
(MSB) of the $\textit{N}$-registers are used for sorting operation in each cycle and
discarded by shifting at the end of the cycle. The least significant bits (LSB) of
the registers are filled with the outputs of the bin management block, as in Fig. 1. By repeating this operation ($\textit{m}$+1) times, the original input elements
are expelled from the registers when sorting is completed, and the registers are filled
by the sorted elements.
Fig. 1. Block diagram of the Sorting Grid.
The grid block is an array of $\textit{N}$${\times}$$\textit{N}$ grid units. The principal
function of the grid unit is a 1${\times}$2 demultiplexer and the array is used to
count the number of 1s in a bin. The bin management (BM) block creates bins, deactivates
useless grid units, and controls progress of a cycle. The flow of entire sorting process
is governed by the control block.
2. Operation of The Sorting Grid
The operation of the SG is explained by using a simple example. The SG supports two
types of inputs. The one type is key-only inputs where keys themselves are the objects
of sorting. In many applications, raw data are the keys to be sorted. The value of
key is everything they need so that no other information is necessary. For this type
of inputs, stability of sorting is not necessary because there is no way to distinguish
two identical data.
The other type of inputs is that key is only a part of input element. Typically, an
input element is composed of a key and an object. The key represents a weight of the
object. Sorting is performed to determine an order of the objects, although it is
done using keys. Various objects can be possible for sorting problems so that a pointer
is used to retrieve the object. For this type of inputs, stability is an important
property of sorting operation.
For the $\textit{m}$-bit ($\textit{m}$ = $\textit{k}$+$\textit{p}$) inputs with $\textit{k}$-bit
key and $\textit{p}$-bit pointer, the I/O block requires ($\textit{m}$+1)-bit shift
registers which can be divided by three fields (key, TB, and pointer). Keys must have
a fixed size. One bit called tie-break (TB) bit is inserted (with 0) between key field
and pointer field. This bit is used to achieve stable sort when multiple inputs have
the same key. This bit is inserted/removed by the control block when data is transmitted
between I/O block and external systems. For key-only inputs, TB-bit is not necessary,
so that $\textit{m}$(=$\textit{k}$)-bit register is sufficient for sorting. The value
of $\textit{k}$, and $\textit{p}$ must be given to the control block before input
transmission.
Fig. 2 presents the sorting process of SG. For example, assume that there are 8 objects
($\textit{D}$$_{1}$, $\textit{D}$$_{2}$, ... $\textit{D}$$_{8}$) of which the weights
are marked by 4-bit binary keys (8, 5, 15, 10, 5, 7, 5, 11). Among 8 objects, three
have the same key. As shown in Fig. 2(a), 8-bit registers are used to store the input elements. In addition to 4-bit key field,
1 bit for stability, and 3-bit for index field are required. Instead of pointer, index
of object is used to save area and operation time.
Before sorting begins, input elements are loaded to I/O registers, all grid units
are activated, and entire array belongs to one bin. At the first cycle, only the right-bottom
input of the grid array has 0. Let us call this 0- value as a ball. According to the
value of MSBs of the I/O registers, grid units pass the ball to the top of the array
as in Fig. 2(a).
Fig. 2. Example for operation of the sorting grid: (a) the first cycle; (b) the second cycle; (c) the third cycle; (d) the fourth cycle; (e) the fifth cycle (tie-break cycle); (f) the 6~8-th cycles (transmission cycle).
Behavior of a grid unit is controlled mainly by the value MSBs of the I/O registers.
A grid unit passes the ball to its upper grid when MSB=1, but to left-up grid when
MSB=0. Since the ball moves left only when MSB=0, the ball arrives at the top of $\textit{j}$-th
column such that $\textit{n}$-$\textit{j}$ is the same as the number of 0s in the
MSB of the registers.
As soon as the $\textit{j}$-th BM-unit receives the ball, it divides the array into
two bins. The right/left columns become 0s/1s bin. According to the bin-decision the
memory cell in each BM-unit is written with proper value as in Fig. 2(a). After bin-decision, useless grid units in the bin are deactivated with kill signals:
K-0 and K-1. The K-0 (kill 0) signal activated in 1s bin disables all grid units with
MSB=0, while K-1 (kill 1) signal in 0s bin disables grid units with MSB =1. If a grid
unit is deactivated, it just passes a ball to upper unit regardless of the value of
MSB. As a result, $\textit{j}$${\times}$$\textit{j}$ active grid units are left in
1s bin and ($\textit{n}$-$\textit{j}$)${\times}$($\textit{n}$-$\textit{j}$) active
units form a new array in 0s bin as in Fig. 2(b). At the end of each cycle, the contents of shift registers are shifted left so that
the bits in MSB are discarded and the bits in the next digit move to MSB while the
contents in the memories in BM blocks shifted to the LSB of the registers as in Fig. 2(b).
The operations in the cycles 2, 3 and 4 are the same as those of in the first cycle
except the fact that there are multiple bins, and the operations in the first cycles
are applied to every bin simultaneously. At the right-bottom of each bin, a new ball
is placed, and it propagates to the top of each bin. Receiving one of the balls, BM-block
divides the bin and kills unnecessary grid units in the new bins. Then, the register
is shifted with the contents of memory in BM-block. The first 4 cycles ($\textit{k}$
cycles in general) are performed with keys so that these cycles are named as sorting
cycle.
When no identical key exists in input elements, only one active grid-unit should be
left in every column of the grid array. However, multiple active units exist in a
column when some elements have the same key as in Fig. 2(e). One additional cycle, called tie-break cycle, is required to set an order among
those elements. By making their order as the same as their input order, the entire
sorting operation can be a stable sort. The tie-break cycle operates as the same as
sorting cycle, except that the $\textit{T}$ signal is generated instead of the KILL-signal.
The $\textit{T}$ signal deactivates unnecessary grid units.
After tie-break cycle, only one active grid unit is left in every column of the array
as in Fig. 2(f). The rest of cycles are called transmission cycles, which just pass the bits in MSB
of the registers to LSB via active grid units and memories in the BM-block. The transmission
cycle operates the same way as the sorting cycle but the deactivating process with
KILL signal is not necessary.
As shown in Fig. 2, the I/O registers are filled initially with input elements but gradually replaced
by outputs and finally filled with sorted elements. Therefore, it requires minimal
memory space for data which can greatly reduce hardware size.
3. Implementation of The Sorting Grid
Each block of SG in Fig. 1 is implemented as a VLSI circuit. The circuit of a grid unit is shown in Fig. 3. The principal function of the grid unit is a 1${\times}$2 demultiplexer. The grid
unit is a 1${\times}$2 demultiplexer with some additional features such as deactivating
logics and dummy path. Behavior of a grid unit is controlled mainly by the value $\textit{D}$$_{i}$,
$\textit{I}$$_{j}$, and internal latch. $\textit{D}$$_{i}$is a bit from an input element
which comes from the MSB of the $\textit{i}$-th I/O register. All grid units in the
$\textit{i}$-th row are controlled by $\textit{D}$$_{i}$. If $\textit{D}$$_{i}$=1,
the ($\textit{i, j}$)-th grid unit ($\textit{G}$$_{ij}$) passes received ball signal
($\overline{B_{ij}})$ to $\textit{G}$$_{i-}$$_{1}$$_{,j}$ while $\overline{B_{ij}}$
to $\textit{G}$$_{i-}$$_{1}$$_{,j-}$$_{1}$ if $\textit{D}$$_{i}$=0. A grid unit can
be disabled by its internal latch. If the latch is set, the grid unit is disabled
so that it just passes ball to the upper unit ($\textit{G}$$_{i-}$$_{1}$$_{,j}$) regardless
of the value $\textit{D}$$_{i}$. At the beginning of the sorting operation, internal
latches of all grid units are reset by RESET signal. During sorting process, grid
units are selectively disabled by setting the latch through one of kill signals (K-0
or K-1). Once a unit is disabled, it cannot be reactivated until the sorting is finished.
The isolation signal, $\textit{I}$$_{j}$ is used to isolate two adjacent columns in
the grid array. A column of the grid array, called a ladder, corresponds to an output
element in a bin. The activated $\textit{I}$$_{j}$, ($\textit{I}$$_{j}$=1) becomes
a border of a bin so that a bin is defined by two activated isolation signals, and
the number of elements in the bin is the number of columns between them. The ball
signal ($\overline{B_{j}})$ cannot cross any border of bins. The isolation signal
$\textit{I}$$_{j}$ activates an auxiliary ball path $\overline{AB_{j}}~ $ to provide
a path for ball signals instead of crossing the border.
Fig. 3. Circuit of the grid unit.
The bin management (BM) block creates bins, deactivates useless grid units, and controls
progress of a cycle. It consists of a series of $\textit{N}$ BM-units as presented
in Fig. 4. Each BM-unit controls a ladder. If a ball (value 0) arrives at the top of $\textit{j}$-th
column (i,e. $\textit{B}$$_{0j}$=1), the $\textit{j}$-th BM-unit divides its bin into
two bins by activating $\textit{I}$$_{j}$$_{+1}$. The ladders in the right of $\textit{j}$-th
ladder become 0 s bin and the other ladders become 1 s bin. After creating new bins,
the memory cells in the 1 s (0 s) bin are written by 1 (0). The signal $\overline{F_{j}}$
is used to inform that the ladder $\textit{j}$ is ready for the next operation. The
control block can check the states of BM-units with $\overline{F_{j}}$ signals.
Fig. 4. Block diagram of the bin-management unit.
The flow of entire sorting process is governed by the control block. The control block
generates various asynchronous signals as soon as possible for high-speed operation.
The entire sorting operation is performed in $\textit{m}$+1 cycles but each cycle
is performed with different latencies. A small bin can be processed more quickly than
a large bin. The number of bins increases while the sizes of bins decrease along the
cycle. Because all bins are sorted in parallel, the processing time for a cycle keeps
decreasing. For this reason, the SG avoids using the fixed period clock. Instead,
it uses a self-timed asynchronous clock which flips as soon as possible by checking
the states of BM-units. However, it uses a fixed external clock when the I/O block
sends/receives data to/from external systems.
Note that three main blocks, the grid array, the I/O block, and the BM-block, have
a modular structure. Each block is made by copying a basic unit. Therefore, the engine
could be easily synthesized if hardware IP for the grid unit and the BM-unit are available
since shift register is included already as a standard cell for most of ASIC tools.
The control block can also be easily synthesized by using the conventional control
logic design method.
4. Performance Enhancement Scheme
Two speed enhancement schemes are developed to enhance the performance of the SG.
A self-timed asynchronous clock is used to avoid the inefficiency of the fixed period
clock, and a bypassing scheme is employed to decrease the delay of ball propagation
time in the grid block.
$\textit{N}$ of $\textit{m}$-bit inputs with $\textit{k}$-bit key and $\textit{p}$-bit
pointer are sorted in ($\textit{k}$+$\textit{p}$+1) cycles with the SG. If a fixed
period clock is used in the operation, the period of the clock, $\textit{T}$$_{clk}$,
should be decided as the longest cycle among ($\textit{k}$+$\textit{p}$+1) cycles,
and the total delay of sorting ($\textit{T}$$_{d}$) becomes ($\textit{m}$+1)$\textit{T}$$_{clk}$.
Since the longest cycle time is proportional to $\textit{N}$, time complexity of the
proposed engine would be $\textit{O}$($\textit{mN}$) when the fixed period clock is
used. However, the execution time of the longest cycle could be tens of times longer
than that of the shortest cycle, so that fixed period clock is quite inefficient in
performance.
The execution time of a cycle ($\textit{T}$$_{cycle}$) can be analyzed with 4 components.
where $\textit{T}$$_{col}$ is the time that a ball propagates from the bottom to the
top of the array, and $\textit{T}$$_{BM}$ is the time for setting BM-units from the
ball position to the boarders of the bin. $\textit{T}$$_{C}$ is the time spent in
the control block and $\textit{T}$$_{R}$ is restoring time for the next cycle. The
contributions of $\textit{T}$$_{C}$and $\textit{T}$$_{R}$ to total sorting time are
small, and their variations along the cycles are negligible compared to the other
two terms, so that they can be regarded as constants. $\textit{T}$$_{col}$ is proportional
to the number of grid units in a column and $\textit{T}$$_{BM}$ is proportional to
bin size, i.e. the number of columns in a bin. The total delay of sorting $\textit{T}$$_{d}$
is
where, $\textit{t}$$_{g}$ is the delay of a grid unit, $\textit{b}$$_{\mathrm{max}}$
is the largest bin size in cycle $\textit{j}$, $\textit{t}$$_{B}$ is the delay of
a BM-unit, and ${\alpha}$ = $\textit{T}$$_{C}$+ $\textit{T}$$_{R}$ is a constant.
The longest $\textit{T}$$_{cycle}$ occurs in the first cycle because the bin size
is biggest at the first cycle. Although the number of bins increases along the cycle,
the size of each bin decreases. Since all bins are processed simultaneously, $\textit{T}$$_{BM}$
is determined by the biggest bin in that cycle. Because $\textit{b}$$_{\mathrm{max}}$
decreases monotonically, $\textit{T}$$_{cycle}$ decreases along the cycle, and all
bins have only one element after the tie-break ($\textit{k}$+1) cycle. Therefore,
the worst-case cycle time at the first cycle is
while the shortest cycle time in transmission cycles is
Eqs. (3) and (4) show a large difference between the longest and shortest cycle times. By reflecting
this result, the new sorting engine uses a self-timed asynchronous clock to increase
operation speed. The clock flips by detecting the end of critical processes so that
each cycle starts as soon as possible. By employing the variable period clock, the
time complexity of $\textit{T}$$_{BM}$ can be reduced from $\textit{O}$($\textit{Nm}$)
to $\textit{O}$($\textit{N+m}$) in the best case. Assume that inputs are well balanced
so that balls bisect bins in every cycle, then $\textit{b}$$_{\mathrm{max,j}}$ would
be $\textit{N}$/2$^{j}$. The $\textit{b}$$_{\mathrm{max,j}}$ decreases until it becomes
1, and rest of the cycles are working with the same $\textit{T}$$_{BM}$. If $\textit{N}$
< 2$^{k}$, approximately,
which has $\textit{O}$($\textit{N+m}$) time complexity. The distribution of pseudo-random
inputs shows similar patterns as the best case.
Unlike $\textit{T}$$_{BM}$, $\textit{T}$$_{col}$ is not changed even though the number
of active grid units in a column keeps decreasing. The columns in a bin are localized
in the array, but the active grid units are scattered over entire column. Even if
most of units in a column are disabled, a ball should pass those units so that a ball
must pass $\textit{N}$ grid units regardless of the number of inactive units. Although
$\textit{T}$$_{BM}$ is much larger than $\textit{T}$$_{col}$ in early cycles, $\textit{T}$$_{col}$
becomes the dominant component from few cycles later causing $\textit{T}$$_{d}$ to
remain in $\textit{O}$($\textit{Nm}$).
To resolve this problem, a bypass scheme is developed as in Fig. 5. The purpose of the bypass scheme is to skip consecutive inactive grid units in ball-propagation.
A column is divided into several groups such that every group consists of consecutive
$\textit{q}$ grid units and has a bypassing line. If all grid units in a group are
disabled, a ball does not pass the grid units in the group, but the ball is sent to
the next group through the bypassing line. The bypassing scheme is very effective
in later cycles since lots of grid units are disabled.
Fig. 5. Basic grid block with the first level bypassing circuit.
Fig. 5 shows the basic block of grid units and implementation of the first level bypassing
scheme. Four grid units with a repeater form a basic grid-block which is used as a
basic module of ASIC library. With the first level bypassing scheme, only 4 times
is the maximum achievable speed-up. For greater speed up, multi-level bypassing scheme
could be employed. The number of levels and the number of grid units in a group should
be decided considering the applications and hardware overheads.
Similarly, with $\textit{T}$$_{BM}$, time complexity of $\textit{T}$$_{col}$ can be
$\textit{O}$($\textit{N+m}$) in the best case with an optimal input distribution and
the maximal level bypassing scheme. Therefore, the overall latency ($\textit{T}$$_{d}$)
of the sorting engine can achieve $\textit{O}$($\textit{N+m}$) time-complexity, but
$\textit{O}$($\textit{Nm}$) in the worst-case distribution.
IV. PERFORMANCE EVALUATION
The performance of SG is evaluated through simulations with pseudo-random inputs.
The simulation is performed by HSPICE with 1.2 V 0.13 ${\mu}$m-process model parameter.
Simulations are performed with several input sizes and key sizes. For each type of
inputs, 10 samples are simulated, and their average values are used as the results.
Two groups of inputs are used in simulation where one group uses 8-bit key and the
other does 16-bit key. The inputs with 16, 64, 256, and 1024 elements are used. Key-only
inputs are used for simulations because transmission cycles are just moving data with
a constant cycle time. Multi-level bypassing scheme is used such that every 4 block
is grouped as a bypassing unit. For example, 4-level (maximum level) bypassing scheme
is used in sorting 1024 inputs.
Table 1 shows the simulation results. The proposed sorting engine sorts 8-bit 1024 inputs
in 1.042 ${\mu}$s. Its rate is about one element per nano-second. Even when the key
size is increased twice, the sorting time increases by 55% for 16 inputs while it
increases by 29% only for 256 inputs. For 1024 inputs, the sorting time increases
only by 6%. One reason of such an amazing result is that 1024 inputs with 8-bit key
include too many identical inputs since $\textit{N}$ > 2$^{k}$, while $\textit{N}$
< 2$^{k}$ for 16-bit key inputs. The efficiency of bypassing scheme drops when inputs
include many same-key inputs.
As in Table 1, the gap in cycle times between the first cycle and the last cycle is very large.
The difference is getting bigger for a larger number of inputs. Fig. 6 shows how the cycle time ($\textit{T}$$_{cycle}$) changes along the cycles. $\textit{T}$$_{cycle}$
decreases rapidly along the cycle and is saturated at a point. It is because $\textit{T}$$_{col}$
and $\textit{T}$$_{BM}$ keep decreasing but $\textit{T}$$_{C}$+$\textit{T}$$_{R}$
does not. Therefore, $\textit{T}$$_{C}$+$\textit{T}$$_{R}$ becomes the dominant factor
of $\textit{T}$$_{cycle}$ from the point.
Table 1. Simulation Results
# of Input
|
8-bit Key
|
16-bit Key
|
$T_{d}$ (ns)
|
$T_{cycle}$ (ns)
|
$T_{d}$ (ns)
|
$T_{cycle}$ (ns)
|
First
|
Last
|
First
|
Last
|
16
|
25.3
|
5.7
|
2.0
|
39.3
|
5.4
|
1.9
|
64
|
69.3
|
19.9
|
3.6
|
100.0
|
20.3
|
3.2
|
256
|
253.3
|
79.6
|
6.8
|
325.6
|
80.8
|
4.9
|
1024
|
1042.2
|
318.4
|
26.2
|
1105.1
|
317.0
|
7.1
|
If $\textit{N}$ > 2$^{k}$, there must be multiple inputs with a same key as the case
of 1024 inputs with 8-bit. This input may have 4 elements for any key value so that
the efficiency of the bypassing scheme is reduced. It can be detected in Fig. 6(a) that $\textit{T}$$_{cycle}$ of 1024 inputs decreases slowly than that of 256 inputs
where only few inputs may be the same. It implies that the bypassing scheme is more
efficient when $\textit{N}$ < 2$^{k}$.
Fig. 6. Change of cycle time: (a) inputs with 8-bit key; (b) inputs with 16-bit key.
The shortest cycle time is about 44 times faster than the longest cycle for 1024 inputs
with 16-bit key. If a fixed period clock is used, the sorting delay ($\textit{T}$$_{d}$)
would be 2.25 ${\mu}$s (16${\times}$317 ns) which is double the delay of SG (1.105
${\mu}$s). The largest difference occurs when sorting 256 inputs with 16-key, which
the delay is quadrupled. It proves that employing the variable period clock makes
the engine more than two times faster than using a fixed cycle clock.
The speed of SG is compared to that of other comparison-free sorting engines. The
comparison is summarized in Table 2. Only the sorting engine in [14] is implemented using FPGA while the others are done with VLSI design, thus it has
difficulties in direct comparison with other engines. The number of cycles required
to complete sorting operation is $\textit{k}$+1 for SG which is fixed regardless of
the number of inputs ($\textit{N}$), while it is proportional to $\textit{N}$ for
the other sorting engines. The total sorting time ($\textit{T}$$_{D}$) of SG is only
affected by clock cycle time ($\textit{T}$$_{cycle}$) so that SG uses the self-timed
asynchronous clock to increase the sorting speed.
Table 2. Comparison of SG with other sorting engines
|
SG
|
in [11]
|
in [12]
|
in [13]
|
in [14]
|
Time complexity
|
O(N+k)
|
O(N+k)
|
O(N+2k)
|
O(N)
|
O(N)
|
Restriction
|
None
|
None
|
N > k
|
N${\approx}$2$^{k}$
|
None
|
Technology
|
130 nm
|
350 nm
|
65 nm
|
90 nm
|
FPGA (65 nm)
|
N
|
1024
|
63
|
16
|
1024
|
1024
|
k
|
16
|
16
|
16
|
10
|
16
|
No. of cycles
|
17
|
78
|
48
|
2k~3k
|
1024
|
( k+1 )
|
( N+k-1 )
|
( N+2k )
|
(2N~3N)
|
(N~2N)
|
T$_{\mathrm{cycle}}$
|
variable
|
5 ns
|
1.6 ns
|
2 ns
|
7.8 ns
|
Sorting Time (T$_{D}$)
|
1105.1 ns
|
390 ns
|
76.8 ns
|
4~6 ms
|
8 ms
|
T$_{D}$ / N
|
1.08 ns
|
6.19 ns
|
4.8 ns
|
4~6 ns
|
7.8 ns
|
Because the sorting engines use different key sizes and data sizes in speed simulation,
the sorting time per element ($\textit{T}$$_{D}$/$\textit{N}$) is used to compare
their speeds. As in Table 2, SG is the fastest among all comparison-free sorting engines. The sorting rate of
SG is about 1 element per 1 ns while those of other engines are 0.13 ~ 0.25 element
per 1 ns, where SG is more than 4 times faster than other engines.