|
|
| |
Porting BDT BenchmarkTM to VIRAM
Haiyun Tang and Ning Zhang
CS 252 Project Report, May 1998
1. Introduction
Custom ASIC, DSP processor, and general-purpose processor are three alternatives to implement DSP functions where the trade-offs involve performance, hardware development complexity, software development complexity, energy efficiency, and cost. Generally, custom ASIC has best performance and energy efficiency, while general-purpose processor has the best software support. DSP processor has the advantage of the both and is the most popular choice for DSP applications.
An innovative general-purpose architecture, the Berkeley VIRAM[1][2][3], embodies two advantages -- high memory bandwidth as provided by its on-chip DRAM and parallel processing capability as enabled by its vector architecture -- that are intrinsic to virtually all DSP applications. In this project we evaluate the performance of Berkeley VIRAM as a potential DSP candidate through a DSP benchmark suite from Berkeley Design Technology Inc.
The report is organized as following. In Section 2, we discuss the background of BDT's DSP benchmark approach and general specification for the benchmark. In Section3, we port individual DSP kernels selected in the benchmark to VIRAM instruction set, verify the functional correctness. In Section 4, we analyze the performance of VIRAM on the benchmark comparing to that of general purpose processors and DSP processors. In Section 5, we summarize the results.
2. BDT BenchmarkTM
BDT benchmark consists of a collection of application kernels. These are small fragments of DSP code that are commonly used in many DSP applications. Example kernels are FIR or IIR filters. The use of these small and common application kernels to benchmark processors has the following advantages. First, due to their moderate size, they can be hand-coded in assembly and the implementation can almost guarantee optimal for different processors so that the performance comparison is fair. Second, since those are key pieces of code that are commonly used in most DSP applications and where most of processing time is spent, they can be considered representative of most DSP applications.
The BDT benchmark suite consists of 11 application kernels including (1) vector addition, (2) vector dot product, (3) real single-sample FIR, (4) real block FIR, (5) complex block FIR, (6) LMS adaptive FIR, (7) IIR, (8) vector maximum, (9) convolutional encoder, (10) FSM, and (11) FFT. Due to the lack of specifications of the last two kernels, we are only be able to port the first nine of the kernels to VIRAM. In the following section we address the porting and VIRAM performance evaluation for the individual kernels.
BDT follows certain guidelines when porting their benchmark to different processors to ensure fair comparison. The primary goal of the implementation is maximum speed and the secondary goal is minimum memory storage. Only reasonable optimizations are allowed. Extreme optimizations that aren't likely to be practiced in real world development are prohibited. Examples of such extreme optimizations includes unlimited loop unrolling that results in unacceptable code size and benchmark-driven table look-up. The implementation should accommodate variable input vector length as if the input length is not known until run time. And finally, all the private state information must be loaded from and stored back to memory. We follow the same guideline during our porting process so that the performance comparison on the kernels between VIRAM and the other processors ported by BDT is meaningful.
3. Porting and Results for Individual Kernels
The porting assumes 16-bit fixed-point fractional arithmetic where the internal precision is 32-bit. The kernels are coded in assembly according to the VIRAM ISA. Each kernel is coded as a global function which can be called by a C main program for testing. Strip-mining is used to support variable vector length. Local optimizations like filling branch delay slot are performed whenever possible. The correctness of the code is checked using VIRAM ISA simulator. The following subsections address individual kernels.
3.1 Vector Addition
This kernel does a vector add, i. e.,
Two source vectors are loaded from memory, added together, and the resulting vector is stored back to memory. The kernel is used extensively in graphics and combining signals. BDT specifies that the vector length is 40 and since the maximum vector length for 16-bit operation on VIRAM is 256, the kernel is handled in one iteration through the strip-mining loop.
3.2 Vector Dot Product
The kernel calculates the dot product of two vectors, i. e.,
It is used extensively in computing convolution and correlation. In the strip-mining loop, the two source vectors are loaded, multiplied together, and the vector result is added into a vector base register. After exiting the strip-mining loop, we all the elements of the base register are accumulated and the scalar result is stored back. The vector length chosen by BDT is 40.
3.3 Real Single-Sample FIR
The above figure shows the block diagram of an N-tap FIR which is basically the dot product of the delayed input vector X and coefficient C, i. e.,
where the new input sample is X(n) and X(1),..., X(n-1) are the elements in the delay line. After calculating the result, the delay line is advanced by storing back X(2), X(3),..., and X(n) (and throwing away X(1)). The kernel is extensively used in speech processing and filtering. The BDT benchmark specifies 16 taps for this kernel.
3.4 Real Block FIR
The kernel is extensively used in speech processing. It basically repeats the single-sample FIR calculation for a certain number of input samples. The delay line should be advanced upon finishing the calculation of the last sample. The BDT benchmark specifies 40 input samples for this kernel and for each of the samples the delay line is 16-tap deep.
3.5 Complex Block FIR
The kernel is used in modem channel equalization. It is similar to the real block FIR except that the input, output, and filter coefficients are all complex. The core operation here is the complex multiplication which is implemented as 4 real multiplications plus 1 real addition and 1 real subtraction. There are 40 samples and 16 taps per sample.
3.6 LMS Adaptive FIR
This kernel is used in channel equalization and linear predictive coding. The whole operation of the kernel is depicted in the following two equations
The first part of the kernel is a single-sample FIR. Then the result is used to update the filter coefficients (rounding required) as shown in the second equation. There are 16 taps for this kernel as specified in BDT benchmark.
3.7 IIR
This kernel is used in audio processing and general filtering. The top of the above graph shows the block diagram for a direct form second-order IIR filter (it directly corresponds to the equation below). After interchanging the two sections in this linear system, one set of delays could be eliminated from the filter structure. The bottom of the above graph shows the equivalent canonical form which is adopted in the implmentation. The operation of the kernel can be expressed as
where X(n) is the new input sample, n is the filter order, a's are poles, and b's are zeros. The filter order for this kernel according to BDT is 16. Extra scaling step are performed to handle large coefficient values as specified by BDT (on fixed-point processors).
3.8 Vector Maximum
The kernel is mostly used in error control coding. According to BDT, the processor must scan through a vector of 40 elements and find the largest element and its position within the vector. Strip-mining is used to load the vector and in each iteration of the strip-mining loop log2-reduction is used to find the maximum element and its index of the iteration.
3.9 Convolutional Encoder
This kernel is a bit-manipulation function used in error correction coding for IS-54 digital cellular telephone standard. The block diagram of the kernel is show below.
The input has 89 bits, each has a 5-element delay line. The output g0 and g1 are calculated as shown in the figure. g0 and g1 are interleaved in the output.
4. Results and Discussion
Without a cycle-level VIRAM simulator, the execution cycle counts of the kernels are estimated by hand calculation. The calculation is based on the following assumptions of the hardware architecture:
(1) there are four 64-bit vector processing lanes with one load, one store, and one arithmetic unit per lane;
(2) no start-up latency;
(3) no chaining;
(4) single issue per cycle;
(5) each scalar instruction takes one cycle;
(6) each vector instruction that has no vector length dependence takes one cycle;
(7) for vector instruction that does depend on vector length, the number of cycles takes to finish the instruction is calculated by number of lanes dividing vector length (rounding up if there is an remainder);
(8) Without dependence, scalar and vector instructions can proceed execution in parallel once issued so that the execution time of the longer one hide that of the shorter one.
Cycle Counts
|
|
VIRAM
|
GPP
|
DSP
|
|
Vector Addition
|
17 (8)
|
89--255
|
40--168
|
|
Vector Dot Product
|
61 (24)
|
48--325
|
32--73
|
|
Real Single FIR
|
31 (22)
|
33-145
|
25--58
|
|
Real Block FIR
|
1046 (930)
|
905--3785
|
347--1264
|
|
Complex Block FIR
|
2088 (1015)
|
2941--14185
|
1307--4080
|
|
LMS Adaptive FIR
|
43 (32)
|
72--290
|
43--117
|
|
IIR
|
49 (28)
|
59--251
|
38--119
|
|
Vector Maximum
|
48 (30)
|
130--345
|
42--309
|
|
Convolutional Encoder
|
64 (19)
|
191--540
|
91--999
|
Table 1 shows the cycle counts for the kernels we implemented. There are two numbers in the VIRAM column for each kernel. The first one the cycle counts for the vector instructions. The one in the parentheses is the number of scalar instructions. If we assume the execution of the scalar instruction can be completely hidden (no data dependence) in that of the vector instructions, the vector execution cycle counts are the true running cycle counts for the kernels. Table 2 shows the normalized cycle counts. The performance of VIRAM is impressive. As a general purpose processor, it almost achieves the performance of the best DSP processor (TI TMS320C62xx which has 8 execution units that include 2 multipliers and 4 ALUs).
.
Normalized
Cycle Counts
|
|
VIRAM
|
GPP
|
DSP
|
|
Vector Addition
|
0.15
|
0.78--2.24
|
0.35--1.48
|
|
Vector Dot Product
|
0.59
|
0.47--3.15
|
0.31--0.48
|
|
Real Single FIR
|
0.59
|
0.63--2.75
|
0.51--0.53
|
|
Real Block FIR
|
0.73
|
0.63--2.64
|
0.24--0.76
|
|
Complex Block FIR
|
0.42
|
0.60--2.88
|
0.27--0.83
|
|
LMS Adaptive FIR
|
0.38
|
0.64--2.59
|
0.38--1.00
|
|
IIR
|
0.47
|
0.57--2.43
|
0.37--1.15
|
|
Vector Maximum
|
0.33
|
0.88--1.40
|
0.29--2.10
|
|
Convolutional Encoder
|
0.14
|
1.04--1.19
|
0.42--2.20
|
|
Total
|
3.81
|
6.39--21.13
|
3.14--10.54
|
We had some trouble during the original implementation of the kernels and the performance of VIRAM was not as expected. We found that the element-wise operations on vector are costly. There was a major problem involving accumulating all the elements in a vector of variable length which is required in 6 of the 9 kernels. To maximally exploit the hardware parallelism, we used log2-reduction adding. But this requires that the initial length of the vector is multiples of 2. If we just simply set the accumulation length to be the maximum vector length, for example 128 for 32-bit vector width, then if the input vector length is small (for example 40 in Vector Dot Product or 16 in Real Single-Sample FIR), we are actually wasting cycles to do useless work.
After comparing several different schemes, we decided to use a look-up-table approach to minimize the accumulation cost. We keep a look-up table of 512 elements (maximum possible vector length) where the value of element i is expressed as
Since the use of such a table is not limited to BDT benchmark kernels, it should not be considered against BDT's benchmark guideline. The initial operation length of a vector is thus expanded to the nearest 2's multiple using the look-up table.
When the number of remaining elements in the vector is less than the number of elements can be parallel processed (number of lanes times number of elements per lane), it may not be wise to use the vector processor since some hardware resource is wasted. It may be worthwhile to switch to the scalar processor when the remaining number of elements in the vector to be accumulated is less than certain threshold. This is especially useful when the vector processor has other tasks waiting while the scalar processor is idle. It is not a concern in our case since we are considering a single kernel each time. Switching to scalar processor below certain threshold is not faster than directly using the vector processor since the scalar execution can not be hidden due to data dependence.
We also encounted problems when implementing the Vector Maximum kernel. Our original implementation extract each element into a scalar register and then do the comparison. This way we didn't exploit at all the hardware parallelism provided by the vector architecture. In the current implementation, we keep the values in a vector register while the corresponding indexes in another. We perform the log2-reduction comparison on the "value" register and screen out the bigger values by flag masking. We perform the same flag masking operation on the "index" register to keep the indexes updated. The resulting implementation has almost the best performance among all the processors.
.
Cycle Counts per Element
|
|
VIRAM
|
DSP
|
|
Vector Addition
|
0.25
|
0.75--4
|
|
Vector Dot Product
|
0.5
|
0.5--1
|
|
Real Single FIR
|
0.625
|
0.5--1
|
|
Real Block FIR
|
0.5
|
0.5--1
|
|
Complex Block FIR
|
1.5
|
2--4
|
|
LMS Adaptive FIR
|
1.25
|
1.125--5
|
|
IIR
|
1
|
1.5--7
|
|
Vector Maximum
|
0.44
|
0.75--6
|
Table 3 shows the performance comparison in the limit of infinite vector length. When the input vector length is very big, the fixed initialization and termination costs (for example, log2 reduction at the end) become negligible and the hardware resource is fully utilized. The performance comparison thus boils down to comparing the number of aviable parallel hardware units. Since the VIRAM has duplicated processing units, it performs best in most of the kernels. The disadvantage of VIRAM is that it needs two cycles to execute a MAC operation while most DSP processors support a single-cycle MAC. This is why some DSP processors can still meet the VIRAM performance in the limit of infinite vector length. (By chaining vector instructions, the performance of VIRAM could be boosted.)
Since MAC operation is crucial in most DSP applications (it is actually the core operation of 6 of the above kernels), adding hardware support for MAC could be very beneficial. The hardware cost is not very much. For example, a particular design in 0.6um technology, with supply voltage of 1.5V, and clock rate of 25MHz, a 16-bit by 16-bit MAC occupies about 45% more area and consumes about 36% more power compared to a 16-bit by 16-bit multiplier. (A 16-bit by 16-bit -> 40-bit MAC has area of 0.86mm2 and power of 2.6mW.) The benefits of having the MAC unit are: first, it reduces multiply-accumulate operation cycle counts by half; second, since there are fixed number of lanes, after executing MAC instruction on two vectors, the results from each lane could be extracted to scalar registers and added together, in a sense it does the log2 reduction for free and there is no need for look-up table. (No chaining is assumed for VIRAM. With chaining, the result of the prior multiply is added every clock cycle so that MAC will take only one cycle.)
If there is instruction set power characteristics for VIRAM, we would like to estimate and compare the energy efficiency on the benchmarks with GPP and DSP.
5. Summary
In this project, we studied the BDT benchmark; hand-coded and -optimized 9 of the 11 kernels; verified the functional correctness of the code using VIRAM ISA simulator; and estimated and analyzed the performance of VIRAM on the kernels. Our implementation shows as expected that the most of DSP application kernels are intrisically vectorizable and from the above analysis, we believe VIRAM is a very competitve candidate for DSP applications.
Reference
[1] D. Patterson et al., "A Case for Intelligent RAM," IEEE Mirco, Mar./Apr. 1997, pp. 34-44
[2] D. Patterson et al., "Intelligent RAM (IRAM): chips that remember and compute," 1997 ISSCC Digest, pp. 224-225
[3] E. Christoforos et al., "Scalable Processors in the Billion-Transistor Era: IRAM," IEEE Computer, Sep. 1997, pp. 75-78
[4] "DSP on General-Purpose Processors", Technical Report, BDTI
[5] "Buyer's Guide to DSP Processors", Technical Report, BDTI
| |
|
|