Abstract
A hierarchical automated design flow for low-energy
direct-mapped signal processing integrated circuits is presented. A modular framework based on a combined
Simulink and floorplan description drives automatic layout generation. Automatic characterization of layout
improves system-level estimates. The
flow is demonstrated on the subsystems of CDMA and OFDM receivers and a 300k transistor test-chip.
Introduction
The signal processing architectures that are the
most efficient in terms of power and area can be obtained by directly mapping
the algorithms into hardware. In this
way, the maximum parallelism can be obtained, allowing the minimum clock rate
and supply voltage to be used, resulting in reduced energy per operation. Direct-mapped design requires an initial
description which allows specification of algorithmic parallelism as well as
control structures. Typical metrics
that can be achieved with this approach are computational efficiencies of
100-1000 MOPS/mW with computation densities of 100-1000 MOPS/mm2 .
These efficiencies are 2 to 3 orders of magnitude higher than can be achieved
by software processors (1). This
efficiency makes these architectures especially attractive for wireless
systems.
In spite of the enormous advantage of this type of
architecture, it is not commonly used unless the application cannot be
accomplished by any other means. Direct-mapped IP cores for some communication
system components (such as Viterbi decoders) are readily available, but most of
today’s wireless devices are based on programmable solutions. Sequential execution destroys the
parallelism inherent in algorithms and wastes the opportunity to save power by
lowering the supply voltage. The complexity of algorithms for future wireless
receivers is so great that they are not presently feasible with programmable
solutions (2).
Direct mapping is typically avoided because of the
indeterminism of design time for such highly dedicated chip solutions. This is due in part to the unpredictability
of timing closure (3) but is exacerbated by the lack of a consistent flow from
an initial system level description through to the physical layout. A standard
design flow has three phases which are typically handled by three different
engineering teams. System designers
conceive the chip and deliver a specification to ASIC designers, often in the
form of a Matlab or C simulation. This
system level description can be used in a simulation to generate system
characterizations such as a bit-error-rate vs. signal-to-noise ratio (BER vs.
SNR) curve for a communications chip. The ASIC designers map the simulation to
VHDL or Verilog code that they deliver to the physical design team. Physical designers synthesize the code to a
standard-cell library and use place route tools to generate layout mask
patterns for fabrication. This
procedure requires 3 translations of the design with requirements for
re-verification at each stage. Opportunities for algorithmic modifications to
reduce power and area are lost, since the system designer does not have
information about physical performance bottlenecks.
Recent studies of improved design methodologies stress
ASIC designer control over physical structure (4) and early floorplanning
(5). To achieve the full benefit of
direct-mapping, however, control of the physical structure and early
floorplanning is needed by system designers as well. To accomplish this, a new kind of design environment is needed
which offers fast, automatic generation of physical information from a
system-level description. This
contrasts the majority of recent work, such as (3), which focuses on automatic
generation of physical information from register-transfer level (RTL)
descriptions.
The goal of the design environment described here is to
take an initial description that is appropriate for a signal-processing
algorithm (including data-flow and control) and to automatically generate a
direct mapped architecture. The
MathWorks’ tool Simulink was chosen as the basis of this description because it
captures algorithmic parallelism and is tightly integrated with the popular
system design tool Matlab. A framework
for module generators aids the creation of raw physical design information, and
a design flow for fast creation of mask-layout is supported. This paper presents this framework and flow
with some examples of their usage.
Chip-in-a-Day Design Flow
This environment accelerates the design process by
automating the design flow to mask layout from a single description. This description encapsulates the important
decisions typically made by system, ASIC and physical designers
separately. These design decisions are
divided into the following types:
·
Function Decisions – basic block diagram and
input-output behavior of each block
·
Signal Decisions – physical signals specifications
·
Circuit Decisions – specification of the
transistors to implement each block
·
Floorplan Decisions – physical location specifications
By pulling all of these decisions into a single
description, system, ASIC and physical designers can more easily come together
to make the trade-offs necessary to optimize the design. In order for the designers to understand
these trade-offs, an accurate method of estimating power, area and delay from
the unified description is needed. To
facilitate these estimates, our design flow assumes that each functional block
corresponds to a unit of layout (sometimes called a “hard macro”) which has
been extracted and characterized for power, area, and delay. These functional blocks are the leaf-cells
of the design hierarchy and will be simply called “macros” for the remainder of
this paper.
As the design progresses, the system becomes larger, and
interconnect capacitance makes these estimates less accurate. We therefore use the automated flow to
produce layout, extract and characterize portions of the design hierarchy. This process is called “hardening the
hierarchy”. As illustrated in Fig. 1,
hardening creates a new hard macro which can be instantiated in the design to
improve estimation accuracy. The design
converges manually as the hierarchy is hardened rather than relying on a
timing-driven flow to manage constraints.
Once the entire design hierarchy is hardened in this way, the chip is
ready to be fabricated.
Fig. 1: Hardening the
Hierarchy
Signal processing architectures are typically data-path
heavy. Control logic accounts for less
than one-tenth the power and area of the data-path logic for baseband signal
processing in wireless sytems. Simulink
offers a number of primitives such as adders, multipliers, and switches which
allow easy description of a data-path.
Fig. 2 shows shows a simple Simulink implementation of an
Add/Compare/Select (ACS) block from a Viterbi decoder. In addition to data-path primitives,
Simulink also offers a control primitive called Stateflow. Stateflow is a graphical language for
describing control flow with state machines.
Fig. 3 shows an example of a simple Stateflow chart. Using the data-path and control primitives
provided with Simulink, it is possible to create a discrete-time simulation
which completely specifies the behavior of a system.

Fig. 2: Simulink
implementation of an ACS block from a Viterbi decoder

Fig. 3: An example of a
Stateflow chart that counts from 0 to 7
and restarts from zero if the input signal sat_en is low or stays
at 7 while sat_en is high
Signal decisions are made through the use of
Simulink’s Fixed-Point Blockset. This
blockset allows the replacement of normal Simulink primitives with versions
that truncate values according to fixed-point types of arbitrary
word-lengths. These signal decisions
are captured by a tool developed in-house which creates a hierarchical EDIF
file from a Simulink system. Another
type of signal-level optimizations is use of multiple clock domains. The design flow recognizes the Simulink
Enable block as a gated clock specification and builds a clock tree
automatically.
Circuit decisions are
encapsulated through the specification of a parameterized generator for each
macro. Simulink parameters can be
passed to the macro generators, thus allowing exploration of design trade-offs
from Simulink. Each generator produces
a netlist of cells for which layout and schematic views exists. There are three main types of macros
supported by our flow:
·
Data-path Macros – Ranging in complexity from arithmetic blocks (such as adders,
registers, and multpliers) to complex pipelines, these macros are typically
implemented with Synopsys’ Design Compiler or Module Compiler or semi-custom
tiled layout generators.
·
Stateflow Macros –
Stateflow charts are translated to VHDL by a tool developed in-house. This VHDL code is then synthesized with
Design Compiler.
·
Composite Macros – These are portions of a design hierarchy which have been hardened.
Specification of physical
placements and other floorplan decisons in Simulink is inconvenient, so one
additional input description, the floorplan, is required. The Simulink system and floorplans are
developed side by side. When a block is added in Simulink, the macro generators
execute automatically, and an unplaced block will appear in the floorplanning
tool. The main design decisions
encapsulated by the floorplan are the boundary for the cell and the
standard-cell rows for automatic placement.
Timing Issues
One of the greatest difficulties of digital design is
timing closure. Timing closure is the
process of making sure that the layout meets the timing constraints assumed in
the initial description. The difficulty
arises from the fact that interconnect capacitance is unknown at design
time. The most common problems that
arise during hardening are failure to meet cycle-time goals and generation of
clock-trees with excessive skew
Cycle time goals can be met by improving the accuracy of
pre-layout delay estimates. Several
aspects of our flow allow us to improve this accuracy. First, early specification of the floorplan
provides detailed placement information to estimation tools. This placement information can be used to
estimate interconnect capacitance more accurately than statistical wire-load
models. When combined with macro timing
models, these capacitance estimates improve delay estimates. Secondly, our
focus on reduced supply voltages and low clock rates reduces the significance
of distributed RC effects. Fig. 4 shows
simulated results for an inverter with a distributed RC load. For lower supply voltages, the delay at the
driver (logic delay) grows while the additional RC delay of the interconnect
remains relatively constant.

Fig. 4: Logic and interconnect
delay scaling for an inverter
with a distributed RC load in
a 0.25 mm technology.
A tool is being developed in-house for the purpose of
automatic, hierarchical insertion of the clock tree during hardening. This tool uses a tree topology because the
lower wire capacitance leads to less power consumption. This approach does require optimization and
balancing. The clock is inserted into
the floorplan and balanced for a target non-zero skew. The amount of allowed clock skew is
determined by the inherent race immunity of the flip-flops, which is given by
the difference between the clock-to-output delay and the hold time. For lower supply voltages, flip-flops are
more race immune, allowing more clock skew.
The cycle-time penalty of large skew is small, given the low clock rates
of interest.
Examples
To demonstrate the power of this approach, we present here
five complex macros and a test-chip designed with simple macros. All examples were designed in Simulink out
of macros for which the generator had already been established as functionally
equivalent to its corresponding Simulink model. Test-vectors from the Simulink simulation were saved and used for
power estimation.
At this point, the automated flow takes over and invokes
each macro generator. The resulting
hierarchy is then hardened by merging the placement information in the
floorplan and routing with Cadence’s IC Craftsman. The resulting layout is verified with Calibre design rule and
layout vs. schematic checks (DRC & LVS), and parasitics are extracted with
Arcadia. EPIC PowerMill simulations of
the extracted netlist then characterize the power consumption of the layout
using the Simulink test-vectors, and
EPIC PathMill finds the critical-path delay.
The macros presented here were designed with Synopsys’
Module Compiler to be used in one of two receiver systems under investigation
for indoor wireless networks. The first is a CDMA system described in (6). The synchronization subsystem shown in Fig.
5. is intended to provide coherent timing recovery and code acquisition for a
stream of soft symbols. The white
blocks in Fig. 5 were hardened with results given in Table 1. The results show large discrepancies of
critical-path delays. These
discrepancies represent flexibility to the system designer when making
performance trade-offs.
Fig. 5: Synchronization
Subsystem for a CDMA Receiver
Table
1: Results of hardening the CDMA system macros
|
|
|
decimation
filter
|
adaptive
pilot detect
|
frequency
offset estimator
|
|
area
in 0.25 mm
|
1.4
mm2
|
1.2
mm2
|
0.22
mm2
|
|
power
@
25 MHz
|
2.5
V
|
120
mW
|
88
mW
|
5.2
mW
|
|
1.0
V
|
13
mW
|
7.6
mW
|
0.47
mW
|
|
critical-path
delay
|
2.5
V
|
5.6
ns
|
10
ns
|
5.2
ns
|
|
1.0
V
|
18
ns
|
32
ns
|
17
ns
|
|
|
cells
|
21
k
|
15
k
|
3500
|
|
|
transistors
|
240
k
|
220
k
|
37
k
|
|
|
|
|
|
|
The second system
under investigation is an orthogonal frequency-division multiplexing (OFDM)
system with 128 sub-carriers with a target aggregated sample rate of 25
Msamples/sec. Two dominant
computational blocks in an OFDM receiver are fast Fourer transform (FFT) and
Viterbi decoder (7). A pipelined FFT architecture with 16-bit precision and
single-path delay feedback was used for the macro in Table 2. The 64-state, 8-level-soft-input Viterbi
decoder in Table 2 has a parallel state metric update unit, which uses modulo
arithmetic based normalization scheme.
The register-exchange method is used for survivor path decoding.
Table
2: Results of hardening the OFDM system macros
|
|
|
FFT
|
viterbi
decoder
|
|
area
in 0.25 mm
|
1.4
mm2
|
0.71
mm2
|
|
power
@
25 MHz
|
2.5
V
|
150
mW
|
69
mW
|
|
1.0
V
|
16
mW
|
7.0
mW
|
|
critical-path
delay
|
2.5
V
|
20
ns
|
4.8
ns
|
|
1.0
V
|
63
ns
|
15
ns
|
|
|
cells
|
19 k
|
10
k
|
|
|
transistors
|
270
k
|
130
k
|
|
|
|
|
|
All results listed here were generated in less than 24
hours for each macro. Smaller macros
ran through the flow in 3 hours. Most of the time is spent routing and
extracting.
To verify the approach of this flow, a test-chip was
developed based on the decimation filter block described earlier. Unlike the complex macro above, the test
chip was designed with several hundred hard macros placed manually in 3 layers
of physical hierarchy. The design
contained 307 k transistors and was fabricated in a 0.25 mm
technology. Table 3 shows the
performance evaluation of the chip at 1.0 V and 25 MHz. The table shows very close agreement between
Simulink and layout estimates and measured values of power and delay. The area discrepancy arises from the fact
that the hard macros could not be abutted, leading to a wasted space. The performance of this chip relative to the
Module Compiler macro indicates that this level of detailed floorplanning is
not necessary for this particular structure.
Table 3: Comparison of
test-chip performance estimates
|
|
Simulink
|
layout
|
actual
|
|
area
|
2.0
mm2
|
6.8
mm2
|
6.8
mm2
|
|
power
|
16
mW
|
17
mW
|
14
mW
|
|
criti.-path
delay
|
19
ns
|
21
ns
|
in
progress
|

Fig. 6: Die photo of the
Decimation Filter Test-Chip
Conclusions and
Future Work
The success of the test-chip and ease of the macro
hardening flow are encouraging. The
next step is to apply the flow to the design of systems in the 1M to 10M
transistor range. The most difficult
aspect of this flow is the verification of functional equivalency of macro
generators and their Simulink models.
As macros become more complex, more opportunities for discrepancy arise,
leading to potential problems when macros are combined. Future work will focus on comparisons of the
estimates gained from this approach to estimates made with other system-level
design methods. Also, much more
investigation is needed into the level of detail needed during floorplanning
and into which macro granularities scale best to future process generations.
Acknowledgements
Many thanks to Cadence Design Systems, Ericsson and Lucent
for their support, and to ST Microelectronics for fabricating our test
chip. Thanks to Hayden So, Ben Coates,
and Dave Wang for their help developing the flow, and to Paul Hustead, Josie
Ammer, and Mike Sheets for using the flow and providing valuable feedback.
References
(1)
R.
Brodersen, “The network computer and its future,” Proc. of the 1997 International
Solid State Circuits Conference.
(2)
N.
Zhang, et al., “Trade-offs of performance and single chip implementation of indoor wireless
multi-access receivers,” Proc. of the 1999 Wireless Communications and
Networking Conference.
(3)
D.
E. Lackey, “Applying placement-based synthesis for on-time system-on-a-chip
design,” Proc. of the 2000 Custom Integrated Circuits Conference.
(4)
W.
J. Dally, A. Chang, “The role of custom design in ASIC Chips,” Proc. of the 2000 Design Automation
Conference (DAC).
(5)
J.
S. Yim, et al., “A floorplan-based planning methodology for power and
clock distribution in ASIC’s,” Proc. of the Design Automation Conference,
New Orleans, LA, 1999.
(6)
C.
Teuscher, et al., "Design of a Wideband Spread Spectrum Radio Using
Adaptive Multiuser Detection," Proc. of the 1998 International
Symposium on Circuits and Systems.
(7)
N.
Weste, D. J. Skellern, “VLSI for OFDM,” IEEE Communications Magazine,
p.127-31, vol.36, (no.10), Oct. 1998.