| |
Comparison of Low-Swing Driver/Receiver Circuits for
Reconfigurable Interconnect
1 Introduction
As described in our midterm
report, this project has focused on global interconnect in a chip
with a core processor and several satellite processors. This work was
inspired by the Pleiades
project at UC Berkeley which aims to reduce power consumption
of multi-media processors through use of this "domain specific"
architecture. The interconnect between the processors is required to
be reconfigurable to allow "on the fly" speed and power optimization.
Furthermore, the interconnect must support a self-timed signaling
scheme i.e. no system clock. Our goal has been to examine the
tradeoffs in the design of such a system and propose a course of
action to the Pleiades project team.

Figure 1: General Interconnect Network Model
The general form of our architecture was determined early in this project and
is shown in Figure 1. Note that the system uses a segmented bus with
programmable interconnect points to connect each processor to the bus.
Because of our focus on low-power design, we considered each interconnect
point as a switch with no driving capability. This lack of repeaters in
the global interconnect limits us to relatively low speeds, with delays in
the range of 5 to 20 ns for wires 2 to 10 mm long. It also lead us to
develop the bit-line model shown in Figure 2 which consists of a driver, a
receiver, and three distributed RC networks separated by two pass
transistors.

Figure 2: Circuit Diagram for Performance Analysis
The project has not proceeded as originally planned in the midterm
report. We had originally focused on the design of individual
bit-lines and promised to examine the design tradeoffs of various
driver/receiver circuits in terms of power, speed, noise, and area.
As we developed metrics for area and noise performance, we discovered
that a number of system level parameters, such as line pitch, could
have a major effect on the choices of certain bit-line parameters,
such as voltage swing. Thus, much of this report covers the
system-level design methods we developed to help us make fair
comparisons between the driver/receiver circuits.
This report seeks to answer the following questions:
- Can low swing interconnect reduce power?
- Can methods be found to handle the problem of noise?
- How much area would such an interconnect scheme need?
- Are the interconnect schemes examined in this report feasible, in the long
run?
We approach these questions by examining the system design and
bit-line design individually. Section 2 will examine how area and
noise in the system are affected by such parameters as line pitch, line
width, and pass transistor design. Section 3 compares the different
driver/receiver pairs which determine the delay and power consumption
of the system. Section 4 summarizes our findings and addresses the
question of feasibility.
2 System Design Analysis
There is an inherent tradeoff in this system between area and noise
sensitivity. The area required for driver/receiver circuits will
probably be insignificant compared to the area needed for the long
interconnect wires. The distance between the wires (or line pitch), the
width of the wires, and the size/placement of the pass transistors are the
key parameters to consider.
Because the interconnect must support a self-timed signaling scheme,
our noise requirement is that the system must be glitch free. Glitching
is caused by noise spikes which vary in size depending on wire
capacitance, coupling capacitance, wire resistance, signal swings and
rise/fall times. With the exception of signal characteristics, all of
these parameters can be calculated when given the line pitch, line
width, and pass transistor geometry. We therefore assume in this section
that signals can be described by a voltage swing and a rise/fall time. These
parameters are considered design criteria for the line drivers and otherwise
have no effect on area and coupling noise. We must also assume that the
receiver can be described by a parameter VPmax, which represents
the size a noise spike must reach before becoming a glitch.
We will begin by discussing the layout of the programmable
interconnect matrix which we used to derive our basic model. Then, we will
discuss our methods for determining coupling capacitances. Lastly, we will
describe how we cope with coupling noise and supply noise
2.1 Programmable Interconnect Matrix (PIM) Layout
The PIM layout in Figure 3 shows the NMOS pass transistors which connect a
four-bit bus of global wires (horizontal, metal 2) to a four-bit bus of local
wires (vertical, metal 1). The poly-silicon gates are shown disconnected in
the figure, but would conceivably be all tied together.
The term programmable interconnect matrix (PIM as opposed to
PIP) stems from the fact that only one memory cell is needed for a
number of bit-lines. Area must be set aside for the memory, but the
memory circuit will be inactive during normal operation and will thus
not contribute to coupling noise. A memory cell could be made out of
metal1 and poly-silicon alone and tucked under metal 2 wires. Since
global wires are assumed to take up large area, the memory area is
considered to be negligible.

Figure 3: Four Bit PIM Layout
This layout is the most dense possible for the desired
interconnections using the MOSIS 0.6um technology. It is based on a
cell which is 16 lambda (4.8um) wide and 7 lambda (2.1um) high. Line
pitch can be increased but not decreased. This gives us a way of
calculating the minimum area of the PIMs. Assuming 16 bit buses with
2 signaling wires, 6 processors and 4 channels, the area will be
AreaPIM = (4.8 um x 18 wires) x (2.1um x 18 wires) =
3270 um2
AreaTotal = 2580 um x 6 processors x 2 PIM's/proc. x 4 channels
= 157000 um2
Here we have assumed that each processor contributes one PIM for the
driver and a second for the receiver. Note that the size of the pass
transistor does not add to the total area unless it exceeds a maximum.
This width can be calculated as
Wmax = (2.1 um x no. of wires) - 1.2um = 36.6 um
If transmission gates are used (i.e.PMOS transistors in addition
to NMOS transistors) then 3.6 um must be subtracted for diffusion spacing.
In this case, 33.0 um of total pass transistor width can be inserted without
affecting the area. With this layout in mind, we can see the primary sources
of coupling noise, namely the neighboring wires and the cross-over wires.
2.2 Methods for Calculating Coupling Capacitances
The crosstalk noise between any two metal lines can be viewed as the
superposition of capacitive coupling and inductive coupling.
Utilizing integrated circuits fabricated with 0.35um to 1.0 um gate
process results in very small currents and high switching speeds so
the capacitive coupling tends to dominate. Therefor the capacitive
coupling between metal lines can be modeled as a single capacitor
instead of a distributed complex impedance [8]. One should also note
that as the metal wires become narrower, the lateral distribution capacitance
Ca1 and Cb1 tend to become larger than the wire to substrate noise as shown in
Figure 4.

Figure 4: Parasitic Lateral Capacitances
The impact of packing materials also contributes the second order
effects of parasitic capacitances. A study done in [7] reveals the
importance of the these effects and incorporates into their coupling
capacitance model, the power effects of the nonlinear space dependency
and the exponential behavior of field sharing as the width of the
metal wires increase. The model is as follows :
Cc = [k1 + k2(1 -
e-w/m)](S/h)-n (1)
The coefficients n, m ,k1 and k2 are all related to the packing
materials used where n is the nonlinear spacing coefficient,
m is the exponential curve parameter for width dependency and
k1 and k2 are surrounding geometric dependent constants.
For more details refer to [7]. S, w, h are the spacing
between the wires, width of the wires and the height,respectively.
The values for the parameters above were chosen with consideration to
typical values for an HP 0.6um CMOS process supplied by MOSIS. Therefor the
following values are obtained from [7]:
- k1 = 4.69e-5 pF/um
- k2 = 3.96e-5 pF/um
- m = 2.35 um
- n = 0.83
- h = 1um
Figure 5 demonstrates how the cross coupling capacitor varies
with respect to the width of the wires and the spacing between the
wires. For this particular model wires spaced closer than 3um,
results in a rapid increase in the capacitance.

Figure 5: Coupling Capacitance as a Function of Spacing and Width
If we examine one bit line in a channel for the architecture given in
Figure 6 the wires which may induce a coupling capacitor are all
of the neighboring wires and the orthogonal wires which cross the bit
line on a different metal. Each bit line has 48 orthonogal wires (3 channels
x 16 bits), 12
transmission gates (only 2 of which are active at any time), and at
least one neighboring wire. Considering that a handshaking signal is
routed along with every channel, there would be an additional 3
orthogonal wires (one for each channel) which increases the total to 51.
If a coupling
capacitor is placed between every neighboring wire orthogonal wire,
the complexity of the model for noise is dramatically increased.
Therefor making valid simplicifications are in order.

Figure 6: Coupling Capacitances of Architecture
The capacitance created between two orthogonal wires is dependent on
the amount of area in which the two overlap. Using area and fringing
capacitances between metal 1 and metal 2 obtained from MOSIS for a 0.6um HP
process, Table 1 is generated. This table
demonstrates for various wire widths , the amount of coupling
capacitance that is produced.
Table 1: Coupling Capacitance as a Function of Spacing and Width
Width of Wire |
Coupling Capacitance |
| 0.5 um |
12.25 aF/um |
| 1 um |
40 aF/um |
| 1.5 um |
83.25 aF/um |
| 2 um |
142 aF/um |
| 2.5 um |
216.25 aF/um |
| 3 um |
306 aF/um |
The coupling capacitance induced between two wires of width 2um in the
same metal and spaced 10um apart is 10pF/um. The total orthogonal
capacitance experienced by one bit line of width 2um is (51*142) =
7.2fF/um. The capacitance induced by all of the orthogonal wires is
insignificant in comparison to that produced by the neighboring wire.
Therefor for simplicity purposes, coupling capacitances between two
wires in the same metal will be furthered explored as potential noise
sources.
2.3 Methods for Handling Coupling Noise
From the perspective of a designer, we would like to examine our layout
and be able to calculate whether or not a glitch is possible. We would
also like to have some idea of how much "noise headroom" we have. In order
to determine these aspects, we examined the circuit in Figure **.

Figure 7: Coupling Noise Model
Here we have assumed that the entire line capacitance is lumped into C,
and the resistance of the wire, pass transistors, and line driver are
lumped into R. Cc represents the coupling capacitance to noise source Vn.
If we assume that the noise source can be modeled as a linear ramp and that
the initial voltage V is 0, then
the solution to the differential equation is
(2)
Using this model, we can assume a linear superposition of all noise sources.
We can make this model more general if we do not assume that the initial
value of V is zero. In this case, the voltage will be in the process of
settling to its final output value. To ensure that no glitch is possible, we
must show that the derivative of V(t) can never be positive when in the
transition region. This requirement reduces to the following expression:
(3)
where VPmax is the maximum peak voltage allowed at the
input of the sense amp (in steady state) to prevent glitching.
VPmax basically sets the boundary of the transition region.
This expression assumes we are summing over all possible noise
sources.
The value of R will tend to be dominated by either the pass transistors or
the line driver, both of which are non-linear resistances. A minimum-sized
inverter was simulated in the HP 0.6 um process and found to have a
resistance ranging from 25 kOhms (for Vout=1.5 V) to 5 kOhms (for Vout=0 V).
To illustrate the use of this model, consider the example of noise coupling
between two adjacent global wires. The wires are 2mm long and 1.2 um apart.
Using the methods presented in section 2.2, we calculate the coupling
capacitance to be 60 fF/mm. The pass transistor is assumed to be a
transmission gate with the PMOS and NMOS both 20um wide. The majority of the
resistance is assumed to come from the CMOS inverter in steady state, thus
giving R a value of 5 kOhms. The total line capacitance include the local
wires and transistor parasitics is calculated to be 390 fF. The single line
is simulated and found to have a rise rate of 115 mV/ns. Using this data,
we can compute the following:
V(infinity) = R Cc dVn/dt = 5k * 120 fF * 115 mV/ns = 0.069 V
time constant = R (C + Cc) = 5k * (120 fF + 390 fF) = 2.55 ns
Two bit lines with these parameters were simulated with the coupling
capacitance distributed along the length of the wire. The peak induced on
the steady state wire was 0.071 V with a time constant of 1.9 ns. The
discrepancy between the time constants is probably due to inaccuracies in
the total line capacitance extraction.
The success of this model in predicting the noise from a neighboring wire
shows that lumping distributed noise contributions into one node can
greatly simplify the noise analysis. These equations can be used to gain
an intuition about how to design a system which is coupling noise resistant.
Please note that assumptions were made about film thicknesses to calculate
coupling capacitance in spite of sketchy data on the HP 0.6 um process.
The induced noise peak in this example may not be entirely accurate.
All that remains is to measure VPmax for each receiver
circuit. A glitch is defined as an erroneous transition. We can
further define an erroneous transition as an intermediate output.
Thus, VPmax for a given receiver circuit will be the change
in input voltage required to produce an output of VIL or
VIH.
VPmax = min { Vin(Vout=VIL),
Vdd - Vin(Vout=VIH) } (4)
These values can be obtained from examination of a DC transfer
characteristic. For the minimum sized CMOS inverter, VPmax
was found to be 0.69 V (for the HP 0.6 um process with a 1.5 V supply).
To summarize, a method has been created for estimating the resistance of
the interconnect to coupling noise. This method requires extraction
of resistance and capacitance values from a layout and an approximation
of the line driver output resistance. The method also requires a
priori knowledge of the noise signal's voltage swing and rise/fall time.
The receiver circuit sensitivity to noise is accounted for in the
VPmax term. Using this method, we will evaluate the noise
performance of different driver and receiver circuits.
2.4 Methods for Handling Supply Noise
Power supply noise is the variation of the supply and ground nets of
the chip. The variations can be attributed to both DC and sinusoidal
components. The DC component is produced by the IR drop though the
power and ground nets, and the sinusoidal is produced by the RLC
response of the chip and package to current demands [5]. When
variations of the power supply are compared to the circuit's frequency
response, it is quite slow and often treated as DC noise for
analysis.
Unfortunately, our project group has had very little experience with
this subject and has not performed an adequate analysis. In order to
predict what kind of supply noise is caused by our circuits, however,
we have measured the peak supply current drawn during transitions.
This data will help us see how much care must be taken in designing
the power distribution system.
3 Driver/Receiver Circuit Analyses
In this section, we will examine several different options for
driver/receiver circuits and the delay-power tradeoff for each one.
First, the case of a standard CMOS progressively-sized inverter chain
will be examined as a basis for comparison. The base case analysis
demonstrates how delay and power relate to pass transistor sizing and
wire length.
Designs for low swing driver/receiver circuits fall roughly into two
categories. First, there are the circuits such as the Hitachi
architecture which use extra supply rails to limit the swing. This
architecture is limited in that it needs low-threshold transistors to
function properly. Second, there are the precharged circuits, such as
the "Precharged to half Vdd" scheme, which rely on special signaling
to initialize and limit the swing on the wires. Both the Hitachi and
"precharged to 1/2 Vdd" scheme will be examined in this section.
We had intended to also analyze a differential driver/receiver pair but
due to the complexity of the circuit, we did not provide performance analysis.
The differential circuit uses reduced swing supply rails in addition to a
precharging
scheme. This circuit will be discussed briefly followed by a summary of
the circuit performances.
3.1 CMOS Inverter Chain
We chose the HP 0.6 um process from MOSIS and a supply voltage of 1.5
V as a starting point for our simulations because it fits the
Pleiades project group's goals. The circuit in Figure 2 (from
Section 1) was implemented in SPICE with pi-3 distributed RC models
and CMOS inverters as the drivers and receivers. Using the technology
parameters available from MOSIS, we calculated the total line
capacitance and resistance for the global wires. Several values are
tabulated in Table 2.
Table 2: Various Line Capacitances and Resistances in Metal 2
| Width
Length |
0.9 um | 1.2 um | 1.5 um |
| 2 mm |
178 Ohms, 174 fF | 133 Ohms, 180 fF | 106 Ohms, 187 fF
| | 5 mm |
445 Ohms, 435 fF | 333 Ohms,451 fF | 267 Ohms,468 fF |
| 10 mm |
889 Ohms, 869 fF | 667 Ohms, 902 fF | 533 Ohms, 935 fF
| |
We had originally planned to use NMOS pass transistors and deal with the
threshold voltage drop by adjusting the switching threshold of the
receiver down by Vth/2, but we found it difficult to design a CMOS inverter
with a switching threshold of 0.4 V. In the end, we decided that adding
a PMOS transistor was necessary for proper operation. Without the
PMOS, the high noise-margin of the combined driver and receiver is
very low. The transmission gate has a significantly
high resistance at around 0.75 V where both the NMOS and PMOS transistors
are on the verge of shutting off. We found that the delay and power were
generally minimized for a ratio of 3:1 for PMOS to NMOS width.
We found that the delay and switching energy depend almost entirely on
the total line capacitance, independent of where local wires intersect
the global wires. As one might expect, increasing the transmission gate
widths gives an optimum in terms of delay, at the point where resistance
becomes insignificant and parasitic capacitances dominate. Also, varying
the number of hanging transmission gates connected to the wire has a major
effect on delay and energy.
Figure 8 shows delay and switching energy plots for a 5 mm wire while
varying the combined NMOS and PMOS widths. A two-stage, progressively
sized CMOS buffer with a scaling factor of 5 was used for these
simulations. The single processor case considers no
hanging drains from pass transistors of unconnected processors.
Increasing the number of processors in the system increases the number
of drain capacitances which must be switched. There will be two extra
drains per processor: one for the driver and one for the receiver.

Figure 8: CMOS Inverter Performance vs. Pass Transistor Size
(for a 5mm wire)
We had wanted to investigate the use of boot-strapping or charge pumps to
boost the gate voltage of the pass transistors. The circuits proved to be
complex and beyond the scope of this project, but Figure 9 shows what
gains could be made from such a scheme. The simulations illustrated here
are identical to those in Figure 8 except that NMOS pass transistor
gate is tied to 2.3V (Vdd + 0.8V) and the PMOS pass transistor is deleted.
Switching energy can be significantly reduced with this method.

Figure 9: CMOS Inverter Performance with Bootstrapped Gate NMOS
Pass Transistor (for a 5mm wire)
3.2 Hitachi Low-Swing Architecture
A schematic of the Hitachi low-swing architecture appears in Figure 10.
The rails of the reduced swing on the interconnect is set by the scaled
voltages Vsl and Vcl and low threshold devices MOSFETs are utilized to help compensate for the scaled swing.

Figure 10: Hitachi Low-Swing Driver/Receiver
The important design parameters of this scheme are :
- to achieve a high speed level conversion both from high to low and low to high
- to achieve low standby current as compared to base case which results in ove
rall reduced energy
The swing is limited by the Vsl and Vcl supplies while the current driving
capacity is maintained by using low threshold transistors. The receiver
amplifies the reduced signal to that of full scale by utilizing positive
feedback. An important factor in determining the overall power consumption of
the scheme will depend on the type of internal supply voltages used to generate
Vcl and Vsl. This is issue will be addressed in section 3.2.3,
but first the operation of this circuit will be examined independently.
Source Offset Driver
The driver is single CMOS gate with the recommended ratio between the
NMOS and PMOS set to 1: 10/3 [1]. Both are low threshold devices, but attaining their SPICE models was not possible. Consequently we improvised and
created these special MOSFETs by modifying the thresholds of an HP 0.6um SPICE
models originally 0.7V to 0.2V. Though this modification will skew the
results as compared to the actual behavior of low-threshold SPICE models, it
will at least assist in obtaining an approximation of the circuit's
performance and verify its operation. The thresholds should be lowered such
that dVTN = Vsl and dVTP =
Vcc-Vsl. The effective gate voltages then become
VGS - VT which helps maintain a low standby current.
Sense Amplifier
The receiver is also composed of a low threshold NMOS and PMOS, and two sets
of standard NMOSs and PMOSs which create two symmetric level converters. An
input signal sensed at about Vcl will be scaled to Vcc
by the upper converter
and similarly a signal sensed at Vsl is converted to ground by the
lower converter. Figure 11 demonstrates the conversion process for a low to
high transition. The feedback of the receiver initially slows the response of
the receiver and turns on when node B reaches Vcl. Then node A rises to Vcc
which turns off the PMOSs and pulls the output down to ground.
The reduced output drive current of the driver contributes to the delayed
response of the receiver.
As a result, the increase in conversion delay added by the receiver becomes
significant. Though some of the conversion penalty can is compensated by the
reduction in interconnect delay.

Figure 11: Illustration of receiver's operation
We are interested in evaluating the driver/receiver's performance when used to drive the interconnect architecture of Figure 1. The length of the
interconnect will vary the delay by a constant factor as long as the
resistance of the driver is always larger than that of the interconnect.
Regardless of the length chosen, it is important to examine the impact of the
size of driver and transmission gates on performance. Therefor choosing
optimum sizes in terms of energy and delay will permit to explore all the
advantages of this scheme.
3.2.1 Using low threshold transistors
The transistor sizes of the driver are set as discussed above. Through SPICE
simulations it was discovered that the optimum transmission gate ratio for the
interconnect architecture was about 1:3. An optimum driver/transmission gate
ratio is desired, therefor a first order analysis is performed by setting the
interconnect length to 2mm. The width of the transmission gates and the driver
are varied from 1.2um to 50um and while maintaining the NMOS:PMOS ratio.
The swing is also varied from 0.7V to 1.5V. Figure 12 reveals that the size of the transmission gate, nor the swing
significantly determines the optimum driver size. It is inferred from Figure 13 that the swing of the signal impacts the
optimum size of the transmission gate. As the swing decreases from 1.5V to
0.7V, the size of the transmission gate increases in order to compensate for the loss in current drive which consequently increases the overall energy.
Table 3: Optimum Driver Sizes
|
Wdriver |
Wtransmission |
Average Delay |
Average Energy |
Average Energy*Delay |
Lowest Energy |
11 um |
11 um |
5.7 ns |
1.8 pJ |
1e-20 Js |
Lowest Delay |
28 um |
23 um |
3.7 ns |
2.5 pJ |
8.6e-21 Js |
Lowest Energy*Delay |
23 um |
18 um |
3.9 ns |
2.2 pJ |
8.1e-21 Js |
Our goal is to optimize for energy while minimizing the delay, therefor sizes
will be chosen which optimize the energy delay product.
Performance is then evaluated by setting the interconnect length to 5mm and
varying the swing between 0.7V to 1.5V. The simulated
results are tabulated in Table 3. A signal with a 0.8V swing reduces the energy by 1/3 (swing of 1.5V) but with a propagation delay that is 15% of its 50ns period. Table 4 demonstrates that the swing with the best Energy Delay Product (EDP) is in fact full swing, though it has the highest energy consumption.
Table 4: Performance
|
Voltage Swing |
Average Delay |
Average Energy |
Average Energy*Delay |
Lowest Energy |
0.8 V |
8 ns |
2.1 pJ |
1.7e-20 Js |
Lowest Delay |
1.5 V |
2.6 ns |
3.1 pJ |
8e-21 Js |
Lowest Energy*Delay |
1.5 V |
2.6 ns |
3.1 pJ |
8e-21 Js |
The effect of wiring capacitance and resistance is simulated as shown in Figure 14. This scheme is quite insensitive to
distributed RC increases as the delay and energy increase proportionally.
This is due to the small ratio of the interconnect delay to that of the entire
scheme. The length must the become quite large before it can begin to
dominate the delay. The operation of this scheme can then be maintained as long
as the distributed RC is smaller than the interconnect delay.
3.2.2 Using standard transistors
The same analysis are performed for this case as that of the previous one.
The results are similar, but due to increased thresholds the architecture could
not operate below a swing of 0.9V. The optimized results for the driver can be examined in
Figure 15 and the for the transmission gate in Figure 16. Table 5 compares the trades offs
between optimizing for energy and delay.
Table 5: Optimized Driver Sizes
|
Wdriver |
Wtransmission |
Average Delay |
Average Energy |
Average Energy*Delay |
Lowest Energy |
13 um |
12 um |
7.4 ns |
1.8 pJ |
1.3e-20 Js |
Lowest Delay |
34 um |
20 um |
4.6 ns |
2.4 pJ |
1.1e-20 Js |
Lowest Energy*Delay |
30 um |
16 um |
4.9 ns |
2.2 pJ |
1e-20 Js |
The simulated effects of the voltage swing and length are in Figure 17 and
Figure 18. Table 6 summarizes the results.
Table 6: Performance
|
Voltage Swing |
Average Delay |
Average Energy |
Average Energy*Delay |
Lowest Energy |
0.9 V |
7 ns |
2 pJ |
1.4e-20 Js |
Lowest Delay |
1.5 V |
2.5 ns |
3.3 pJ |
1e-20 Js |
Lowest Energy*Delay |
1.5 V |
2.5 ns |
3.3 pJ |
1e-20 Js |
3.2.3 Voltage Scaling
The energy measured above was performed without taking into consideration the
efficiency of the supply. There are currently three independent supplies, Vsl,
Vcl and Vdd. The measurements were performed by integrating the current from
each of the supply across a capacitor and then summing up the results.
Vsl and Vcl will be generated on chip using
the main supply, therefor examining the effect of these supplies on the
overall performance provides a better approximation of the actual power
dissipation. The supplies can be generated in order to attain the desired
voltage through the use of voltage converters. Such converters can be
divided into two main groups : linear and switching regulator. Though
there are many variations of the two, we are interested in overall efficiency
of the converter. The linear regulator can be modeled as a resistor network
which divides the voltage. As the new voltage becomes much smaller than the
supply, the efficiency of the linear regulator will begin to decrease. The
efficiency of such supplies is usually about 80% [10].
The switching regulator, also called a pulse width modulator is able to
provide a higher efficiency, but the overhead is much greater than that of the
linear in terms of area and complexity. The efficiency of this regulator is in the range of 90-95% [10].
Therefor the energy measurements provided above can be modified to exhibit
the behavior of either regulator by adding the inefficiency of the supply
which is be a function to the swing desired.
3.2.4 Summary of Results
The results for the first and second case are
very close. Case 1 (low threshold) has less delay and EDP on the average while
the energy is about the same as that of case2 (standard threshold).
Simulations were performed using the modified low threshold transistors instead
of an actual SPICE model because lack of availability. Therefor the
similarities in performance between the two cases can be attributed to this.
Table 7 compares the performance of the two cases using the optimized
versions of each one.
Table 7: Performance Comparison
| Signal Swing |
Total Energy per Cycle |
Edyn % |
Esta % |
Maximum tp |
| 1.5V |
2.9 pJ |
93 % |
7 % |
2.9 ns |
| 1V |
2.4 pJ |
90 % |
10 % |
5 ns |
Modified MOS
| Signal Swing |
Total Energy per Cycle |
Edyn % |
Esta % |
Maximum tp |
| 1.5V |
3.3 pJ |
97 % |
3 % |
3 ns |
| 1V |
2.4 pJ |
97 % |
3 % |
6.3 ns |
Standard MOS
Due to factors such as conversion delay and the RC effects of the interconnect
proposed, the advantages of this architecture would be best achieved at lower
frequencies. The reduced supply (1.5V) decreases the output current drive of
the driver which contributes to the conversion delay of the receiver. An
additional overhead in terms of voltage regulators are needed which have
tradeoffs in terms of power, area and complexity. The linear regulator is the
least complex, but is more inefficient while the switching regulator is more
complex and uses more area but results in a more efficient conversion.
The proposed architecture is intended to operate with a supply of 2 Volts, but
we were interested in evaluating its performance at 1.5V because speed
degradation may not be large enough to hinder the operation of the type of low
power application in which it may be utilized. Using low threshold transistors
may have improved the overall performance of the scheme, but for feasibility
purposes, the scheme was examined without low thresholds. The delay increases
as expected, but only enforces the fact that this scheme cannot operate at high
frequencies with reduced voltage supply Vdd.
3.3 Precharged to Half Vdd Scheme
The conventional single-ended precharged circuit appears in Figure 19. During
the precharge clock phase, the output of inverter M3-M4 is connected to its
own input. This causes the inverter to precharge the line to the switching
threshold. This method allows for very low power consumption and voltage
swing, but these values are frequency dependent. This circuit can be
implemented in the reconfigurable interconnect scheme only if it is operated
in a fixed synchronous scheme.

Figure 19(a): Single-Ended precharged driver/receiver

Figure 19(b): Timing diagram of single-Ended precharged driver/receiver
The circuit is performed as follows. During the precharging phase, the input
of the precharging inverter (M3-M4) is short-circuited to the input so that
the inverter is forced to its switching point (Vdd/2) when M3 and M4 are
appropriately sized. During the evaluation phase, the short-circuit is
disabled, and the interconnect capacitance is charged up or discharged
according to the input. The interconnect swing depends on the current through M1
M2 (i.e., the sizes of M1 and M2), the interconnect capacitance, and how long
evaluation signal is high. The reduced interconnect swing results in a large
output swing since the output inverter acts as an amplifier.
In our midterm report, we promised to examine this circuit with M3 and M4
at the driver instead of the receiver. This would allow the driver to
handle all precharging. Such a modification does not work, however,
because the gate voltage rises too quickly to the median value and reduces
the precharge current to a trickle. Therefore, M3 and M4 must be at the
receiver.
Several important design issues for optimizing this circuit need to be
addressed:
- The interconnect swing are approximated by
(5)
Thus, swing can be adjusted mainly by changing the sizes of M1 and M2,
and the evaluation period. It can be also adjusted by changing the sizes of
M3 and M4 (i.e., changing the Cint)
- The total delay is the sum of the delay through NAND(NOR) gate, delay
through M1 and M2, delay through interconnect, and delay through M3 and M4.
The portion of delay due to NAND(NOR) could be significant and need to be
reduced by sizing up the transistors, however, in the cost of increasing the
circuit area.
- The total power dissipation can be broken up into four categories as
described below:
- precharge power dissipation which is attributed to the
static current through M3 and M4 during precharge phase, and is given by
(6)
- dynamic power dissipation attributed to the interconnect
charge switching and the interconnect swing during evaluation phase, and
is given by
(7)
- static power dissipation which comes mostly from the leakage
current after the evaluation phase and before the next precharge signal.
- standby power dissipation, the power dissipation when none of the
precharge, evaluation, or input signal occurs. For instance, when
bus line is inactive, the only power dissipation is caused by standby power.
- The switching threshold of INV1 has to be significantly below half-
Vdd to reduce the power dissipation of INV1 during percharged
phase. It can be achieved by properly sizing the INV1.
- The sizes of M3 and M4 are selected so that the switching point is in the
middle of the Vdd. The large sizes of M3 and M4 are desirable in
order to amplify a small swing and meet the delay constraints. However, it will
increase the precharge power dissipation while decrease the dynamic
power dissipation because the swing is reduced by sizing up M3 and M4.
Figure 20(a)(b)(c) shows the plots of swing, delay, and total power dissipation
determined by various sizes of M1 and M2(=M1/3), and evaluation periods.
Table 8 lists the optimized power and delay for three different swings.
Table 7
lists their corresponding transistor sizes when interconnect length is 5mm,
Vdd is 2.5volt , and operation frequency is 20MHz.
Table 8: Optimization in conventional half-Vdd circuit
| Swing(volt) |
0.5 |
0.75 |
1 |
| tdelay(ns) |
2.14 |
2.03 |
2.03 |
| Power_precharge (mW) |
0.186 |
0.168 |
0.155 |
| Power_dynamic (mW) |
0.166 |
0.17 |
0.174 |
| Power_static (mW) |
0.085 |
0.094 |
0.104 |
| Power_standby (mW) |
0.186 |
0.168 |
0.155 |
| HSPICE |
plot1 |
plot2 |
plot3 |
Since this circuit is depending upon the operating frequency, the
implementation of this circuit in the reconfigurable interconnect scheme is
complicated by the fact that our project is geared towards locally synchronous
, globally asynchronous, data driven interconnect, that is, each processor may
be operated under different frequency and voltage. Therefore, the challenge is
how to combine data-driven asynchronous communication at the protocol level
with a simple self-timed signal which provides a signaling scheme that is not
dependent upon the operation clock.
We propose a modified circuit as shown in Figure 21a, in which the precharging
signal is self-timed and sent by the handshaking ACK signals from the previous
communication. Instead of using an evaluation clock, the evaluation signal is
triggered by the input signals, and enables the evaluation phase.
The timing diagram for this precharged to half-Vdd circuit is
shown in Figure 21b.

Figure 21a: Modified precharged driver/receiver scheme

Figure 21b: Timing diagram for the precharged circuit
Several optimization issues are addressed as follows:
- The periods of handshaking request and acknowledge signals are
determined by the wire delay. In the extreme case which bus length is 10mm,
wire delay is approximately 2.5ns and thus the precharge period is about
5ns which is sufficient for precharging.
- While this self-timed precharged interconnect scheme is feasible,
its potential has not yet been fulfilled because of its non-zero standby
power dissipation. Note that this reconfigurable bus is often
inactive and a "zero-power" standby operation is critical for a
low-power application. It can be explained by the fact that the bus
lines still stay at half Vdd during the standby period
(communication inactive period). Hence, we choose the minimum sizes of
M3-M4 and INV1 to reduce the possible standby power dissipation.
Table 9 lists the optimized power and delay for three different swings. It
indicates that about 50% of the total energy is consumed by the precharge and
standby power in the 100% communication active case. In considering the case
that each bus line is only 10% active, 95% of the total power has to be
wasted in the standby phase. Obviously, it is unacceptable for the low
power application.
Table 9: Optimization of swing in modified half-Vdd
circuit
| Swing(volt) |
0.5 |
0.75 |
1 |
| tdelay(ns) |
2.06 |
2.13 |
2.04 |
| Power_precharge (mW) |
0.26 |
0.25 |
0.24 |
| Power_dynamic (mW) |
0.86 |
0.61 |
0.63 |
| Power_static (mW) |
0.023 |
0.041 |
0.023 |
| Power_standby (mW) |
0.26 |
0.25 |
0.24 |
| HSPICE |
plot1 |
plot2 |
plot3 |
Therefore, we propose another self-timed precharged interconnect scheme to
improve the standby power dissipation, as presented in Figure 22a. As can
be observed, a precharged signal is synchronized by input signal while a
sequential evaluation signal is triggered by the precharged signal after
the precharging is complete. Figure 22(b) shows its timing diagram.

Figure 22a: Zero standby-power precharged driver/receiver scheme

Figure 22b: Timing diagram for this improved precharged circuit
This improved circuit shows a superior performance in standby power
dissipation, as summarized in Table 10. Since ths bus line in this case
is set at either 0 or Vdd during the standby period, it
essentially comsumes no standby power and thus reduce total power
consumption. However, the cost we have to pay is an increased delay
time due to the fact that input signal has to wait for the completion
of precharging phase in order for the bus line starting to send
signal. Fortunately, this additional delay overhead is usually less
than 5ns.
Table 10: Optimization in conventional half-Vddcircuit
| Swing(volt) |
0.5 |
0.75 |
1 |
| tdelay(ns) |
7.37 |
8.36 |
7.94 |
| Power_precharge (mW) |
0.15 |
0.11 |
0.14 |
| Power_dynamic (mW) |
0.78 |
0.69 |
0.41 |
| Power_static (mW) |
0.15 |
0.07 |
0.06 |
| Power_standby (mW) |
0.15 |
0.07 |
0.06 |
| HSPICE |
plot1 |
plot2 |
plot3 |
It is also worth noting that the bus length is varied due to the nature of the
reconfigurable interconnect. Assuming the bus length is ranged from 1mm to 10mm
in the case of a 1cm x 1cm chip size, the simulation results show that the
total power dissipation only increase 5% while delay increases 37% as bus increases from 1mm to 10mm.
3.4 Differential Scheme
An extensive search for a low swing, low energy differential scheme was carried
out. Figures 23(a) and (b) show the schematic of the chosen scheme which possesses the desi
red attributes. Differential data transfer inherently has a higher noise
margins and has been employed as an effective means of reducing bus power
consumption on heavily loaded data lines. In terms of low power, the proposed
architecture uses a method of sharing the complementary signal bits of
neighboring wires in order to reduce by half the number of wires needed. This
results in a power savings, but requires the use of a multiplexing scheme.
Figure 23a Driver Schematic
Figure 23b Receiver Schematic
This architecture is intended for very high speed applications and can drive
a bus using a differential swing of 100mV with a 1.2V supply [6]. In
order to obtain the low swing and operate at a fast rate, precharging the bus
to half the level of VDD during half of the cycle is employed.
The driver is composed of switches and an equalizer while the reliever is
composed of a current sense amplifier and a gate-receiver. The gate-receiver
amplifies the voltage difference between the differential lines. The driver and
receiver are synchronized through the use of a main clock (MCLK) and a receiver
clock (RCLK).
The transferred data is sensed by the receiver when it is clocked by RCLK.
It is operated at the same frequency as MCLK, but is delayed by a factor of a
quarter of MCLK's cycle.
The driver precharges the complementary lines of each bit to half VDD during the former
half of the cycle and in the latter half the bus switching is controlled by the
input data. The bus lines are equalized by using charge-sharing between the
two which is attained by shorting both floating lines. The driver uses two
additional supplies to generate the desired low swing, similar to the concept
of the Hitachi circuit. It also needs to generates the complement of the
signal.
To obtain a low-voltage operation, a larger VGS is generated at
input gates of the receiver and output gates of the driver by employing low
thresholds CMOSs. Obtaining a larger VGS helps compensate for the
reduction in voltage supply. During the latter half of the RCLK, the
internal nodes of the receiver are precharged to half of VDD. In
the former half, any slight voltage deviation from VDD/2 on the
primary inputs will be amplified. This new charge will cause the output to
be quickly pulled to high or to ground and for the remainder of the cycle the
receiver will continue to sustain the signal at is current level.
Simulations were performed and a handshaking protocol as suggested by the the
precharged scheme discussed in section 3.3 would need to be utilized. The most
complex portion
of this scheme is implementing the multiplexing between neighboring wires which
is desirable due to power savings.
Preliminary simulations demonstrated the high degree of accuracy required in
implementing this scheme. Many high speed switches are required and a
third clock, called the data clock (DCLK) would be needed for multiplexing
data between neighboring wires. As can be inferred, in order to take advantage
of this scheme's benefit, a large amount of overhead is required :
- The scheme needs to be converted to self-timed using a hand-shaking protocol
l.
- Regulators or additional supplies generate the reduced voltages needed to set the signal swing.
- Low threshold CMOSs are needed to compensate for the reduced current drive
.
- Data lines are precharged which increases static power consumption.
- Complementary signal needs to be generated requiring additional circuitry.
- Multiplexing circuitry is needed, which needs to be combined with the hand
-shaking protocol
Attempts to modify the circuit in order to avoid some of the overheads were
futile as the entire scheme would need to be modified.
Therefor in midst of the examination, the overhead of the circuit out-weighed
the promising low power results with high noise tolerance. If the focus of this
project included high speed, pursuing the scheme may have been more appropriate
,
though the overhead would have to be well justified.
3.5 Summary
In an attempt to make a fair comparison between these circuits, we
decided to optimize each driver/receiver pair for energy when driving
a 5 mm wire at both 10 ns and 5 ns. Only one processor is considered,
that is, no hanging pass-transistor drains were included in these
simulations. Table 11 shows the results, including the transmission gate
dimensions and width of the NMOS driver transistor (assuming a 3:1 ratio
for the PMOS to NMOS size). The energy listed is the total supply energy for
a low-to-high and high-to-low transition.
Table 11: Comparison Drivers/Receivers optimized for a 5 mm Wire
ArchitectureDelayEnergySwing
WpassNWpassP
WdriverNRdrvVPmax
Ipeak | | | | | | | | | |
| CMOS Inverters | 10 ns | 1.46 pJ | 1.5 V | 2.4 um | 6.9 um | 2.1um | 8600 Ohms | 0.68 V | 125 uA
| | | 5 ns | 4.7 pJ | 1.5 V | 21 um | 50.1 um | 32.4 um (1) | 560 Ohms | 0.68 V | 1.34 mA
| | Hitachi | 10 ns | 2.08 pJ (2) | 1.0 V | 8 um | 8 um | 14 um | 500 kOhms | 0.41 | 295 uA
| | | 5 ns | 2.11 pJ | 1.0 V | 20 um | 20 um | 18 um | 500 kOhms | 0.41 | 401 uA
| | Prech. to 1/2 Vdd (3) | 8 ns (4) | 3.44 pJ | 0.9 V | 12 um | 36 um | 1.2 um | 15 kOhms | 0.63 V | 503 uA
| | | 2 ns (5) | 6.8 pJ | 0.75 V | 12 um | 36 um | 1.2 um | 15 kOhms | 0.63 V | 641 uA
|
Notes:
- The driver is the last inverter in a chain of 4 progressively sized
inverters. The delay includes the transitions of the entire chain.
- Assumes 100% efficient switching regulators for low-swing supply rails.
- Uses a 2.5 V supply instead of 1.5 V. The circuit could not be made
to function at lower voltages.
- Uses the Zero-Standby Power signaling scheme. Delay includes 5 ns
for precharging and 3 ns for transmission.
- Assumes the synchronous timing scheme is used with Teval = 2 ns.
Also included in the table are the noise parameters discussed in Section 2.
Rdrv is the output resistance of the driver in the middle of the
signal swing, where the circuit is most vulnerable to noise.
VPmax is the maximum allowable steady state noise voltage input
to the sense amp. Both values were measured from DC simulations of the
drivers and receivers. Lastly, Ipeak was measured as the
peak supply current durring the transient simulations.
In retrospect, the circuits we chose to test were intended to increase
the speed of interconnect and not to reduce power. The CMOS circuit
has the lowest driver resistance and a large VPmax,
suggesting that it will be the most noise resistant. This circuit
tends to induce large current peaks in the supply rails and be energy
inefficient for small delays.
The precharged is slightly less noise resistant due to its high driver
resistance. This circuit can achieve speeds much greater than the
CMOS inverter chain can, but at a large energy cost. This circuit is
probably not suited to a low power system, even if the timing problems
could be solved.
The Hitachi circuit has an energy advantage over the CMOS inverter for a
delay of 5 ns, but uses more energy than the CMOS inverter with a
delay of 10 ns. The circuit was difficult to optimize for swings below
1 V, and did not function below 0.7 V. Note that VPmax is
reduced, meaning that this circuit will be more sensitive to noise.
Also, the driver output resistance is huge (500 kOhms), but this may simply
be an artifact of our doctored technology file. While the circuit does
seem to be significantly more prone to coupling noise, it is important to
remember that the rise rate of neighboring bitline voltages will be much
smaller for this scheme than for the CMOS inverter.
4 Conclusions and Recommendations
In conclusion, let us readdress the original questions this project posed.
- Can low swing interconnect reduce power?
Yes, but only for higher switching speeds or with complex signaling schemes.
The overhead of the Hitachi architecture causes it to consume more power than
the standard CMOS inverter when the total delay is around 10 ns. The
power used by precharged schemes depends entirely on the signaling architecture
which supports it.
- Can methods be found to handle the problem of noise?
Coupling noise for the speeds of interest can be modeled and predicted
using the methods presented. Layouts can be made noise resistant to an
arbitrary degree. The real problem is likely to be supply noise which has
not yet been adequately examined.
- How much area would such an interconnect scheme need?
The area depends on the dimensions of the PIM and the number of PIM's required.
Additional area is needed for the global wires. For the 4-channel, 6-processor
case, the total area is likely to be around 1 mm2.
- Are the interconnect schemes examined in this report feasible, in the long
run?
Probably. The base case system with CMOS inverters operating in full-swing
mode seems to be the best of the circuits analyzed for the Pleiades
team. This circuit is robust, energy efficient, and relatively noise free
for delays around 10ns. But it is still possible that a low-swing
architecture exists that will outperform the static CMOS inverter. What will
be the characteristics of such an architecture?
- Only one extra supply rail - The Hitachi scheme examined would
not work unless two extra supply rails were used to put the swing around the
midpoint voltage of 0.75 V. If a circuit could be developed which allows
the voltage to swing between ground and an intermediate voltage, then the PMOS
pass transistors could be eliminated as well as the complexity of a second
voltage regulator.
- Static receiver with large noise margins - The receiver should
be static to support handshaking signals. Making a static level restorer
circuit with noise margins equal to half the voltage swing is non-trivial.
5 References
- Nakagone, Y. et al, Sub-1-V Swing Internal Bus Architecture for Future Low-Power ULSI's, IEEE Journal of Solid-State Circuits, April 1993
- Shepard, K.L. et al, Noise in Deep Submicron Digital Design, 1996 IEEE/ACM International Conference on Computer-Aided Design,November 1996
- Okada, T., et al, Characterization of Net Configurations for Multichip Modules, IEEE Multi-Chip Module Conference, July 1994
- Vassiliou, I. and Prihadi, K., A Comparison of CMOS Driver/Receiver Circuits for Reduced Swing Interconnect, U.C. Berkeley May 1993
- Rabaey, J.M., "Digital Integrated Circuits : A Design Perspective",copyright 1996, Prentice Hall Electronics
- Yamauchi, H. et al, A Signal-Swing SUppressing Strategy for Power and Layout Area Savings Using Time-Multiplexed Differential Data-Transfer Scheme, IWWW Journal of Solid-State Circuits, September 1996
- Lee, M., A Fringing and Coupling Interconnect Line Capacitance Model for VLSI On-Chip Wiring Delay and Crosstalk, IEEE International Symposium on Circuits and Systems, May 1996
- Sicard, E., Analysis of Crosstalk Interference in CMOS Integrated Circuits, IEEE Transactions on Electromagnetic Compatibility, May 1992
- Kuroda, T., et al, A High Speed Low-Power 0.3um CMOS Gate Arrat with Variable Threshold Voltage (VT) Scheme, Proceedings of the IEEE Custom Integrated Circuits Conference, 1996
Peggy Laramie, Roawen Chen, Rhett Davis
15 May 1997
| |