HomeProjectsPeoplePublicatons
Search:
   
 

Comparison of Low-Swing Driver/Receiver Circuits for Reconfigurable Interconnect

EECS 241 Project Final Report
by Peggy Laramie, Roawen Chen, and Rhett Davis
Presentation Slides (175 kB PostScript File)

1 Introduction

As described in our midterm report, this project has focused on global interconnect in a chip with a core processor and several satellite processors. This work was inspired by the Pleiades project at UC Berkeley which aims to reduce power consumption of multi-media processors through use of this "domain specific" architecture. The interconnect between the processors is required to be reconfigurable to allow "on the fly" speed and power optimization. Furthermore, the interconnect must support a self-timed signaling scheme i.e. no system clock. Our goal has been to examine the tradeoffs in the design of such a system and propose a course of action to the Pleiades project team.



Figure 1: General Interconnect Network Model

The general form of our architecture was determined early in this project and is shown in Figure 1. Note that the system uses a segmented bus with programmable interconnect points to connect each processor to the bus. Because of our focus on low-power design, we considered each interconnect point as a switch with no driving capability. This lack of repeaters in the global interconnect limits us to relatively low speeds, with delays in the range of 5 to 20 ns for wires 2 to 10 mm long. It also lead us to develop the bit-line model shown in Figure 2 which consists of a driver, a receiver, and three distributed RC networks separated by two pass transistors.



Figure 2: Circuit Diagram for Performance Analysis

The project has not proceeded as originally planned in the midterm report. We had originally focused on the design of individual bit-lines and promised to examine the design tradeoffs of various driver/receiver circuits in terms of power, speed, noise, and area. As we developed metrics for area and noise performance, we discovered that a number of system level parameters, such as line pitch, could have a major effect on the choices of certain bit-line parameters, such as voltage swing. Thus, much of this report covers the system-level design methods we developed to help us make fair comparisons between the driver/receiver circuits.

This report seeks to answer the following questions:

  • Can low swing interconnect reduce power?
  • Can methods be found to handle the problem of noise?
  • How much area would such an interconnect scheme need?
  • Are the interconnect schemes examined in this report feasible, in the long run?
We approach these questions by examining the system design and bit-line design individually. Section 2 will examine how area and noise in the system are affected by such parameters as line pitch, line width, and pass transistor design. Section 3 compares the different driver/receiver pairs which determine the delay and power consumption of the system. Section 4 summarizes our findings and addresses the question of feasibility.

2 System Design Analysis

There is an inherent tradeoff in this system between area and noise sensitivity. The area required for driver/receiver circuits will probably be insignificant compared to the area needed for the long interconnect wires. The distance between the wires (or line pitch), the width of the wires, and the size/placement of the pass transistors are the key parameters to consider.

Because the interconnect must support a self-timed signaling scheme, our noise requirement is that the system must be glitch free. Glitching is caused by noise spikes which vary in size depending on wire capacitance, coupling capacitance, wire resistance, signal swings and rise/fall times. With the exception of signal characteristics, all of these parameters can be calculated when given the line pitch, line width, and pass transistor geometry. We therefore assume in this section that signals can be described by a voltage swing and a rise/fall time. These parameters are considered design criteria for the line drivers and otherwise have no effect on area and coupling noise. We must also assume that the receiver can be described by a parameter VPmax, which represents the size a noise spike must reach before becoming a glitch.

We will begin by discussing the layout of the programmable interconnect matrix which we used to derive our basic model. Then, we will discuss our methods for determining coupling capacitances. Lastly, we will describe how we cope with coupling noise and supply noise

2.1 Programmable Interconnect Matrix (PIM) Layout

The PIM layout in Figure 3 shows the NMOS pass transistors which connect a four-bit bus of global wires (horizontal, metal 2) to a four-bit bus of local wires (vertical, metal 1). The poly-silicon gates are shown disconnected in the figure, but would conceivably be all tied together.

The term programmable interconnect matrix (PIM as opposed to PIP) stems from the fact that only one memory cell is needed for a number of bit-lines. Area must be set aside for the memory, but the memory circuit will be inactive during normal operation and will thus not contribute to coupling noise. A memory cell could be made out of metal1 and poly-silicon alone and tucked under metal 2 wires. Since global wires are assumed to take up large area, the memory area is considered to be negligible.



Figure 3: Four Bit PIM Layout

This layout is the most dense possible for the desired interconnections using the MOSIS 0.6um technology. It is based on a cell which is 16 lambda (4.8um) wide and 7 lambda (2.1um) high. Line pitch can be increased but not decreased. This gives us a way of calculating the minimum area of the PIMs. Assuming 16 bit buses with 2 signaling wires, 6 processors and 4 channels, the area will be

AreaPIM = (4.8 um x 18 wires) x (2.1um x 18 wires) = 3270 um2
AreaTotal = 2580 um x 6 processors x 2 PIM's/proc. x 4 channels = 157000 um2

Here we have assumed that each processor contributes one PIM for the driver and a second for the receiver. Note that the size of the pass transistor does not add to the total area unless it exceeds a maximum. This width can be calculated as

Wmax = (2.1 um x no. of wires) - 1.2um = 36.6 um

If transmission gates are used (i.e.PMOS transistors in addition to NMOS transistors) then 3.6 um must be subtracted for diffusion spacing. In this case, 33.0 um of total pass transistor width can be inserted without affecting the area. With this layout in mind, we can see the primary sources of coupling noise, namely the neighboring wires and the cross-over wires.

2.2 Methods for Calculating Coupling Capacitances

The crosstalk noise between any two metal lines can be viewed as the superposition of capacitive coupling and inductive coupling. Utilizing integrated circuits fabricated with 0.35um to 1.0 um gate process results in very small currents and high switching speeds so the capacitive coupling tends to dominate. Therefor the capacitive coupling between metal lines can be modeled as a single capacitor instead of a distributed complex impedance [8]. One should also note that as the metal wires become narrower, the lateral distribution capacitance Ca1 and Cb1 tend to become larger than the wire to substrate noise as shown in Figure 4.



Figure 4: Parasitic Lateral Capacitances

The impact of packing materials also contributes the second order effects of parasitic capacitances. A study done in [7] reveals the importance of the these effects and incorporates into their coupling capacitance model, the power effects of the nonlinear space dependency and the exponential behavior of field sharing as the width of the metal wires increase. The model is as follows :

Cc = [k1 + k2(1 - e-w/m)](S/h)-n (1)

The coefficients n, m ,k1 and k2 are all related to the packing materials used where n is the nonlinear spacing coefficient, m is the exponential curve parameter for width dependency and k1 and k2 are surrounding geometric dependent constants. For more details refer to [7]. S, w, h are the spacing between the wires, width of the wires and the height,respectively.

The values for the parameters above were chosen with consideration to typical values for an HP 0.6um CMOS process supplied by MOSIS. Therefor the following values are obtained from [7]:

k1 = 4.69e-5 pF/um
k2 = 3.96e-5 pF/um
m = 2.35 um
n = 0.83
h = 1um
Figure 5 demonstrates how the cross coupling capacitor varies with respect to the width of the wires and the spacing between the wires. For this particular model wires spaced closer than 3um, results in a rapid increase in the capacitance.



Figure 5: Coupling Capacitance as a Function of Spacing and Width

If we examine one bit line in a channel for the architecture given in Figure 6 the wires which may induce a coupling capacitor are all of the neighboring wires and the orthogonal wires which cross the bit line on a different metal. Each bit line has 48 orthonogal wires (3 channels x 16 bits), 12 transmission gates (only 2 of which are active at any time), and at least one neighboring wire. Considering that a handshaking signal is routed along with every channel, there would be an additional 3 orthogonal wires (one for each channel) which increases the total to 51. If a coupling capacitor is placed between every neighboring wire orthogonal wire, the complexity of the model for noise is dramatically increased. Therefor making valid simplicifications are in order.



Figure 6: Coupling Capacitances of Architecture

The capacitance created between two orthogonal wires is dependent on the amount of area in which the two overlap. Using area and fringing capacitances between metal 1 and metal 2 obtained from MOSIS for a 0.6um HP process, Table 1 is generated. This table demonstrates for various wire widths , the amount of coupling capacitance that is produced.

Table 1: Coupling Capacitance as a Function of Spacing and Width

Width of Wire

Coupling Capacitance

0.5 um 12.25 aF/um
1 um 40 aF/um
1.5 um 83.25 aF/um
2 um 142 aF/um
2.5 um 216.25 aF/um
3 um 306 aF/um

The coupling capacitance induced between two wires of width 2um in the same metal and spaced 10um apart is 10pF/um. The total orthogonal capacitance experienced by one bit line of width 2um is (51*142) = 7.2fF/um. The capacitance induced by all of the orthogonal wires is insignificant in comparison to that produced by the neighboring wire. Therefor for simplicity purposes, coupling capacitances between two wires in the same metal will be furthered explored as potential noise sources.

2.3 Methods for Handling Coupling Noise

From the perspective of a designer, we would like to examine our layout and be able to calculate whether or not a glitch is possible. We would also like to have some idea of how much "noise headroom" we have. In order to determine these aspects, we examined the circuit in Figure **.



Figure 7: Coupling Noise Model

Here we have assumed that the entire line capacitance is lumped into C, and the resistance of the wire, pass transistors, and line driver are lumped into R. Cc represents the coupling capacitance to noise source Vn. If we assume that the noise source can be modeled as a linear ramp and that the initial voltage V is 0, then the solution to the differential equation is

(2)

Using this model, we can assume a linear superposition of all noise sources. We can make this model more general if we do not assume that the initial value of V is zero. In this case, the voltage will be in the process of settling to its final output value. To ensure that no glitch is possible, we must show that the derivative of V(t) can never be positive when in the transition region. This requirement reduces to the following expression:

(3)

where VPmax is the maximum peak voltage allowed at the input of the sense amp (in steady state) to prevent glitching. VPmax basically sets the boundary of the transition region. This expression assumes we are summing over all possible noise sources.

The value of R will tend to be dominated by either the pass transistors or the line driver, both of which are non-linear resistances. A minimum-sized inverter was simulated in the HP 0.6 um process and found to have a resistance ranging from 25 kOhms (for Vout=1.5 V) to 5 kOhms (for Vout=0 V).

To illustrate the use of this model, consider the example of noise coupling between two adjacent global wires. The wires are 2mm long and 1.2 um apart. Using the methods presented in section 2.2, we calculate the coupling capacitance to be 60 fF/mm. The pass transistor is assumed to be a transmission gate with the PMOS and NMOS both 20um wide. The majority of the resistance is assumed to come from the CMOS inverter in steady state, thus giving R a value of 5 kOhms. The total line capacitance include the local wires and transistor parasitics is calculated to be 390 fF. The single line is simulated and found to have a rise rate of 115 mV/ns. Using this data, we can compute the following:

V(infinity) = R Cc dVn/dt = 5k * 120 fF * 115 mV/ns = 0.069 V
time constant = R (C + Cc) = 5k * (120 fF + 390 fF) = 2.55 ns

Two bit lines with these parameters were simulated with the coupling capacitance distributed along the length of the wire. The peak induced on the steady state wire was 0.071 V with a time constant of 1.9 ns. The discrepancy between the time constants is probably due to inaccuracies in the total line capacitance extraction.

The success of this model in predicting the noise from a neighboring wire shows that lumping distributed noise contributions into one node can greatly simplify the noise analysis. These equations can be used to gain an intuition about how to design a system which is coupling noise resistant. Please note that assumptions were made about film thicknesses to calculate coupling capacitance in spite of sketchy data on the HP 0.6 um process. The induced noise peak in this example may not be entirely accurate.

All that remains is to measure VPmax for each receiver circuit. A glitch is defined as an erroneous transition. We can further define an erroneous transition as an intermediate output. Thus, VPmax for a given receiver circuit will be the change in input voltage required to produce an output of VIL or VIH.

VPmax = min { Vin(Vout=VIL), Vdd - Vin(Vout=VIH) } (4)

These values can be obtained from examination of a DC transfer characteristic. For the minimum sized CMOS inverter, VPmax was found to be 0.69 V (for the HP 0.6 um process with a 1.5 V supply).

To summarize, a method has been created for estimating the resistance of the interconnect to coupling noise. This method requires extraction of resistance and capacitance values from a layout and an approximation of the line driver output resistance. The method also requires a priori knowledge of the noise signal's voltage swing and rise/fall time. The receiver circuit sensitivity to noise is accounted for in the VPmax term. Using this method, we will evaluate the noise performance of different driver and receiver circuits.

2.4 Methods for Handling Supply Noise

Power supply noise is the variation of the supply and ground nets of the chip. The variations can be attributed to both DC and sinusoidal components. The DC component is produced by the IR drop though the power and ground nets, and the sinusoidal is produced by the RLC response of the chip and package to current demands [5]. When variations of the power supply are compared to the circuit's frequency response, it is quite slow and often treated as DC noise for analysis.

Unfortunately, our project group has had very little experience with this subject and has not performed an adequate analysis. In order to predict what kind of supply noise is caused by our circuits, however, we have measured the peak supply current drawn during transitions. This data will help us see how much care must be taken in designing the power distribution system.

3 Driver/Receiver Circuit Analyses

In this section, we will examine several different options for driver/receiver circuits and the delay-power tradeoff for each one. First, the case of a standard CMOS progressively-sized inverter chain will be examined as a basis for comparison. The base case analysis demonstrates how delay and power relate to pass transistor sizing and wire length.

Designs for low swing driver/receiver circuits fall roughly into two categories. First, there are the circuits such as the Hitachi architecture which use extra supply rails to limit the swing. This architecture is limited in that it needs low-threshold transistors to function properly. Second, there are the precharged circuits, such as the "Precharged to half Vdd" scheme, which rely on special signaling to initialize and limit the swing on the wires. Both the Hitachi and "precharged to 1/2 Vdd" scheme will be examined in this section.

We had intended to also analyze a differential driver/receiver pair but due to the complexity of the circuit, we did not provide performance analysis. The differential circuit uses reduced swing supply rails in addition to a precharging scheme. This circuit will be discussed briefly followed by a summary of the circuit performances.

3.1 CMOS Inverter Chain

We chose the HP 0.6 um process from MOSIS and a supply voltage of 1.5 V as a starting point for our simulations because it fits the Pleiades project group's goals. The circuit in Figure 2 (from Section 1) was implemented in SPICE with pi-3 distributed RC models and CMOS inverters as the drivers and receivers. Using the technology parameters available from MOSIS, we calculated the total line capacitance and resistance for the global wires. Several values are tabulated in Table 2.

Table 2: Various Line Capacitances and Resistances in Metal 2

Width

Length

0.9 um

1.2 um

1.5 um

2 mm 178 Ohms, 174 fF 133 Ohms, 180 fF106 Ohms, 187 fF
5 mm 445 Ohms, 435 fF333 Ohms,451 fF267 Ohms,468 fF
10 mm 889 Ohms, 869 fF 667 Ohms, 902 fF533 Ohms, 935 fF


We had originally planned to use NMOS pass transistors and deal with the threshold voltage drop by adjusting the switching threshold of the receiver down by Vth/2, but we found it difficult to design a CMOS inverter with a switching threshold of 0.4 V. In the end, we decided that adding a PMOS transistor was necessary for proper operation. Without the PMOS, the high noise-margin of the combined driver and receiver is very low. The transmission gate has a significantly high resistance at around 0.75 V where both the NMOS and PMOS transistors are on the verge of shutting off. We found that the delay and power were generally minimized for a ratio of 3:1 for PMOS to NMOS width.

We found that the delay and switching energy depend almost entirely on the total line capacitance, independent of where local wires intersect the global wires. As one might expect, increasing the transmission gate widths gives an optimum in terms of delay, at the point where resistance becomes insignificant and parasitic capacitances dominate. Also, varying the number of hanging transmission gates connected to the wire has a major effect on delay and energy.

Figure 8 shows delay and switching energy plots for a 5 mm wire while varying the combined NMOS and PMOS widths. A two-stage, progressively sized CMOS buffer with a scaling factor of 5 was used for these simulations. The single processor case considers no hanging drains from pass transistors of unconnected processors. Increasing the number of processors in the system increases the number of drain capacitances which must be switched. There will be two extra drains per processor: one for the driver and one for the receiver.



Figure 8: CMOS Inverter Performance vs. Pass Transistor Size (for a 5mm wire)

We had wanted to investigate the use of boot-strapping or charge pumps to boost the gate voltage of the pass transistors. The circuits proved to be complex and beyond the scope of this project, but Figure 9 shows what gains could be made from such a scheme. The simulations illustrated here are identical to those in Figure 8 except that NMOS pass transistor gate is tied to 2.3V (Vdd + 0.8V) and the PMOS pass transistor is deleted. Switching energy can be significantly reduced with this method.



Figure 9: CMOS Inverter Performance with Bootstrapped Gate NMOS Pass Transistor (for a 5mm wire)

3.2 Hitachi Low-Swing Architecture

A schematic of the Hitachi low-swing architecture appears in Figure 10. The rails of the reduced swing on the interconnect is set by the scaled voltages Vsl and Vcl and low threshold devices MOSFETs are utilized to help compensate for the scaled swing.



Figure 10: Hitachi Low-Swing Driver/Receiver

The important design parameters of this scheme are :
  1. to achieve a high speed level conversion both from high to low and low to high
  2. to achieve low standby current as compared to base case which results in ove rall reduced energy
The swing is limited by the Vsl and Vcl supplies while the current driving capacity is maintained by using low threshold transistors. The receiver amplifies the reduced signal to that of full scale by utilizing positive feedback. An important factor in determining the overall power consumption of the scheme will depend on the type of internal supply voltages used to generate Vcl and Vsl. This is issue will be addressed in section 3.2.3, but first the operation of this circuit will be examined independently.

Source Offset Driver
The driver is single CMOS gate with the recommended ratio between the NMOS and PMOS set to 1: 10/3 [1]. Both are low threshold devices, but attaining their SPICE models was not possible. Consequently we improvised and created these special MOSFETs by modifying the thresholds of an HP 0.6um SPICE models originally 0.7V to 0.2V. Though this modification will skew the results as compared to the actual behavior of low-threshold SPICE models, it will at least assist in obtaining an approximation of the circuit's performance and verify its operation. The thresholds should be lowered such that dVTN = Vsl and dVTP = Vcc-Vsl. The effective gate voltages then become VGS - VT which helps maintain a low standby current.

Sense Amplifier
The receiver is also composed of a low threshold NMOS and PMOS, and two sets of standard NMOSs and PMOSs which create two symmetric level converters. An input signal sensed at about Vcl will be scaled to Vcc by the upper converter and similarly a signal sensed at Vsl is converted to ground by the lower converter. Figure 11 demonstrates the conversion process for a low to high transition. The feedback of the receiver initially slows the response of the receiver and turns on when node B reaches Vcl. Then node A rises to Vcc which turns off the PMOSs and pulls the output down to ground.

The reduced output drive current of the driver contributes to the delayed response of the receiver. As a result, the increase in conversion delay added by the receiver becomes significant. Though some of the conversion penalty can is compensated by the reduction in interconnect delay.



Figure 11: Illustration of receiver's operation

We are interested in evaluating the driver/receiver's performance when used to drive the interconnect architecture of Figure 1. The length of the interconnect will vary the delay by a constant factor as long as the resistance of the driver is always larger than that of the interconnect. Regardless of the length chosen, it is important to examine the impact of the size of driver and transmission gates on performance. Therefor choosing optimum sizes in terms of energy and delay will permit to explore all the advantages of this scheme.

3.2.1 Using low threshold transistors
The transistor sizes of the driver are set as discussed above. Through SPICE simulations it was discovered that the optimum transmission gate ratio for the interconnect architecture was about 1:3. An optimum driver/transmission gate ratio is desired, therefor a first order analysis is performed by setting the interconnect length to 2mm. The width of the transmission gates and the driver are varied from 1.2um to 50um and while maintaining the NMOS:PMOS ratio. The swing is also varied from 0.7V to 1.5V. Figure 12 reveals that the size of the transmission gate, nor the swing significantly determines the optimum driver size. It is inferred from Figure 13 that the swing of the signal impacts the optimum size of the transmission gate. As the swing decreases from 1.5V to 0.7V, the size of the transmission gate increases in order to compensate for the loss in current drive which consequently increases the overall energy.

Table 3: Optimum Driver Sizes

Wdriver Wtransmission Average Delay Average Energy Average Energy*Delay

Lowest Energy

11 um 11 um 5.7 ns 1.8 pJ 1e-20 Js

Lowest Delay

28 um 23 um 3.7 ns 2.5 pJ 8.6e-21 Js

Lowest Energy*Delay

23 um 18 um 3.9 ns 2.2 pJ 8.1e-21 Js

Our goal is to optimize for energy while minimizing the delay, therefor sizes will be chosen which optimize the energy delay product. Performance is then evaluated by setting the interconnect length to 5mm and varying the swing between 0.7V to 1.5V. The simulated results are tabulated in Table 3. A signal with a 0.8V swing reduces the energy by 1/3 (swing of 1.5V) but with a propagation delay that is 15% of its 50ns period. Table 4 demonstrates that the swing with the best Energy Delay Product (EDP) is in fact full swing, though it has the highest energy consumption.

Table 4: Performance

Voltage Swing Average Delay Average Energy Average Energy*Delay

Lowest Energy

0.8 V 8 ns 2.1 pJ 1.7e-20 Js

Lowest Delay

1.5 V 2.6 ns 3.1 pJ 8e-21 Js

Lowest Energy*Delay

1.5 V 2.6 ns 3.1 pJ 8e-21 Js

The effect of wiring capacitance and resistance is simulated as shown in Figure 14. This scheme is quite insensitive to distributed RC increases as the delay and energy increase proportionally. This is due to the small ratio of the interconnect delay to that of the entire scheme. The length must the become quite large before it can begin to dominate the delay. The operation of this scheme can then be maintained as long as the distributed RC is smaller than the interconnect delay.

3.2.2 Using standard transistors
The same analysis are performed for this case as that of the previous one. The results are similar, but due to increased thresholds the architecture could not operate below a swing of 0.9V. The optimized results for the driver can be examined in Figure 15 and the for the transmission gate in Figure 16. Table 5 compares the trades offs between optimizing for energy and delay.

Table 5: Optimized Driver Sizes

Wdriver Wtransmission Average Delay Average Energy Average Energy*Delay

Lowest Energy

13 um 12 um 7.4 ns 1.8 pJ 1.3e-20 Js

Lowest Delay

34 um 20 um 4.6 ns 2.4 pJ 1.1e-20 Js

Lowest Energy*Delay

30 um 16 um 4.9 ns 2.2 pJ 1e-20 Js

The simulated effects of the voltage swing and length are in Figure 17 and Figure 18. Table 6 summarizes the results.

Table 6: Performance

Voltage Swing Average Delay Average Energy Average Energy*Delay

Lowest Energy

0.9 V 7 ns 2 pJ 1.4e-20 Js

Lowest Delay

1.5 V 2.5 ns 3.3 pJ 1e-20 Js

Lowest Energy*Delay

1.5 V 2.5 ns 3.3 pJ 1e-20 Js

3.2.3 Voltage Scaling
The energy measured above was performed without taking into consideration the efficiency of the supply. There are currently three independent supplies, Vsl, Vcl and Vdd. The measurements were performed by integrating the current from each of the supply across a capacitor and then summing up the results.

Vsl and Vcl will be generated on chip using the main supply, therefor examining the effect of these supplies on the overall performance provides a better approximation of the actual power dissipation. The supplies can be generated in order to attain the desired voltage through the use of voltage converters. Such converters can be divided into two main groups : linear and switching regulator. Though there are many variations of the two, we are interested in overall efficiency of the converter. The linear regulator can be modeled as a resistor network which divides the voltage. As the new voltage becomes much smaller than the supply, the efficiency of the linear regulator will begin to decrease. The efficiency of such supplies is usually about 80% [10].

The switching regulator, also called a pulse width modulator is able to provide a higher efficiency, but the overhead is much greater than that of the linear in terms of area and complexity. The efficiency of this regulator is in the range of 90-95% [10].

Therefor the energy measurements provided above can be modified to exhibit the behavior of either regulator by adding the inefficiency of the supply which is be a function to the swing desired.

3.2.4 Summary of Results
The results for the first and second case are very close. Case 1 (low threshold) has less delay and EDP on the average while the energy is about the same as that of case2 (standard threshold). Simulations were performed using the modified low threshold transistors instead of an actual SPICE model because lack of availability. Therefor the similarities in performance between the two cases can be attributed to this.

Table 7 compares the performance of the two cases using the optimized versions of each one.

Table 7: Performance Comparison

Signal Swing Total Energy per Cycle Edyn % Esta % Maximum tp
1.5V 2.9 pJ 93 % 7 % 2.9 ns
1V 2.4 pJ 90 % 10 % 5 ns
Modified MOS

Signal Swing Total Energy per Cycle Edyn % Esta % Maximum tp
1.5V 3.3 pJ 97 % 3 % 3 ns
1V 2.4 pJ 97 % 3 % 6.3 ns
Standard MOS

Due to factors such as conversion delay and the RC effects of the interconnect proposed, the advantages of this architecture would be best achieved at lower frequencies. The reduced supply (1.5V) decreases the output current drive of the driver which contributes to the conversion delay of the receiver. An additional overhead in terms of voltage regulators are needed which have tradeoffs in terms of power, area and complexity. The linear regulator is the least complex, but is more inefficient while the switching regulator is more complex and uses more area but results in a more efficient conversion.

The proposed architecture is intended to operate with a supply of 2 Volts, but we were interested in evaluating its performance at 1.5V because speed degradation may not be large enough to hinder the operation of the type of low power application in which it may be utilized. Using low threshold transistors may have improved the overall performance of the scheme, but for feasibility purposes, the scheme was examined without low thresholds. The delay increases as expected, but only enforces the fact that this scheme cannot operate at high frequencies with reduced voltage supply Vdd.

3.3 Precharged to Half Vdd Scheme

The conventional single-ended precharged circuit appears in Figure 19. During the precharge clock phase, the output of inverter M3-M4 is connected to its own input. This causes the inverter to precharge the line to the switching threshold. This method allows for very low power consumption and voltage swing, but these values are frequency dependent. This circuit can be implemented in the reconfigurable interconnect scheme only if it is operated in a fixed synchronous scheme.



Figure 19(a): Single-Ended precharged driver/receiver



Figure 19(b): Timing diagram of single-Ended precharged driver/receiver

The circuit is performed as follows. During the precharging phase, the input of the precharging inverter (M3-M4) is short-circuited to the input so that the inverter is forced to its switching point (Vdd/2) when M3 and M4 are appropriately sized. During the evaluation phase, the short-circuit is disabled, and the interconnect capacitance is charged up or discharged according to the input. The interconnect swing depends on the current through M1 M2 (i.e., the sizes of M1 and M2), the interconnect capacitance, and how long evaluation signal is high. The reduced interconnect swing results in a large output swing since the output inverter acts as an amplifier.

In our midterm report, we promised to examine this circuit with M3 and M4 at the driver instead of the receiver. This would allow the driver to handle all precharging. Such a modification does not work, however, because the gate voltage rises too quickly to the median value and reduces the precharge current to a trickle. Therefore, M3 and M4 must be at the receiver.

Several important design issues for optimizing this circuit need to be addressed:

  1. The interconnect swing are approximated by

    (5)

    Thus, swing can be adjusted mainly by changing the sizes of M1 and M2, and the evaluation period. It can be also adjusted by changing the sizes of M3 and M4 (i.e., changing the Cint)

  2. The total delay is the sum of the delay through NAND(NOR) gate, delay through M1 and M2, delay through interconnect, and delay through M3 and M4. The portion of delay due to NAND(NOR) could be significant and need to be reduced by sizing up the transistors, however, in the cost of increasing the circuit area.

  3. The total power dissipation can be broken up into four categories as described below:

    • precharge power dissipation which is attributed to the static current through M3 and M4 during precharge phase, and is given by

      (6)

    • dynamic power dissipation attributed to the interconnect charge switching and the interconnect swing during evaluation phase, and is given by

      (7)

    • static power dissipation which comes mostly from the leakage current after the evaluation phase and before the next precharge signal.

    • standby power dissipation, the power dissipation when none of the precharge, evaluation, or input signal occurs. For instance, when bus line is inactive, the only power dissipation is caused by standby power.

  4. The switching threshold of INV1 has to be significantly below half- Vdd to reduce the power dissipation of INV1 during percharged phase. It can be achieved by properly sizing the INV1.

  5. The sizes of M3 and M4 are selected so that the switching point is in the middle of the Vdd. The large sizes of M3 and M4 are desirable in order to amplify a small swing and meet the delay constraints. However, it will increase the precharge power dissipation while decrease the dynamic power dissipation because the swing is reduced by sizing up M3 and M4.

Figure 20(a)(b)(c) shows the plots of swing, delay, and total power dissipation determined by various sizes of M1 and M2(=M1/3), and evaluation periods. Table 8 lists the optimized power and delay for three different swings. Table 7 lists their corresponding transistor sizes when interconnect length is 5mm, Vdd is 2.5volt , and operation frequency is 20MHz.

Table 8: Optimization in conventional half-Vdd circuit

Swing(volt) 0.5 0.75 1
tdelay(ns) 2.14 2.03 2.03
Power_precharge (mW) 0.186 0.168 0.155
Power_dynamic (mW) 0.166 0.17 0.174
Power_static (mW) 0.085 0.094 0.104
Power_standby (mW) 0.186 0.168 0.155
HSPICE plot1 plot2 plot3

Since this circuit is depending upon the operating frequency, the implementation of this circuit in the reconfigurable interconnect scheme is complicated by the fact that our project is geared towards locally synchronous , globally asynchronous, data driven interconnect, that is, each processor may be operated under different frequency and voltage. Therefore, the challenge is how to combine data-driven asynchronous communication at the protocol level with a simple self-timed signal which provides a signaling scheme that is not dependent upon the operation clock.

We propose a modified circuit as shown in Figure 21a, in which the precharging signal is self-timed and sent by the handshaking ACK signals from the previous communication. Instead of using an evaluation clock, the evaluation signal is triggered by the input signals, and enables the evaluation phase. The timing diagram for this precharged to half-Vdd circuit is shown in Figure 21b.



Figure 21a: Modified precharged driver/receiver scheme



Figure 21b: Timing diagram for the precharged circuit

Several optimization issues are addressed as follows:

  1. The periods of handshaking request and acknowledge signals are determined by the wire delay. In the extreme case which bus length is 10mm, wire delay is approximately 2.5ns and thus the precharge period is about 5ns which is sufficient for precharging.

  2. While this self-timed precharged interconnect scheme is feasible, its potential has not yet been fulfilled because of its non-zero standby power dissipation. Note that this reconfigurable bus is often inactive and a "zero-power" standby operation is critical for a low-power application. It can be explained by the fact that the bus lines still stay at half Vdd during the standby period (communication inactive period). Hence, we choose the minimum sizes of M3-M4 and INV1 to reduce the possible standby power dissipation.

Table 9 lists the optimized power and delay for three different swings. It indicates that about 50% of the total energy is consumed by the precharge and standby power in the 100% communication active case. In considering the case that each bus line is only 10% active, 95% of the total power has to be wasted in the standby phase. Obviously, it is unacceptable for the low power application.

Table 9: Optimization of swing in modified half-Vdd circuit

Swing(volt) 0.5 0.75 1
tdelay(ns) 2.06 2.13 2.04
Power_precharge (mW) 0.26 0.25 0.24
Power_dynamic (mW) 0.86 0.61 0.63
Power_static (mW) 0.023 0.041 0.023
Power_standby (mW) 0.26 0.25 0.24
HSPICE plot1 plot2 plot3

Therefore, we propose another self-timed precharged interconnect scheme to improve the standby power dissipation, as presented in Figure 22a. As can be observed, a precharged signal is synchronized by input signal while a sequential evaluation signal is triggered by the precharged signal after the precharging is complete. Figure 22(b) shows its timing diagram.



Figure 22a: Zero standby-power precharged driver/receiver scheme



Figure 22b: Timing diagram for this improved precharged circuit

This improved circuit shows a superior performance in standby power dissipation, as summarized in Table 10. Since ths bus line in this case is set at either 0 or Vdd during the standby period, it essentially comsumes no standby power and thus reduce total power consumption. However, the cost we have to pay is an increased delay time due to the fact that input signal has to wait for the completion of precharging phase in order for the bus line starting to send signal. Fortunately, this additional delay overhead is usually less than 5ns.

Table 10: Optimization in conventional half-Vddcircuit

Swing(volt) 0.5 0.75 1
tdelay(ns) 7.37 8.36 7.94
Power_precharge (mW) 0.15 0.11 0.14
Power_dynamic (mW) 0.78 0.69 0.41
Power_static (mW) 0.15 0.07 0.06
Power_standby (mW) 0.15 0.07 0.06
HSPICE plot1 plot2 plot3

It is also worth noting that the bus length is varied due to the nature of the reconfigurable interconnect. Assuming the bus length is ranged from 1mm to 10mm in the case of a 1cm x 1cm chip size, the simulation results show that the total power dissipation only increase 5% while delay increases 37% as bus increases from 1mm to 10mm.

3.4 Differential Scheme

An extensive search for a low swing, low energy differential scheme was carried out. Figures 23(a) and (b) show the schematic of the chosen scheme which possesses the desi red attributes. Differential data transfer inherently has a higher noise margins and has been employed as an effective means of reducing bus power consumption on heavily loaded data lines. In terms of low power, the proposed architecture uses a method of sharing the complementary signal bits of neighboring wires in order to reduce by half the number of wires needed. This results in a power savings, but requires the use of a multiplexing scheme.

Figure 23a Driver Schematic

Figure 23b Receiver Schematic

This architecture is intended for very high speed applications and can drive a bus using a differential swing of 100mV with a 1.2V supply [6]. In order to obtain the low swing and operate at a fast rate, precharging the bus to half the level of VDD during half of the cycle is employed. The driver is composed of switches and an equalizer while the reliever is composed of a current sense amplifier and a gate-receiver. The gate-receiver amplifies the voltage difference between the differential lines. The driver and receiver are synchronized through the use of a main clock (MCLK) and a receiver clock (RCLK). The transferred data is sensed by the receiver when it is clocked by RCLK. It is operated at the same frequency as MCLK, but is delayed by a factor of a quarter of MCLK's cycle.

The driver precharges the complementary lines of each bit to half VDD during the former half of the cycle and in the latter half the bus switching is controlled by the input data. The bus lines are equalized by using charge-sharing between the two which is attained by shorting both floating lines. The driver uses two additional supplies to generate the desired low swing, similar to the concept of the Hitachi circuit. It also needs to generates the complement of the signal.

To obtain a low-voltage operation, a larger VGS is generated at input gates of the receiver and output gates of the driver by employing low thresholds CMOSs. Obtaining a larger VGS helps compensate for the reduction in voltage supply. During the latter half of the RCLK, the internal nodes of the receiver are precharged to half of VDD. In the former half, any slight voltage deviation from VDD/2 on the primary inputs will be amplified. This new charge will cause the output to be quickly pulled to high or to ground and for the remainder of the cycle the receiver will continue to sustain the signal at is current level.

Simulations were performed and a handshaking protocol as suggested by the the precharged scheme discussed in section 3.3 would need to be utilized. The most complex portion of this scheme is implementing the multiplexing between neighboring wires which is desirable due to power savings. Preliminary simulations demonstrated the high degree of accuracy required in implementing this scheme. Many high speed switches are required and a third clock, called the data clock (DCLK) would be needed for multiplexing data between neighboring wires. As can be inferred, in order to take advantage of this scheme's benefit, a large amount of overhead is required :

  1. The scheme needs to be converted to self-timed using a hand-shaking protocol l.
  2. Regulators or additional supplies generate the reduced voltages needed to set the signal swing.
  3. Low threshold CMOSs are needed to compensate for the reduced current drive .
  4. Data lines are precharged which increases static power consumption.
  5. Complementary signal needs to be generated requiring additional circuitry.
  6. Multiplexing circuitry is needed, which needs to be combined with the hand -shaking protocol
Attempts to modify the circuit in order to avoid some of the overheads were futile as the entire scheme would need to be modified. Therefor in midst of the examination, the overhead of the circuit out-weighed the promising low power results with high noise tolerance. If the focus of this project included high speed, pursuing the scheme may have been more appropriate , though the overhead would have to be well justified.

3.5 Summary

In an attempt to make a fair comparison between these circuits, we decided to optimize each driver/receiver pair for energy when driving a 5 mm wire at both 10 ns and 5 ns. Only one processor is considered, that is, no hanging pass-transistor drains were included in these simulations. Table 11 shows the results, including the transmission gate dimensions and width of the NMOS driver transistor (assuming a 3:1 ratio for the PMOS to NMOS size). The energy listed is the total supply energy for a low-to-high and high-to-low transition.

Table 11: Comparison Drivers/Receivers optimized for a 5 mm Wire

Architecture

Delay

Energy

Swing

WpassN

WpassP

WdriverN

Rdrv

VPmax

Ipeak

CMOS Inverters 10 ns 1.46 pJ 1.5 V 2.4 um 6.9 um 2.1um 8600 Ohms 0.68 V125 uA
5 ns 4.7 pJ 1.5 V 21 um 50.1 um 32.4 um (1)560 Ohms 0.68 V1.34 mA
Hitachi 10 ns 2.08 pJ (2) 1.0 V 8 um 8 um 14 um 500 kOhms 0.41 295 uA
5 ns 2.11 pJ 1.0 V 20 um 20 um 18 um 500 kOhms 0.41 401 uA
Prech. to 1/2 Vdd (3)8 ns (4) 3.44 pJ 0.9 V 12 um 36 um 1.2 um 15 kOhms 0.63 V503 uA
2 ns (5) 6.8 pJ 0.75 V 12 um 36 um 1.2 um 15 kOhms 0.63 V641 uA

Notes:

  1. The driver is the last inverter in a chain of 4 progressively sized inverters. The delay includes the transitions of the entire chain.
  2. Assumes 100% efficient switching regulators for low-swing supply rails.
  3. Uses a 2.5 V supply instead of 1.5 V. The circuit could not be made to function at lower voltages.
  4. Uses the Zero-Standby Power signaling scheme. Delay includes 5 ns for precharging and 3 ns for transmission.
  5. Assumes the synchronous timing scheme is used with Teval = 2 ns.
Also included in the table are the noise parameters discussed in Section 2. Rdrv is the output resistance of the driver in the middle of the signal swing, where the circuit is most vulnerable to noise. VPmax is the maximum allowable steady state noise voltage input to the sense amp. Both values were measured from DC simulations of the drivers and receivers. Lastly, Ipeak was measured as the peak supply current durring the transient simulations.

In retrospect, the circuits we chose to test were intended to increase the speed of interconnect and not to reduce power. The CMOS circuit has the lowest driver resistance and a large VPmax, suggesting that it will be the most noise resistant. This circuit tends to induce large current peaks in the supply rails and be energy inefficient for small delays.

The precharged is slightly less noise resistant due to its high driver resistance. This circuit can achieve speeds much greater than the CMOS inverter chain can, but at a large energy cost. This circuit is probably not suited to a low power system, even if the timing problems could be solved.

The Hitachi circuit has an energy advantage over the CMOS inverter for a delay of 5 ns, but uses more energy than the CMOS inverter with a delay of 10 ns. The circuit was difficult to optimize for swings below 1 V, and did not function below 0.7 V. Note that VPmax is reduced, meaning that this circuit will be more sensitive to noise. Also, the driver output resistance is huge (500 kOhms), but this may simply be an artifact of our doctored technology file. While the circuit does seem to be significantly more prone to coupling noise, it is important to remember that the rise rate of neighboring bitline voltages will be much smaller for this scheme than for the CMOS inverter.

4 Conclusions and Recommendations

In conclusion, let us readdress the original questions this project posed.

  • Can low swing interconnect reduce power?

    Yes, but only for higher switching speeds or with complex signaling schemes. The overhead of the Hitachi architecture causes it to consume more power than the standard CMOS inverter when the total delay is around 10 ns. The power used by precharged schemes depends entirely on the signaling architecture which supports it.

  • Can methods be found to handle the problem of noise?

    Coupling noise for the speeds of interest can be modeled and predicted using the methods presented. Layouts can be made noise resistant to an arbitrary degree. The real problem is likely to be supply noise which has not yet been adequately examined.

  • How much area would such an interconnect scheme need?

    The area depends on the dimensions of the PIM and the number of PIM's required. Additional area is needed for the global wires. For the 4-channel, 6-processor case, the total area is likely to be around 1 mm2.

  • Are the interconnect schemes examined in this report feasible, in the long run?

    Probably. The base case system with CMOS inverters operating in full-swing mode seems to be the best of the circuits analyzed for the Pleiades team. This circuit is robust, energy efficient, and relatively noise free for delays around 10ns. But it is still possible that a low-swing architecture exists that will outperform the static CMOS inverter. What will be the characteristics of such an architecture?

    1. Only one extra supply rail - The Hitachi scheme examined would not work unless two extra supply rails were used to put the swing around the midpoint voltage of 0.75 V. If a circuit could be developed which allows the voltage to swing between ground and an intermediate voltage, then the PMOS pass transistors could be eliminated as well as the complexity of a second voltage regulator.
    2. Static receiver with large noise margins - The receiver should be static to support handshaking signals. Making a static level restorer circuit with noise margins equal to half the voltage swing is non-trivial.

5 References

  1. Nakagone, Y. et al, Sub-1-V Swing Internal Bus Architecture for Future Low-Power ULSI's, IEEE Journal of Solid-State Circuits, April 1993
  2. Shepard, K.L. et al, Noise in Deep Submicron Digital Design, 1996 IEEE/ACM International Conference on Computer-Aided Design,November 1996
  3. Okada, T., et al, Characterization of Net Configurations for Multichip Modules, IEEE Multi-Chip Module Conference, July 1994
  4. Vassiliou, I. and Prihadi, K., A Comparison of CMOS Driver/Receiver Circuits for Reduced Swing Interconnect, U.C. Berkeley May 1993
  5. Rabaey, J.M., "Digital Integrated Circuits : A Design Perspective",copyright 1996, Prentice Hall Electronics
  6. Yamauchi, H. et al, A Signal-Swing SUppressing Strategy for Power and Layout Area Savings Using Time-Multiplexed Differential Data-Transfer Scheme, IWWW Journal of Solid-State Circuits, September 1996
  7. Lee, M., A Fringing and Coupling Interconnect Line Capacitance Model for VLSI On-Chip Wiring Delay and Crosstalk, IEEE International Symposium on Circuits and Systems, May 1996
  8. Sicard, E., Analysis of Crosstalk Interference in CMOS Integrated Circuits, IEEE Transactions on Electromagnetic Compatibility, May 1992
  9. Kuroda, T., et al, A High Speed Low-Power 0.3um CMOS Gate Arrat with Variable Threshold Voltage (VT) Scheme, Proceedings of the IEEE Custom Integrated Circuits Conference, 1996

Peggy Laramie, Roawen Chen, Rhett Davis
15 May 1997