- Parent Directory - MIF File - Postscript File -

Project Goals

1. Ultra-Low Power DSP Processing based on Heterogeneous Co-Processors

Spawning computationally-intensive tasks from a programmable core to heterogeneous, low power co-processors makes it possible to have the flexibility of a microprocessor, while achieving the power dissipation of specialized architectures. This hybrid processing approach will lead to orders of magnitude reductions in power consumption, when compared to implementations on conventional programmable processors, allowing implementations of complex speech, communication and video applications in the mW range, yet providing computational performance in the range of multiple GOPS. Software processes (running on the embedded control processor) and hardware processes (on the co-processors) can be freely interchanged, hence making it possible to employ traditional compilation environments. Claims 2 to 6 elaborate on the architecture and circuit innovations that enable the development of such a system.

2. Self-timed Reconfigurable Interconnect

A low-swing reconfigurable interconnect scheme using self-timed signaling allows for a transparent and effortless connection and synchronization of core processor and co-processors, each of which may be operating at different voltages and clock frequencies.

3. An Embedded Core Processor System with Power-Scalable Performance

Performance-adaptive voltage scaling through use of circuits that allow variable voltage operation and ultra low swing interconnect (0.2 V) integrated with programmable peripheral circuitry yields an embedded microprocessor system that allows optimal power efficiency over a wide range of performance levels. Estimated numbers for the complete system are 700 mW for 80 MIPS and 10 mW at 10 MIPS (in a conservative 0.6 m CMOS technology).

4. Low Power Programmable Logic Modules (PLM)

The use of DRAM and SRAM low power memory circuit techniques drastically reduces dissipation in PLMs, especially in the interconnect, without a penalty in performance. This includes the use of reduced swing on reconfigurable interconnect, sense amplifiers, self-timed signaling and low-voltage PGA cells. PLM's are intended to be used as co-processors in the hybrid DSP architecture, but can also be employed in stand-alone mode (as FPGA).

5. High Efficiency Voltage Conversion

Integrated CMOS solutions to DC-DC conversion will be developed that yield high efficiencies (> 90%) over rapidly varying load conditions (1000-to-1 step variation), while developing variable output voltages (between 1 and 5V).

6. A System-Level Power Exploration and Implementation Methodology

The combination of dynamic and static profiling of benchmark applications with parameterizable macromodels of switching capacitance yields meaningful power estimates that can be used to guide function partitioning between hardware and software implementations. System-level simulators and emulators will be extended to yield power predictions of multimedia terminals under real operating conditions.

Project Summary

Multimedia computation is generally acknowledged to be one of the most important drivers behind the growth in the semiconductor industry in the next decade [Sasaki96]. The increase in integration levels and device performance will make possible the integration of complete multimedia computing resources on a single die. This will lead to a wide spread use of portable multimedia access devices in a wide range of applications domains, such as military, manufacturing, computing, services, and broadcast media. However, along with the increase in density and integration has come the problem of increased power dissipation. In the span of five years, it has risen from one of the afterthoughts in the design process to be one of the highest priority design issues. Without significant improvements in this area, power dissipation will increasingly limit the capability and use of future multimedia devices, which are cost sensitive or have limited energy sources as in portable applications.

In the future, multimedia terminals and network computers will need to be able to support multi-modal communications and display flexibility and adaptivity in the peripheral devices to accommodate for the changing conditions in communication bandwidth and data sources. For instance, different video decompression schemes can be used, graphics resolution and encoding scheme can change depending upon the available bandwidth and the application, a variety of security and encryption techniques can be used based on the type of data transmitted, and the radio-modem protocols depend upon the location (cellular GSM, PCS, micro- or pico-cellular). It is therefore believed necessary to base the design of such a terminal around programmable and reconfigurable components.

The goal of this proposal is to demonstrate that ultra-low power implementations of programmable components are attainable using a well-established and reusable low-power design methodology. The resulting devices will have an energy-efficiency that is orders of magnitude better than presently available solutions (though still not as high as the fully dedicated, application-specific approach). The basic idea behind the proposed architecture is pictured in the Figure below. While general computing (including task-scheduling, interrupt handling,

and overall control functionality) is best implemented on a general-purpose processor, this is presently done at a high energy cost per operation. Multimedia computation, on the other hand, has intrinsic properties that make it more amenable to high performance, low-energy implementation: it contains a large amount of inherent concurrency and is centered around a few, regular kernels of computation that are executed over and over again and can be optimized for energy efficiency. While an application-specific implementation would be the optimal approach to exploit these properties for power-reduction, significant power-savings approaching the dedicated approach can be obtained by implementing the multi-media kernels on heterogeneous, dedicated co-processors, that are optimized for the task. This leads to the concept of the domain-specific processors, as advocated in this proposal.

The goal of this project is to design a number of representative processors that illustrate the energy-efficiency of the proposed approach. An accompanying design methodology and the necessary circuit components and modules will be developed, making the realization of new reconfigurable processor modules rather painless. Finally, we are introducing a number of architecture and circuit approaches that will make the energy-cost of reconfiguration rather minimal. This translates into a number of tasks to be performed that are summarized in the subsequent paragraphs.

Conception, Design and Test of Domain-Specific Reconfigurable Processors

To guide and steer the development of the proposed domain-specific methodology, we have selected a number of target applications that cover most of the scope of multimedia processing. This will ensure the generality and the wide applicability of the proposed architectures and techniques. More specifically, we have selected voice coding, video decompression, RF baseband processing, and encryption as benchmarks. These applications span the complete range of computational requirements and computational kernels that typically occur in multimedia. The task at hand include:

the definition and development of the general architecture, including the composing modules, the communication mechanisms, and the reconfigurable interconnect.

the mapping of the proposed applications on the architecture.

the development of a design methodology to support the mapping process including code analysis, kernel identification, hardware allocation and compilation.

prototyping and test.

Low-Power Design Techniques

Underlying the proposed architecture are a number of innovative low-power design techniques, that address issues such as energy efficiency versus performance, synchronization between co-processors and reconfiguration in a fundamentally different fashion.

First of all, we propose to use dynamic voltage scaling to achieve an optimal ratio of energy versus performance in other words, operations are delivered at an energy level that is in accordance with the requested performance level. This requires the availability of integrated dc-dc converters, that are efficient over a wide range of voltage and load conditions and that have a fast response time.

Secondly, the synchronization and communication between a number of processors that operate at varying clock rates and voltage levels requires a new approach. We propose the use of a data-driven asynchronous approach at the protocol level. This is combined with an "locally synchronous island" approach at the circuit level that allows for the reconfigurable interconnect network to operate in a self-timed mode at low voltage swing. Processor modules can be operating in either synchronous or self-timed mode at arbitrary voltage levels. The combination of a data-driven communication protocol and the locally synchronous islands eliminates the occurrence of synchronization failures.

To check the validity of the proposed approaches, we plan first on a thorough analysis, followed by the development of a number of test circuits. The techniques will later be demonstrated in the multimedia processors.

Design of Energy-Efficient Modules

The proposed heterogeneous reconfigurable processor modules consist mainly of four different components: reconfigurable interconnect (see above), core processor, dedicated co-processors, and reconfigurable co-processors (or programmable logic modules). For each of these, we plan to develop low-energy implementations. These will be developed in a parameterizable fashion so that they can be re-used over different applications. All modules will be developed in the most up-to-date technology available through MOSIS (currently 0.6 m, 3-metal layer CMOS) with the aid of both industrial design tools (Cadence, Synopsys, Epic) and in-house power analysis and optimization tools. Energy-efficiency will be at the highest priority level during the design of these models. Depending on the technology parameters of the process available, supply voltages are projected to go as low as 0.9 V.

While most of the modules developed here are intended to be used in an embedded fashion as a part of the domain-specific heterogeneous processor concept, most of them will be usable in a stand-alone mode as well. This is especially true for the performance-scalable co-processor and the programmable logic modules.

Design Infrastructure

To support the low-power design from the earliest phases of the design process, we propose to develop power analysis and guidance environment at the conceptual system level. This includes the development of static and dynamic profiling tools and the mapping of the obtained statistics into realistic performance and power estimates. This requires the availability of flexible, simple, yet accurate macromodules for the various components of the heterogeneous architecture. The guidance tools will use regularity information to propose potential candidates for implementation on co-processors and will predicted the expected savings.

Besides developing techniques that address the design of reconfigurable components at the system level, we also plan to examine and analyze the impact of our proposed low-power techniques on reliability and integrity of the produced processors. While we plan to use industrial tools to the maximum extent, some dedicated tool development will also be necessary (e.g. for the optimization and design of the dc-dc converters).

  • problem. The concept of data-enabled synchronous islands is applicable to a wide range a multi-processor architectures.

  • Low-power design methodologies and their encapsulation into well-established design flows with the accompanying design tools are of general interest to the design automation industry. Especially, we believe that tools that address design at the "conceptual level" are the next important step in design automation at the system and architectural level.

    Technical Summary

    1. Introduction

    This project addresses the problem of high-performance / low-power programmable computing as needed in future portable multimedia and network-access terminals. Research initiated in our research group at Berkeley (and later exploited at many other locations) has demonstrated that ultra-low-power solutions for the various components of such a terminal are attainable if one is willing to rely on application-specific implementations that are uniquely geared towards a single application [Chan92].

    This approach, while effective, comes at a substantial price:

    (i) it requires a redesign for each new application, communication protocol, or algorithm.

    (ii) it does not allow for in-the-field reconfiguration and adaptation.

    (ii) the time required for the design and the risks involved in the implementation are high.

    In particular, future multimedia terminals and network computers will need to be able to support multi-modal communications and display flexibility and adaptivity in the peripheral devices to accommodate for the changing conditions in communication bandwidth and data sources. For instance, different video decompression schemes can be used, graphics resolution and encoding scheme can change depending upon the available bandwidth and the application, different security and encryption techniques can be used based on the type of data transmitted, and the radio-modem protocols depend upon the availability of infrastructure at a given location (cellular GSM, PCS, micro- or pico-cellular). It is therefore believed necessary to base the design of such a terminal around programmable and reconfigurable components. The goal of this proposal is to demonstrate that ultra-low power implementations of these programmable components are attainable using a well-established and reusable low-power design methodology. The resulting devices will have an energy-efficiency that will be orders of magnitude better than presently available solutions (though still not as high as the fully dedicated, application-specific approach).

    To accomplish this, we propose a hybrid system architecture that addresses the high computational requirements required in the multimedia environment, while keeping power dissipation at ultra-low levels. Future general-purpose microprocessors and DSP processors (eventually in a multiprocessor configuration) might be able to meet the computational requirements, but will do so at an excessive cost in energy. Furthermore, these general purpose architectures miss the opportunities intrinsic to multimedia applications. This was clearly illustrated by Sasaki (NEC) in his keynote talk at ISSCC96 [Sasaki96], where he argued that general purpose performance (as measured in IPS) is bound to saturate in the next decade, while multimedia performance (as measured in OPS) will continue the follow the exponential growth curve of the past (Figure 1a). This can be attributed to the ample opportunities for optimization and parallelism in the latter. The potential of distributed computation and the presence of often-repeated computational kernels makes it also possible to deliver multimedia OPS at a much lower energy level, as is illustrated in Figure 1b.

    Different architectures have been proposed to address multimedia programmable computing. Important trends are the use of high-performance technology (as exemplified in the MicroUnity multimedia processor [Moussouris95]), and the native signal processing approach, where a traditional RISC architecture is augmented with dedicated instructions for multimedia functions. An example of the latter is the UltraSparc architecture of Sun MicroSystems [Sun95]. Both of these approaches are simple extensions of current architectures and fail to exploit the power and performance opportunities offered by multimedia.

    Our power-efficient approach is based on a number of premises:

  • A reconfigurable architecture that adapts to the task at hand to improve the power spent-per-computation.

  • A hybrid combination of an embedded energy-efficient microprocessor with power-scalable performance, heterogeneous co-processors, and low-power programmable logic.

  • The use of novel circuit techniques to reduce power in interconnect, to enable automatic power-up on demand and "zero-power-when-not-in-use", and to support dynamic supply variation.

  • An integrated tool design flow for low-power that enables power optimization at all levels and supports architectural exploration.

    The remainder of the technical rationale demonstrates how these goals can be accomplished in a heterogeneous, domain-specific processor architecture. We first describe the overall architecture, followed by a detailed description of the power-reduction approaches used for the core processor, the satellite processors, and the programmable logic modules. The section is concluded with an overview of the design flow that we propose to use for the development of these multimedia processors.

    2. Heterogeneous Domain-Specific Processors for Multimedia

    Beyond the scaling of the supply voltage, most of the architectural techniques to reduce power consumption are based on the elimination of waste, i.e. minimizing excess switching capacitance. A number of basic techniques to do so have been identified: use of application-specific processing, preservation of data correlations, locality of reference, distributed processing, and demand- or data-driven computation (often called power management) [Rabaey95]. Most of these concepts can only be applied with limited success to general-purpose computing. The amount of concurrency in generic programs is typically small, while the generic nature of the computation makes it hard to discern meaningful application-specific units that could be widely used. There is some room for the exploitation of locality of reference and demand-driven computation, but the overall impact is unclear at present and is overall limited in its global impact.

    The picture is substantially different when considering communication and multimedia applications, where the computational complexity (and thus power dissipation) is high (between 1 GOP and 1 TOP) but is mostly located in a few kernels that are executed repeatedly (such as the CDMA correlators in a radio modem or the DCT and the motion compensation in video compression). Improving the power-efficiency of these kernels by using application-specific modules therefore has a substantial impact. Furthermore, the kernels typically display a high amount of parallelism, operate on data-streams with strong temporal correlations, and tend to be more adaptable to data- or demand-driven paradigms than a generic Von-Neuman oriented program. These feature are optimally exploited in a so-called domain-specific processor, i.e. a processor that is targeted to a specific application domain such as video compression or spread spectrum, yet is programmable enough to support a wide range of alternative algorithms and programs within that domain.

    The proposed hybrid architecture is based on a fixed template (as shown in Figure 2), consisting of "a performance-scalable" microprocessor, surrounded by an array of heterogeneous, autonomous satellite processors. The former manages the overall control flow of the computation and handles issues such as context switching, process management and interrupt handling. The power-intensive kernel computations are implemented as a set of dedicated, communicating processes on the co-processors. These "weakly programmable" or (maybe only) parameterizable units are optimized for the execution of a single computational kernel. Examples are a DCT processor, an add-compare-select processor for Viterbi decoding, a Galois-module for encryption, an FFT butterfly, or a memory processor. While most of these satellites are very dedicated, others might need a higher degree of dynamic reconfiguration capabilities to support the implementation of a range of kernels. Some satellites are therefore implemented as PLMs (programmable logic modules or embedded PGAs), which support reconfiguration up to the arithmetic operator or even the gate level. The low-power implementation of these PLMs is discussed further in this proposal.

    The type and the number of satellite processors is determined by the targeted application domain and a family of these will be developed with the appropriate software tools to automate the process. Central to the total concept is that each of the satellites adheres to the following architectural concepts: dedicated application-specific datapath, data storage which assumes a high degree of locality, minimal control and a standardized data-driven interface.

    Communication between processors proceeds over a programmable, reconfigurable network [Yeung95]. Processors communicate in a data-driven fashion, which makes synchronization and initialization easy to implement, and which allows for automatic power-up. A single communication channel is allocated to a given data stream, ensuring preservation of data-correlations and thus reducing power consumption.

    To understand the overall flow of the computation, let us consider a simple scenario. Upon the occurrence of an external or internal control event, the core processor wakes up. It determines the status of the overall system and moves to the next operation phase. It programs (parameterizes) the various satellite processors (over a control bus) and configures the interconnect network. In a sense, it can be stated that a "hardware process" is "spawned". Once this initialization is completed, the system proceeds into data-flow mode. The control processor goes to sleep (or at least the operation system part of it) and the satellite processors move into execution mode, communicating with each other in data-flow fashion. This continues until the current task is completed or interrupted by an external event and the control is passed back to the RISC. Once can observe that the system thus uses two computational paradigms: demand-driven at the operation system level and data-flow driven at the computational level. This clear definition of operation paradigms makes the programming of the engine considerably simpler.

    Estimates indicate that the use of a heterogeneous, reconfigurable processor architecture reduces dissipation for selected application domains by more than an order of magnitude compared to programmable (DSP) processors, while maintaining a high level of programmability. For example, we project that a programmable voice coding processor (as used in cellular applications) implemented in this approach in a conservative technology would dissipate only 5 mW. This should be compared to the best (application-specific) implementation, which currently consumes more than 70 mW [Ueda93].

    3. Energy-efficient Core Processor with Power-Scalable Performance

    At the core of the proposed architecture resides an energy-efficient processor. While "power-conscious" design techniques have succeeded in keeping the power consumption of general purpose and embedded processors under control, [e.g. PowerPC 603, ARM 6/7], the reduction has not been significant. Not until the recently announced StrongARM [Montanaro96], which delivers Pentium-like performance in 0.5 Watts, has a processor design made significant improvements in energy efficiency, though this design exploited a very advanced fabrication technology to make this advance and the 0.5 W is for the processor-core only.

    A commonly used scheme to reduce computation power in power-critical conditions is to reduce the clock frequency. However, energy for a given function in CMOS implementations is independent of clock frequency, so we still expend the same amount of work (energy) per operation, we just take longer to do it (as is illustrated in Figure 3).

    Since the computation per battery life remains constant, there is no benefit in reducing the clock (assuming that the processor has a effective power-down mode when not in use).

    This observation is especially important for the control processor in a multimedia terminal where computation requests often come in bursts and the latency between a request for computation and its completion is important. If now the supply voltage is reduced in conjunction with the clock frequency, we get a quadratic reduction in energy/operation. By scaling the supply voltage, we can scale the throughput to meet the performance demands of the user (Figure 3). The fundamental trade-off is that higher throughput costs higher energy/operation. As illustrated in Figure 4, a reduction of the supply voltage from 3.3 V to 1.2 Volt reduces the energy/operation with a factor of approximately 10, while reducing the throughput (and clock frequency) by 8.5.

    To make this voltage scaling useful, it needs to be performed dynamically, which is possible by building on existing research in DC-DC converters. A voltage (and clock frequency) change occurring on the order of 100 microseconds is feasible. A register containing a desired frequency can be written to under software control. This frequency value then sets the supply voltage that meets the desired frequency via an adaptive feedback loop. This approach has the advantage of delivering performance-on-demand, yet can trade performance-for-power.

    Optimization of the processor power alone is not sufficient. Because power is distributed throughout the whole processor system, power optimization should be performed on all components (processor, main memory, processor bus, glue logic). For instance, the microprocessor sub-system implemented in the InfoPad multimedia terminal presently consists of an ARM60 processor, 0.5 MByte of SRAM for main memory, a boot-EPROM, and a PAL for external glue logic. Total measured power dissipation is 1.2 Watts, while throughput is about 10 MIPS. By redesigning the processor, the SRAM, and the bus I/O, we project a system with equivalent throughput, but with a power dissipation down to 10 mW. For those times when greater throughput is desired, the system will dynamically adjust its throughput to a peak value of about 80 MIPS, while dissipating 700 mWatts (in a 0.6 m CMOS technology). The CPU will be compatible with the ARM Instruction Set, so that existing software code can be directly ported over to take advantage of the energy efficient processor system. Energy savings will be made by aggressively pursuing low-energy design techniques, such as the use of low-voltage swing bus-transceivers, use of parallel-access embedded memory integrated on the same die, removal of excess functionality that is not needed for multimedia computation, and integration of support functionality such as clock generator and interrupt handler into the processor core.

    4. Satellite Processors

    The computational core of the architecture consists of a set of autonomous, application-specific satellite processors, that are optimized for the execution of a single computational kernel. The dedicated nature of the processors makes it possible to execute multimedia operations with minimal overhead, hence achieving low energy per operation. The controller overhead is minimal as the instruction set of a given satellite processor is typically small, and very often is nothing more than a single control register. High performance can be achieved at low voltage level through the use of concurrency, both within a processor or by dividing a task over multiple processors. As mentioned before, multimedia kernels typically have a large degree of inherent parallelism. Finally, most of the data transfers and memory references within a processor access only local resources and are hence energy-efficient. It should be mentioned that a co-processor that is not operational must have zero standby power. How this can be accomplished is described in the section on the reconfigurable network. An example of a potential co-processor for filtering operations is shown in Figure 5.

    To quantify the potential impact of spawning multimedia kernels onto dedicated satellite processors, consider the example of the VSELP speech coder example. The results of a dynamic profiling of the voice coder code when excited with actual speech data are given in Figure 6.

    It can be derived that the top three functions account for 77% of the total computational complexity of the algorithm (and hence the energy) and that vector-dot products account for 70% of the total operation count. Being able of implementing this operation at minimum voltage with minimum overhead will have an enormous impact on the overall dissipation. Figure 7 illustrates how one of the most power-hungry functions (lag computation) can be mapped on a cluster of satellite processors. Notice how the final part of the processing (which performs a minimum operation and stores the result) is performed as a software process on the core processor. A given task can thus partly be mapped on both the core and the satellite processors, hence avoiding a proliferation in the number of satellites needed.

    5. Programmable Logic Modules

    While most computational kernels can easily be mapped onto dedicated processors, each application typically contains a number of operations that are executed frequently enough to require energy-efficient implementation, yet do not merit a special co-processor. Examples are division operations, the computation of goniometric functions, and bit-level manipulations. An adequate way to address these function is by means of programmable logic modules (PGAs), that offer reconfigurability up to the arithmetic operator, or the gate level. A number of the satellite processors could be simple PLMs, that are reconfigured by the core processor to perform a range of functions. We define a PLM as an embedded PGA both terms could be used interchangeably. Since this proposal concentrates on dynamic reconfigurability, we concentrate here on SRAM-based PGA architectures that can be reconfigured (or reprogrammed) under software control. This is in contrast with the EEPROM and fuse-based PGAs, that offer a higher density but are either write-once, or require high voltages for reprogramming. Also, while this proposal concentrates on PLMs, most of the results are also applicable to commodity stand-alone PGAs.

    Programmable and reconfigurable logic modules are an essential ingredient of most prototyping efforts. Even more, they are increasingly being used in large-volume systems and as sub-modules in integrated circuits. The idea of using programmable logic to dynamically modify the instruction set of a microprocessor, or to change its periphery is receiving considerable attention. An example is the Clay-based Reconfigurable Signal Processor architecture, proposed by National semiconductor [Clay96] (with whom we are cooperating). Most efforts in the PLM domain have concentrated on increasing performance and density and little effort has been devoted to improving the power efficiency. This is bound to change for a number of reasons. First of all, it can be seen that the scaling of current PGA implementations to higher gate densities and larger complexity levels will undoubtedly lead to unacceptable consumption levels. For instance, it is projected [Xilinx96] that a 100 KGate PGA running at 100 MHz will consume 100`s of Watts (100 KGates translate to 10K signals switching at 100 MHz @ 0.3 mW/MHz for current technologies...). This does not include the power consumed in the projected 700 input-output pins. Obviously, a dramatic reduction in the energy dissipation is necessary for this level of integration to become a reality. The increasing use of PGAs in volume systems and even portable applications is another incentive to address the dissipation issue in PGAs. For example, a fully programmed Xilinx4000 easily consumes 1 to 2 W, which makes them unsuitable for most portable applications.

    When attempting to reduce power consumption in PGAs, it is essential to first identify the dominant sources of energy dissipation. The architecture of an SRAM based architecture is shown in Figure 8. Studies at Xylinx have demonstrated that the power dissipation of a PGA under normal operation distributes evenly over three components:

  • Input/Output circuitry

  • Global interconnect (mostly used for clock distribution)

  • Local interconnect and CLBs (combinational logic blocks)

    Lowering the dissipation requires equal attention to each component. In an embedded application, I/O power dissipation is not a real issue and can be reduced to ignorable levels. However, reducing the signal swing at the board level and the use of advanced packaging techniques (such as chip-on-board) is the only way to deal with I/O dissipation. More important in the scope of this proposal are the other two components. Reducing the supply voltage presents an obvious way of reducing the consumption, but at the same time introduces the performance which might or might not be acceptable depending upon the application.

    Fortunately, the PLM architecture offers room for a substantial improvement in power efficiency beyond what can be achieved by voltage scaling. In a broad sense, soft-programmable FPGAs resemble memory architectures. A large number of the techniques that have been exploited in low-power SRAMs and DRAMs can come to fruition here as well:

  • Low-swing interconnect combined with sense amplifiers at the receiver end. Both drivers and receivers can be part of the standard CLB. This approach can certainly be applied for the global interconnect network and with some care can also be used for the local interconnect. The PGA architecture is very regular, which makes it more easy to compensate or annihilate crosstalk and interference effects.

  • Complete shut-down of non-programmed or non-active modules

  • Self-timed signalling to reduce glitching on the buses

  • Reduced operation voltage on the core, combined with reduced thresholds and back-gate biasing for higher performance, and on-chip voltage converters.

    This has to be combined with architectural improvements. It has been observed that every gate in current PGA architectures carries a power overhead of approximately 8 to 10 with respect to an equivalent custom-designed gate. The granularity of the basic cell determines the ratio of power consumption between combinational logic block (CLB) and local interconnect. More complex cells have the advantage of providing increased locality, that is most switching events are happening at low capacitance levels. A similar observation prompts the hardware mapper in logic synthesis to prefer larger fan-in gates over smaller ones when energy minimization is the primary goal [Pedram96]. Smaller granularity cells tend to ease the technology mapping process and improve the utilization, but rely more heavily on the local interconnect network. Also, the mechanism used to program a cell has an impact on the dissipation as well. For instance, the Actel cell uses the interconnect network to determine the cell-functionality, while Xilinx employs a programmable table-lookup structure (implemented as a large multiplexer) within the cell [Trimberger94]. Our goal is to investigate the energy-efficiency of the various approaches and to propose a low-power CLB, that minimizes energy/operation while at the same time maintaining performance.

    It is our further belief that substantial savings in power dissipation of PLMs are possible by focusing on the applications layer, i.e. how applications are mapped onto the hardware structure. Refocusing the technology mapping phase in the synthesis process can lead to important power savings. Unfortunately, even the most advanced technology mapping currently available from EDA vendors does not effectively utilize the many features offered by hardware. One way to come to truly energy solutions is by using predefined (manually mapped) library modules. It is our intention to develop an energy-efficient library that encapsulates the most important functions needed in multimedia.

    6. Reconfigurable Low-Swing Interconnect Scheme

    The proposed heterogeneous architecture is centered around communications between a core processor and a set of co-processors. While the distributed architecture enables important energy savings, it also introduces synchronization problems. These are further complicated by the fact that each (co-)processor may have its own operating frequency and supply voltage, some of which might be dynamically varying. This implies that a fixed synchronous scheme with static scheduling of the communication events is non-practical and even counterproductive. We propose a scheme that combines data-driven asynchronous communication at the protocol level with simple self-timed signaling at the circuit level. The scheme has the advantage that it offers flexibility at the architectural and circuit levels, while requiring virtually no (if any) hardware overhead. It furthermore enables dynamic data-triggered awakening of inactive co-processors and "zero-power" standby operation.

    At the protocol layer, processors communicate in a data-driven fashion, which makes synchronization and initialization easy to implement. A single communication channel is allocated to a given data stream for the duration of the computation, ensuring preservation of data-correlations and thus reducing power consumption. The presence of a token on the corresponding signaling line indicates the availability of data and activates the destination processor. When implemented for a homogeneous fine-grained processor array [Yeung95], we have demonstrated that this approach reduces power consumption in the global interconnect network to maximally 10% over a range of applications.

    The challenge at the circuit layer is to provide a signaling scheme that is not depending upon the operating frequency of any of the connecting modules. A simple two-phase single wire signaling scheme is presented in Figure 9. As can be observed, individual processors operate on a locally generated clock and are hence follow the synchronous design approach. The operation of the clock is enabled by the presence of a token event on the data receiver, i.e. when no data is available the local clock is automatically turned off. The period of the clock generator is programmable and can be adapted to the required performance or operation voltage. The combination of the chosen two-phase signaling and the local clock generation ensures that synchronization failures cannot occur.

    It is worth observing that our research group has been at the forefront of the application of self-timed circuits to high performance signal processing systems, both dedicated [Meng90] and programmable [Jacobs88].

    7. DC/DC Conversion and Other Low-power Circuit Techniques

    In the architectural discussions of the previous chapters, we mentioned a number of enabling circuit techniques that are instrumental to the success of the project. We intend to include efforts in each of the following areas: (i) On-chip dc-dc converters; (ii) Very low-swing busses and reconfigurable interconnect, where the voltage swing is controlled by a dc-dc converter. Efficient transceiver circuits and sense amps restore signal levels; (iii) Self-timed bus-interfaces and processor communication protocols; (iv) Embedded memories for low-voltage operation; and (v) Low voltage FPGA cells.

    The dc-dc converter plays an essential role in the proposed scheme and merits some detailed attention. For a dynamic voltage adaptation scheme to be meaningful and worthwhile, it is essential that the voltage converters are (i) compact; (ii) tightly integrated with the load circuitry; (iii) use minimal number of external components; (iv) be efficient over a wide range of loading conditions and operating voltages.

    In some preliminary studies [Stratakos95], we have demonstrated that these goals can be accomplished by using a custom approach towards the traditional buck converter approach. Converter parameters are optimized for the voltage range and load conditions at hand. Low frequency operation going as low as 1 MHz reduces the size of the external reactive components 1 inductor and 3 capacitors are typically required. The use of innovative control schemes such as adaptive dead-time control makes efficient conversion possible over a wide range of loads. To minimize power dissipation in the control, we propose to investigate a number of novel control techniques such as phase-locked and - based pulse-width modulators. It is our projection that the proposed techniques will yield converters with high efficiencies (> 90%) over rapidly varying load conditions (1000-to-1 step variation), while developing variable output voltages (between 1 and 5V). A simple prototype circuit we have developed confirms that these projections are realistic. A micro-photograph and specifications of the prototype are shown in Figure 10.

    8. Low-Power Design Flow and Support Software

    The implementation and realization of the low-power heterogeneous processor architecture requires the presence of a well-established design methodology with the accompanying tools. A viable design flow should include the following components:

  • Power analysis and estimation tools at all levels of abstraction (ranging from the conceptual and system levels over the architecture levels down to the circuit level).

  • Power optimization at the circuit and gate level.

  • Architectural optimization.

  • Automatic compilation and mapping of applications on the selected architecture template.

    As part of the InfoPad project, we have developed a suite of design tools for low-power that include power analysis, optimization, and synthesis tools for application-specific circuits [Mehra96]. An overview of the resulting low-power design methodology is plotted in Figure 11.

    When attempting to apply this methodology to the design of heterogeneous, reconfigurable processors, we observe that a number of tools carry across the implementation styles, but also that some important capabilities are lacking. A first necessity is the early exploration of the impact of design decisions and architectural choices on the power budget. In the realm of programmable processors, this involves the study of switching statistics over a set of meaningful benchmarks and input data. A combination of static and dynamic code profilers can provide accurate switching profiles from behavioral code (C or C++) at the primitive operation level [Rabaey96]. This information can be combined with parameterizable macro-models of the architectural building blocks to yield first-order power estimates, which can help to determine the impact of low-power techniques such as partitioning, voltage scaling and swing reduction.

    The automatic identification of computational kernels that are candidates for implementation on dedicated co-processors is another important component of the design methodology. Such a function relies on code analysis (template extraction), statistics analysis (as obtained from the conceptual analysis) and estimation of the impact on the power dissipation.

    Once kernels are identified, we believe that the compilation process of the application process on the selected architecture can proceed easily. In fact, due to the selected and standardized communication protocols, code migration from core to co-processor is rather straightforward. Key challenges will be to ensure that the real-time requirements are met and to choose the overall control functionality needed for a given application. For instance, applications such a voice coders allow for a static ordering in the firing of the processes. This drastically simplifies the operating system that has to run on the processor core. Other applications however requires dynamic scheduling and need a process kernel.

    Finally, the (adaptive) scaling of the supply voltages to very low levels can have a substantial impact on the system reliability and signal integrity. We propose to study this impact for actual IC processes under a variety of operating conditions.

    9. Case Studies

    We have selected a number of potential case studies to drive the development of the heterogeneous processor concept. All of them are related to the domain of multimedia, portable communication and computing and have stringent power requirements. Other applications can be considered during the execution of the project in coordination with our partners.

  • Speech coder for cellular applications. This application has critical important for the implementation of ultra-low power cellular terminals the coder should be reprogrammable to support various coding techniques (such as VSELP and half-rate coding).

  • Baseband processor for radio-communications. Such a processor is an essential ingredient for any multi-modal radio and should combine efficient filtering, channel coding, modulation, correlation, and error correction units

  • Encryption processor. This processor would implement a variety of encryption schemes such as DES and RCA.

  • Video decompression processor. Combines a number of decompression schemes such as VQ, subband coding, and MPEG2. The availability of a programmable decompression units avoids the necessity of having a transcoder in the network. The decompression code could even be downloaded over the communication link.


  • - Parent Directory - MIF File - Postscript File -
    This FrameMaker Document was converted to HTML by maker2html v1.1a.
    (This file was created: Tue Dec 3 19:28:10 PST 1996 )