In the future, multimedia terminals and network computers will need to be able to support multi-modal communications and display flexibility and adaptivity in the peripheral devices to accommodate for the changing conditions in communication bandwidth and data sources. For instance, different video decompression schemes can be used, graphics resolution and encoding scheme can change depending upon the available bandwidth and the application, a variety of security and encryption techniques can be used based on the type of data transmitted, and the radio-modem protocols depend upon the location (cellular GSM, PCS, micro- or pico-cellular). It is therefore believed necessary to base the design of such a terminal around programmable and reconfigurable components.
The goal of this proposal is to demonstrate that ultra-low power implementations of programmable components are attainable using a well-established and reusable low-power design methodology. The resulting devices will have an energy-efficiency that is orders of magnitude better than presently available solutions (though still not as high as the fully dedicated, application-specific approach). The basic idea behind the proposed architecture is pictured in the Figure below. While general computing (including task-scheduling, interrupt handling,
The goal of this project is to design a number of representative processors that illustrate the energy-efficiency of the proposed approach. An accompanying design methodology and the necessary circuit components and modules will be developed, making the realization of new reconfigurable processor modules rather painless. Finally, we are introducing a number of architecture and circuit approaches that will make the energy-cost of reconfiguration rather minimal. This translates into a number of tasks to be performed that are summarized in the subsequent paragraphs.
Conception, Design and Test of Domain-Specific Reconfigurable Processors
To guide and steer the development of the proposed domain-specific methodology, we have selected a number of target applications that cover most of the scope of multimedia processing. This will ensure the generality and the wide applicability of the proposed architectures and techniques. More specifically, we have selected voice coding, video decompression, RF baseband processing, and encryption as benchmarks. These applications span the complete range of computational requirements and computational kernels that typically occur in multimedia. The task at hand include:
![]()
the definition and development of the general architecture, including the composing modules, the communication mechanisms, and the reconfigurable interconnect.
![]()
the mapping of the proposed applications on the architecture.
![]()
the development of a design methodology to support the mapping process including code analysis, kernel identification, hardware allocation and compilation.
Underlying the proposed architecture are a number of innovative low-power design techniques, that address issues such as energy efficiency versus performance, synchronization between co-processors and reconfiguration in a fundamentally different fashion.
First of all, we propose to use dynamic voltage scaling to achieve an optimal ratio of energy versus performance in other words, operations are delivered at an energy level that is in accordance with the requested performance level. This requires the availability of integrated dc-dc converters, that are efficient over a wide range of voltage and load conditions and that have a fast response time.
Secondly, the synchronization and communication between a number of processors that operate at varying clock rates and voltage levels requires a new approach. We propose the use of a data-driven asynchronous approach at the protocol level. This is combined with an "locally synchronous island" approach at the circuit level that allows for the reconfigurable interconnect network to operate in a self-timed mode at low voltage swing. Processor modules can be operating in either synchronous or self-timed mode at arbitrary voltage levels. The combination of a data-driven communication protocol and the locally synchronous islands eliminates the occurrence of synchronization failures.
To check the validity of the proposed approaches, we plan first on a thorough analysis, followed by the development of a number of test circuits. The techniques will later be demonstrated in the multimedia processors.
Design of Energy-Efficient Modules
The proposed heterogeneous reconfigurable processor modules consist mainly of four different components: reconfigurable interconnect (see above), core processor, dedicated co-processors, and reconfigurable co-processors (or programmable logic modules). For each of these, we plan to develop low-energy implementations. These will be developed in a parameterizable fashion so that they can be re-used over different applications. All modules will be developed in the most up-to-date technology available through MOSIS (currently 0.6
m, 3-metal layer CMOS) with the aid of both industrial design tools (Cadence, Synopsys, Epic) and in-house power analysis and optimization tools. Energy-efficiency will be at the highest priority level during the design of these models. Depending on the technology parameters of the process available, supply voltages are projected to go as low as 0.9 V.
While most of the modules developed here are intended to be used in an embedded fashion as a part of the domain-specific heterogeneous processor concept, most of them will be usable in a stand-alone mode as well. This is especially true for the performance-scalable co-processor and the programmable logic modules.
To support the low-power design from the earliest phases of the design process, we propose to develop power analysis and guidance environment at the conceptual system level. This includes the development of static and dynamic profiling tools and the mapping of the obtained statistics into realistic performance and power estimates. This requires the availability of flexible, simple, yet accurate macromodules for the various components of the heterogeneous architecture. The guidance tools will use regularity information to propose potential candidates for implementation on co-processors and will predicted the expected savings.
Besides developing techniques that address the design of reconfigurable components at the system level, we also plan to examine and analyze the impact of our proposed low-power techniques on reliability and integrity of the produced processors. While we plan to use industrial tools to the maximum extent, some dedicated tool development will also be necessary (e.g. for the optimization and design of the dc-dc converters).
This approach, while effective, comes at a substantial price:
(i) it requires a redesign for each new application, communication protocol, or algorithm.
(ii) it does not allow for in-the-field reconfiguration and adaptation.
(ii) the time required for the design and the risks involved in the implementation are high.
In particular, future multimedia terminals and network computers will need to be able to support multi-modal communications and display flexibility and adaptivity in the peripheral devices to accommodate for the changing conditions in communication bandwidth and data sources. For instance, different video decompression schemes can be used, graphics resolution and encoding scheme can change depending upon the available bandwidth and the application, different security and encryption techniques can be used based on the type of data transmitted, and the radio-modem protocols depend upon the availability of infrastructure at a given location (cellular GSM, PCS, micro- or pico-cellular). It is therefore believed necessary to base the design of such a terminal around programmable and reconfigurable components. The goal of this proposal is to demonstrate that ultra-low power implementations of these programmable components are attainable using a well-established and reusable low-power design methodology. The resulting devices will have an energy-efficiency that will be orders of magnitude better than presently available solutions (though still not as high as the fully dedicated, application-specific approach).
To accomplish this, we propose a hybrid system architecture that addresses the high computational requirements required in the multimedia environment, while keeping power dissipation at ultra-low levels. Future general-purpose microprocessors and DSP processors (eventually in a multiprocessor configuration) might be able to meet the computational requirements, but will do so at an excessive cost in energy. Furthermore, these general purpose architectures miss the opportunities intrinsic to multimedia applications. This was clearly illustrated by Sasaki (NEC) in his keynote talk at ISSCC96 [Sasaki96], where he argued that general purpose performance (as measured in IPS) is bound to saturate in the next decade, while multimedia performance (as measured in OPS) will continue the follow the exponential growth curve of the past (Figure 1a). This can be attributed to the ample opportunities for optimization and parallelism in the latter. The potential of distributed computation and the presence of often-repeated computational kernels makes it also possible to deliver multimedia OPS at a much lower energy level, as is illustrated in Figure 1b.
Different architectures have been proposed to address multimedia programmable computing. Important trends are the use of high-performance technology (as exemplified in the MicroUnity multimedia processor [Moussouris95]), and the native signal processing approach, where a traditional RISC architecture is augmented with dedicated instructions for multimedia functions. An example of the latter is the UltraSparc architecture of Sun MicroSystems [Sun95]. Both of these approaches are simple extensions of current architectures and fail to exploit the power and performance opportunities offered by multimedia.
Our power-efficient approach is based on a number of premises:
The remainder of the technical rationale demonstrates how these goals can be accomplished in a heterogeneous, domain-specific processor architecture. We first describe the overall architecture, followed by a detailed description of the power-reduction approaches used for the core processor, the satellite processors, and the programmable logic modules. The section is concluded with an overview of the design flow that we propose to use for the development of these multimedia processors.
The picture is substantially different when considering communication and multimedia applications, where the computational complexity (and thus power dissipation) is high (between 1 GOP and 1 TOP) but is mostly located in a few kernels that are executed repeatedly (such as the CDMA correlators in a radio modem or the DCT and the motion compensation in video compression). Improving the power-efficiency of these kernels by using application-specific modules therefore has a substantial impact. Furthermore, the kernels typically display a high amount of parallelism, operate on data-streams with strong temporal correlations, and tend to be more adaptable to data- or demand-driven paradigms than a generic Von-Neuman oriented program. These feature are optimally exploited in a so-called domain-specific processor, i.e. a processor that is targeted to a specific application domain such as video compression or spread spectrum, yet is programmable enough to support a wide range of alternative algorithms and programs within that domain.
The proposed hybrid architecture is based on a fixed template (as shown in Figure 2), consisting of "a performance-scalable" microprocessor, surrounded by an array of heterogeneous, autonomous satellite processors. The former manages the overall control flow of the computation and handles issues such as context switching, process management and interrupt handling. The power-intensive kernel computations are implemented as a set of dedicated, communicating processes on the co-processors. These "weakly programmable" or (maybe only) parameterizable units are optimized for the execution of a single computational kernel. Examples are a DCT processor, an add-compare-select processor for Viterbi decoding, a Galois-module for encryption, an FFT butterfly, or a memory processor. While most of these satellites are very dedicated, others might need a higher degree of dynamic reconfiguration capabilities to support the implementation of a range of kernels. Some satellites are therefore implemented as PLMs (programmable logic modules or embedded PGAs), which support reconfiguration up to the arithmetic operator or even the gate level. The low-power implementation of these PLMs is discussed further in this proposal.
The type and the number of satellite processors is determined by the targeted application domain and a family of these will be developed with the appropriate software tools to automate the process. Central to the total concept is that each of the satellites adheres to the following architectural concepts: dedicated application-specific datapath, data storage which assumes a high degree of locality, minimal control and a standardized data-driven interface.
Communication between processors proceeds over a programmable, reconfigurable network [Yeung95]. Processors communicate in a data-driven fashion, which makes synchronization and initialization easy to implement, and which allows for automatic power-up. A single communication channel is allocated to a given data stream, ensuring preservation of data-correlations and thus reducing power consumption.
To understand the overall flow of the computation, let us consider a simple scenario. Upon the occurrence of an external or internal control event, the core processor wakes up. It determines the status of the overall system and moves to the next operation phase. It programs (parameterizes) the various satellite processors (over a control bus) and configures the interconnect network. In a sense, it can be stated that a "hardware process" is "spawned". Once this initialization is completed, the system proceeds into data-flow mode. The control processor goes to sleep (or at least the operation system part of it) and the satellite processors move into execution mode, communicating with each other in data-flow fashion. This continues until the current task is completed or interrupted by an external event and the control is passed back to the RISC. Once can observe that the system thus uses two computational paradigms: demand-driven at the operation system level and data-flow driven at the computational level. This clear definition of operation paradigms makes the programming of the engine considerably simpler.
Estimates indicate that the use of a heterogeneous, reconfigurable processor architecture reduces dissipation for selected application domains by more than an order of magnitude compared to programmable (DSP) processors, while maintaining a high level of programmability. For example, we project that a programmable voice coding processor (as used in cellular applications) implemented in this approach in a conservative technology would dissipate only 5 mW. This should be compared to the best (application-specific) implementation, which currently consumes more than 70 mW [Ueda93].
A commonly used scheme to reduce computation power in power-critical conditions is to reduce the clock frequency. However, energy for a given function in CMOS implementations is independent of clock frequency, so we still expend the same amount of work (energy) per operation, we just take longer to do it (as is illustrated in Figure 3).
This observation is especially important for the control processor in a multimedia terminal where computation requests often come in bursts and the latency between a request for computation and its completion is important. If now the supply voltage is reduced in conjunction with the clock frequency, we get a quadratic reduction in energy/operation. By scaling the supply voltage, we can scale the throughput to meet the performance demands of the user (Figure 3). The fundamental trade-off is that higher throughput costs higher energy/operation. As illustrated in Figure 4, a reduction of the supply voltage from 3.3 V to 1.2 Volt reduces the energy/operation with a factor of approximately 10, while reducing the throughput (and clock frequency) by 8.5.
To make this voltage scaling useful, it needs to be performed dynamically, which is possible by building on existing research in DC-DC converters. A voltage (and clock frequency) change occurring on the order of 100 microseconds is feasible. A register containing a desired frequency can be written to under software control. This frequency value then sets the supply voltage that meets the desired frequency via an adaptive feedback loop. This approach has the advantage of delivering performance-on-demand, yet can trade performance-for-power.
Optimization of the processor power alone is not sufficient. Because power is distributed throughout the whole processor system, power optimization should be performed on all components (processor, main memory, processor bus, glue logic). For instance, the microprocessor sub-system implemented in the InfoPad multimedia terminal presently consists of an ARM60 processor, 0.5 MByte of SRAM for main memory, a boot-EPROM, and a PAL for external glue logic. Total measured power dissipation is 1.2 Watts, while throughput is about 10 MIPS. By redesigning the processor, the SRAM, and the bus I/O, we project a system with equivalent throughput, but with a power dissipation down to 10 mW. For those times when greater throughput is desired, the system will dynamically adjust its throughput to a peak value of about 80 MIPS, while dissipating 700 mWatts (in a 0.6
m CMOS technology). The CPU will be compatible with the ARM Instruction Set, so that existing software code can be directly ported over to take advantage of the energy efficient processor system. Energy savings will be made by aggressively pursuing low-energy design techniques, such as the use of low-voltage swing bus-transceivers, use of parallel-access embedded memory integrated on the same die, removal of excess functionality that is not needed for multimedia computation, and integration of support functionality such as clock generator and interrupt handler into the processor core.
To quantify the potential impact of spawning multimedia kernels onto dedicated satellite processors, consider the example of the VSELP speech coder example. The results of a dynamic profiling of the voice coder code when excited with actual speech data are given in Figure 6.
Programmable and reconfigurable logic modules are an essential ingredient of most prototyping efforts. Even more, they are increasingly being used in large-volume systems and as sub-modules in integrated circuits. The idea of using programmable logic to dynamically modify the instruction set of a microprocessor, or to change its periphery is receiving considerable attention. An example is the Clay-based Reconfigurable Signal Processor architecture, proposed by National semiconductor [Clay96] (with whom we are cooperating). Most efforts in the PLM domain have concentrated on increasing performance and density and little effort has been devoted to improving the power efficiency. This is bound to change for a number of reasons. First of all, it can be seen that the scaling of current PGA implementations to higher gate densities and larger complexity levels will undoubtedly lead to unacceptable consumption levels. For instance, it is projected [Xilinx96] that a 100 KGate PGA running at 100 MHz will consume 100`s of Watts (100 KGates translate to 10K signals switching at 100 MHz @ 0.3 mW/MHz for current technologies...). This does not include the power consumed in the projected 700 input-output pins. Obviously, a dramatic reduction in the energy dissipation is necessary for this level of integration to become a reality. The increasing use of PGAs in volume systems and even portable applications is another incentive to address the dissipation issue in PGAs. For example, a fully programmed Xilinx4000 easily consumes 1 to 2 W, which makes them unsuitable for most portable applications.
When attempting to reduce power consumption in PGAs, it is essential to first identify the dominant sources of energy dissipation. The architecture of an SRAM based architecture is shown in Figure 8. Studies at Xylinx have demonstrated that the power dissipation of a PGA under normal operation distributes evenly over three components:
Lowering the dissipation requires equal attention to each component. In an embedded application, I/O power dissipation is not a real issue and can be reduced to ignorable levels. However, reducing the signal swing at the board level and the use of advanced packaging techniques (such as chip-on-board) is the only way to deal with I/O dissipation. More important in the scope of this proposal are the other two components. Reducing the supply voltage presents an obvious way of reducing the consumption, but at the same time introduces the performance which might or might not be acceptable depending upon the application.
Fortunately, the PLM architecture offers room for a substantial improvement in power efficiency beyond what can be achieved by voltage scaling. In a broad sense, soft-programmable FPGAs resemble memory architectures. A large number of the techniques that have been exploited in low-power SRAMs and DRAMs can come to fruition here as well:
This has to be combined with architectural improvements. It has been observed that every gate in current PGA architectures carries a power overhead of approximately 8 to 10 with respect to an equivalent custom-designed gate. The granularity of the basic cell determines the ratio of power consumption between combinational logic block (CLB) and local interconnect. More complex cells have the advantage of providing increased locality, that is most switching events are happening at low capacitance levels. A similar observation prompts the hardware mapper in logic synthesis to prefer larger fan-in gates over smaller ones when energy minimization is the primary goal [Pedram96]. Smaller granularity cells tend to ease the technology mapping process and improve the utilization, but rely more heavily on the local interconnect network. Also, the mechanism used to program a cell has an impact on the dissipation as well. For instance, the Actel cell uses the interconnect network to determine the cell-functionality, while Xilinx employs a programmable table-lookup structure (implemented as a large multiplexer) within the cell [Trimberger94]. Our goal is to investigate the energy-efficiency of the various approaches and to propose a low-power CLB, that minimizes energy/operation while at the same time maintaining performance.
It is our further belief that substantial savings in power dissipation of PLMs are possible by focusing on the applications layer, i.e. how applications are mapped onto the hardware structure. Refocusing the technology mapping phase in the synthesis process can lead to important power savings. Unfortunately, even the most advanced technology mapping currently available from EDA vendors does not effectively utilize the many features offered by hardware. One way to come to truly energy solutions is by using predefined (manually mapped) library modules. It is our intention to develop an energy-efficient library that encapsulates the most important functions needed in multimedia.
At the protocol layer, processors communicate in a data-driven fashion, which makes synchronization and initialization easy to implement. A single communication channel is allocated to a given data stream for the duration of the computation, ensuring preservation of data-correlations and thus reducing power consumption. The presence of a token on the corresponding signaling line indicates the availability of data and activates the destination processor. When implemented for a homogeneous fine-grained processor array [Yeung95], we have demonstrated that this approach reduces power consumption in the global interconnect network to maximally 10% over a range of applications.
The challenge at the circuit layer is to provide a signaling scheme that is not depending upon the operating frequency of any of the connecting modules. A simple two-phase single wire signaling scheme is presented in Figure 9. As can be observed, individual processors operate on a locally generated clock and are hence follow the synchronous design approach. The operation of the clock is enabled by the presence of a token event on the data receiver, i.e. when no data is available the local clock is automatically turned off. The period of the clock generator is programmable and can be adapted to the required performance or operation voltage. The combination of the chosen two-phase signaling and the local clock generation ensures that synchronization failures cannot occur.
The dc-dc converter plays an essential role in the proposed scheme and merits some detailed attention. For a dynamic voltage adaptation scheme to be meaningful and worthwhile, it is essential that the voltage converters are (i) compact; (ii) tightly integrated with the load circuitry; (iii) use minimal number of external components; (iv) be efficient over a wide range of loading conditions and operating voltages.
In some preliminary studies [Stratakos95], we have demonstrated that these goals can be accomplished by using a custom approach towards the traditional buck converter approach. Converter parameters are optimized for the voltage range and load conditions at hand. Low frequency operation going as low as 1 MHz reduces the size of the external reactive components 1 inductor and 3 capacitors are typically required. The use of innovative control schemes such as adaptive dead-time control makes efficient conversion possible over a wide range of loads. To minimize power dissipation in the control, we propose to investigate a number of novel control techniques such as phase-locked and
-
based pulse-width modulators. It is our projection that the proposed techniques will yield converters with high efficiencies (> 90%) over rapidly varying load conditions (1000-to-1 step variation), while developing variable output voltages (between 1 and 5V). A simple prototype circuit we have developed confirms that these projections are realistic. A micro-photograph and specifications of the prototype are shown in Figure 10.
As part of the InfoPad project, we have developed a suite of design tools for low-power that include power analysis, optimization, and synthesis tools for application-specific circuits [Mehra96]. An overview of the resulting low-power design methodology is plotted in Figure 11.
When attempting to apply this methodology to the design of heterogeneous, reconfigurable processors, we observe that a number of tools carry across the implementation styles, but also that some important capabilities are lacking. A first necessity is the early exploration of the impact of design decisions and architectural choices on the power budget. In the realm of programmable processors, this involves the study of switching statistics over a set of meaningful benchmarks and input data. A combination of static and dynamic code profilers can provide accurate switching profiles from behavioral code (C or C++) at the primitive operation level [Rabaey96]. This information can be combined with parameterizable macro-models of the architectural building blocks to yield first-order power estimates, which can help to determine the impact of low-power techniques such as partitioning, voltage scaling and swing reduction.
The automatic identification of computational kernels that are candidates for implementation on dedicated co-processors is another important component of the design methodology. Such a function relies on code analysis (template extraction), statistics analysis (as obtained from the conceptual analysis) and estimation of the impact on the power dissipation.
Once kernels are identified, we believe that the compilation process of the application process on the selected architecture can proceed easily. In fact, due to the selected and standardized communication protocols, code migration from core to co-processor is rather straightforward. Key challenges will be to ensure that the real-time requirements are met and to choose the overall control functionality needed for a given application. For instance, applications such a voice coders allow for a static ordering in the firing of the processes. This drastically simplifies the operating system that has to run on the processor core. Other applications however requires dynamic scheduling and need a process kernel.
Finally, the (adaptive) scaling of the supply voltages to very low levels can have a substantial impact on the system reliability and signal integrity. We propose to study this impact for actual IC processes under a variety of operating conditions.
![]()