

swiss scientific initiative in health / security / environment systems



# **Ultra-Low-Power Processing Platform**

Antonio Pullini<sup>1</sup>, Frank K. Gürkaynak<sup>1</sup>, Luca Benini<sup>1</sup>, Adam Teman<sup>2</sup>, Jeremy Constantin<sup>2</sup>, Andreas Burg<sup>2</sup>

<sup>1</sup>Integrated Systems Laboratory, ETH Zürich, <sup>2</sup>Telecommunications Circuits Lab, EPFL





PULP stands for Parallel Ultra-Low Power Processor Architecture and is actively being developed by the Integrated Systems Laboratory of ETH Zürich. Our goal is to develop a system that has the same energy efficiency regardless of the computational load. We call this property energy proportionality. Our system works very well when there is little to do but is equally efficient when the work load increases. This is different form other processor systems which are optimized to work well at one corner, but do not scale well.

To allow us to scale efficiently, we have designed a many-core architecture organized in clusters.

Each cluster consists of several simple processor cores. Our current system is based on an open source architecture called OpenRISC

All processor cores within a cluster have access to a common data 3 memory that we call Tightly-Coupled Data-Memory (TCDM)

### How do we achieve energy efficiency?

We have designed the architecture so that it can work at the near/subthreshold operation mode. At this mode, the circuits work slower, but they are more energy efficient (see below). We can make up the speed deficit by having multiple parallel cores. In addition we also have the capability of switching cores on/off, and using body biasing techniques to improve energy efficiency.





### Logic Vcc / Memory Vcc (V)

Graph showing the energy efficiency in different operating regimes. In the yellow circled region is almost 5 times more efficient than working at the nominal operating voltage. Taken from Vivek De, Intel, DATE-201

# What will we do in IcySoC?

In the IcySoC project we will combine two exciting ideas for more efficient processing platforms. Operating at the near/sub threshold and using inexact computing. We will use PULP based systems for our investigations. As part of the project we envision several modifications to the original PULP architecture.

The performance of the system can be improved by developing dedicated hardware accelerators that can calculate certain operations faster than what can be done in a standard processor. These can be tightly coupled to the processor.

is also possible to design hardware accelerators that work more independently and even have access to their own local memory.

Inexact computing allows a trade-off between the accuracy of calculated results and the performance metrics, such as energy. It is



# A better processing core

In the current PULP architecture we are using the OpenRISC architecture. As this project is open source, it allows us to share our platform without problems to all project partners.

There is also a publicly available HDL implementation of the OpenRISC architecture. Under ideal conditions such a processor is expected to execute 1 instruction per clock cycle (IPC). We soon realized that this is not the case for this implementation.

The original architecture had three problems, effectively reducing its IPC:

- Reading and writing to memory was slow (LSU Stalls).
- Multiplications took 3 cycles
- Time was lost during branches

We improved this implementation and the new micro-architecture has

Block diagram of the improved OpenRISC processor



### New Memory Styles

Memories occupy a large part of any processing system. They not only occupy circuit area, but also contribute significantly to the power consumption. This is why it is important to optimize the memory hierarchy in a system as much as possible.

Typically local memories in such systems are built using standard SRAMs (Static Random Access Memory). For large memory sizes these SRAM blocks offer high density, however when smaller memory blocks are used, they have some overhead. Most importantly, it is not very easy to scale down the operating voltage of the SRAM macros, which prevents us from operating in the near-threshold region.

One of the solutions we are investigating is the use of **S**tandard **C**ell based Memories (SCM) to help us with these problems in applications where smaller memories are needed. For a 64x64 bit



