

# **Precise and Approximate Logarithmic Number Units** shared in a Multi-Core Cluster

Michael Gautschi, Michael Schaffner, Frank Gurkaynak, Luca Benini

Integrated Systems Laboratory, ETH Zurich, Switzerland

## 1. The logarithmic number system:

The logarithmic number system (LNS) can be used to exploit a larger dynamic range. The





# 2. Sharing a LNU in a cluster of simple cores:



- tolerated in a lot of be can applications, for example in image processing
- The area of the LNU increases with the precision requirements.
- Approximation can be done by:
- Reducing the bit width of the interpolators
- Pruning lookup tables
- Tolerating errors leads to smaller LNUs, and smaller delay
  - Further allows to decrease the number of pipeline stages!

Approx2 is 24% more energy efficient than the exact LNU due to:

- 2 cycles latency instead of 4
- 28% faster execution
- Only 6% higher power consumption





Parallel Ultra Low Power



#### **Applications:**

- Gradient Magnitude:
  - Sobel Filter
- Edge detection
- **Bilateral Filter:**
- Nonlinear, edge preserving
- noise.-reducing smoothing filter

# 3. LNU vs FPU Comparison: [3]



## Private FPU:



• LNU up to 4x more energy efficient than FPU when computing complex kernels.

1 LNU can be efficiently shared in a cluster of four processing cores.

• 5x5 FIR Filter: FIR Filter Smoothing • Blurring 5 => No visible, discernible errors

Constants,

## 5. LNU Demonstrator:

- Image processing using approximate computing
- Image Data, Demo Core with two LNUs From DRAM IF • 64x32b registers • ALU to handle LNS From Ethernet IF MUL/DIV/SQRT
  - Precise or approx. LNU can be selected to Binary, compute ADD, SUB, EXP, From Ethernet IF LOG, Casts
  - Stencil memory for input image
  - Output written to frame buffer
- Implemented on FPGA
- Altera Stratix IV
- 40 MHz
- 113 kGE



**Mandelbrot:** 

LNU 1: (Single precision) • 8.23 16 ulp • 8.23 0.72 ulp • 21 kGE • 31.1 kGE



### Shared LNU:



| Implementation Details            |    | Private FPU[3]    | Shared LNU[3]      | ELM [1,2]          |
|-----------------------------------|----|-------------------|--------------------|--------------------|
| Technology                        |    | 65nm LVT          | 65nm LVT           | 180nm              |
| max speed [MHz]                   |    | 374               | 337                | 125                |
| max. Throughput [GFLOPS]          |    | 1.1               | 0.9                | 0.084              |
| Power @100MHz, 1.2V, 25°C<br>[mW] |    | 41.84             | 44.0               | -                  |
| Leakage @ 1.2V, 25°C [mW]         |    | 2.823             | 3.019              | -                  |
| Precision (max err) [ulp]         |    | 0.5               | 0.478 <sup>1</sup> | 0.454 <sup>1</sup> |
| avg. Inu/fpu utilization          |    | 0.21              | 0.37               | -                  |
| Total area [kGE]                  |    | 719               | 749                | -                  |
| Single core area [kGE]            |    | 51.1 <sup>2</sup> | 44.5               | -                  |
| Instruction support               |    | Private FPU       | Shared LNU         | ELM [1,2]          |
| Latency add/sub/casts             | hw | 2/2/2             | 4/4/4              | 3/3(4)/ -          |
| Latency mul/div/sqrt <sup>3</sup> | hw | 2/-/-             | 1/1/1              | 1/1/1              |
|                                   | SW | -/62/56           | -/-/-              | -/-/-              |
| Latency exp/log <sup>3</sup>      | hw | -/-               | 4/4                | - / -              |
|                                   | SW | 51/85             | -/-                | -/-                |

#### **Gradient Magnitude:**



## 6. References:

[1] "The European Logarithmic Microprocessor", J.N Coleman et.al, 2008 [2] "ROM-less LNS, R.Che Ismail and J.N Coleman", 2011

[3] "A 65nm CMOS 6.4-to-29.2 pJ/FLOP@ 0.8V shared logarithmic floating point unit for acceleration of nonlinear function kernels in a tightly coupled processor cluster", M. Gautschi et. al, ISSCC 2016 [4] "Accuracy and Performance Trade-offs of Logarithmic Number Units in Multi-Core Clusters", M. Schaffner et. al, ARITH 2016