Hardware-based computation of the Roughness Index for infrared imagers |
This paper presents a compact and low-power digital implementation of the Roughness Index, which provides an estimate of the fixed-pattern noise present in an infrared video stream. This noise is caused by imperfections in the fabrication of the focal plane array and its readout circuitry, in addition to other time-variant processes. The index is continuously used during normal operation of the imager to activate calibration mechanisms that compensate for these limitations. Our FPGAbased prototype computes the Roughness Index on infrared video at 30fps and a resolution of 720x480 pixels. On a low-cost Xilinx Spartan 3E XC3S500 FPGA, adding our circuit to an existing system increases logic resource utilization by only 6% and power consumption by 1.59 mW. Apr 05, 2012 |
|
A1CSA: An Energy-Efficient Fast Adder Archtecture for Cell-Based VLSI Design |
Energy-efficient fast adders are needed in the design of battery-powered portable devices. Although many fast adder architectures exist, most of them require transistor-level optimizations that prevent their synthesis in a standard-cell flow. This paper presents two energy-efficient Add-One Carry- Select Adders (A1CSA and A1CSAH) suited for standard-cells synthesis. Synthesis results showed that the A1CSA is the smallest fast adder requiring, on average, 22.2% less area than the Carry-Select Adder. They also showed that the A1CSAH is, on average, 10.8% faster and 3.4% more energy-efficient than the Carry-Lookahead Adder, thus corresponding to the best choice for high speed and high efficiency addition. Apr 05, 2012 |
|
An Energy-Efficient FDCT/IDT Configurable IP Core for Mobile Multimedia Platforms |
The development of mobile multimedia devices follows the platform-based design methodology in which IP cores are the building blocks. In the context of mobile devices there is a concern of battery lifetime which leads to the need of energy efficient IP cores. This paper presents an energy-efficient FDCT/IDCT configurable IP core. Synthesis for 90 nm resulted in 50 MHz as maximum frequency and 1.66 mW as total power, achieving a throughput of 188.2 Mpixels/s, which is enough to process two HDTV@1080p videos in real time. The IP core architecture is based on Massimino's algorithm, which was chosen for its accuracy and parallelism. The exploration of its parallelism resulted in a fully-combinational 1-D FDCT/IDCT configurable datapath. In addition, the IP core is IEEE-1180 compliant. Comparisons with related work, in terms of energy efficiency (mJ/Mpixel), revealed that our architecture... Apr 05, 2012 |
|
MARC II: A Parametrized Speculative Multi-Ported Memory Subsystem for Reconfigurable Computers |
We describe a parameterized memory system suitable as target for automatic high-level language to hardware compilers for reconfigurable computers. It fully supports the spatial computation paradigm by allowing the realization of each memory operator by a dedicated hardware memory port. Interport coherency is maintained only for those ports that actually require it, and efficient speculative execution is enabled by a dynamic scheme for arbitrating access to shared resources (such as main memory), relying on techniques inspired by the branch prediction of conventional software-programmable processors. Apr 05, 2012 |
|
Compiling Geometric Algebra Computations into Reconfigurable Hardware Accelerators |
Geometric Algebra (GA), a generalization of quaternions and complex numbers, is a very powerful framework for intuitively expressing and manipulating the complex geometric relationships common to engineering problems. However, actual processing of GA expressions is very compute intensive, and acceleration is generally required for practical use. GPUs and FPGAs offer such acceleration, while requiring only low-power per operation. In this paper, we present key components of a proof-of-concept compile flow combining symbolic and hardware optimization techniques to automatically generate hardware accelerators from the abstract GA descriptions that are suitable for high-performance embedded computing. Apr 05, 2012 |
|
Subspace-based face recognition on an FPGA |
We present a custom hardware system for image recognition, featuring a dimensionality reduction network and a classification stage. We use Bi-Directional PCA and Linear Discriminant Analysis for feature extraction, and classify based on Manhattan distances. Our FPGA-based implementation runs at 75MHz, consumes 157.24mW of power, and can classify a 61 x 49-pixel image in 143.7_s, with a sustained throughput of more than 7,000 classifications per second. Compared to a software implementation on a workstation, our solution achieves the same classification performance (93.3% hit rate), with more than twice the throughput and more than an order of magnitude less power. Mar 05, 2012 |
|
Long-Tail Behavior of Process Variation with Application to Domino Keeper Sizing |
Designers use Monte Carlo simulations to evaluate the impact of variability on circuits, but such simulations require prohibitive amounts of computation to characterize rare events. In this paper, we propose a method by which the long tail behavior of circuits can be modeled with a reasonable number of simulations. This technique is then applied to the problem of domino keeper sizing to determine the sizing necessary to ensure a reliable circuit. We find that to ensure reliability for a commercial 45 nm process, the width of the keeper must be 0.17 times the effective width of the pull-down stack. Such a wide keeper results in a delay penalty of 9.9% compared to a circuit with no keeper. Mar 05, 2012 |
|
Parallel High-Radix Montgomery Multipliers |
This paper describes the algorithm and design tradeoffs for multiple hardware implementations of parallel high-radix scalable Montgomery multipliers. Hardware implementations of Montgomery multipliers require choosing a radix, shift direction, and whether to use Booth encoding. Presented are processing element designs exploring combinations of radices 2, 4, and 8, right vs. left shifting, and Booth encoding. A radix-4, left-shifting, non-Booth encoded design performs a 1024-bit modular exponentiation in 9.4 ms using 4997 LUTs and 4051 REGs and appears to maximize performance/hardware in an FPGA implementation. A Booth encoded version of the above multiplier performs a 1024-bit modular exponentiation in 13 ms using 4852 LUTs and 2887 REGs. This design may be beneficial for systems constrained by the cycle time of other elements because the design minimizes hardware usage and requires no precomputed multiples. The radix-8, right-shifting, Booth encoded design offers no performance/hardware advantage over a comparable radix-4 design. Mar 05, 2012 |
|
| Voltage Scalable High-Speed Robust Hybrid Arithmetic Units Using Adaptive Clocking |
In this paper, we explore various arithmetic units for possible use in high-speed, high-yield ALUs operated at scaled supply voltage with adaptive clock stretching. We demonstrated that careful logic optimization of the existing arithmetic units (to create hybrid units) indeed make them further amenable to supply voltage scaling. Such hybrid units result from mixing right amount of fast arithmetic into the slower ones. Simulations on different hybrid adder and multipliers in BPTM 70 nm technology show 18%-50% improvements in power compared to standard adders with only 2%-8% increase in die-area at iso-yield. These optimized datapath units can be used to construct voltage scalable robust ALUs that can operate at high clock frequency with minimal performance degradation due to occasional clock stretching. Mar 05, 2012 |
|
| Multiple-Parameter Side-Channel Analysis: A Non-Invasive Hardware Trojan Detection Approach |
Malicious alterations of integrated circuits during fabrication in untrusted foundries pose major concern in terms of their reliable and trusted field operation. It is extremely difficult to discover such alterations, also referred to as “hardware Trojans” using conventional structural or functional testing strategies. In this paper, we propose a novel non-invasive, multipleparameter side-channel analysis based Trojan detection approach that is capable of detecting malicious hardware modifications in the presence of large process variation induced noise. We exploit the intrinsic relationship between dynamic current (IDDT ) and maximum operating frequency (Fmax) of a circuit to distinguish the effect of a Trojan from process induced fluctuations in IDDT . We propose a vector generation approach for IDDT measurement that can improve the Trojan detection sensitivity for arbitrary Trojan instances. Simulation results with two large circuits, a 32-bit integer execution unit (IEU) and a 128-bit Advanced Encryption System (AES) cipher, show a detection resolution of 0.04% can be achieved in presence of ±20% parameter (Vth) variations. The approach is also validated with experimental results using 120nm FPGA (Xilinx Virtex-II) chips. Mar 05, 2012 |
|
| Dynamic Bit-Width Adaptation in DCT: An Approach to Trade Off Image Quality and Computation Energy |
This paper presents a dynamic bit-width adaptation scheme for applications using discrete cosine transform (DCT). The technique can efficiently trade off image quality and computation energy. Based on sensitivity differences of 64 DCT coefficients, separate operand bit-widths are used for different frequency components to reduce computation energy. To select the appropriate operand bit-widths that achieve significant reduction of power consumption with minimum image quality degradation, we also propose a bit-width selection algorithm. The proposed variable bit precision DCT algorithm can be efficiently implemented using carry save adder trees. The reconfigurable DCT architecture can achieve power savings ranging from 36% to 75% compared to normal operation at the expense of minor image quality degradation. Mar 05, 2012 |
|
An FPGA-based Real-Time Nonuniformity Correction System for Infrared Focal Plan Arrays |
Spatial and temporal nonuniformity in Infrared Focal Plane Arrays (IRFPA) severely degrades the quality of images obtained from modern infrared cameras. An efficient implementation of a nonuniformity correction algorithm is therefore necessary in real-time thermal-image visualization systems. This paper presents an FPGA-based implementation of the scene-based Constant Range algorithm for adaptive nonuniformity correction. The system processes an NTSC infrarred video signal at 30fps in real time and consumes only 157 mW of power. The performance of our system is currently limited by the input video Feb 10, 2012 |
|
WCET-driven Cache-aware Code Positioning |
Code positioning is a well-known compiler optimization aiming at the improvement of the instruction cache behavior. A contiguous mapping of code fragments in memory avoids overlapping of cache sets and thus decreases the number of cache conflict misses.
We present a novel cache-aware code positioning optimization driven by worst-case execution time (WCET) information. For this purpose, we introduce a formal cache model based on a conflict graph which is able to capture a broad class of cache architectures. This cache model is combined with a formal WCET timing model, resulting in a cache conflict graph weighted with WCET data. This conflict graph is then exploited by heuristics for code positioning of both basic blocks and entire functions.
Code positioning is able to decrease the accumulated cache misses for a total of 18 real-life benchmarks by 15.5% on average for an automotive processor featuring a 2-way setassociative cache. These cache miss reductions translate to average WCET reductions by 6.1%. For direct-mapped caches, even larger savings of 18.8% (cache misses) and 9.0%(WCET) were achieved. Feb 10, 2012 |
|
WCET-driven Branch Prediction aware Code Positioning |
In the past decades, embedded system designers moved from simple, predictable system designs towards complex systems equipped with caches, branch prediction units and speculative execution. This step was necessary in order to fulfill increasing requirements on computational power. Static analysis techniques considering such speculative units had to be developed to allow the estimation of an upper bound of the execution time of a program. This bound is called worst-case execution time (WCET). Its knowledge is crucial to verify whether hard real-time systems satisfy their timing constraints, and the WCET is a key parameter for the design of embedded systems.
In this paper, we propose a WCET-driven branch prediction aware optimization which reorders basic blocks of a function in order to reduce the amount of jump instructions and mispredicted branches. We employed a genetic algorithmwhich rearranges basic blocks in order to decrease the WCET of a program. This enables a first estimation of the possible optimization potential at the cost of high optimization runtimes. To avoid time consuming repetitive WCET analyses, we developed a new algorithm employing integer-linear programming (ILP). The ILP models the worst-case execution path (WCEP) of a program and takes branch prediction effects into account. This algorithm enables short optimization runtimes at slightly decreased optimization results. In a case study, the genetic algorithm is able to reduce the benchmarks’ WCET by up to 24.7% whereas our ILP-based approach is able to decrease the WCET by up to 20.0%.
Feb 10, 2012 |
|
| Optimized Communication Architecture of MPSoCs with a Hardware Scheduler: A System View |
With increasing complexity of MPSoCs, efficient runtime management of system resources becomes of vital importance for improving the system performance and energy efficiency. OSIP [1] – an operating system application-specific instruction-set processor – provides a promising solution to this. It delivers high computational performance to deal with dynamic task scheduling and mapping, while still being programmable. However, the distributed computation among the different processing elements introduces complexity to the communication architecture, which tends to become the bottleneck of such systems. In this work, we show a detailed analysis and optimization for the communication architecture of OSIP-based MPSoCs. In particular, the joint effects of OSIP and the communication architecture are investigated from the system point of view. Feb 10, 2012 |
|
NetStage/DPR: A Self-adaptable FPGA Platform for Application-Level Network Security |
Increasing transmission speeds in high-performance networks pose significant challenges to protecting the systems and networking infrastructure. Reconfigurable devices have already been used with great success to implement lower-levels of appropriate security measures (e.g., deep-packet inspection). We present a reconfigurable processing architecture capable of handling even application-level tasks, and also able to autonomously adapt itself to varying traffic patterns using dynamic partial reconfiguration. As a first use-case, we examine the collection of Malware by emulating an entire honeynet of potentially hundreds of thousands of hosts using a single-chip implementation of the architecture. Feb 10, 2012 |
|
| Low Power Passive RFID Transponder Frontend Design for Implantable Biosensor Applications |
This paper presents a passive 13.56 MHz RFID transponder frontend design using 0.18 μm CMOS Technology for implantable biosensor applications. Power is provided to the system through a dual output full wave rectifier that provides power at two different voltage levels; the low level to the transponder frontend to reduce its power consumption and the high level to the biosensor to increase its dynamic range. The low voltage operation of the frontend is supplemented further by a current starved design reducing its power consumption to a minimal and leaving most available power to the biosensor. The design is verified using HSPICE Simulation showing a maximum frontend power consumption of only 6.5 μW and leaving at least 88% of the available power for the biosensor’s operation. Feb 10, 2012 |
|
| IMPACT: IMPrecise adders for low-power Approximate Computing |
Low-power is an imperative requirement for portable multimedia devices employing various signal processing algorithms and architectures. In most multimedia applications, the final output is interpreted by human senses, which are not perfect. This fact obviates the need to produce exactly correct numerical outputs. Previous research in this context exploits error-resiliency primarily through voltage overscaling, utilizing algorithmic and architectural techniques to mitigate the resulting errors. In this paper, we propose logic complexity reduction as an alternative approach to take advantage of the relaxation of numerical accuracy. We demonstrate this concept by proposing various imprecise or approximate Full Adder (FA) cells with reduced complexity at the transistor level, and utilize them to design approximate multi-bit adders. In addition to the inherent reduction in switched capacitance, our techniques result in significantly shorter critical paths, enabling voltage scaling. We design architectures for video and image compression algorithms using the proposed approximate arithmetic units, and evaluate them to demonstrate the efficacy of our approach. Post-layout simulations indicate power savings of up to 60% and area savings of up to 37% with an insignificant loss in output quality, when compared to existing implementations. Jan 27, 2012 |
|
| Fast and Compact Binary-to-BCD Conversion Circuits for Decimal Multiplication |
Decimal arithmetic has received considerable attention recently due to its suitability for many financial and commercial applications. In particular, numerous algorithms have been recently proposed for decimal multiplication. A major approach to decimal multiplication shaped by these proposals is based on performing the decimal digit-by-digit multiplication in binary, converting the binary partial product back to decimal, and then adding the decimal partial products as appropriate to form the final product in decimal. With this approach, the efficiency of binary-to-BCD partial product conversion is critical for the efficiency of the overall multiplication process. A recently proposed algorithm for this conversion is based on splitting the binary partial product into two parts (i.e., two groups of bits), and then computing the contributions of the two parts to the partial BCD result in parallel. This paper proposes two new algorithms (Three-Four split and Four-Three split) based on this principle . We present our proposed architectures that implement these algorithms and compare them to existing algorithms. The synthesis results show that the Three-Four split algorithm runs 15%faster and occupies 26.1%less area than the best performing equivalent circuit found in the literature. Furthermore, the Four-Three split algorithm occupies 37.5% less area than the state of the art equivalent circuit. Jan 27, 2012 |
|
| Methodology for Local Resonant Clock Synthesis using LC-Assisted Local Clock Buffers |
Resonant clocking is a form of adiabatic clocking that retains much of the energy present in clock switching and recycles it into the following clock cycle. In this paper we present the first automated methodology using LC-assisted local clock buffers (LCLCB) for generating local resonant clocks. This uses a single-buffer singleinductor sector topology applied to non-uniform trees as found in most ASIC designs. We show that this form of adiabatic clocking can achieve power savings as much as 75% over traditional buffered clock networks. Jan 27, 2012 |
|
| Short-Noise-Induced Failure in Nanoscale Flip-Flops—Part I: Numerical Framework |
As CMOS technology continues the path of miniaturization, noise-induced fluctuations raise heightened reliability concerns. In previous work, an analytical framework based on Markov queueing theory and Poisson shot noise was presented to model the probabilistic behavior of a CMOS flip-flop operated in the subthreshold regime. In this paper, this model is extended to also account for the above-threshold shot noise, where the noise distribution is no longer Poissonian. The formulas for the time-dependent charging and discharging of node capacitors of a four-transistor flip-flop are derived for different regimes of operation characterized by distinct Fano factors. The statistics of electron arrival and departure at node capacitors is incorporated in an algebraic representation based on Markov queueing theory to map the effects of charge fluctuations on the logic stability of a flip-flop. This framework is used in Part II of this work to investigate failure in time for end-of-roadmap CMOS at the 10-nm gate-length scale. Jan 27, 2012 |
|
Variable-Latency Adder (VL-Adder) Designs for Low Power and NBTI Tolerance |
We proposed a new adder design, called Variable-Latency Adder (VL-adder). This technique allows the adder to work at a lower supply voltage than that required by a conventional adder, while maintaining the same throughput. The VL-adder design can be further modified to overcome the effects of Negative Bias Temperature Instability (NBTI) on circuit delay. By applying VL-adder concept to 64-bit carry-select adder design, more than 40% energy saving is obtained while a similar throughput is maintained. Jan 27, 2012 |
|
Approximating Pareto Optimal Compiler Optimization Sequences – a Trade-off between WCET, ACET and Code Size |
With the growing complexity of embedded systems software, high code quality can only be achieved using a compiler. Sophisticated compilers provide a vast spectrum of various optimizations to improve code aggressively w.r.t. different objective functions, e.g. average-case execution time (ACET) or code size. Owing to the complex interactions between the optimizations, the choice for a promising sequence of code transformations is not trivial. Compiler developers address this problem by proposing standard optimization levels, e.g. O3 or Os. However, previous studies have shown that these standard levels often miss optimization potential or might even result in performance degradation. In this paper, we propose the first adaptive worst-case execution time (WCET)-aware compiler framework for an automatic search of compiler optimization sequences that yield highly optimized code. Besides the objective functions ACET and code size, we consider the WCET which is a crucial parameter for real-time systems. To find suitable trade-offs between these objectives, stochastic evolutionary multi-objective algorithms identifying Pareto optimal solutions for the objectives _WCET, ACET_ and _WCET, code size_ are exploited. A comparison based on statistical performance assessments is performed that helps to determine the most suitable multiobjective optimizer. The effectiveness of our approach is demonstrated on real-life benchmarks showing that standard optimization levels can be significantly outperformed. Jan 12, 2012 |
|
WCET-aware Register Allocation based on Integer-Linear Programming |
Current compilers lack precise timing models guiding their built-in optimizations. Hence, compilers apply ad-hoc heuristics during optimization to improve code quality.One of the most important optimizations is register allocation. Many compilers heuristically decide when and where to spill a register to memory, without having a clear understanding of the impact of such spill code on a program’s runtime.This paper presents an integer-linear programming (ILP) based register allocator that uses precise worst-case execution time (WCET) models. Using this WCET timing data, the compiler avoids spill code generation along the critical path defining a program’s WCET. To the best of our knowledge, this paper is the first one to present a WCET-aware ILP-based register allocator. Our results underline the effectiveness of the proposed techniques. For a total of 55 realistic benchmarks, we reduced WCETs by 20.2% on average and ACETs by 14%, compared to a standard graph coloring allocator. Furthermore, our ILP-based register allocator outperforms a WCET-aware graph coloring allocator by more than a factor of two for the considered benchmarks, while requiring less runtime. Jan 12, 2012 |
|