Full-stack Optimization for Accelerating CNNs with FPGA Validation

Full-stack Optimization for Accelerating CNNs with FPGA Validation

Abstract

We present a full-stack optimization framework for accelerating in-ference of CNNs (Convolutional Neural Networks) and validate theapproach with field-programmable gate arrays (FPGA) implementa-tions. By jointly optimizing CNN models, computing architectures,and hardware implementations, our full-stack approach achievesunprecedented performance in the trade-off space characterizedby inference latency, energy efficiency, hardware utilization andinference accuracy. As a validation vehicle, we have implementeda 170MHz FPGA inference chip achieving 2.28ms latency for theImageNet benchmark. The achieved latency is among the lowest re-ported in the literature while achieving comparable accuracy. How-ever, our chip shines in that it has 9x higher energy efficiency com-pared to other implementations achieving comparable latency. Ahighlight of our full-stack approach which attributes to the achievedhigh energy efficiency is an efficient Selector-Accumulator (SAC)architecture for implementing the multiplier-accumulator (MAC)operation present in any digital CNN hardware. For instance, com-pared to a FPGA implementation for a traditional 8-bit MAC, SACsubstantially reduces required hardware resources (4.85x fewerLook-up Tables) and power consumption (2.48x).

Publication
the 33rd ACM International Conference on Supercomputing (ICS 2019)

As a validation vehicle, we have implementeda 170MHz FPGA inference chip achieving 2.28ms latency for theImageNet benchmark. The achieved latency is among the lowest re-ported in the literature while achieving comparable accuracy. However, our chip shines in that it has 9x higher energy efficiency compared to other implementations achieving comparable latency. Highlights of our system are,

  • Power of two and low-bit weights
  • Low-bit activation
  • Zero skipping to improve I/O and computation
  • Efficient ResNet-like architecture with no shortcut
  • Shift operation1 for convolution layers
  • Column combining2 for efficient matrix multiplication
  • Automatic generated instructions
  • Systolic array3 architecture

    Hardware implementation


  1. Wu, Bichen, et al. “Shift: A zero flop, zero parameter alternative to spatial convolutions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. ^
  2. Kung, H. T., Bradley McDanel, and Sai Qian Zhang. “Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization.” ASPLOS. 2019. ^
  3. Kung, H. T., and Charles E. Leiserson. “Systolic arrays (for VLSI).” Sparse Matrix Proceedings 1978. Vol. 1. Society for industrial and applied mathematics, 1979. ^