We present a full-stack optimization framework for accelerating in-ference of CNNs (Convolutional Neural Networks) and validate theapproach with field-programmable gate arrays (FPGA) implementa-tions. By jointly optimizing CNN models, computing architectures,and hardware implementations, our full-stack approach achievesunprecedented performance in the trade-off space characterizedby inference latency, energy efficiency, hardware utilization andinference accuracy. As a validation vehicle, we have implementeda 170MHz FPGA inference chip achieving 2.28ms latency for theImageNet benchmark. The achieved latency is among the lowest re-ported in the literature while achieving comparable accuracy. How-ever, our chip shines in that it has 9x higher energy efficiency com-pared to other implementations achieving comparable latency. Ahighlight of our full-stack approach which attributes to the achievedhigh energy efficiency is an efficient Selector-Accumulator (SAC)architecture for implementing the multiplier-accumulator (MAC)operation present in any digital CNN hardware. For instance, com-pared to a FPGA implementation for a traditional 8-bit MAC, SACsubstantially reduces required hardware resources (4.85x fewerLook-up Tables) and power consumption (2.48x).
As a validation vehicle, we have implementeda 170MHz FPGA inference chip achieving 2.28ms latency for theImageNet benchmark. The achieved latency is among the lowest re-ported in the literature while achieving comparable accuracy. However, our chip shines in that it has 9x higher energy efficiency compared to other implementations achieving comparable latency. Highlights of our system are,