The past few years have seen a spur of deep learning (DL) innovations. These innovations span from DL models to software stack optimizations (e.g. frameworks such as MXNet or PyTorch, libraries such as cuDNN or MKL-DNN) and hardware stack improvements (e.g. CPU, GPU, FPGA). Among all the innovations, however, DL models are the most rapidly evolving and prolific. This is true in both academia and industry , where models are tweaked and introduced on a weekly, daily, or even hourly basis.

Both industry and academia have invested heavily in developing benchmarks to characterize DL models and systems . Characterization is followed by optimizations to improve the model performance. However, there is currently a gap between the benchmarking results and possible optimizations to perform. Researchers use profilers, such as nvprof , Nsight , and VTune , to profile and get low-level GPU and CPU information. With ample knowledge of how models execute and utilize system resources, researchers manually identify bottlenecks and inefficiencies within model execution using the profilers. Researchers then make hypotheses of solutions, and try out different ideas to optimize the model execution — which may or may not pan out. This manual and ad-hoc process requires a lot of effort and expertise and slows down the turnaround time for model optimization and system tuning.

Thus there is a need for a systematic DL benchmarking and subsequent analysis design that can guide researchers to potential optimization opportunities and assess hypothetical execution scenarios. Since for GPUs model execution latency is determined by the hardware, framework, and system libraries (primarily cuDNN and cuBLAS for DL), answers to the following questions are highly desired by researchers: what is the potential latency speedup if optimizations are performed? Are independent layers executed in parallel? Are convolution layers using the optimal convolution algorithms? Are there any inefficiencies or unexpected behavior in a framework? Does the execution fuse layers or leverage Tensor Cores, and what are the benefits? We motivate our design by answering these (6) questions, while ensuring the sustainability and extensibility of the design.

To answer these questions, we first propose a new benchmarking metric: “lower-bound” latency. The “lower-bound” latency estimates the ideal latency of a DL model given a software and hardware stack, and is based on the following observations: (1) DL models are executed as layers in frameworks and thus layers form the performance building blocks of DL models. (2) Frameworks delegate execution of common layers to either cuDNN or cuBLAS. The “lower-bound” latency is defined in terms of the latencies of the cuDNN and cuBLAS API functions invoked by model layers (framework overhead and memory transfers are ignored). We refine the “lower-bound” latency and define it under sequential execution mode (all layers are executed sequentially) and parallel execution mode (data-independent layers are executed asynchronously).

This paper presents Benanza (pronounced bonanza) — an sustainable and extensible benchmarking and analysis design. consists of a set of modular components: (1) a model processor to process input ONNX models into a set of unique layers (layers are considered the same if they have the same layer type, shape, and parameters), (2) a benchmark generator to automatically generate parameterized cuDNN and cuBLAS micro-benchmarks from the unique layers, (3) a performance database to store historical benchmark results, and (4) an analyzer to compute the “lower-bound” latency of DL models and inform potential optimizations.

Benanza is architected to be sustainable. The benchmarking workflow of Benanza is highly automated and minimizes the benchmark development and maintenance effort. Benanza uses the observation that DL models have repeated layers (i.e. non-unique) within and across models to decrease the time to benchmark. When a new model is introduced, only the newly un-benchmarked layers that do (not in the performance database) need to be benchmarked. Although the focus of the paper is on NVIDIA GPUs using cuDNN and cuBLAS, the design proposed is extensible and users can incorporate other benchmark runtimes that target other software libraries or hardware such as: frameworks’ API or MKL-DNN for CPUs.

In summary, this paper makes the following contributions:

  • We propose a “lower-bound” latency metric for DL models based on the observation that the latency of a DL model is bounded by the latencies of the cuDNN and cuBLAS API calls invoked by the model layers. This metric estimates the ideal latency of a model given a specific GPU hardware and software stack.

  • We present , a benchmarking and analyzing design that automatically generates micro-benchmarks given a set of models, computes their “lower-bound” latencies using the benchmark data, and informs optimizations of their executions on GPUs. The sustainable and extensible design of makes it cope with the fast evolution of DL innovations.

  • Using Benanza, we characterized the “lower-bound” latencies of (30) ONNX models in MXNet, ONNX Runtime, and PyTorch on (7) systems . We performed a comprehensive “lower-bound” latency analysis as we vary the model, execution mode, batch size, and system. E.g., when using parallel execution mode, up to (2.87\times)(with a geometric mean of (1.32\times) across models) latency speedup could be made to MXNet using batch size (1) on the Tesla_V100 system.

  • We further identified optimization opportunities through in cuDNN convolution algorithm selection (up to (1.32\times) geometric mean speedup across models), inefficiencies within MXNet (up to (1.15\times) speedup across models) and PyTorch (up to (2.3\times) speedup using batch size (1)), layer fusion, and Tensor Cores (up to (1.09\times) and (1.72\times) speedup for ResNet50-v1 respectively). We evaluated the above optimizations jointly and get up to (1.95\times) speedup for ResNet50-v1 across systems and batch sizes.