A Survey of Model Compression and Acceleration for Deep Neural Networks
Compacting and accelerating CNN techniques are roughly categorized into four scheme:
1. Parameter pruning and sharing: explore the redundancy in the model parameters and try to remove redundant and uncritical ones [vector quantization, binary coding, sparse constraints]
a. model quantization and binarization; b. parameter sharing; c. structural matrix
2. Low-rank factorization: use matrix/tensor decomposition to estimate the informative parameters of the deep CNNs
3. Transfered/compact convolutional filters: design special structural convolutional filters to reduce the storage and computational complexity
4. Knowledge distillation: learn a distilled model and train a more compact neural networks to reproduce the output of a larger network
It makes senses to combine two or three of them to maximize the compression/speedup rates. For some specific applications, like object detection, which requires both convolutional and fully connected layers, you can compress the convolutional layers with low rank factorization and the fully connected layers with a pruning method.
DNNDK暂不对外支持TensorFlow模型
TensorFlow Lite 及支持模型列表
TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded devices. It enables on-device machine learning inference with low latency and a small binary size.