This talk explores cutting-edge approaches to neural network compression, addressing the growing need for efficient deployment of deep learning models in resource-constrained environments. The presentation provides a comprehensive overview of three primary compression paradigms: compact architecture design, pruning methods, and quantization techniques. Special attention is given to sparsification approaches, examining both structured and unstructured pruning methods and their impact on model efficiency. The discussion covers theoretical foundations and practical implementations of these techniques, along with their trade-offs between model size, computational efficiency, and accuracy.
George Retsinas