The Nanoscale Integrated Circuits and System Lab, Energy Efficient Computing Group (NICS-EFC) in the Department of Electronic Engineering at Tsinghua University is led by Professor Yu Wang. The Efficient Algorithm Team (EffAlg) in the NICS-EFC group is led by Research Assistant Professor Xuefei Ning. Our team has an in-depth academic collaboration with Infinigence-AI, and fellows from many institutions including SJTU, MSR, HKU, and so on.
Our current research primarily focuses on efficient deep learning, including algorithm-level acceleration, model-level compression, model architecture design, system co-optimization, and other techniques. Our work targets several application domains, including language generative models (i.e., LLMs), vision generative models, vision understanding models and so on. Most of our projects are open sourced at the thu-nics GitHub organization (most efficient DL projects) or the imagination-research GitHub organization (some efficient DL projects and projects for broader topics; These projects are co-lead researches with Dr. Zinan Lin from MSR).
Our group welcomes all kinds of collaborations, and is continuously recruiting visiting students and engineers who are interested in efficient deep learning. If you're interested in collaborations or visiting student opportunities, email Xuefei or Prof. Yu Wang.
News
2024/12/05
Get the 2nd place in the Model Compression Track and the Training From Scratch Track at the NeurIPS 2024 Edge-Device LLM Competition.
2024/12/04
Zhihang gives an invited talk about DiTFastAttn at Jiangmen TechBeat.
2024/12/03
Give an invited talk about efficient AIGC research at VIVO.
2024/11/23
Give an invited talk about efficient AIGC research at UESTC.
2024/11/20
Give an invited talk about our EffAlg team and recent researches at AI Time.
2024/11/06
Will serve as a TPC member for DAC 2025's AI Track.
Competition Awards
2024 NeurIPS Edge-Device Large Language Model Competition, Model Compression Track 2nd
2024 NeurIPS Edge-Device Large Language Model Competition, Training From Scratch Track 2nd
2020 CVPR Low-Power CV Challenges 3rd
2018 NeurIPS Adversarial Robustness Competition, Model Track 2nd
Efficient DL Projects
Technique
Target
Domain
Publishing House of Electronics Industry 2024
(Chinese Book) Efficient Deep Learning: Model Compression and Design. 《高效深度学习:模型压缩与设计》 (京东有售)
Model-level | Efficient Inference | Vision Recognition, Vision Generation, Language
ArXiv 2024
Distilling Auto-regressive Models into Few Steps 1: Image Generation
Algorithm-level | Efficient Inference | Vision Generation
ArXiv 2024
Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding
Algorithm-level | Efficient Inference | Vision Generation
Paper
NeurIPS 2024
Rad-NeRF: Ray-decoupled Training of Neural Radiance Field
Algorithm-level | Efficient Inference | 3D Modeling
Paper
Code
Video
NeurIPS 2024
Can LLMs Learn by Teaching for Better Reasoning? A Preliminary Study
Algorithm-level | Better Reasoning | Language
Paper
Code
Website
Video
This study explores whether or not the current LLMs can learn by teach (LbT), which is a well-recognized paradigm in human learning. As one can imagine, the ability of LbT could offer exciting opportunities for the models to continuously evolve by teaching other (potentially weaker) models. We implement the LbT idea into well-established pipelines to see if it can improve the reasoning outcomes and ability on complex tasks (e.g., mathematical reasoning, competition-level code synthesis). The results show some promise, and importantly, we share our thoughts on the research rationale and roadmap in detail.
ArXiv 2024
Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs
Model-level (Pruning) | Efficient Inference | Language
Paper
Code
ArXiv 2024
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
Model-level (Sparsification) | Efficient Inference | Language
Paper
Code
Website
Mixture of Sparse Attention (MoA) addresses the computational and memory challenges of long-context LLM inference. It proposes an automatic compression pipeline, assigning the optimal heterogeneous elastic sparse pattern for each attention head. MoA achieves a 1.2-1.4x GPU memory reduction, boosting decode throughput by 6.6−8.2x and 1.7−1.9x compared to FlashAttention2 and vLLM, with minimal impact on performance.
NeurIPS 2024
DiTFastAttn: Attention Compression for Diffusion Transformer Models
Model-level (Sparsification), Model-level (Structure Optimization) | Efficient Inference | Vision Generation
Paper
Code
Website
Video
Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We identify three types of redundancies in DiT and propose DiTFastAttn, a post-training compression method, to reduce them. Our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.
ArXiv 2024
ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation
Model-level (Quantization) | Efficient Inference | Vision Generation
Paper
Code
Website
This paper investigates the quantization of diffusion transformers, conducting a systematic analysis of the sources of quantization error. It then designs a unique static-dynamic channel balancing technique to address the time-varying channel imbalance problem. Additionally, a metric-decoupled mixed precision approach is adopted to handle failures under lower bitwidth settings (W4A8, W4A4). This method achieves lossless W4A8 quantization for a variety of popular text-to-image and text-to-video models.
ArXiv 2024
DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis
Model-level (Structure Optimization) | Efficient Inference | Vision Generation
Paper
Code
ArXiv 2024
Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better
Algorithm-level | Efficient Training, Efficient Inference | Vision Generation
Paper
Code
ECCV 2024
MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization
Model-level (Quantization) | Efficient Inference | Vision Generation
Paper
Code
Website
Video
This paper addresses the issue of existing quantization techniques faces challenge when quantizing the "few-step" diffusion models. It introduces the BOS-aware quantization technique to handle the highly sensitive text embedding related layers, and proposes a novel metric decoupled sensitivity analysis method to decouple the effect of quantization on image quality and content. MixDQ achieves lossless W4A8 quantization for challenging one-step SDXL-turbo model, while existing method fall short at W8A8.
ArXiv 2024
A Survey on Efficient Inference for Large Language Models
Survey
| Efficient Inference | Language
Paper
ArXiv 2024
LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K
Benchmark, Evaluation
| |
Paper
Code
Website
ICCAD 2024
Towards Floating Point-Based Attention-Free LLM: Hybrid PIM with Non-Uniform Data Format and Reduced Multiplications
System-level | Efficient Inference | Language
Paper
FPGA 2024
FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs
System-level, Model-level (Quantization), Model-level (Sparsification) | Efficient Inference | Language
Paper
Evaluating Quantized Large Language Models
Evaluation
Model-level (Quantization) | Efficient Inference | Language
Paper
Code
Video
We evaluate the performance of 11 LLM families under W, WA, KV quantization on various tasks. Based on the evaluation results, we summarize tensor-level, model-level, and task-level knowledge for quantized LLMs.
CVPR 2024
FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models
Algorithm-level | Efficient Optimization Process | Vision Generation
Paper
Code
Video
This paper addresses the issue of the evaluation process for text-to-image generation models, which often involves generating images on 1K-4K prompts, making it cumbersome for application scenarios requiring iterative evaluation. It constructs an extensive model zoo and introduces an evolutionary-inspired searching method to identify a "representative subset" of textual datasets. With equivalent evaluation quality (measured using Kendall's tau), FlashEval's 50-item prompt set can replace a 500-item randomly sampled subset, achieving a 10x speedup in evaluation efficiency.
ICLR 2024
A Unified Sampling Framework for Solver Searching of Diffusion Probabilistic Models
Algorithm-level | Efficient Inference | Vision Generation
Paper
Code
Video
ICLR 2024
Skeleton-of-Thought: Prompting Large Language Models for Efficient Parallel Generation
Algorithm-level | Efficient Inference | Language
Paper
Code
Video
WACV 2024
TCP: Triplet Contrastive-relationship Preserving for Class-Incremental Learning
| |
Paper
NeurIPS Workshop 2023
LLM-MQ: Mixed-precision Quantization for Efficient LLM Deployment
Model-level (Quantization) | Efficient Inference | Language
Paper
NeurIPS 2023
Jaccard Metric Losses: Optimizing the Jaccard Index with Soft Labels
| | Vision Recognition
Paper
Code
ICCV 2023
Ada3D: Exploiting the Spatial Redundancy with Adaptive Inference for Efficient 3D Object Detection
Model-level (Sparsification) | Efficient Inference | Vision Recognition
Paper
Video
ICML 2023
OMS-DPM: Deciding The Optimal Model Schedule for Diffusion Probabilistic Model
Algorithm-level | Efficient Inference | Vision Generation
Paper
Code
Website
Video
AAAI 2023 (Oral)
Dynamic Ensemble of Low-fidelity Experts: Mitigating NAS "Cold-Start"
Model-level (Structure Optimization) | Efficient Optimization Process | Vision Recognition
Paper
Code
AAAI 2023
Memory-Oriented Structural Pruning for Efficient Image Restoration
Model-level (Structure Optimization) | Efficient Inference | Vision Generation
Paper
AAAI 2023
Ensemble-in-One: Ensemble Learning within Random Gated Networks for Enhanced Adversarial Robustness
| | Vision Recognition
Paper
TPAMI 2023
A Generic Graph-based Neural Architecture Encoding Scheme with Multifaceted Information
Model-level (Structure Optimization) | Efficient Optimization Process | Vision Recognition
Paper
Code
DATE 2022 & TCAD 2023
Gibbon: Efficient Co-Exploration of NN Model and Processing-In-Memory Architecture
Model-level (Structure Optimization), System-level | Efficient Optimization Process, Efficient Inference | Vision Recognition
Paper
TCAD 2022
Exploring the Potential of Low-bit Training of Convolutional Neural Networks
Model-level (Quantization) | Efficient Training | Vision Recognition
Paper
CVPR 2022
CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance
Model-level (Structure Optimization) | Efficient Inference | Vision Recognition
Paper
CVPR 2022
FedCor: Correlation-Based Active Client Selection Strategy for Heterogeneous Federated Learning
Algorithm-level | Efficient Training | Vision Recognition
Paper
ECCV 2022
CLOSE: Curriculum Learning On the Sharing Extent Towards Better One-shot NAS
Model-level (Structure Optimization) | Efficient Optimization Process | Vision Recognition
Paper
NeurIPS 2022 (Spotlight)
TA-GATES: An Encoding Scheme for Neural Network Architectures
Model-level (Structure Optimization) | Efficient Optimization Process | Vision Recognition
Paper
Low-Power CV 2022
Hardware Design and Software Practices for Efficient Neural Network Inference
Model-level (Structure Optimization), Model-level (Quantization), System-level | Efficient Inference | Vision Recognition
Paper
TODAES 2021
Machine learning for electronic design automation: A survey
| | Other
Paper
Code
NeurIPS 2021
Evaluating Efficient Performance Estimators of Neural Architectures
Evaluation
Model-level (Structure Optimization) | Efficient Optimization Process | Vision Recognition
Paper
Code
ASP-DAC 2020
Black Box Search Space Profiling for Accelerator-Aware Neural Architecture Search
Model-level (Structure Optimization) | Efficient Optimization Process, Efficient Inference | Vision Recognition
Paper
Code
ECCV 2020
A Generic Graph-based Neural Architecture Encoding Scheme for Predictor-based NAS
Model-level (Structure Optimization) | Efficient Optimization Process | Vision Recognition
Paper
Code
ECCV 2020 (Spotlight)
DSA: More Efficient Budgeted Pruning via Differentiable Sparsity Allocation
Model-level (Structure Optimization) | Efficient Inference | Vision Recognition
Paper
ArXiv 2020
aw_nas: A Modularized and Extensible NAS framework
Model-level (Structure Optimization) | Efficient Inference, Efficient Optimization Process | Vision Recognition, Language
Paper
Code