Who We Are
The Nanoscale Integrated Circuits and System Lab, Energy Efficient Computing Group (NICS-EFC) in the Department of Electronic Engineering at Tsinghua University is led by Professor Yu Wang. The Efficient Algorithm Team (EffAlg) in the NICS-EFC group is led by Research Assistant Professor Xuefei Ning. Our team has an in-depth academic collaboration with Infinigence-AI, and fellows from many institutions including SJTU, MSR, HKU, and so on.
Our current research primarily focuses on efficient deep learning, including algorithm-level acceleration, model-level compression, model architecture design, system co-optimization, and other techniques. Our work targets several application domains, including language generative models (i.e., LLMs), vision generative models, vision understanding models and so on. Most of our projects are open sourced at the thu-nics GitHub organization (most efficient DL projects) or the imagination-research GitHub organization (some efficient DL projects and projects for broader topics; These projects are co-lead researches with Dr. Zinan Lin from MSR).
Our group welcomes all kinds of collaborations, and is continuously recruiting visiting students and engineers who are interested in efficient deep learning. If you're interested in collaborations or visiting student opportunities, email Xuefei or Prof. Yu Wang.
News
-
2024/09/26Three of our papers are accepted by NeurIPS 2024! (1) "Can LLMs Learn by Teaching? A Preliminary Study" explores a novel pathway (learning by teach) towards better and continual evolving reasoning ability of LLMs. (2) "DiTFastAttn: Attention Compression for Diffusion Transformer Models" proposes post-training attention sparsification and sharing techniques to accelerate Diffusion Transformers. (3) "Rad-NeRF: Ray-decoupled Training of Neural Radiance Field" proposes a ray-decopuled soft ensemble of NeRF MLPs to better model complex scenes, and offers a new and parameter-efficient scaling dimension.
-
2024/07/01Our paper Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs is public on arXiv. We conduct expert pruning of MoE models and use weight merging to inherit the knowledge from the retrained and discarded experts. This is a novel and gradient-free way to inherit the knowledge during model compression. The code will soon be available at code.
-
2024/06/24Our paper Can LLMs Learn by Teaching? A Preliminary Study is public on arXiv. We make the first attempt in adapting the "learning by teach" strategy in education into LLMs and see if the contemporary LLMs are ready to learn by teach to improve the outcome or ability for mathematical reasoning and code synthesis. The results show some promise. We also provide a roadmap for future research. Welcome to check the code.
-
2024/06/24Our paper MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression is public on arXiv. MoA compresses attention in LLMs, so that they can compute short attention, but remember long context. It achieves 5.5-6.7x faster throughput than dense FlashAttention2, improving retrieval accuracy by 1.5-7.1x compared to uniform sparse attention. Welcome to check the code.
-
2024/06/14Our paper DiTFastAttn: Attention Compression for Diffusion Transformer Models is public on arXiv. We design three training-free techniques to compress the attention operation, which is the efficiency bottleneck when generating large-resolution images. For example, applying DiTFastAttn to PixArt-Sigma-XL can reduce the FLOPs and latency of attention computation by 88% and 37% when generating 2Kx2K images. Welcome to check the code.
-
2024/06/02Our two papers are public on arXiv. In the first paper, we propose MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization that successfully tackles the challenging few-step text-to-image diffusion model quantization. With negligible visual quality degradation and content change, MixDQ could achieve W4A8, equivalent to 3.4x memory compression and 1.5x latency speedup. Check our code and project page. The second paper is ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation, a quantization method specialized for the transformer-based video & image diffusion models. For popular large-scale models (e.g., open-sora, Latte, Pixart) for the video and image generation task, ViDiT-Q could achieve W8A8 quantization without metric degradation, and W4A8 without notable visual quality degradation. Check our code and project page. Update 07/01: MixDQ is accepted by ECCV'24!
-
2024/04/23Our survey A Survey on Efficient Inference for Large Language Models is public on arXiv. Any discussions and suggestions are welcome!
-
2024/04/05Our paper Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better is public on arXiv. This work proposes a method, Linear Combination of Saved Checkpoints (LCSC). LCSC uses gradient-free search-based checkpoint combination to obtain the final weights, achieving significant training speedups (23x on CIFAR-10 and 15x on ImageNet-64) compared to full gradient-based training. LCSC can be used to enhance pre-trained models with a small cost. Check our code.
-
2024/05/021 paper, Evaluating Quantized Large Language Models, is accepted by ICML'24. This work evaluates 11 LLM families, different tasks (including emergent abilities, dialogue, long-context tasks, and so on), and different tensor types (Weight, Weight-Activation, Key-Value Cache). We provide quantative suggestions and qualitative insights on the quantization. Practitioners could benefit from this work with a full scope of quantization suggestions.
-
2024/02/271 paper, FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models, is accepted by CVPR'24. This work is on selecting a compact data subset to evaluate text-to-image Diffusion models.
-
2024/02/09Our paper on long-context benchmark, LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K, is public on arXiv. Check the code.
-
2024/01/172 papers are acccepted by ICLR'24. One is Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation, which accelerates LLM generation by letting the LLM itself to plan and generate segments in parallel, achieving ~2x speed-ups; Another is A Unified Sampling Framework for Solver Searching of Diffusion Probabilistic Models, which summarizes the sampling strategies for diffusion and search for the best sampling strategy.
Efficient DL Projects
Technique
Target
Domain
Model-level | Efficient Inference | Vision Recognition, Vision Generation, Language
Algorithm-level | Efficient Inference | Vision Generation Paper
Algorithm-level | Efficient Inference | Vision Generation Paper
| Efficient Inference | 3D Modeling
| Efficient Inference | Language Paper
System-level | Efficient Inference | Language Paper
System-level, Model-level (Quantization), Model-level (Sparsification) | Efficient Inference | Language Paper
System-level | Efficient Inference | Vision Recognition Paper
Algorithm-level | Efficient Inference | Vision Generation Paper
Model-level (Quantization) | Efficient Inference | Language Paper
Model-level (Sparsification) | Efficient Inference | Vision Recognition Paper
Model-level (Structure Optimization) | Efficient Inference | Vision Generation Paper
| | Vision Recognition Paper
Model-level (Structure Optimization), System-level | Efficient Optimization Process, Efficient Inference | Vision Recognition Paper
Model-level (Quantization) | Efficient Training | Vision Recognition Paper
Model-level (Structure Optimization) | Efficient Inference | Vision Recognition Paper
Algorithm-level | Efficient Training | Vision Recognition Paper
Model-level (Structure Optimization) | Efficient Optimization Process | Vision Recognition Paper
Model-level (Structure Optimization) | Efficient Optimization Process | Vision Recognition Paper
Model-level (Structure Optimization), Model-level (Quantization), System-level | Efficient Inference | Vision Recognition Paper
Model-level (Structure Optimization) | Efficient Inference | Vision Recognition Paper