报告13:Understanding and Improving LLM Training: Insights into Adam and Advent of Adam-mini
2024/10/21 来源: 编辑:


报告人:孙若愚 (香港中文大学)


报告题目:Understanding and Improving LLM Training: Insights into Adam and Advent of Adam-mini


摘要:Adam is the default algorithm for training large foundation models. In this talk, we aim to understand why Adam is better than SGD on training large foundation models, and propose a memory-efficient alternative called Adam-mini. First, we show that the original version of Adam does converge. We will also explain that the earlier work on the non-convergence of Adam has used an unusual notion of convergence. Second, we provide an explanation of the failure of SGD on transformer:(i) Transformers are “heterogeneous”: the Hessian spectrum across parameter blocks vary dramatically;(ii) Heterogeneity hampers SGD: SGD performs badly on problems with block heterogeneity. Third, motivated by this finding, we introduce Adam-mini, which partitions the parameters according to the Hessian structure and assigns a single second momentum term to all weights in a block. We empirically show that Adam-mini saves 45-50% memory over Adam without compromising performance, on various models including 8B-size language models and ViT.