大湾区大学

首页 > 合作交流 > 2024年优化理论与方法研讨会 > 大会报告

报告13：Understanding and Improving LLM Training: Insights into Adam and Advent of Adam-mini

2024/10/21 来源: 编辑:

报告人：孙若愚 (香港中文大学)

报告题目：Understanding and Improving LLM Training: Insights into Adam and Advent of Adam-mini

摘要：Adam is the default algorithm for training large foundation models. In this talk, we aim to understand why Adam is better than SGD on training large foundation models, and propose a memory-efficient alternative called Adam-mini. First, we show that the original version of Adam does converge. We will also explain that the earlier work on the non-convergence of Adam has used an unusual notion of convergence. Second, we provide an explanation of the failure of SGD on transformer:(i) Transformers are “heterogeneous”: the Hessian spectrum across parameter blocks vary dramatically;(ii) Heterogeneity hampers SGD: SGD performs badly on problems with block heterogeneity. Third, motivated by this finding, we introduce Adam-mini, which partitions the parameters according to the Hessian structure and assigns a single second momentum term to all weights in a block. We empirically show that Adam-mini saves 45-50% memory over Adam without compromising performance, on various models including 8B-size language models and ViT.

新闻资讯

信息公开

学校概况

师资队伍

招生信息

教育教学

合作交流