Large-scale pre-trained models (PTMs)
, such as Transformer models
, have promoted the deep learning (DL) development on various complicated tasks, including natural language processing, e.g., BERT [9], GPT [6], T5 [41], computer vision, e.g., ViT [10], Swin [25], advertising recommendation, e.g., M6 [24], and so on.
These models are also known as foundation models
since they are trained on hundreds of gigabytes of data and can be adapted, e.g., task-specific fine-tuning, to a wide range of downstream tasks
FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement