2024-09-19
Distributed Model Training and Parallelism Techniques
How to train large language models
once we have fully optimized our architecture, only then we should move towards distributed training
there are mainly 2 parallelism paradigms:
- model parallelism
a straightforward way to do distributed training is to divide the ops into smaller sets so that they can be executed on different devices. Still, ops usually depend on each other (forward and backward pass), making it difficult to implement.
so to resolve this we can use: an inter-layer strategy
inter-layer strategy (pipeline parallelism)
in this strategy, each model layer is executed on a different device in parallel.
inter-operation strategy
we divide operations that are executed on each layer into smaller chunks
intra-operation strategy (tensor parallelism)
executes part of the same operation on different devices
- data parallelism
divide the training dataset into smaller sets and then use these sets to train distinct replicas of the model
at each step, we'd synchronize gradients across devices. this phase is the key to the data parallel approach.
this synchronization phase can be implemented using approaches like parameter server (one server aggregates the gradients, calculates the average gradient and sends them to all the training servers but this may become a major bottleneck operation) or all-reduce
using data parallel approach is very popular but with large language models coming into picture we are shifting to other strategies in order to efficiently train these huge models
we can see how these large language models are trained on multiple GPU enabled clusters with FSDP or DeepSpeed
Lilian Weng has an in-depth blog post on these strategies 👉 https://lnkd.in/gQiqAbDp