2024-09-19

Distributed Model Training and Parallelism Techniques

How to train large language models

once we have fully optimized our architecture, only then we should move towards distributed training

there are mainly 2 parallelism paradigms:

model parallelism

a straightforward way to do distributed training is to divide the ops into smaller sets so that they can be executed on different devices. Still, ops usually depend on each other (forward and backward pass), making it difficult to implement.

so to resolve this we can use: an inter-layer strategy

inter-layer strategy (pipeline parallelism)

in this strategy, each model layer is executed on a different device in parallel.

inter-operation strategy

we divide operations that are executed on each layer into smaller chunks

intra-operation strategy (tensor parallelism)

executes part of the same operation on different devices

data parallelism

divide the training dataset into smaller sets and then use these sets to train distinct replicas of the model

at each step, we'd synchronize gradients across devices. this phase is the key to the data parallel approach.

this synchronization phase can be implemented using approaches like parameter server (one server aggregates the gradients, calculates the average gradient and sends them to all the training servers but this may become a major bottleneck operation) or all-reduce

using data parallel approach is very popular but with large language models coming into picture we are shifting to other strategies in order to efficiently train these huge models

we can see how these large language models are trained on multiple GPU enabled clusters with FSDP or DeepSpeed

Lilian Weng has an in-depth blog post on these strategies 👉 https://lnkd.in/gQiqAbDp