2024-09-13

Efficient Data Pipeline

Create efficient data pipelines with using pin memory and increasing number of workers

before feeding data to the model for training we need to

load the data from the disk to the memory
do preprocessing like normalization, data augmentation

to build an efficient data pipeline pytorch has

Dataset - a source of data like files and transformations
DataLoader - interface to get the samples

our goal is to minimize GPU idle time and to do this we need to

minimize data transfer time between CPU and GPU
increase workers

data transfer from CPU to GPU

data never gets directly copied to the pageable memory but there's a pin memory in between. to reduce this work, we can directly use this pin memory to write our data.

DataLoader(training_data, batch_size=128, pin_memory=True)

request for the pinned memory can fail and there will be an increase in memory usage

refer to the blog post by nvidia to understand more

DataLoader by default uses a single worker for execution

so if we increase the number of workers, pytorch creates additional processes to handle multiple dataset samples asynchronously

DataLoader(training_data, batch_size=128, num_workers=8)