2024-09-13

Efficient Data Pipeline

Create efficient data pipelines with using pin memory and increasing number of workers


before feeding data to the model for training we need to

  1. load the data from the disk to the memory
  2. do preprocessing like normalization, data augmentation

to build an efficient data pipeline pytorch has

  1. Dataset - a source of data like files and transformations
  2. DataLoader - interface to get the samples

our goal is to minimize GPU idle time and to do this we need to

  1. minimize data transfer time between CPU and GPU
  2. increase workers

data transfer from CPU to GPU

data never gets directly copied to the pageable memory but there's a pin memory in between. to reduce this work, we can directly use this pin memory to write our data.

DataLoader(training_data, batch_size=128, pin_memory=True)

request for the pinned memory can fail and there will be an increase in memory usage

refer to the blog post by nvidia to understand more

DataLoader by default uses a single worker for execution

so if we increase the number of workers, pytorch creates additional processes to handle multiple dataset samples asynchronously

DataLoader(training_data, batch_size=128, num_workers=8)