2024-09-13
Efficient Data Pipeline
Create efficient data pipelines with using pin memory and increasing number of workers
before feeding data to the model for training we need to
- load the data from the disk to the memory
- do preprocessing like normalization, data augmentation
to build an efficient data pipeline pytorch has
Dataset
- a source of data like files and transformationsDataLoader
- interface to get the samples
our goal is to minimize GPU idle time and to do this we need to
- minimize data transfer time between CPU and GPU
- increase workers
data transfer from CPU to GPU
data never gets directly copied to the pageable memory but there's a pin memory in between. to reduce this work, we can directly use this pin memory to write our data.
DataLoader(training_data, batch_size=128, pin_memory=True)
request for the pinned memory can fail and there will be an increase in memory usage
refer to the blog post by nvidia to understand more
DataLoader
by default uses a single worker for execution
so if we increase the number of workers, pytorch creates additional processes to handle multiple dataset samples asynchronously
DataLoader(training_data, batch_size=128, num_workers=8)