2024-09-13
Efficient Data Pipeline
Create efficient data pipelines with using pin memory and increasing number of workers
before feeding data to the model for training we need to
- load the data from the disk to the memory
- do preprocessing like normalization, data augmentation
to build an efficient data pipeline pytorch has
Dataset
- a source of data like files and transformationsDataLoader
- interface to get the samples
our goal is to minimize GPU idle time and to do this we need to
- minimize data transfer time between CPU and GPU
- increase workers
data transfer from CPU to GPU
data never gets directly copied to the pageable memory but there's a pin memory in between. to reduce this work, we can directly use this pin memory to write our data.
request for the pinned memory can fail and there will be an increase in memory usage
refer to the blog post by nvidia to understand more
DataLoader
by default uses a single worker for execution
so if we increase the number of workers, pytorch creates additional processes to handle multiple dataset samples asynchronously