Deep dive into FFCV internals

There’s been a lot of hype about FFCV, a pytorch-compatible data-loading library that claims major improvements over current pytorch data-loading solutions. But not a lot was written about what it actually does, so I spent some time diving into FFCV’s internals. It was a fun and very instructive exercise, as there is a lof of smart engineering going on.

This post summarizes my findings. I wasn’t involved in FFCV’s development at all, so if I got something wrong or missed critical details, please do let me know!

TL;DR:

Overview

FFCV controls the 3 main parts of the entire data loading pipeline:

These 3 components are inter-dependent and cannot be used in isolation: it’s impossible to use FFCV’s transforms without relying on their own data loader, and you can’t use the data loader without using the custom dataset format.

The most interesting bit of FFCV IMHO is the data loader, and how it pre-fetches/caches the data. To understand how it works, we first need to look at the dataset format:

Custom Dataset Format

FFCV requires storing the data in a specific .beton format. There are 2 main aspects to this format:

This concept of page in the .beton file is directly related to the way samples are pre-fetched, cached and loaded by the data loader:

DataLoader: pre-fetching and caching

FFCV’s data loader has an os_cache parameter that determines how the data is pre-fetched:

When os_cache = True things are pretty simple: the entire .beton file is memmapedinto RAM. Needless to say, this doesn’t work for big-ish dataset.

The most interesting engineering happens when os_cache = False. In this case, FFCV is able to pre-fetch the samples in advance, because it knows exactly when each sample will be needed during the training loop. Instead of storing the entire .beton file in memory, it only stores a small-ish number of pages. The number of pages it stores (called number of slots) is determined at runtime by figuring out how many pages need to be loaded for any given batch of the training loop.

On a fully random split, this number of slots would be quite high, because 2 samples of the same page can be present in 2 batches that are needed “far apart” during the training loop: e.g. first and last batch. Since a page can only be loaded (and unloaded) once, it means the page would need to be loaded during the entire iteration. In the worst case, this means we potentially need all pages to be loaded at any given time. But this is where FFCV’s QUASI_RANDOM sampler comes in: it makes sure that the number of slots needed is reasonably small, by only shuffling within N pages at any given time, restricting the number of slots to N. Unfortunately QUASI_RANDOM isn’t supported for distributed training, so one always needs to load the entire dataset in RAM for DDP uses. This isn’t a hard limitation, just something that isn’t supported yet.

The actual page loading (transfer from .beton file to RAM) is handled by an army of 12 threads running in the background. Each thread continuously waits for a load request, and performs the page read when requested. No GIL, no problem.

Overall, this page-based pre-fetching and caching mechanism is very different from what the current torch.utils.data.DataLoader does: FFCV’s pre-loads entire pages of samples ahead of time, as it knows when they will be needed. Most samples are already available in RAM when the data loader needs them, as the sample’s page has already been read from disk and loaded. In Pytorch’s DataLoader, individual samples are read and loaded only when they are needed (with the prefetch_factor parameter), which requires less memory usage, but more disk reads.

DataLoader: parallelism and transforms

FFCV’s built-in transforms implement a protocol to declare the amount of memory that their output requires. This allows the data-loader to pre-allocate all of that space at the beginning of the training loop, and to re-use that space for each batch. This saves allocation time. While arbitrary transforms (e.g. torchvision’s transforms) can be used with FFCV, only the FFCV built-ins can leverage this pre-allocation mechanism. Writing a transform that is fully compatible with FFCV isn’t always easy.

The parallelism of FFCV data loader is implemented very differently from torch.utils.data: there’s only 2 workers: the main worker which loads samples from itsqueue, and another worker which works in the background to load samples, transform them, and put them in the queue to be consumed by the main worker’s loader. The bulk of the parallelism happens at the transform level: most transforms do a prange from numba over all images in the batch they’re given, i.e. the worker will process each image of a batch in parallel over num_workers threads. Only the CPU transforms can leverage that kind of parallelism.

This is very different from pytorch’s DataLoader, where num_workers (often processes) are run in parallel, with each worker processing each image in a batch individually, and putting the transformed result in the main worker’s queue.

Other random facts:

Nicolas Hug

Nicolas Hug

ML Engineer

comments powered by Disqus
rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora