Deep dive into FFCV internals

Friday. September 30, 2022

ffcv pytorch

There’s been a lot of hype about FFCV, a pytorch-compatible data-loading library that claims major improvements over current pytorch data-loading solutions. But not a lot was written about what it actually does, so I spent some time diving into FFCV’s internals. It was a fun and very instructive exercise, as there is a lof of smart engineering going on.

This post summarizes my findings. I wasn’t involved in FFCV’s development at all, so if I got something wrong or missed critical details, please do let me know!

TL;DR:

FFCV markets itself as data-loading for computer vision, but a lot of what it does is about general data loading, not just about vision.
FFCV relies on a page-based pre-fetching mechanism, where “pages” of multiple samples are loaded in advance and cached in RAM. Most samples are already available in RAM when the data loader needs them. This contrasts with most of pytorch’s DataLoader uses, where samples are loaded and read from disk individually each time they’re needed (or a bit before). Basically in FFCV: bigger cache, fewer io operations.
The parallelism of FFCV’s data loader relies on threads from numba, not on multi-processing.
You can’t use one of FFCV’s parts in isolation: you either use the whole thing (custom dataset format + custom loader + transforms) or none of it.
If you want distributed training, you need to store the entire dataset in RAM. This isn’t a hard limitation, just something that isn’t (yet?) supported.

Overview

FFCV controls the 3 main parts of the entire data loading pipeline:

The dataset format itself - how is the data stored on disk
The data loader - how is the data read from disk and loaded into memory (RAM or GPU)
The pre-processing transforms - how are the jpg files decoded, augmented, etc.

These 3 components are inter-dependent and cannot be used in isolation: it’s impossible to use FFCV’s transforms without relying on their own data loader, and you can’t use the data loader without using the custom dataset format.

The most interesting bit of FFCV IMHO is the data loader, and how it pre-fetches/caches the data. To understand how it works, we first need to look at the dataset format:

Custom Dataset Format

FFCV requires storing the data in a specific .beton format. There are 2 main aspects to this format:

Images can be stored as encoded jpeg or decoded jpegs. You can choose whether you want all images to be encoded or all to be decoded, or a given percentage of decoded. This is a one-time decision and cannot be decided on the fly at runtime.
The entire .beton file is divided into memory segments called “pages”. The page size is 8 MB by default. A page typically contains multiple images: ~100 images per page for compressed ImageNet jpegs.

This concept of page in the .beton file is directly related to the way samples are pre-fetched, cached and loaded by the data loader:

DataLoader: pre-fetching and caching

FFCV’s data loader has an os_cache parameter that determines how the data is pre-fetched:

When os_cache = True things are pretty simple: the entire .beton file is memmapedinto RAM. Needless to say, this doesn’t work for big-ish dataset.

The most interesting engineering happens when os_cache = False. In this case, FFCV is able to pre-fetch the samples in advance, because it knows exactly when each sample will be needed during the training loop. Instead of storing the entire .beton file in memory, it only stores a small-ish number of pages. The number of pages it stores (called number of slots) is determined at runtime by figuring out how many pages need to be loaded for any given batch of the training loop.

On a fully random split, this number of slots would be quite high, because 2 samples of the same page can be present in 2 batches that are needed “far apart” during the training loop: e.g. first and last batch. Since a page can only be loaded (and unloaded) once, it means the page would need to be loaded during the entire iteration. In the worst case, this means we potentially need all pages to be loaded at any given time. But this is where FFCV’s QUASI_RANDOM sampler comes in: it makes sure that the number of slots needed is reasonably small, by only shuffling within N pages at any given time, restricting the number of slots to N. Unfortunately QUASI_RANDOM isn’t supported for distributed training, so one always needs to load the entire dataset in RAM for DDP uses. This isn’t a hard limitation, just something that isn’t supported yet.

The actual page loading (transfer from .beton file to RAM) is handled by an army of 12 threads running in the background. Each thread continuously waits for a load request, and performs the page read when requested. No GIL, no problem.

Overall, this page-based pre-fetching and caching mechanism is very different from what the current torch.utils.data.DataLoader does: FFCV’s pre-loads entire pages of samples ahead of time, as it knows when they will be needed. Most samples are already available in RAM when the data loader needs them, as the sample’s page has already been read from disk and loaded. In Pytorch’s DataLoader, individual samples are read and loaded only when they are needed (with the prefetch_factor parameter), which requires less memory usage, but more disk reads.

DataLoader: parallelism and transforms

FFCV’s built-in transforms implement a protocol to declare the amount of memory that their output requires. This allows the data-loader to pre-allocate all of that space at the beginning of the training loop, and to re-use that space for each batch. This saves allocation time. While arbitrary transforms (e.g. torchvision’s transforms) can be used with FFCV, only the FFCV built-ins can leverage this pre-allocation mechanism. Writing a transform that is fully compatible with FFCV isn’t always easy.

The parallelism of FFCV data loader is implemented very differently from torch.utils.data: there’s only 2 workers: the main worker which loads samples from itsqueue, and another worker which works in the background to load samples, transform them, and put them in the queue to be consumed by the main worker’s loader. The bulk of the parallelism happens at the transform level: most transforms do a prange from numba over all images in the batch they’re given, i.e. the worker will process each image of a batch in parallel over num_workers threads. Only the CPU transforms can leverage that kind of parallelism.

This is very different from pytorch’s DataLoader, where num_workers (often processes) are run in parallel, with each worker processing each image in a batch individually, and putting the transformed result in the main worker’s queue.

Other random facts:

FFCV’s data loader can handle GPU transforms (but these can’t be parallelized). I assume this is because there’s only 2 workers and thus no multi-processing issue?
For now, it’s impossible to transform both the target and the image. They say they’re working on it tho.
For decoding the images, they use libjpeg-turbo.
CPU transforms are jitted with numba and sometimes “fused” together when possible. I’m not sure this amounts for a significant gain compared to pytorch’s C++ operators though. Perhaps the fusion also saves the time of writing down intermediate results? I could be (completely) wrong, but I assume this is just a nice “plus” they eventually implemented, because they already needed numba to implement parallel transforms anyway.

Nicolas Hug

ML Engineer