Deep dive into FFCV internals
There’s been a lot of hype about FFCV, a pytorch-compatible data-loading library that claims major improvements over current pytorch data-loading solutions. But not a lot was written about what it actually does, so I spent some time diving into FFCV’s internals. It was a fun and very instructive exercise, as there is a lof of smart engineering going on.
This post summarizes my findings. I wasn’t involved in FFCV’s development at all, so if I got something wrong or missed critical details, please do let me know!
- FFCV markets itself as data-loading for computer vision, but a lot of what it does is about general data loading, not just about vision.
- FFCV relies on a page-based pre-fetching mechanism, where “pages” of
multiple samples are loaded in advance and cached in RAM. Most samples
are already available in RAM when the data loader needs them. This contrasts
with most of pytorch’s
DataLoaderuses, where samples are loaded and read from disk individually each time they’re needed (or a bit before). Basically in FFCV: bigger cache, fewer io operations.
- The parallelism of FFCV’s data loader relies on threads from
numba, not on multi-processing.
- You can’t use one of FFCV’s parts in isolation: you either use the whole thing (custom dataset format + custom loader + transforms) or none of it.
- If you want distributed training, you need to store the entire dataset in RAM. This isn’t a hard limitation, just something that isn’t (yet?) supported.
FFCV controls the 3 main parts of the entire data loading pipeline:
- The dataset format itself - how is the data stored on disk
- The data loader - how is the data read from disk and loaded into memory (RAM or GPU)
- The pre-processing transforms - how are the jpg files decoded, augmented, etc.
These 3 components are inter-dependent and cannot be used in isolation: it’s impossible to use FFCV’s transforms without relying on their own data loader, and you can’t use the data loader without using the custom dataset format.
The most interesting bit of FFCV IMHO is the data loader, and how it pre-fetches/caches the data. To understand how it works, we first need to look at the dataset format:
Custom Dataset Format
FFCV requires storing the data in a specific
.beton format. There are 2 main
aspects to this format:
- Images can be stored as encoded jpeg or decoded jpegs. You can choose whether you want all images to be encoded or all to be decoded, or a given percentage of decoded. This is a one-time decision and cannot be decided on the fly at runtime.
- The entire
.betonfile is divided into memory segments called “pages”. The page size is 8 MB by default. A page typically contains multiple images: ~100 images per page for compressed ImageNet jpegs.
This concept of page in the
.beton file is directly related to the way samples
are pre-fetched, cached and loaded by the data loader:
DataLoader: pre-fetching and caching
FFCV’s data loader has an
os_cache parameter that determines how the data is
os_cache = True things are pretty simple: the entire
.beton file is
RAM. Needless to say, this doesn’t work for big-ish dataset.
The most interesting engineering happens when
os_cache = False. In this case,
FFCV is able to pre-fetch the samples in advance, because it knows exactly
when each sample will be needed during the training loop. Instead of storing
.beton file in memory, it only stores a small-ish number of
The number of pages it stores (called number of slots) is determined at
runtime by figuring out how many
need to be loaded for any given batch of the training loop.
On a fully random split, this number of slots would be quite high, because 2
samples of the same page can be present in 2 batches that are needed “far apart”
during the training loop: e.g. first and last batch. Since a page can only be
loaded (and unloaded) once, it means the page would need to be loaded during the
entire iteration. In the worst case, this means we potentially need all pages to
be loaded at any given time. But this is where FFCV’s
comes in: it makes sure that the number of slots needed is reasonably
by only shuffling within
N pages at any given time, restricting the number of
QUASI_RANDOM isn’t supported for distributed
training, so one always needs to load the entire dataset in RAM for DDP uses.
This isn’t a hard limitation, just something that isn’t supported yet.
The actual page loading (transfer from
.beton file to RAM) is handled by an
army of 12
running in the background. Each thread continuously waits for a load request,
and performs the page read when
No GIL, no problem.
Overall, this page-based pre-fetching and caching mechanism is very different
from what the current
torch.utils.data.DataLoader does: FFCV’s pre-loads
entire pages of samples ahead of time, as it knows when they will be needed.
Most samples are already available in RAM when the data loader needs them, as
the sample’s page has already been read from disk and loaded. In Pytorch’s
DataLoader, individual samples are read and loaded only when they are
needed (with the
prefetch_factor parameter), which requires less memory usage,
but more disk reads.
DataLoader: parallelism and transforms
FFCV’s built-in transforms implement a protocol to declare the amount of memory that their output requires. This allows the data-loader to pre-allocate all of that space at the beginning of the training loop, and to re-use that space for each batch. This saves allocation time. While arbitrary transforms (e.g. torchvision’s transforms) can be used with FFCV, only the FFCV built-ins can leverage this pre-allocation mechanism. Writing a transform that is fully compatible with FFCV isn’t always easy.
The parallelism of FFCV data loader is implemented very differently from
torch.utils.data: there’s only 2 workers: the main worker which loads samples
which works in the background to load samples, transform them, and put them in
the queue to be consumed by the main worker’s loader. The bulk of the
parallelism happens at the transform level: most transforms do a
from numba over all
in the batch they’re given, i.e. the worker will process each image of a batch
in parallel over
num_workers threads. Only the CPU transforms can leverage
that kind of parallelism.
This is very different from pytorch’s
processes) are run in parallel, with each worker processing each image in a
batch individually, and putting the transformed result in the main worker’s
Other random facts:
- FFCV’s data loader can handle GPU transforms (but these can’t be parallelized). I assume this is because there’s only 2 workers and thus no multi-processing issue?
- For now, it’s impossible to transform both the target and the image. They say they’re working on it tho.
- For decoding the images, they use libjpeg-turbo.
- CPU transforms are jitted with
numbaand sometimes “fused” together when possible. I’m not sure this amounts for a significant gain compared to pytorch’s C++ operators though. Perhaps the fusion also saves the time of writing down intermediate results? I could be (completely) wrong, but I assume this is just a nice “plus” they eventually implemented, because they already needed
numbato implement parallel transforms anyway.