Deep dive into FFCV internals
There’s been a lot of hype about FFCV, a pytorch-compatible data-loading library that claims major improvements over current pytorch data-loading solutions. But not a lot was written about what it actually does, so I spent some time diving into FFCV’s internals. It was a fun and very instructive exercise, as there is a lof of smart engineering going on.
This post summarizes my findings. I wasn’t involved in FFCV’s development at all, so if I got something wrong or missed critical details, please do let me know!
TL;DR:
- FFCV markets itself as data-loading for computer vision, but a lot of what it does is about general data loading, not just about vision.
- FFCV relies on a page-based pre-fetching mechanism, where “pages” of
multiple samples are loaded in advance and cached in RAM. Most samples
are already available in RAM when the data loader needs them. This contrasts
with most of pytorch’s
DataLoader
uses, where samples are loaded and read from disk individually each time they’re needed (or a bit before). Basically in FFCV: bigger cache, fewer io operations. - The parallelism of FFCV’s data loader relies on threads from
numba
, not on multi-processing. - You can’t use one of FFCV’s parts in isolation: you either use the whole thing (custom dataset format + custom loader + transforms) or none of it.
- If you want distributed training, you need to store the entire dataset in RAM. This isn’t a hard limitation, just something that isn’t (yet?) supported.
Overview
FFCV controls the 3 main parts of the entire data loading pipeline:
- The dataset format itself - how is the data stored on disk
- The data loader - how is the data read from disk and loaded into memory (RAM or GPU)
- The pre-processing transforms - how are the jpg files decoded, augmented, etc.
These 3 components are inter-dependent and cannot be used in isolation: it’s impossible to use FFCV’s transforms without relying on their own data loader, and you can’t use the data loader without using the custom dataset format.
The most interesting bit of FFCV IMHO is the data loader, and how it pre-fetches/caches the data. To understand how it works, we first need to look at the dataset format:
Custom Dataset Format
FFCV requires storing the data in a specific .beton
format. There are 2 main
aspects to this format:
- Images can be stored as encoded jpeg or decoded jpegs. You can choose whether you want all images to be encoded or all to be decoded, or a given percentage of decoded. This is a one-time decision and cannot be decided on the fly at runtime.
- The entire
.beton
file is divided into memory segments called “pages”. The page size is 8 MB by default. A page typically contains multiple images: ~100 images per page for compressed ImageNet jpegs.
This concept of page in the .beton
file is directly related to the way samples
are pre-fetched, cached and loaded by the data loader:
DataLoader: pre-fetching and caching
FFCV’s data loader has an os_cache
parameter that determines how the data is
pre-fetched:
When os_cache = True
things are pretty simple: the entire .beton
file is
memmap
edinto
RAM. Needless to say, this doesn’t work for big-ish dataset.
The most interesting engineering happens when os_cache = False
. In this case,
FFCV is able to pre-fetch the samples in advance, because it knows exactly
when each sample will be needed during the training loop. Instead of storing
the entire .beton
file in memory, it only stores a small-ish number of
pages.
The number of pages it stores (called number of slots) is determined at
runtime by figuring out how many
pages
need to be loaded for any given batch of the training loop.
On a fully random split, this number of slots would be quite high, because 2
samples of the same page can be present in 2 batches that are needed “far apart”
during the training loop: e.g. first and last batch. Since a page can only be
loaded (and unloaded) once, it means the page would need to be loaded during the
entire iteration. In the worst case, this means we potentially need all pages to
be loaded at any given time. But this is where FFCV’s QUASI_RANDOM
sampler
comes in: it makes sure that the number of slots needed is reasonably
small,
by only shuffling within N
pages at any given time, restricting the number of
slots to N
. Unfortunately QUASI_RANDOM
isn’t supported for distributed
training, so one always needs to load the entire dataset in RAM for DDP uses.
This isn’t a hard limitation, just something that isn’t supported yet.
The actual page loading (transfer from .beton
file to RAM) is handled by an
army of 12
threads
running in the background. Each thread continuously waits for a load request,
and performs the page read when
requested.
No GIL, no problem.
Overall, this page-based pre-fetching and caching mechanism is very different
from what the current torch.utils.data.DataLoader
does: FFCV’s pre-loads
entire pages of samples ahead of time, as it knows when they will be needed.
Most samples are already available in RAM when the data loader needs them, as
the sample’s page has already been read from disk and loaded. In Pytorch’s
DataLoader
, individual samples are read and loaded only when they are
needed (with the prefetch_factor
parameter), which requires less memory usage,
but more disk reads.
DataLoader: parallelism and transforms
FFCV’s built-in transforms implement a protocol to declare the amount of memory that their output requires. This allows the data-loader to pre-allocate all of that space at the beginning of the training loop, and to re-use that space for each batch. This saves allocation time. While arbitrary transforms (e.g. torchvision’s transforms) can be used with FFCV, only the FFCV built-ins can leverage this pre-allocation mechanism. Writing a transform that is fully compatible with FFCV isn’t always easy.
The parallelism of FFCV data loader is implemented very differently from
torch.utils.data
: there’s only 2 workers: the main worker which loads samples
from
itsqueue,
and another
worker
which works in the background to load samples, transform them, and put them in
the queue to be consumed by the main worker’s loader. The bulk of the
parallelism happens at the transform level: most transforms do a prange
from numba over all
images
in the batch they’re given, i.e. the worker will process each image of a batch
in parallel over num_workers
threads. Only the CPU transforms can leverage
that kind of parallelism.
This is very different from pytorch’s DataLoader
, where num_workers
(often
processes) are run in parallel, with each worker processing each image in a
batch individually, and putting the transformed result in the main worker’s
queue.
Other random facts:
- FFCV’s data loader can handle GPU transforms (but these can’t be parallelized). I assume this is because there’s only 2 workers and thus no multi-processing issue?
- For now, it’s impossible to transform both the target and the image. They say they’re working on it tho.
- For decoding the images, they use libjpeg-turbo.
- CPU transforms are jitted with
numba
and sometimes “fused” together when possible. I’m not sure this amounts for a significant gain compared to pytorch’s C++ operators though. Perhaps the fusion also saves the time of writing down intermediate results? I could be (completely) wrong, but I assume this is just a nice “plus” they eventually implemented, because they already needednumba
to implement parallel transforms anyway.