Parallelism (notes)
Hugging Face Accelerate
A nice explanation about model partitioning across devices for inference.
From the explanation of
load_checkpoint_and_dispatch
:
By passing device_map="auto"
, we tell Accelerate to determine
automatically where to put each layer of the model depending on the
available resources:
-
First, we use the maximum space available on the GPU(s).
-
If we still need space, we store the remaining weights on the CPU (meaning RAM).
-
If there is not enough RAM, we store the remaining weights on the hard drive as memory-mapped tensors.
A bit later, on the discussion about running the model: Now that we have done this, our model lies across several devices, and maybe the hard drive. But it can still be used as a regular PyTorch model (...) behind the scenes, Accelerate added hooks to the model, so that:
-
At each layer, the inputs are put on the right device (so evein if your model is spread across several GPUs, it works).
-
For the weights offloaded on the CPU, they are put on a GPU just before the forward pass and cleaned up just after.
-
For the weights offloaded on the hard drive, they are loaded in RAM then put on a GPU just before the forward pass and cleaned up just after.
This way, your model can run inference even if it does not fit on one of the GPUs or the CPU RAM, as long as the largest layer fits on the GPU.
Important comment: this only supports the inference of your model,
[not training]{.underline}. Most of the computation happens behind
torch.no_grad()
context managers to avoid spending some GPU memory
with intermediate activations. To do distributed training, check the
DeepSpeed
library with ZeRO
.
from this
link:
Libraries in the Hugging Face ecosystem, like Transformers or Diffusers,
support Big Model Inference in their from_pretrained
constructors. You
just need to add device_map="auto"
in the from_pretrained
method to
enable Big Model Inference.
An interesting comment about CUDA.
On the same page: “When a first allocation happens in PyTorch, it loads
CUDA kernels which take about 1-2GB of memory depending on the GPU.
Therefore, [you always have less usable memory than the actual size of
the GPU]{.underline}. To see how much memory is actually used do
torch.ones(1).cude()
and look at the memory usage.”