Parallelism (notes)

Hugging Face `Accelerate`

A nice explanation about model partitioning across devices for inference.

From the explanation of load_checkpoint_and_dispatch: By passing device_map="auto", we tell Accelerate to determine automatically where to put each layer of the model depending on the available resources:

First, we use the maximum space available on the GPU(s).
If we still need space, we store the remaining weights on the CPU (meaning RAM).
If there is not enough RAM, we store the remaining weights on the hard drive as memory-mapped tensors.

A bit later, on the discussion about running the model: Now that we have done this, our model lies across several devices, and maybe the hard drive. But it can still be used as a regular PyTorch model (...) behind the scenes, Accelerate added hooks to the model, so that:

At each layer, the inputs are put on the right device (so evein if your model is spread across several GPUs, it works).
For the weights offloaded on the CPU, they are put on a GPU just before the forward pass and cleaned up just after.
For the weights offloaded on the hard drive, they are loaded in RAM then put on a GPU just before the forward pass and cleaned up just after.

This way, your model can run inference even if it does not fit on one of the GPUs or the CPU RAM, as long as the largest layer fits on the GPU.

Important comment: this only supports the inference of your model, [not training]{.underline}. Most of the computation happens behind torch.no_grad() context managers to avoid spending some GPU memory with intermediate activations. To do distributed training, check the DeepSpeed library with ZeRO.

from this link: Libraries in the Hugging Face ecosystem, like Transformers or Diffusers, support Big Model Inference in their from_pretrained constructors. You just need to add device_map="auto" in the from_pretrained method to enable Big Model Inference.

An interesting comment about CUDA.

On the same page: “When a first allocation happens in PyTorch, it loads CUDA kernels which take about 1-2GB of memory depending on the GPU. Therefore, [you always have less usable memory than the actual size of the GPU]{.underline}. To see how much memory is actually used do torch.ones(1).cude() and look at the memory usage.”

Parallelism (notes)

Hugging Face Accelerate

A nice explanation about model partitioning across devices for inference.

An interesting comment about CUDA.

Hugging Face `Accelerate`