PeANo

Parallelism (notes)

Hugging Face Accelerate

A nice explanation about model partitioning across devices for inference.

From the explanation of load_checkpoint_and_dispatch: By passing device_map="auto", we tell Accelerate to determine automatically where to put each layer of the model depending on the available resources:

A bit later, on the discussion about running the model: Now that we have done this, our model lies across several devices, and maybe the hard drive. But it can still be used as a regular PyTorch model (...) behind the scenes, Accelerate added hooks to the model, so that:

This way, your model can run inference even if it does not fit on one of the GPUs or the CPU RAM, as long as the largest layer fits on the GPU.

Important comment: this only supports the inference of your model, [not training]{.underline}. Most of the computation happens behind torch.no_grad() context managers to avoid spending some GPU memory with intermediate activations. To do distributed training, check the DeepSpeed library with ZeRO.

from this link: Libraries in the Hugging Face ecosystem, like Transformers or Diffusers, support Big Model Inference in their from_pretrained constructors. You just need to add device_map="auto" in the from_pretrained method to enable Big Model Inference.

An interesting comment about CUDA.

On the same page: “When a first allocation happens in PyTorch, it loads CUDA kernels which take about 1-2GB of memory depending on the GPU. Therefore, [you always have less usable memory than the actual size of the GPU]{.underline}. To see how much memory is actually used do torch.ones(1).cude() and look at the memory usage.”