Scaling Laws

Important papers:

Scaling Laws for Neural Language Models, by OpenAI.
Training Compute-Optimal Large Language Models.
Scaling Laws for Autoregressive Generative Modeling.
Language Models are Few-Shot Learners.

Definition

The name given to the line (in log-log scale) that a model test error cannot cross as their size/data/compute (or whatever property the scaling law is studying) increases.

Three main scaling laws have been observed:

Validation error x compute (FLOPS). What does "compute" means here?
Validation error x model size
Validation error x dataset size

Scaling laws do not depend much on architecture or other minor algorithmic details.

Definition

The fact that for a given input, there may be many possible next tokens that are \"correct\".

The entropy of natural laguange introduces a constant error term in the scaling law, impliying that the test error cannot be loer than it even for an idealized infinitely large model with infinite compute and data for training.

Scaling laws do not need to be a straight line. They follow a power law.

EXERCISE.

At minute 16, he makes an interesting animation where he shows the evolution of the low-dimensional UMAP representation as the network trains. Reproduce this yourself.

Scaling laws may be explained by the low-dimensional manifold where the data lives in.