Pytorch v2.12.0 Release Notes

Release Date: 2026-05-13 // about 1 month ago
  • 🚀 PyTorch 2.12.0 Release Notes

    Highlights

    | Batched linalg.eigh on CUDA is up to 100x faster due to updated cuSolver backend selection. | | New torch.accelerator.Graph API unifies graph capture and replay across CUDA, XPU, and out-of-tree backends. | | torch.export.save now supports Microscaling (MX) quantization formats, enabling full export of aggressively compressed models. | | Adagrad now supports fused=True, joining Adam, AdamW, and SGD with a single-kernel optimizer implementation. | | torch.cond control flow can now be captured and replayed inside CUDA Graphs. | | ROCm users gain expandable memory segments, rocSHMEM symmetric memory collectives, and FlexAttention pipelining. |

    🚀 For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.

    Backwards Incompatible Changes

    🏗 Build Frontend

    • 🔧 Strengthened SVE compile checks in FindARM.cmake, which may reject previously accepted but incorrect SVE configurations (#176646)

    • ⚡️ Updated the minimum CUDA version required to build PyTorch from source to CUDA 12.6 (#178925)

    • 🏗 Enforced a C++20 minimum in CMake build files (#178662)

    Distributed

    • torch.distributed.nn.functional ops now raise RuntimeError under torch.compile (#177342)

    TorchElastic

    • 0️⃣ torchrun now defaults to an OS-assigned free port for single-node training instead of port 29500 (#175699)

    MPS

    • All MPS tensors are now allocated in unified memory (#175818)

    Inductor

    • The max_autotune layout-constraint deferral introduced in 2.11 is now opt-in (#175330)

    🗄 Deprecations

    🚀 Release Engineering

    • 🏗 Deprecate CUDA 12.8 builds in favor of CUDA 13.0 (#179072)

    • 🚀 Compatibility with CMake < 3.10 will be removed in a future release (#166259)

    Linear Algebra

    • Several CUDA linear algebra operators no longer use the MAGMA backend and now dispatch to cuSolver or cuBLAS unconditionally:

    FullyShardedDataParallel2 (FSDP2)

    • ⚡️ Compiling through FSDP2 hooks without graph breaks is no longer supported (#174863, #174906). If you use compiled autograd with FSDP2, update your code to allow graph breaks around FSDP2 hooks or disable compiled autograd for the FSDP2 training step.

    Profiler

    • 🗄 Profiler's metadata_json field is now deprecated; use event_metadata instead (#179417)

    Dynamo

    • torch.compile(fullgraph=True) now warns when a call runs no compiled code; will error in 2.13 (#181940)

Previous changes from v2.11.0

  • 🚀 PyTorch 2.11.0 Release Notes

    Highlights

    | Added Support for Differentiable Collectives for Distributed Training | | FlexAttention now has a FlashAttention-4 backend on Hopper and Blackwell GPUs | | MPS (Apple Silicon) Comprehensive Operator Expansion | | Added RNN/LSTM GPU Export Support | | Added XPU Graph Support |

    🚀 For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.

    Backwards Incompatible Changes

    🚀 Release Engineering

    🏗 Volta (SM 7.0) GPU support removed from CUDA 12.8 and 12.9 binary builds (#172598)

    ⚡️ Starting with PyTorch 2.11, the CUDA 12.8 and 12.9 pre-built binaries no longer include support for Volta GPUs (compute capability 7.0, e.g. V100). This change was necessary to enable updating to CuDNN 9.15.1, which is incompatible with Volta.

    Users with Volta GPUs who need CUDA 12.8+ should use the CUDA 12.6 builds, which continue to include Volta support. Alternatively, build PyTorch from source with Volta included in TORCH_CUDA_ARCH_LIST.

    🔖 Version 2.10:

    # CUDA 12.8 builds supported Volta (SM 7.0)
    pip install torch --index-url https://download.pytorch.org/whl/cu128
    # Works on V100
    

    🔖 Version 2.11:

    # CUDA 12.8 builds no longer support Volta
    # For V100 users, use CUDA 12.6 builds instead:
    pip install torch --index-url https://download.pytorch.org/whl/cu126
    

    🚀 PyPI wheels now ship with CUDA 13.0 instead of CUDA 12.x (#172663, announcement)

    🐧 Starting with PyTorch 2.11, pip install torch on PyPI installs CUDA 13.0 wheels by default for both Linux x86_64 and Linux aarch64. Previously, PyPI wheels shipped with CUDA 12.x and only Linux x86_64 CUDA wheels were available on PyPI. Users whose systems have only CUDA 12.x drivers installed may encounter errors when running pip install torch without specifying an index URL.

    ➕ Additionally, CUDA 13.0 only supports Turing (SM 7.5) and newer GPU architectures on Linux x86_64. Maxwell and Pascal GPUs are no longer supported under CUDA 13.0. Users with these older GPUs should use the CUDA 12.6 builds instead.

    CUDA 12.6 and 12.8 binaries remain available via download.pytorch.org.

    🔖 Version 2.10:

    # PyPI wheel used CUDA 12.xpip install torch
    

    🔖 Version 2.11:

    # PyPI wheel now uses CUDA 13.0pip install torch# To get CUDA 12.8 wheels instead:pip install torch --index-url https://download.pytorch.org/whl/cu128# To get CUDA 12.6 wheels (includes Maxwell/Pascal/Volta support):pip install torch --index-url https://download.pytorch.org/whl/cu126
    

    Python Frontend

    torch.hub.list(), torch.hub.load(), and torch.hub.help() now default the trust_repo parameter to "check" instead of None. The trust_repo=None option has been removed. (#174101)

    ⚠ Previously, passing trust_repo=None (or relying on the default) would silently download and run code from untrusted repositories with only a warning. Now, the default "check" behavior will prompt the user for explicit confirmation before running code from repositories not on the trusted list.

    ⚡️ Users who were explicitly passing trust_repo=None must update their code. Users who were already passing trust_repo=True, trust_repo=False, or trust_repo="check" are not affected.

    🔖 Version 2.10:

    # Default trust\_repo=None — downloads with a warningtorch.hub.load("user/repo","model")# Explicit None — same behaviortorch.hub.load("user/repo","model",trust\_repo=None)
    

    🔖 Version 2.11:

    # Default trust\_repo="check" — prompts for confirmation if repo is not trustedtorch.hub.load("user/repo","model")# To skip the prompt, explicitly trust the repotorch.hub.load("user/repo","model",trust\_repo=True)
    

    torch.nn

    Add sliding window support to varlen_attn via window_size, making optional arguments keyword-only (#172238)

    The signature of torch.nn.attention.varlen_attn has changed: a * (keyword-only separator) has been inserted before the optional arguments. Previously, optional arguments like is_causal, return_aux, and scale could be passed positionally; they must now be passed as keyword arguments. A new window_size keyword argument has also been added.

    # Before (2.10)output=varlen\_attn(query,key,value,cu\_seq\_q,cu\_seq\_k,max\_q,max\_k,True,None,1.0)# After (2.11) — pass as keyword argumentoutput=varlen\_attn(query,key,value,cu\_seq\_q,cu\_seq\_k,max\_q,max\_k,window\_size=(-1,0),return\_aux=None,scale=1.0)
    

    Remove is_causal flag from varlen_attn (#172245)

    🚚 The is_causal parameter has been removed from torch.nn.attention.varlen_attn. Causal attention is now expressed through the window_size parameter: use window_size=(-1, 0) for causal masking, or window_size=(W, 0) for causal attention with a sliding window of size W. The default window_size=(-1, -1) corresponds to full (non-causal) attention.

    # Before (2.10)output=varlen\_attn(query,key,value,cu\_seq\_q,cu\_seq\_k,max\_q,max\_k,is\_causal=True)# After (2.11) — use window\_size insteadoutput=varlen\_attn(query,key,value,cu\_seq\_q,cu\_seq\_k,max\_q,max\_k,window\_size=(-1,0))
    

    Distributed

    DebugInfoWriter now honors $XDG_CACHE_HOME for its cache directory in C++ code, consistent with the Python side. Previously it always used ~/.cache/torch. (#168232)

    This avoids issues where $HOME is not set or not writable. Users who relied on ~/.cache/torch being used regardless of $XDG_CACHE_HOME may see debug info written to a different location.

    🔖 Version 2.10:

    # C++ DebugInfoWriter always wrote to ~/.cache/torch
    

    🔖 Version 2.11:

    # C++ DebugInfoWriter now respects $XDG_CACHE_HOME/torch (same as Python code)
    # Falls back to ~/.cache/torch if $XDG_CACHE_HOME is not set
    

    DeviceMesh now stores a process group registry (_pg_registry) directly, enabling torch.compile to trace through get_group(). (#172272)

    🖨 This may break code that skips init_process_group, loads a saved DTensor (constructing a DeviceMesh with no PGs), and later creates PGs separately — during torch.compile runtime the PG lookup will fail. Users should ensure process groups are initialized before constructing the DeviceMesh.

    🔖 Version 2.10:

    # PGs resolved via global \_resolve\_process\_group at runtimemesh=DeviceMesh(...)# PGs could be created later
    

    🔖 Version 2.11:

    # PGs now stored on DeviceMesh.\_pg\_registry; must exist at mesh creationdist.init\_process\_group(...)# Must be called before creating meshmesh=DeviceMesh(...)
    

    Distributed (DTensor)

    0️⃣ DTensor.to_local() backward now converts Partial placements to Replicate by default when grad_placements is not provided. (#173454)

    Previously, calling to_local() on a Partial DTensor would preserve the Partial placement in the backward gradient, which could produce incorrect gradients when combined with from_local(). Now, the backward pass automatically maps Partial forward placements to Replicate gradient placements, matching the behavior of from_local().

    👀 Users who relied on the previous behavior (where to_local() backward preserved Partial gradients) may see different gradient values. To ensure correctness, explicitly pass grad_placements to to_local().

    🔖 Version 2.10:

    # Partial placement preserved in backward — could produce incorrect gradientslocal\_tensor=partial\_dtensor.to\_local()
    

    🔖 Version 2.11:

    # Partial → Replicate in backward by default (correct behavior)local\_tensor=partial\_dtensor.to\_local()# Or explicitly specify grad\_placements for full control:local\_tensor=partial\_dtensor.to\_local(grad\_placements=[Replicate()])
    

    👀 _PhiloxState.seed and _PhiloxState.offset now return torch.Tensor instead of int (#173876)

    👀 The DTensor RNG internal _PhiloxState class changed its seed and offset properties to return tensors instead of Python ints, and the setters now expect tensors. This makes the RNG state compatible with PT2 tracing (the previous .item() calls were not fake-tensor friendly).

    👀 Code that directly reads _PhiloxState.seed or _PhiloxState.offset and treats them as ints will break. Call .item() to get the int value. When setting, wrap the value in a tensor.

    🔖 Version 2.10:

    fromtorch.distributed.tensor.\_randomimport\_PhiloxStatephilox=\_PhiloxState(state)seed:int=philox.seed# returned intphilox.offset=42# accepted int
    

    🔖 Version 2.11:

    fromtorch.distributed.tensor.\_randomimport\_PhiloxStatephilox=\_PhiloxState(state)seed:int=philox.seed.item()# now returns Tens...