Pytorch v2.12.0 Release Notes
Release Date: 2026-05-13 // about 1 month ago-
🚀 PyTorch 2.12.0 Release Notes
- Highlights
- Backwards Incompatible Changes
- 🗄 Deprecations
- New Features
- Improvements
- 🛠 Bug fixes
- 🐎 Performance
- 📚 Documentation
- Developers
- 🔒 Security
Highlights
| Batched linalg.eigh on CUDA is up to 100x faster due to updated cuSolver backend selection. | | New torch.accelerator.Graph API unifies graph capture and replay across CUDA, XPU, and out-of-tree backends. | | torch.export.save now supports Microscaling (MX) quantization formats, enabling full export of aggressively compressed models. | | Adagrad now supports
fused=True, joining Adam, AdamW, and SGD with a single-kernel optimizer implementation. | | torch.cond control flow can now be captured and replayed inside CUDA Graphs. | | ROCm users gain expandable memory segments, rocSHMEM symmetric memory collectives, and FlexAttention pipelining. |🚀 For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.
Backwards Incompatible Changes
🏗 Build Frontend
🔧 Strengthened SVE compile checks in
FindARM.cmake, which may reject previously accepted but incorrect SVE configurations (#176646)⚡️ Updated the minimum CUDA version required to build PyTorch from source to CUDA 12.6 (#178925)
🏗 Enforced a C++20 minimum in CMake build files (#178662)
Distributed
torch.distributed.nn.functionalops now raiseRuntimeErrorundertorch.compile(#177342)
TorchElastic
- 0️⃣
torchrunnow defaults to an OS-assigned free port for single-node training instead of port 29500 (#175699)
MPS
- All MPS tensors are now allocated in unified memory (#175818)
Inductor
- The
max_autotunelayout-constraint deferral introduced in 2.11 is now opt-in (#175330)
🗄 Deprecations
🚀 Release Engineering
🏗 Deprecate CUDA 12.8 builds in favor of CUDA 13.0 (#179072)
🚀 Compatibility with CMake < 3.10 will be removed in a future release (#166259)
Linear Algebra
- Several CUDA linear algebra operators no longer use the MAGMA backend and now dispatch to cuSolver or cuBLAS unconditionally:
FullyShardedDataParallel2 (FSDP2)
- ⚡️ Compiling through FSDP2 hooks without graph breaks is no longer supported (#174863, #174906). If you use compiled autograd with FSDP2, update your code to allow graph breaks around FSDP2 hooks or disable compiled autograd for the FSDP2 training step.
Profiler
- 🗄 Profiler's
metadata_jsonfield is now deprecated; useevent_metadatainstead (#179417)
Dynamo
torch.compile(fullgraph=True)now warns when a call runs no compiled code; will error in 2.13 (#181940)
Previous changes from v2.11.0
-
🚀 PyTorch 2.11.0 Release Notes
- Highlights
- Backwards Incompatible Changes
- 🗄 Deprecations
- New Features
- Improvements
- 🛠 Bug fixes
- 🐎 Performance
- 📚 Documentation
- Developers
- 🔒 Security
Highlights
| Added Support for Differentiable Collectives for Distributed Training | | FlexAttention now has a FlashAttention-4 backend on Hopper and Blackwell GPUs | | MPS (Apple Silicon) Comprehensive Operator Expansion | | Added RNN/LSTM GPU Export Support | | Added XPU Graph Support |
🚀 For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.
Backwards Incompatible Changes
🚀 Release Engineering
🏗 Volta (SM 7.0) GPU support removed from CUDA 12.8 and 12.9 binary builds (#172598)
⚡️ Starting with PyTorch 2.11, the CUDA 12.8 and 12.9 pre-built binaries no longer include support for Volta GPUs (compute capability 7.0, e.g. V100). This change was necessary to enable updating to CuDNN 9.15.1, which is incompatible with Volta.
Users with Volta GPUs who need CUDA 12.8+ should use the CUDA 12.6 builds, which continue to include Volta support. Alternatively, build PyTorch from source with Volta included in
TORCH_CUDA_ARCH_LIST.🔖 Version 2.10:
# CUDA 12.8 builds supported Volta (SM 7.0) pip install torch --index-url https://download.pytorch.org/whl/cu128 # Works on V100🔖 Version 2.11:
# CUDA 12.8 builds no longer support Volta # For V100 users, use CUDA 12.6 builds instead: pip install torch --index-url https://download.pytorch.org/whl/cu126🚀 PyPI wheels now ship with CUDA 13.0 instead of CUDA 12.x (#172663, announcement)
🐧 Starting with PyTorch 2.11,
pip install torchon PyPI installs CUDA 13.0 wheels by default for both Linux x86_64 and Linux aarch64. Previously, PyPI wheels shipped with CUDA 12.x and only Linux x86_64 CUDA wheels were available on PyPI. Users whose systems have only CUDA 12.x drivers installed may encounter errors when runningpip install torchwithout specifying an index URL.➕ Additionally, CUDA 13.0 only supports Turing (SM 7.5) and newer GPU architectures on Linux x86_64. Maxwell and Pascal GPUs are no longer supported under CUDA 13.0. Users with these older GPUs should use the CUDA 12.6 builds instead.
CUDA 12.6 and 12.8 binaries remain available via
download.pytorch.org.🔖 Version 2.10:
# PyPI wheel used CUDA 12.xpip install torch🔖 Version 2.11:
# PyPI wheel now uses CUDA 13.0pip install torch# To get CUDA 12.8 wheels instead:pip install torch --index-url https://download.pytorch.org/whl/cu128# To get CUDA 12.6 wheels (includes Maxwell/Pascal/Volta support):pip install torch --index-url https://download.pytorch.org/whl/cu126Python Frontend
torch.hub.list(),torch.hub.load(), andtorch.hub.help()now default thetrust_repoparameter to"check"instead ofNone. Thetrust_repo=Noneoption has been removed. (#174101)⚠ Previously, passing
trust_repo=None(or relying on the default) would silently download and run code from untrusted repositories with only a warning. Now, the default"check"behavior will prompt the user for explicit confirmation before running code from repositories not on the trusted list.⚡️ Users who were explicitly passing
trust_repo=Nonemust update their code. Users who were already passingtrust_repo=True,trust_repo=False, ortrust_repo="check"are not affected.🔖 Version 2.10:
# Default trust\_repo=None — downloads with a warningtorch.hub.load("user/repo","model")# Explicit None — same behaviortorch.hub.load("user/repo","model",trust\_repo=None)🔖 Version 2.11:
# Default trust\_repo="check" — prompts for confirmation if repo is not trustedtorch.hub.load("user/repo","model")# To skip the prompt, explicitly trust the repotorch.hub.load("user/repo","model",trust\_repo=True)torch.nn
Add sliding window support to
varlen_attnviawindow_size, making optional arguments keyword-only (#172238)The signature of
torch.nn.attention.varlen_attnhas changed: a*(keyword-only separator) has been inserted before the optional arguments. Previously, optional arguments likeis_causal,return_aux, andscalecould be passed positionally; they must now be passed as keyword arguments. A newwindow_sizekeyword argument has also been added.# Before (2.10)output=varlen\_attn(query,key,value,cu\_seq\_q,cu\_seq\_k,max\_q,max\_k,True,None,1.0)# After (2.11) — pass as keyword argumentoutput=varlen\_attn(query,key,value,cu\_seq\_q,cu\_seq\_k,max\_q,max\_k,window\_size=(-1,0),return\_aux=None,scale=1.0)Remove
is_causalflag fromvarlen_attn(#172245)🚚 The
is_causalparameter has been removed fromtorch.nn.attention.varlen_attn. Causal attention is now expressed through thewindow_sizeparameter: usewindow_size=(-1, 0)for causal masking, orwindow_size=(W, 0)for causal attention with a sliding window of sizeW. The defaultwindow_size=(-1, -1)corresponds to full (non-causal) attention.# Before (2.10)output=varlen\_attn(query,key,value,cu\_seq\_q,cu\_seq\_k,max\_q,max\_k,is\_causal=True)# After (2.11) — use window\_size insteadoutput=varlen\_attn(query,key,value,cu\_seq\_q,cu\_seq\_k,max\_q,max\_k,window\_size=(-1,0))Distributed
DebugInfoWriternow honors$XDG_CACHE_HOMEfor its cache directory in C++ code, consistent with the Python side. Previously it always used~/.cache/torch. (#168232)This avoids issues where
$HOMEis not set or not writable. Users who relied on~/.cache/torchbeing used regardless of$XDG_CACHE_HOMEmay see debug info written to a different location.🔖 Version 2.10:
# C++ DebugInfoWriter always wrote to ~/.cache/torch🔖 Version 2.11:
# C++ DebugInfoWriter now respects $XDG_CACHE_HOME/torch (same as Python code) # Falls back to ~/.cache/torch if $XDG_CACHE_HOME is not setDeviceMeshnow stores a process group registry (_pg_registry) directly, enablingtorch.compileto trace throughget_group(). (#172272)🖨 This may break code that skips
init_process_group, loads a saved DTensor (constructing a DeviceMesh with no PGs), and later creates PGs separately — duringtorch.compileruntime the PG lookup will fail. Users should ensure process groups are initialized before constructing the DeviceMesh.🔖 Version 2.10:
# PGs resolved via global \_resolve\_process\_group at runtimemesh=DeviceMesh(...)# PGs could be created later🔖 Version 2.11:
# PGs now stored on DeviceMesh.\_pg\_registry; must exist at mesh creationdist.init\_process\_group(...)# Must be called before creating meshmesh=DeviceMesh(...)Distributed (DTensor)
0️⃣
DTensor.to_local()backward now convertsPartialplacements toReplicateby default whengrad_placementsis not provided. (#173454)Previously, calling
to_local()on aPartialDTensor would preserve thePartialplacement in the backward gradient, which could produce incorrect gradients when combined withfrom_local(). Now, the backward pass automatically mapsPartialforward placements toReplicategradient placements, matching the behavior offrom_local().👀 Users who relied on the previous behavior (where
to_local()backward preservedPartialgradients) may see different gradient values. To ensure correctness, explicitly passgrad_placementstoto_local().🔖 Version 2.10:
# Partial placement preserved in backward — could produce incorrect gradientslocal\_tensor=partial\_dtensor.to\_local()🔖 Version 2.11:
# Partial → Replicate in backward by default (correct behavior)local\_tensor=partial\_dtensor.to\_local()# Or explicitly specify grad\_placements for full control:local\_tensor=partial\_dtensor.to\_local(grad\_placements=[Replicate()])👀
_PhiloxState.seedand_PhiloxState.offsetnow returntorch.Tensorinstead ofint(#173876)👀 The DTensor RNG internal
_PhiloxStateclass changed itsseedandoffsetproperties to return tensors instead of Python ints, and the setters now expect tensors. This makes the RNG state compatible with PT2 tracing (the previous.item()calls were not fake-tensor friendly).👀 Code that directly reads
_PhiloxState.seedor_PhiloxState.offsetand treats them as ints will break. Call.item()to get the int value. When setting, wrap the value in a tensor.🔖 Version 2.10:
fromtorch.distributed.tensor.\_randomimport\_PhiloxStatephilox=\_PhiloxState(state)seed:int=philox.seed# returned intphilox.offset=42# accepted int🔖 Version 2.11:
fromtorch.distributed.tensor.\_randomimport\_PhiloxStatephilox=\_PhiloxState(state)seed:int=philox.seed.item()# now returns Tens...