Ray v2.55.1 Release Notes
Release Date: 2026-04-22 // about 2 months agoPrevious changes from v2.55.0
-
Ray Data
🎉 New Features
- ➕ Add
DataSourceV2API with scanner/reader framework, file listing, and file partitioning (#61220, #61615, #61997) - 👌 Support GPU shuffle with
rapidsmpf26.2 (#61371, #62062) - ➕ Add Kafka datasink, migrate to
confluent-kafka, supportdatetimeoffsets (#60307, #61284, #60909) - ➕ Add Turbopuffer datasink (#58910)
- ➕ Add 2-phase commit checkpointing with trie recovery and load method (#61821, #60951)
- Queue-based autoscaling policy integrated with task consumers (#59548, #60851)
- Enable autoscaling for GPU stages (#61130)
- 👍 Expressions: add
random(),uuid(),cast, and map namespace support (#59656, #60695, #59879) - ➕ Add support for Arrow native fixed-shape tensor type (#56284)
- 👌 Support writing tensors to tfrecords (#60859)
- ➕ Add
pathlib.Pathsupport toread_*functions (#61126) - ➕ Add
cudfas abatch_format(#61329) - 👍 Allow
ActorPoolStrategyforread_datasource()viacomputeparameter (#59633) - Introduce
ExecutionCachefor streamlined caching (#60996) - 👌 Support
strict=Falsemode forStreamingRepartition(#60295) - Port changes from lance-ray into Ray Data (#60497)
- Enable PyArrow compute-to-expression conversion for predicate pushdown (#61617)
- ➕ Add vLLM metrics export and Data LLM Grafana dashboard (#60385)
- ⏱ Include logical memory in resource manager scheduling decisions (#60774)
- ➕ Add monotonically increasing ID support (#59290)
💫 Enhancements
- Performance: cache
_map_taskargs, heap-based actor ranking, actor pool map improvements (#61996, #62114, #61591) - ⚡️ Optimize concat tables and PyArrow schema hashing (#61315, #62108)
- ⬇️ Reduce default
DownstreamCapacityBackpressurePolicythreshold to 50% (#61890) - 👌 Improve reproducibility for random APIs (#59662)
- Clamp batch size to fall within C++ 32-bit int range (#62242)
- Account for external consumer object store usage in resource manager budget (#62117)
- Make
get_parquet_datasetconfigurable in number of fragments to scan (#61670) - Consolidate schema inference and make all preprocessors implement
SerializablePreprocessorBase(#61213, #61341) - 0️⃣ Disable hanging issue detection by default (#62405)
- 👉 Make execution callback dataflow explicit to prevent state leakage (#61405)
- 🌲 Log
DataContextin JSON format at execution start for traceability (#61150, #61428) - 🔧 Autoscaler: configurable traceback, Prometheus gauges, relaxed constraints (#62210, #62209, #61917, #61385)
- ➕ Add metrics for task scheduling time, output backpressure, and logical memory (#61192, #61007, #61436)
- Prevent operators from dominating entire shared object store budget (#61605)
- 📌 Eliminate generators to avoid intermediate state pinning (#60598)
- 🏁 Default log encoding to UTF-8 on Windows (#61143)
- Remove legacy
BlockList,locality_with_output, old callback API, PyArrow 9.0 checks (#60575, #61044, #62055, #61483) - ⬆️ Upgrade to
pyiceberg0.11.0; cappandasto <3 (#61062, #60406) - 🔨 Refactor logical operators to frozen dataclasses (#61059, #61308, #61348, #61349, #61351, #61364, #61481)
- ⏱ Prevent aggregator head node scheduling (#61288)
- ➕ Add error for
local://paths with a zero-resource head node (#60709)
🛠 🔨 Fixes
- 🛠 Fix RCE in Arrow extension type deserialization from Parquet (#62056)
- 🛠 Fix
StreamingSplitDataIterator.schema()(#62057) - 🛠 Fix
ParquetDatasourcehandling ofFileSystemFactory.inspect(#62065) - 🛠 Fix
read_parquetfile-extension filtering for versioned object-store URIs (#61376) - Fix
wide_schema_pipeline_tensorscloudpickle deserialization (#62149) - 🛠 Fix
OpBufferQueuerace condition (#60828) - 🛠 Fix scheduling metrics computation (#62031)
- 🛠 Fix
OneHotEncodermax_categoriesto use global top-k instead of per-partition (#60790) - 🛠 Fix
ReservationOpResourceAllocatorresource borrowing forActorPoolMapOperator(#60882) - 🛠 Fix
DatabricksUCDatasourceschema()shadowing by schema string attribute (#61282) - 🛠 Fix
AliasExprstructural equality to respect rename flag (#60711) - Fix
_align_struct_fieldsfailure with unaligned scalar fields (#58364) - ⏱ Fix
min_scheduling_resourcesfallback toincremental_resource_usage(#60997) - 🛠 Fix output backpressure unblocking sequence for terminal ops (#60798)
- 🛠 Fix multi-input operator object store memory attribution (#61208)
- 🛠 Fix reference cycle by moving to module scope (#61934)
- 🛠 Fix autoscaler logging: reduce verbose output and move traceback to debug (#61989, #62126)
- Fix double counting
ref_bundle+input_files(#61774) - Replace
on_exithook with__ray_shutdown__to fix UDF cleanup race (#61700) - Prevent
Limitfrom getting pushed pastmap_groups(#60881) - Propagate schema in empty
_shuffle_blockto fixColumnNotFoundin chained left joins (#61507) - 🛠 Fix unclear metadata warning and incorrect operator name logging (#61380)
- Clamp rolling utilization averages to zero (#61543)
- 🛠 Fix floating point errors in
TimeWindowAverageCalculator(#61580) - ✂ Remove default task-level timeout and clamp
end_offsetin Kafka datasource (#61476) - ✅ Avoid redundant reads in
train_test_split(#60274) - Return
Nonewhen no outputs have been produced (#62029) - Replace bare
raisewithTypeErrorin string concatenation (#60795)
📚 📖 Documentation
- ➕ Add job-level checkpointing documentation (#60921)
- ⚡️ Update
exclude_resourcesdocs for Train autoscaling changes (#61990) - Add
locality_with_outputmigration instructions (#61151) - Document
max_tasks_in_flight_per_actorvsmax_concurrent_batches(#60477) - ➕ Add missing
MODoperation docs; improveray.data.Datasourcedocs (#60803, #59654) - ➕ Add
polarsusage instructions (#60029)
Ray Serve
🎉 New Features:
- ➕ Added end-to-end gRPC client and bidirectional streaming support, including public APIs, proxy handling, proto updates, and developer docs, so Serve apps can handle streaming workloads natively instead of building custom transport layers. (#60767, #60768, #60769, #60770, #60771)
- 👍 Introduced HAProxy-based serving with fallback proxy support and load-balancer tunables, giving operators a higher-throughput ingress path and more control over traffic behavior in production. (#60586, #61180, #61271, #61468, #61988)
- ➕ Added queue-based autoscaling for async inference and Taskiq-backed workloads, so scaling decisions can account for both HTTP in-flight load and queued tasks. (#59548, #60851, #60977, #61008)
- ⚡️ Rolled out gang scheduling support across validation, core scheduling, fault tolerance, downscaling, autoscaling, rolling updates, and migration, enabling coordinated multi-replica placement for tightly coupled workloads. (#60944, #61205, #61206, #61207, #61215, #61467, #61216, #61659)
- 🚀 Introduced deployment-scoped actors with config/schema, lifecycle management, public API, and controller health checks, making it easier to run durable per-deployment sidecar-like logic inside Serve. (#61639, #61648, #61664, #61833, #62161)
💫 Enhancements:
- ➕ Added first-class tracing support for Serve, including inter-deployment gRPC propagation and richer streaming-path attributes, improving end-to-end observability across distributed request flows. (#61230, #61089, #61451)
- 🔊 Expanded operational metrics with replica utilization, richer error labeling, and client IP logging in access logs, helping teams diagnose bottlenecks and user-impacting issues faster. (#60758, #61092, #60967)
- 👌 Improved autoscaling extensibility with class-based policies and
policy_kwargs, so advanced users can package reusable autoscaling logic without custom forks. (#60964) - 🚀 Reduced controller overhead with broad algorithmic improvements (indexing, cache reuse, and avoiding repeated per-tick work), which improves scalability as deployment and replica counts grow. (#60810, #60829, #60830, #60838, #60842, #60843, #60844, #60832, #60806)
- 👌 Improved throughput-oriented operation controls by adding environment-based tuning and explicit throughput optimization logging, making performance behavior easier to configure and audit. (#60757, #62146)
- ⬆️ Upgraded Serve internals to Pydantic v2 and refined time-series aggregation behavior for more predictable metric accuracy under high load. (#61061, #61403)
🛠 🔨 Fixes:
- 🛠 Fixed a direct-ingress shutdown bug where replicas could hang indefinitely while draining stuck requests, ensuring bounded shutdown behavior in failure scenarios. (#60754)
- 🛠 Fixed HAProxy reliability issues, including config race conditions, draining guards, and platform compatibility edge cases, improving stability in production rollouts. (#61120, #60955)
- 🛠 Fixed autoscaling correctness issues that could cause runaway scaling or delayed reactions, including feedback-loop regressions, streaming scale-down behavior, and wall-clock delay handling. (#61731, #61920, #62331, #61844, #60613)
- 🛠 Fixed high-percentile latency regression in request routing and queue-length accounting, reducing tail-latency spikes under load. (#61755)
- 🛠 Fixed replica-state and health-state edge cases during migration and ingress transitions, preventing false errors and unhealthy/healthy misreporting. (#60365, #61818, #62213)
- 🛠 Fixed chained upstream actor-failure handling so request failures are attributed correctly and no longer hang when upstream deployments die mid-chain. (#61758, #62147)
- 🛠 Fixed HTTP status classification for client disconnects after successful responses, improving accuracy of error-rate monitoring and alerting. (#61396)
📚 📖 Documentation:
- ➕ Added
AsyncInferenceAutoscalingPolicydocumentation and clarified Serve performance guidance for HAProxy and inter-deployment gRPC use cases. (#61086, #61386) - 🚀 Updated scheduling and configuration docs, including replica scheduling guidance and a catalog of Serve environment variables, so operators can tune deployments with less guesswork. (#60922, #60807)
- 📄 Clarified multiplexing and async behavior docs (including model pre-warming con...
- ➕ Add