Release Notes

AI 2.3.0

New and Optimized Features

Ray-based Distributed Workloads

KubeRay and CodeFlare SDK enable developers to run Ray-based distributed workloads from Workbench and manage remote Ray clusters on Kubernetes. Developers can create and monitor RayCluster resources, submit RayJob workloads, and define distributed compute jobs for Python-based environments.

Feast Feature Store

Feast Feature Store provides a consistent way to manage reusable machine learning features across training, batch scoring, and online inference. Administrators can deploy Feast on Kubernetes through the FeatureStore custom resource, which manages core services such as the online store, offline store, registry, UI, and client configuration.

Connection Hub S3 Support

Connection Hub supports S3 connection types, expanding the set of reusable connection configurations for AI workflows. It enables users to configure S3-compatible object storage access once and reuse the connection across supported model, data, and development workflows.

TrainingHub Fine-Tuning and Post-Tuning

TrainingHub provides a unified high-level API for model fine-tuning and post-tuning in Workbench environments. It supports SFT and OSFT workflows across single-GPU, multi-GPU, and multi-node execution, simplifying distributed training configuration, memory management, checkpointing, and experiment tracking.

Expanded Notebook Base Image Library for ARM

The Notebook Base Image Library for ARM now includes minimal CANN, PyTorch CANN, MindSpore CANN, and datascience code-server images. The CANN-based images provide framework support for Ascend NPUs, expanding ARM-based development options for notebook and code-server environments.

Deprecated Features

None in this release.

Fixed Issues

  • [LWS] Add master/control-plane node tolerations to LWS controller to fix Pending status.
  • Inconsistency between the tag and the start tag of a Node Feature Discovery package leads to an anomalous deployment state in the global cluster, but not in the business cluster.
  • After successfully creating an inference service and updating the parameters of ServingRuntime in the management view, if the inference service references this ServingRuntime, even if it is stopped and restarted or some of the parameters are updated, it cannot actually reference the latest ServingRuntime parameters internally.

Known Issues

  • Modifying library_name in Gitlab by directly editing the readme file does not synchronize the model type change on the page.
    Temporary solution: Use UI operation to modify the library_name to avoid direct operation in Gitlab.
  • When using VictoriaMetrics for monitoring data collection of inference services operating in Serverless mode, there is a known issue where the inference services cannot scale down to zero.