NVIDIA AI Dynamo Creation Guide
This guide explains how to create and configure NVIDIA AI Dynamo through the Cordatus Platform.
NVIDIA AI Dynamo is a powerful distributed runtime framework that enables efficient execution of large language models (LLMs) across multi-GPU and multi-node environments.
Through Cordatus, users can easily launch NVIDIA AI Dynamo on vLLM Runtime with their preferred model, using a minimal configuration.
This allows prefill and decode processes to be distributed intelligently across multiple GPUs, optimizing both performance and scalability.
See details → Application Hub Overview | Application Hub Quickstart | Standard Application Launch Guide | Container Management Guide
About NVIDIA AI Dynamo
NVIDIA AI Dynamo is a high-performance distributed execution engine designed for large-scale AI model workloads. It allows multiple GPUs or nodes to collaboratively process large LLM tasks by efficiently managing memory, computation, and communication.
Dynamo offers the following key capabilities:
- Parallel execution of prefill and decode tasks across GPUs
- Optimized management of KV Cache blocks
- Low-latency data transfers between workers
- Dynamic scalability for distributed inference workloads
The Cordatus Platform abstracts Dynamo’s technical complexity with a user-friendly interface, allowing users to define runtime modes, routing strategies, connectors, and GPU assignments effortlessly.
Steps to Launch NVIDIA AI Dynamo
Selecting the Application
- From the left-side menu, click Containers.
- From the dropdown submenu, select Applications.
- Choose Dynamo from the list of applications and navigate to its Detail page.
Device and Version Selection
- Click Start Application.
- In the modal window, select the device on which you want to run NVIDIA Dynamo and connect to it.
- Proceed to the Version step, where available Docker image versions are displayed.
- If the image is already installed on your device, a downloaded icon appears.
- If not, a to be downloaded icon will be shown.
Advanced Settings
The final step takes you to the Advanced Settings page, where you can configure all parameters for launching your NVIDIA AI Dynamo container.
Advanced Settings Details
General Settings
-
Environment Name:
Assign a custom name to your container. If left blank, Cordatus automatically generates one. -
Enable Open Web UI:
When enabled, Cordatus automatically creates an Open Web UI container during deployment.
This allows you to interact with your running model directly through a browser interface.Learn more: The functionality and configuration are identical to standard applications. See Enable Open Web UI - Section 5.2 for detailed explanation and video tutorial.
-
Model Selection:
On the right bottom panel, you can select the model to run. Three options are available:-
Cordatus Models:
Displays pre-tested and pre-registered models within the Cordatus system, along with their tags.💡 Smart Volume Detection: If you have configured model paths on your device, Cordatus automatically detects installed models and checks whether the selected Cordatus Model already exists on your device. If found, Cordatus automatically configures the volume mount to use your local copy. If the model is not found locally, the default volume mount path will be used (model will be downloaded during container launch), but you can manually modify this in the Docker Options section if needed.
-
Custom Model:
If the model you wish to run is not in the system, enter its name manually in the Model Name field.Learn more: The usage is identical to standard applications. See Custom Model - Section 5.2 for detailed explanation and video tutorial.
-
User Models:
Displays models that you have previously defined and added to your Cordatus account.Learn more: The configuration and usage process is identical to standard applications. See User Models - Section 5.2 for complete details including:
- Same Device Usage
- Model Transfer Feature
- Automatic Volume Configuration
- Video tutorials for both scenarios
-
-
Resource Limits:
Control how much CPU and RAM your container can use.
The Resource Limits section allows you to optimize container performance and manage system resources efficiently.CPU Core Assignment:
You can limit the maximum number of CPU cores that your container will use.-
Setting Limits:
Manually specify how many CPU cores the container can utilize. -
Host Reserved:
Cordatus automatically reserves a certain amount of CPU cores for system operations to ensure your device continues functioning normally.
Therefore, you cannot assign all CPU cores to the container as the maximum limit. -
No Limit Option:
If you want to allocate all CPU resources to the container, check the No Limit option in the upper right corner.
This removes the Host Reserved restriction and makes all resources available to the container.
RAM Assignment:
You can specify how much of your system's total RAM can be allocated to the container as the maximum limit.-
Setting Limits:
Manually define the amount of RAM the container can use. -
Host Reserved:
Cordatus automatically reserves a certain amount of RAM for system stability to ensure your device operates properly.
Therefore, you cannot assign all RAM capacity to the container as the maximum limit. -
No Limit Option:
If you want to allocate all RAM resources to the container, check the No Limit option in the upper right corner.
This removes the Host Reserved restriction and makes all RAM available to the container.
⚠️ Warning: Using the No Limit option allocates the entire system to the container, which may negatively impact your device's performance and stability. Use this option with caution.
💡 Recommendation: For optimal performance and system stability, it is recommended to maintain the Host Reserved values automatically set by Cordatus when allocating resources.
-
Processing Mode
This section defines how NVIDIA Dynamo distributes workloads between GPUs.
-
Aggregated Mode:
Both prefill and decode processes are executed on the same GPU.
Suitable for small to medium-scale deployments. -
Disaggregated Mode:
Prefill and decode processes run on different GPUs, optimizing performance for large-scale models.
The choice of processing mode directly impacts GPU assignment and worker creation, as described below.
Router Configuration
The Router defines how incoming requests are distributed among workers.
-
KV-Aware (Smart) Router:
Tracks which worker holds specific KV Cache blocks and routes requests to the optimal worker, reducing recomputation and latency. -
Round Robin Router:
Distributes requests sequentially across all workers for balanced workload distribution. Does not use KV awareness. -
Random Router:
Routes requests to randomly selected workers. Simple but less efficient for large-scale workloads.
Connector Configuration
The Connector defines how data is exchanged between workers or nodes in a distributed system.
It becomes active when KV blocks or intermediate data need to be transferred.
-
KVBM Connector:
Acts as a translation layer between runtime outputs (e.g., TRT-LLM, vLLM) and the KV Block Manager.
Manages KV block memory across RAM and Disk. -
NIXL Connector:
Enables low-latency memory access (RDMA-like) between workers for optimized data transfers in large-scale setups. -
LM Cache Connector:
Provides a reusable cache layer for text or preset KV blocks, preventing redundant computation and improving performance.
Memory Allocation:
- For LM Cache Connector, specify the amount of RAM to allocate for LM cache.
- For KVBM Connector, define how much RAM and Disk space to allocate for KV block management.
4.5 GPU Assignment and Worker Creation
This section allows you to create workers and assign GPUs to them, depending on the selected processing mode.
- The left-side Available GPUs list displays all GPUs in your system that are not yet assigned to any worker.
Aggregated Mode
- Only Decode Workers can be created.
- Each worker must be assigned at least one GPU.
- You can create as many workers as needed.
Disaggregated Mode
- You can create both Prefill Workers and Decode Workers.
- Each worker must be assigned at least one GPU.
- The number of workers is fully customizable to match your workload.
4.6 Configuration Settings
After selecting your model, Cordatus automatically loads the default configuration required for Dynamo to function properly.
Docker Options
- Configure standard
docker runparameters. - Use the Choose Options menu on the left to add arguments such as
--env,--volume, or--network. - Port Mapping:
Cordatus automatically assigns an available port, which you can modify manually.
If a port is already in use, the label Port is Allocated will appear. - Volume Mapping:
Cordatus provides a visual interface to browse disks, select directories, and create new folders for volume bindings.
Environment Variables
- Displays the default environment variables defined for Dynamo.
- Add new variables using the Add your own Environment Variables feature.
- If a variable is a Token, select an existing token from your Cordatus account.
(See the relevant section for instructions on adding tokens to Cordatus.)
Dynamo Arguments
- Define engine-level arguments specific to your Dynamo instance.
- Depending on the selected Processing Mode, configure how Prefill and Decode workers operate — e.g., batch size, concurrency limits, or cache strategy.
5. Launching the Application
Once you’ve configured GPU assignments, workers, Docker Options, Environment Variables, and Dynamo Arguments,
the Start Environment button will become active.
- Cordatus will request your Sudo Password to authorize container creation.
- If the selected Docker image is not present on your device, Cordatus will prompt you to confirm whether it should be downloaded.
- If already available, Cordatus will start the container using your defined configuration.
You can monitor container statuses under:
- Applications > Containers, or
- The Containers section in the main sidebar.
Cordatus will also automatically configure any supporting applications required for Dynamo’s operation, ensuring optimal runtime setup.