1# Frequently Asked Questions 2 3## General 4 5### Overview 6 7#### What is UCX? 8UCX is a framework (collection of libraries and interfaces) that provides efficient 9and relatively easy way to construct widely used HPC protocols: MPI tag matching, 10RMA operations, rendezvous protocols, stream, fragmentation, remote atomic operations, etc. 11 12#### What is UCP, UCT, UCS? 13* **UCT** is a transport layer that abstracts the differences across various hardware architectures and provides a low-level API that enables the implementation of communication protocols. The primary goal of the layer is to provide direct and efficient access to hardware network resources with minimal software overhead. For this purpose UCT relies on low-level drivers provided by vendors such as InfiniBand Verbs, Cray's uGNI, libfabrics, etc. In addition, the layer provides constructs for communication context management (thread-based and ap- plication level), and allocation and management of device- specific memories including those found in accelerators. In terms of communication APIs, UCT defines interfaces for immediate (short), buffered copy-and-send (bcopy), and zero- copy (zcopy) communication operations. The short operations are optimized for small messages that can be posted and completed in place. The bcopy operations are optimized for medium size messages that are typically sent through a so- called bouncing-buffer. Finally, the zcopy operations expose zero-copy memory-to-memory communication semantics. 14 15* **UCP** implements higher-level protocols that are typically used by message passing (MPI) and PGAS programming models by using lower-level capabilities exposed through the UCT layer. 16UCP is responsible for the following functionality: initialization of the library, selection of transports for communication, message fragmentation, and multi-rail communication. Currently, the API has the following classes of interfaces: Initialization, Remote Memory Access (RMA) communication, Atomic Memory Operations (AMO), Active Message, Tag-Matching, and Collectives. 17 18* **UCS** is a service layer that provides the necessary func- tionality for implementing portable and efficient utilities. 19 20#### How can I contribute? 211. Fork 222. Fix bug or implement a new feature 233. Open Pull Request 24 25#### How do I get in touch with UCX developers? 26Please join our mailing list: https://elist.ornl.gov/mailman/listinfo/ucx-group or 27submit issues on github: https://github.com/openucx/ucx/issues 28 29<br/> 30 31### UCX mission 32 33#### What are the key features of UCX? 34* **Open source framework supported by vendors** 35The UCX framework is maintained and supported by hardware vendors in addition to the open source community. Every pull-request is tested and multiple hardware platforms supported by vendors community. 36 37* **Performance, performance, performance!** 38The framework design, data structures, and components are design to provide highly optimized access to the network hardware. 39 40* **High level API for a broad range HPC programming models.** 41UCX provides a high level API implemented in software 'UCP' to fill in the gaps across interconnects. This allows to use a single set of APIs in a library to implement multiple interconnects. This reduces the level of complexities when implementing libraries such as Open MPI or OpenSHMEM. Because of this, UCX performance portable because a single implementation (in Open MPI or OpenSHMEM) will work efficiently on multiple interconnects. (e.g. uGNI, Verbs, libfabrics, etc). 42 43* **Support for interaction between multiple transports (or providers) to deliver messages.** 44For example, UCX has the logic (in UCP) to make 'GPUDirect', IB' and share memory work together efficiently to deliver the data where is needed without the user dealing with this. 45 46* **Cross-transport multi-rail capabilities.** UCX protocol layer can utilize multiple transports, 47 event on different types of hardware, to deliver messages faster, without the need for 48 any special tuning. 49 50* **Utilizing hardware offloads for optimized performance**, such as RDMA, Hardware tag-matching 51 hardware atomic operations, etc. 52 53#### What protocols are supported by UCX? 54UCP implements RMA put/get, send/receive with tag matching, Active messages, atomic operations. In near future we plan to add support for commonly used collective operations. 55 56#### Is UCX replacement for GASNET? 57No. GASNET exposes high level API for PGAS programming management that provides symmetric memory management capabilities and build in runtime environments. These capabilities are out of scope of UCX project. 58Instead, GASNET can leverage UCX framework for fast end efficient implementation of GASNET for the network technologies support by UCX. 59 60#### What is the relation between UCX and network drivers? 61UCX framework does not provide drivers, instead it relies on the drivers provided by vendors. Currently we use: OFA VERBs, Cray's UGNI, NVIDIA CUDA. 62 63#### What is the relation between UCX and OFA Verbs or Libfabrics? 64UCX, is a middleware communication layer that relies on vendors provided user level drivers including OFA Verbs or libfabrics (or any other drivers provided by another communities or vendors) to implement high-level protocols which can be used to close functionality gaps between various vendors drivers including various libfabrics providers: coordination across various drivers, multi-rail capabilities, software based RMA, AMOs, tag-matching for transports and drivers that do not support such capabilities natively. 65 66#### Is UCX a user level driver? 67No. Typically, Drivers aim to expose fine-grain access to the network architecture specific features. 68UCX abstracts the differences across various drivers and fill-in the gaps using software protocols for some of the architectures that don't provide hardware level support for all the operations. 69 70<br/> 71 72### Dependencies 73 74#### What stuff should I have on my machine to use UCX? 75 76UCX detects the exiting libraries on the build machine and enables/disables support 77for various features accordingly. 78If some of the modules UCX was built with are not found during runtime, they will 79be silently disabled. 80 81* **Basic shared memory and TCP support** - always enabled 82* **Optimized shared memory** - requires knem or xpmem drivers. On modern kernels also CMA (cross-memory-attach) mechanism will be used. 83* **RDMA support** - requires rdma-core or libibverbs library. 84* **NVIDIA GPU support** - requires Cuda drives 85* **AMD GPU support** - requires ROCm drivers 86 87 88#### Does UCX depend on an external runtime environment? 89UCX does not depend on an external runtime environment. 90 91`ucx_perftest` (UCX based application/benchmark) can be linked with an external runtime environment that can be used for remote `ucx_perftest` launch, but this an optional configuration which is only used for environments that do not provide direct access to compute nodes. By default this option is disabled. 92 93<br/> 94 95 96### Configuration and tuning 97 98#### How can I specify special configuration and tunings for UCX? 99 100UCX takes parameters from specific **environment variables**, which start with the 101prefix `UCX_`. 102> **IMPORTANT NOTE:** Changing the values of UCX environment variables to non-default 103may lead to undefined behavior. The environment variables are mostly indented for 104 dvanced users, or for specific tunings or workarounds recommended by UCX community. 105 106#### 2. Where can I see all UCX environment variables? 107 108* Running `ucx_info -c` prints all environment variables and their default values. 109* Running `ucx_info -cf` prints the documentation for all environment variables. 110 111 112<br/> 113 114--- 115<br/> 116 117## Network capabilities 118 119### Selecting networks and transports 120 121#### Which network devices does UCX use? 122 123By default, UCX tries to use all available devices on the machine, and selects 124best ones based on performance characteristics (bandwidth, latency, NUMA locality, etc). 125Setting `UCX_NET_DEVICES=<dev1>,<dev2>,...` would restrict UCX to using **only** 126the specified devices. 127For example: 128* `UCX_NET_DEVICES=eth2` - Use the Ethernet device eth2 for TCP sockets transport. 129* `UCX_NET_DEVICES=mlx5_2:1` - Use the RDMA device mlx5_2, port 1 130 131Running `ucx_info -d` would show all available devices on the system that UCX can utilize. 132 133#### Which transports does UCX use? 134 135By default, UCX tries to use all available transports, and select best ones 136according to their performance capabilities and scale (passed as estimated number 137of endpoints to *ucp_init()* API). 138For example: 139* On machines with Ethernet devices only, shared memory will be used for intra-node 140communication and TCP sockets for inter-node communication. 141* On machines with RDMA devices, RC transport will be used for small scale, and 142 DC transport (available with Connect-IB devices and above) will be used for large 143 scale. If DC is not available, UD will be used for large scale. 144* If GPUs are present on the machine, GPU transports will be enabled for detecting 145 memory pointer type and copying to/from GPU memory. 146 147It's possible to restrict the transports in use by setting `UCX_TLS=<tl1>,<tl2>,...`. 148The list of all transports supported by UCX on the current machine can be generated 149by `ucx_info -d` command. 150> **IMPORTANT NOTE** 151> In some cases restricting the transports can lead to unexpected and undefined behavior: 152> * Using *rc_verbs* or *rc_mlx5* also requires *ud_verbs* or *ud_mlx5* transport for bootstrap. 153> * Applications using GPU memory must also specify GPU transports for detecting and 154> handling non-host memory. 155 156In addition to the built-in transports it's possible to use aliases which specify multiple transports. 157 158##### List of main transports and aliases 159<table> 160<tr><td>all</td><td>use all the available transports.</td></tr> 161<tr><td>sm or shm</td><td>all shared memory transports.</td></tr> 162<tr><td>ugni</td><td>ugni_rdma and ugni_udt.</td></tr> 163<tr><td>rc</td><td>RC (=reliable connection), "accelerated" transports are used if possible.</td></tr> 164<tr><td>ud</td><td>UD (=unreliable datagram), "accelerated" is used if possible.</td></tr> 165<tr><td>dc</td><td>DC - Mellanox scalable offloaded dynamic connection transport</td></tr> 166<tr><td>rc_x</td><td>Same as "rc", but using accelerated transports only</td></tr> 167<tr><td>rc_v</td><td>Same as "rc", but using Verbs-based transports only</td></tr> 168<tr><td>ud_x</td><td>Same as "ud", but using accelerated transports only</td></tr> 169<tr><td>ud_v</td><td>Same as "ud", but using Verbs-based transports only</td></tr> 170<tr><td>cuda</td><td>CUDA (NVIDIA GPU) memory support: cuda_copy, cuda_ipc, gdr_copy</td></tr> 171<tr><td>rocm</td><td>ROCm (AMD GPU) memory support: rocm_copy, rocm_ipc, rocm_gdr</td></tr> 172<tr><td>tcp</td><td>TCP over SOCK_STREAM sockets</td></tr> 173<tr><td>self</td><td>Loopback transport to communicate within the same process</td></tr> 174</table> 175 176For example: 177- `UCX_TLS=rc` will select RC, UD for bootstrap, and prefer accelerated transports 178- `UCX_TLS=rc,cuda` will select RC along with Cuda memory transports. 179 180 181<br/> 182 183 184### Multi-rail 185 186#### Does UCX support multi-rail? 187 188Yes. 189 190#### What is the default behavior in a multi-rail environment? 191 192By default UCX would pick the 2 best network devices, and split large 193messages between the rails. For example, in a 100MB message - the 1st 50MB 194would be sent on the 1st device, and the 2nd 50MB would be sent on the 2nd device. 195If the device network speeds are not the same, the split will be proportional to 196their speed ratio. 197 198The devices to use are selected according to best network speed, PCI bandwidth, 199and NUMA locality. 200 201#### Is it possible to use more than 2 rails? 202 203Yes, by setting `UCX_MAX_RNDV_RAILS=<num-rails>`. Currently up to 4 are supported. 204 205#### Is it possible that each process would just use the closest device? 206 207Yes, by `UCX_MAX_RNDV_RAILS=1` each process would use a single network device 208according to NUMA locality. 209 210#### Can I disable multi-rail? 211 212Yes, by setting `UCX_NET_DEVICES=<dev>` to the single device that should be used. 213 214<br/> 215 216### Adaptive routing 217 218#### Does UCX support adaptive routing fabrics? 219 220Yes. 221 222#### What do I need to do to run UCX with adaptive routing? 223 224When adaptive routing is configured on an Infiniband fabric, it is enabled per SL 225(IB Service Layer). 226Setting `UCX_IB_SL=<sl-num>` will make UCX run on the given 227service level and utilize adaptive routing. 228 229<br/> 230 231### RoCE 232 233#### How to specify service level with UCX? 234 235Setting `UCX_IB_SL=<sl-num>` will make UCX run on the given service level. 236 237#### How to specify DSCP priority? 238 239Setting `UCX_IB_TRAFFIC_CLASS=<num>`. 240 241#### How to specify which address to use? 242 243Setting `UCX_IB_GID_INDEX=<num>` would make UCX use the specified GID index on 244the RoCE port. The system command `show_gids` would print all available addresses 245and their indexes. 246 247--- 248<br/> 249 250## Working with GPU 251 252### GPU support 253 254#### How UCX supports GPU? 255 256UCX protocol operations can work with GPU memory pointers the same way as with Host 257memory pointers. For example, the 'buffer' argument passed to `ucp_tag_send_nb()` can 258be either host or GPU memory. 259 260 261#### Which GPUs are supported? 262 263Currently UCX supports NVIDIA GPUs by Cuda library, and AMD GPUs by ROCm library. 264 265 266#### Which UCX APIs support GPU memory? 267 268Currently only UCX tagged APIs (ucp_tag_send_XX/ucp_tag_recv_XX) and stream APIs 269(ucp_stream_send/ucp_stream_recv_XX) support GPU memory. 270 271#### How to run UCX with GPU support? 272 273In order to run UCX with GPU support, you will need an application which allocates 274GPU memory (for example, 275[MPI OSU benchmarks with Cuda support](https://mvapich.cse.ohio-state.edu/benchmarks)), 276and UCX compiled with GPU support. Then you can run the application as usual (for 277example, with MPI) and whenever GPU memory is passed to UCX, it either use GPU-direct 278for zero copy operations, or copy the data to/from host memory. 279> NOTE When specifying UCX_TLS explicitly, must also specify cuda/rocm for GPU memory 280> support, otherwise the GPU memory will not be recognized. 281> For example: `UCX_TLS=rc,cuda` or `UCX_TLS=dc,rocm` 282 283#### I'm running UCX with GPU memory and geting a segfault, why? 284 285Most likely UCX does not detect that the pointer is a GPU memory and tries to 286access it from CPU. It can happen if UCX is not compiled with GPU support, or fails 287to load CUDA or ROCm modules due to missing library paths or version mismatch. 288Please run `ucx_info -d | grep cuda` or `ucx_info -d | grep rocm` to check for 289UCX GPU support. 290 291#### What are the current limitations of using GPU memory? 292 293* **Static compilation** - programs which are statically compiled with Cuda libraries 294 must disable memory detection cache by setting `UCX_MEMTYPE_CACHE=n`. The reason 295 is that memory allocation hooks do not work with static compilation. Disabling this 296 cache could have a negative effect on performance, especially for small messages. 297 298<br/> 299 300### Performance considerations 301 302#### Does UCX support zero-copy for GPU memory over RDMA? 303 304Yes. For large messages UCX can transfer GPU memory using zero-copy RDMA using 305rendezvous protocol. It requires the peer memory q for the relevant GPU type 306to be loaded on the system. 307> **NOTE:** In some cases if the RDMA network device and the GPU are not on 308the same NUMA node, such zero-copy transfer is inefficient. 309 310 311 312<br/> 313