1.. include:: replace.txt 2 3MPI for Distributed Simulation 4------------------------------ 5 6Parallel and distributed discrete event simulation allows the execution of a 7single simulation program on multiple processors. By splitting up the simulation 8into logical processes, LPs, each LP can be executed by a different processor. 9This simulation methodology enables very large-scale simulations by leveraging 10increased processing power and memory availability. In order to ensure proper 11execution of a distributed simulation, message passing between LPs is required. 12To support distributed simulation in |ns3|, the standard Message Passing 13Interface (MPI) is used, along with a new distributed simulator class. 14Currently, dividing a simulation for distributed purposes in |ns3| can only occur 15across point-to-point links. 16 17.. _current-implementation-details: 18 19Current Implementation Details 20****************************** 21 22During the course of a distributed simulation, many packets must cross simulator 23boundaries. In other words, a packet that originated on one LP is destined for a 24different LP, and in order to make this transition, a message containing the 25packet contents must be sent to the remote LP. Upon receiving this message, the 26remote LP can rebuild the packet and proceed as normal. The process of sending 27an receiving messages between LPs is handled easily by the new MPI interface in 28|ns3|. 29 30Along with simple message passing between LPs, a distributed simulator is used 31on each LP to determine which events to process. It is important to process 32events in time-stamped order to ensure proper simulation execution. If a LP 33receives a message containing an event from the past, clearly this is an issue, 34since this event could change other events which have already been executed. To 35address this problem, two conservative synchronization algorithm with lookahead are 36used in |ns3|. For more information on different synchronization approaches and 37parallel and distributed simulation in general, please refer to "Parallel and 38Distributed Simulation Systems" by Richard Fujimoto. 39 40The default parallel synchronization strategy implemented in the 41DistributedSimulatorImpl class is based on a globally synchronized 42algorithm using an MPI collective operation to synchronize simulation 43time across all LPs. A second synchronization strategy based on local 44communication and null messages is implemented in the 45NullMessageSimulatorImpl class, For the null message strategy the 46global all to all gather is not required; LPs only need to 47communication with LPs that have shared point-to-point links. The 48algorithm to use is controlled by which the |ns3| global value 49SimulatorImplementationType. 50 51The best algorithm to use is dependent on the communication and event 52scheduling pattern for the application. In general, null message 53synchronization algorithms will scale better due to local 54communication scaling better than a global all-to-all gather that is 55required by DistributedSimulatorImpl. There are two known cases where 56the global synchronization performs better. The first is when most 57LPs have point-to-point link with most other LPs, in other words the 58LPs are nearly fully connected. In this case the null message 59algorithm will generate more message passing traffic than the 60all-to-all gather. A second case where the global all-to-all gather 61is more efficient is when there are long periods of simulation time 62when no events are occurring. The all-to-all gather algorithm is able 63to quickly determine then next event time globally. The nearest 64neighbor behavior of the null message algorithm will require more 65communications to propagate that knowledge; each LP is only aware of 66neighbor next event times. 67 68 69Remote point-to-point links 70+++++++++++++++++++++++++++ 71 72As described in the introduction, dividing a simulation for distributed purposes 73in |ns3| currently can only occur across point-to-point links; therefore, the 74idea of remote point-to-point links is very important for distributed simulation 75in |ns3|. When a point-to-point link is installed, connecting two nodes, the 76point-to-point helper checks the system id, or rank, of both nodes. The rank 77should be assigned during node creation for distributed simulation and is 78intended to signify on which LP a node belongs. If the two nodes are on the same 79rank, a regular point-to-point link is created. If, however, the two nodes are 80on different ranks, then these nodes are intended for different LPs, and a 81remote point-to-point link is used. If a packet is to be sent across a remote 82point-to-point link, MPI is used to send the message to the remote LP. 83 84Distributing the topology 85+++++++++++++++++++++++++ 86 87Currently, the full topology is created on each rank, regardless of the 88individual node system ids. Only the applications are specific to a rank. For 89example, consider node 1 on LP 1 and node 2 on LP 2, with a traffic generator on 90node 1. Both node 1 and node 2 will be created on both LP1 and LP2; however, the 91traffic generator will only be installed on LP1. While this is not optimal for 92memory efficiency, it does simplify routing, since all current routing 93implementations in |ns3| will work with distributed simulation. 94 95Running Distributed Simulations 96******************************* 97 98Prerequisites 99+++++++++++++ 100.. highlight:: bash 101 102Ensure that MPI is installed, as well as mpic++. In Ubuntu repositories, 103these are openmpi-bin, openmpi-common, openmpi-doc, libopenmpi-dev. In 104Fedora, these are openmpi and openmpi-devel. 105 106Note: 107 108There is a conflict on some Fedora systems between libotf and openmpi. A 109possible "quick-fix" is to yum remove libotf before installing openmpi. 110This will remove conflict, but it will also remove emacs. Alternatively, 111these steps could be followed to resolve the conflict: 112 113 1) Rename the tiny otfdump which emacs says it needs:: 114 115 $ mv /usr/bin/otfdump /usr/bin/otfdump.emacs-version 116 117 2) Manually resolve openmpi dependencies:: 118 119 $ sudo yum install libgfortran libtorque numactl 120 121 3) Download rpm packages: 122 123 .. sourcecode:: text 124 125 openmpi-1.3.1-1.fc11.i586.rpm 126 openmpi-devel-1.3.1-1.fc11.i586.rpm 127 openmpi-libs-1.3.1-1.fc11.i586.rpm 128 openmpi-vt-1.3.1-1.fc11.i586.rpm 129 130 from http://mirrors.kernel.org/fedora/releases/11/Everything/i386/os/Packages/ 131 132 4) Force the packages in:: 133 134 $ sudo rpm -ivh --force \ 135 openmpi-1.3.1-1.fc11.i586.rpm \ 136 openmpi-libs-1.3.1-1.fc11.i586.rpm \ 137 openmpi-devel-1.3.1-1.fc11.i586.rpm \ 138 openmpi-vt-1.3.1-1.fc11.i586.rpm 139 140Also, it may be necessary to add the openmpi bin directory to PATH in order to 141execute mpic++ and mpirun from the command line. Alternatively, the full path to 142these executables can be used. Finally, if openmpi complains about the inability 143to open shared libraries, such as libmpi_cxx.so.0, it may be necessary to add 144the openmpi lib directory to LD_LIBRARY_PATH. 145 146Here is an example of setting up PATH and LD_LIBRARY_PATH using a bash shell: 147 148 * For a 32-bit Linux distribution:: 149 150 $ export PATH=$PATH:/usr/lib/openmpi/bin 151 $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/openmpi/lib 152 153 For a 64-bit Linux distribution:: 154 155 $ export PATH=$PATH:/usr/lib64/openmpi/bin 156 $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/lib 157 158These lines can be added into ~/.bash_profile or ~/.bashrc to avoid having to 159retype them when a new shell is opened. 160 161Note 2: There is a separate issue on recent Fedora distributions, which is 162that the libraries are built with AVX instructions. On older machines or 163some virtual machines, this results in an illegal instruction 164being thrown. This is not an |ns3| issue, a simple MPI test case will also 165fail. The AVX instructions are being called during initialization. 166 167The symptom of this is that attempts to run an ns-3 MPI program will fail 168with the error: `terminated with signal SIGILL`. To check if this is the 169problem, run:: 170 171 $ grep avx /proc/cpuinfo 172 173and it will not return anything if AVX is not present. 174 175If AVX is not supported, it is recommended to switch to a different MPI 176implementation such as MPICH:: 177 178 $ dnf remove openmpi openmpi-devel 179 $ dnf install mpich mpich-devel environment-modules 180 $ module load mpi/mpich-x86_64 181 182Building and Running Examples 183+++++++++++++++++++++++++++++ 184 185If you already built |ns3| without MPI enabled, you must re-build:: 186 187 $ ./waf distclean 188 189Configure |ns3| with the --enable-mpi option:: 190 191 $ ./waf -d debug configure --enable-examples --enable-tests --enable-mpi 192 193Ensure that MPI is enabled by checking the optional features shown from the 194output of configure. 195 196Next, build |ns3|:: 197 198 $ ./waf 199 200After building |ns3| with mpi enabled, the example programs are now 201ready to run with `mpiexec`. It is advised to avoid running Waf directly 202with `mpiexec`; two options that should be more robust are to either use 203the `--command-template` way of running the mpiexec program, or to use 204`./waf shell` and run the executables directly on the command line. 205Here are a few examples (from the root |ns3| directory):: 206 207 $ ./waf --command-template="mpiexec -np 2 %s" --run simple-distributed 208 $ ./waf --command-template="mpiexec -np 2 -machinefile mpihosts %s --nix=0" --run nms-p2p-nix-distributed 209 210An example using the null message synchronization algorithm:: 211 212 $ ./waf --command-template="mpiexec -np 2 %s --nullmsg" --run simple-distributed 213 214The np switch is the number of logical processors to use. The machinefile switch 215is which machines to use. In order to use machinefile, the target file must 216exist (in this case mpihosts). This can simply contain something like: 217 218.. sourcecode:: text 219 220 localhost 221 localhost 222 localhost 223 ... 224 225Or if you have a cluster of machines, you can name them. 226 227The other alternative to `command-template` is to use `./waf shell`. Here 228are the equivalent examples to the above (assuming optimized build profile):: 229 230 $ ./waf shell 231 $ cd build/src/mpi/examples 232 $ mpiexec -np 2 ns3-dev-simple-distributed-optimized 233 $ mpiexec -np 2 -machinefile mpihosts ns3-dev-nms-p2p-nix-distributed-optimized --nix=0 234 $ mpiexec -np 2 ns3-dev-simple-distributed-optimized --nullmsg 235 236Setting synchronization algorithm to use 237++++++++++++++++++++++++++++++++++++++++ 238 239The global value SimulatorImplementationType is used to set the 240synchronization algorithm to use. This value must be set before the 241MpiInterface::Enable method is invoked if the default 242DistributedSimulatorImpl is not used. Here is an example code snippet 243showing how to add a command line argument to control the 244synchronization algorithm choice::: 245 246 cmd.AddValue ("nullmsg", "Enable the use of null-message synchronization", nullmsg); 247 if(nullmsg) 248 { 249 GlobalValue::Bind ("SimulatorImplementationType", 250 StringValue ("ns3::NullMessageSimulatorImpl")); 251 } 252 else 253 { 254 GlobalValue::Bind ("SimulatorImplementationType", 255 StringValue ("ns3::DistributedSimulatorImpl")); 256 } 257 258 // Enable parallel simulator with the command line arguments 259 MpiInterface::Enable (&argc, &argv); 260 261 262 263Creating custom topologies 264++++++++++++++++++++++++++ 265.. highlight:: cpp 266 267The example programs in src/mpi/examples give a good idea of how to create different 268topologies for distributed simulation. The main points are assigning system ids 269to individual nodes, creating point-to-point links where the simulation should 270be divided, and installing applications only on the LP associated with the 271target node. 272 273Assigning system ids to nodes is simple and can be handled two different ways. 274First, a NodeContainer can be used to create the nodes and assign system ids:: 275 276 NodeContainer nodes; 277 nodes.Create (5, 1); // Creates 5 nodes with system id 1. 278 279Alternatively, nodes can be created individually, assigned system ids, and added 280to a NodeContainer. This is useful if a NodeContainer holds nodes with different 281system ids:: 282 283 NodeContainer nodes; 284 Ptr<Node> node1 = CreateObject<Node> (0); // Create node1 with system id 0 285 Ptr<Node> node2 = CreateObject<Node> (1); // Create node2 with system id 1 286 nodes.Add (node1); 287 nodes.Add (node2); 288 289Next, where the simulation is divided is determined by the placement of 290point-to-point links. If a point-to-point link is created between two 291nodes with different system ids, a remote point-to-point link is created, 292as described in :ref:`current-implementation-details`. 293 294Finally, installing applications only on the LP associated with the target node 295is very important. For example, if a traffic generator is to be placed on node 2960, which is on LP0, only LP0 should install this application. This is easily 297accomplished by first checking the simulator system id, and ensuring that it 298matches the system id of the target node before installing the application. 299 300Tracing During Distributed Simulations 301************************************** 302 303Depending on the system id (rank) of the simulator, the information traced will 304be different, since traffic originating on one simulator is not seen by another 305simulator until it reaches nodes specific to that simulator. The easiest way to 306keep track of different traces is to just name the trace files or pcaps 307differently, based on the system id of the simulator. For example, something 308like this should work well, assuming all of these local variables were 309previously defined:: 310 311 if (MpiInterface::GetSystemId () == 0) 312 { 313 pointToPoint.EnablePcapAll ("distributed-rank0"); 314 phy.EnablePcap ("distributed-rank0", apDevices.Get (0)); 315 csma.EnablePcap ("distributed-rank0", csmaDevices.Get (0), true); 316 } 317 else if (MpiInterface::GetSystemId () == 1) 318 { 319 pointToPoint.EnablePcapAll ("distributed-rank1"); 320 phy.EnablePcap ("distributed-rank1", apDevices.Get (0)); 321 csma.EnablePcap ("distributed-rank1", csmaDevices.Get (0), true); 322 } 323