1.. include:: replace.txt
2
3MPI for Distributed Simulation
4------------------------------
5
6Parallel and distributed discrete event simulation allows the execution of a
7single simulation program on multiple processors. By splitting up the simulation
8into logical processes, LPs, each LP can be executed by a different processor.
9This simulation methodology enables very large-scale simulations by leveraging
10increased processing power and memory availability. In order to ensure proper
11execution of a distributed simulation, message passing between LPs is required.
12To support distributed simulation in |ns3|, the standard Message Passing
13Interface (MPI) is used, along with a new distributed simulator class.
14Currently, dividing a simulation for distributed purposes in |ns3| can only occur
15across point-to-point links.
16
17.. _current-implementation-details:
18
19Current Implementation Details
20******************************
21
22During the course of a distributed simulation, many packets must cross simulator
23boundaries. In other words, a packet that originated on one LP is destined for a
24different LP, and in order to make this transition, a message containing the
25packet contents must be sent to the remote LP. Upon receiving this message, the
26remote LP can rebuild the packet and proceed as normal. The process of sending
27an receiving messages between LPs is handled easily by the new MPI interface in
28|ns3|.
29
30Along with simple message passing between LPs, a distributed simulator is used
31on each LP to determine which events to process. It is important to process
32events in time-stamped order to ensure proper simulation execution. If a LP
33receives a message containing an event from the past, clearly this is an issue,
34since this event could change other events which have already been executed. To
35address this problem, two conservative synchronization algorithm with lookahead are
36used in |ns3|. For more information on different synchronization approaches and
37parallel and distributed simulation in general, please refer to "Parallel and
38Distributed Simulation Systems" by Richard Fujimoto.
39
40The default parallel synchronization strategy implemented in the
41DistributedSimulatorImpl class is based on a globally synchronized
42algorithm using an MPI collective operation to synchronize simulation
43time across all LPs.  A second synchronization strategy based on local
44communication and null messages is implemented in the
45NullMessageSimulatorImpl class, For the null message strategy the
46global all to all gather is not required; LPs only need to
47communication with LPs that have shared point-to-point links.  The
48algorithm to use is controlled by which the |ns3| global value
49SimulatorImplementationType.
50
51The best algorithm to use is dependent on the communication and event
52scheduling pattern for the application.  In general, null message
53synchronization algorithms will scale better due to local
54communication scaling better than a global all-to-all gather that is
55required by DistributedSimulatorImpl.  There are two known cases where
56the global synchronization performs better.  The first is when most
57LPs have point-to-point link with most other LPs, in other words the
58LPs are nearly fully connected.  In this case the null message
59algorithm will generate more message passing traffic than the
60all-to-all gather.  A second case where the global all-to-all gather
61is more efficient is when there are long periods of simulation time
62when no events are occurring.  The all-to-all gather algorithm is able
63to quickly determine then next event time globally.  The nearest
64neighbor behavior of the null message algorithm will require more
65communications to propagate that knowledge; each LP is only aware of
66neighbor next event times.
67
68
69Remote point-to-point links
70+++++++++++++++++++++++++++
71
72As described in the introduction, dividing a simulation for distributed purposes
73in |ns3| currently can only occur across point-to-point links; therefore, the
74idea of remote point-to-point links is very important for distributed simulation
75in |ns3|. When a point-to-point link is installed, connecting two nodes, the
76point-to-point helper checks the system id, or rank, of both nodes. The rank
77should be assigned during node creation for distributed simulation and is
78intended to signify on which LP a node belongs. If the two nodes are on the same
79rank, a regular point-to-point link is created. If, however, the two nodes are
80on different ranks, then these nodes are intended for different LPs, and a
81remote point-to-point link is used. If a packet is to be sent across a remote
82point-to-point link, MPI is used to send the message to the remote LP.
83
84Distributing the topology
85+++++++++++++++++++++++++
86
87Currently, the full topology is created on each rank, regardless of the
88individual node system ids. Only the applications are specific to a rank. For
89example, consider node 1 on LP 1 and node 2 on LP 2, with a traffic generator on
90node 1. Both node 1 and node 2 will be created on both LP1 and LP2; however, the
91traffic generator will only be installed on LP1. While this is not optimal for
92memory efficiency, it does simplify routing, since all current routing
93implementations in |ns3| will work with distributed simulation.
94
95Running Distributed Simulations
96*******************************
97
98Prerequisites
99+++++++++++++
100.. highlight:: bash
101
102Ensure that MPI is installed, as well as mpic++. In Ubuntu repositories,
103these are openmpi-bin, openmpi-common, openmpi-doc, libopenmpi-dev. In
104Fedora, these are openmpi and openmpi-devel.
105
106Note:
107
108There is a conflict on some Fedora systems between libotf and openmpi. A
109possible "quick-fix" is to yum remove libotf before installing openmpi.
110This will remove conflict, but it will also remove emacs. Alternatively,
111these steps could be followed to resolve the conflict:
112
113    1) Rename the tiny otfdump which emacs says it needs::
114
115         $ mv /usr/bin/otfdump /usr/bin/otfdump.emacs-version
116
117    2) Manually resolve openmpi dependencies::
118
119         $ sudo yum install libgfortran libtorque numactl
120
121    3) Download rpm packages:
122
123       .. sourcecode:: text
124
125         openmpi-1.3.1-1.fc11.i586.rpm
126         openmpi-devel-1.3.1-1.fc11.i586.rpm
127         openmpi-libs-1.3.1-1.fc11.i586.rpm
128         openmpi-vt-1.3.1-1.fc11.i586.rpm
129
130       from http://mirrors.kernel.org/fedora/releases/11/Everything/i386/os/Packages/
131
132    4) Force the packages in::
133
134         $ sudo rpm -ivh --force \
135         openmpi-1.3.1-1.fc11.i586.rpm \
136         openmpi-libs-1.3.1-1.fc11.i586.rpm \
137         openmpi-devel-1.3.1-1.fc11.i586.rpm \
138         openmpi-vt-1.3.1-1.fc11.i586.rpm
139
140Also, it may be necessary to add the openmpi bin directory to PATH in order to
141execute mpic++ and mpirun from the command line. Alternatively, the full path to
142these executables can be used. Finally, if openmpi complains about the inability
143to open shared libraries, such as libmpi_cxx.so.0, it may be necessary to add
144the openmpi lib directory to LD_LIBRARY_PATH.
145
146Here is an example of setting up PATH and LD_LIBRARY_PATH using a bash shell:
147
148    * For a 32-bit Linux distribution::
149
150         $ export PATH=$PATH:/usr/lib/openmpi/bin
151         $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/openmpi/lib
152
153     For a 64-bit Linux distribution::
154
155         $ export PATH=$PATH:/usr/lib64/openmpi/bin
156         $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/lib
157
158These lines can be added into ~/.bash_profile or ~/.bashrc to avoid having to
159retype them when a new shell is opened.
160
161Note 2:  There is a separate issue on recent Fedora distributions, which is
162that the libraries are built with AVX instructions.  On older machines or
163some virtual machines, this results in an illegal instruction
164being thrown.  This is not an |ns3| issue, a simple MPI test case will also
165fail.  The AVX instructions are being called during initialization.
166
167The symptom of this is that attempts to run an ns-3 MPI program will fail
168with the error: `terminated with signal SIGILL`.  To check if this is the
169problem, run::
170
171    $ grep avx /proc/cpuinfo
172
173and it will not return anything if AVX is not present.
174
175If AVX is not supported, it is recommended to switch to a different MPI
176implementation such as MPICH::
177
178    $ dnf remove openmpi openmpi-devel
179    $ dnf install mpich mpich-devel environment-modules
180    $ module load mpi/mpich-x86_64
181
182Building and Running Examples
183+++++++++++++++++++++++++++++
184
185If you already built |ns3| without MPI enabled, you must re-build::
186
187    $ ./waf distclean
188
189Configure |ns3| with the --enable-mpi option::
190
191    $ ./waf -d debug configure --enable-examples --enable-tests --enable-mpi
192
193Ensure that MPI is enabled by checking the optional features shown from the
194output of configure.
195
196Next, build |ns3|::
197
198    $ ./waf
199
200After building |ns3| with mpi enabled, the example programs are now
201ready to run with `mpiexec`.  It is advised to avoid running Waf directly
202with `mpiexec`; two options that should be more robust are to either use
203the `--command-template` way of running the mpiexec program, or to use
204`./waf shell` and run the executables directly on the command line.
205Here are a few examples (from the root |ns3| directory)::
206
207    $ ./waf --command-template="mpiexec -np 2 %s" --run simple-distributed
208    $ ./waf --command-template="mpiexec -np 2 -machinefile mpihosts %s --nix=0" --run nms-p2p-nix-distributed
209
210An example using the null message synchronization algorithm::
211
212    $ ./waf --command-template="mpiexec -np 2 %s --nullmsg" --run simple-distributed
213
214The np switch is the number of logical processors to use. The machinefile switch
215is which machines to use. In order to use machinefile, the target file must
216exist (in this case mpihosts). This can simply contain something like:
217
218.. sourcecode:: text
219
220    localhost
221    localhost
222    localhost
223    ...
224
225Or if you have a cluster of machines, you can name them.
226
227The other alternative to `command-template` is to use `./waf shell`.  Here
228are the equivalent examples to the above (assuming optimized build profile)::
229
230    $ ./waf shell
231    $ cd build/src/mpi/examples
232    $ mpiexec -np 2 ns3-dev-simple-distributed-optimized
233    $ mpiexec -np 2 -machinefile mpihosts ns3-dev-nms-p2p-nix-distributed-optimized --nix=0
234    $ mpiexec -np 2 ns3-dev-simple-distributed-optimized --nullmsg
235
236Setting synchronization algorithm to use
237++++++++++++++++++++++++++++++++++++++++
238
239The global value SimulatorImplementationType is used to set the
240synchronization algorithm to use.  This value must be set before the
241MpiInterface::Enable method is invoked if the default
242DistributedSimulatorImpl is not used.  Here is an example code snippet
243showing how to add a command line argument to control the
244synchronization algorithm choice:::
245
246  cmd.AddValue ("nullmsg", "Enable the use of null-message synchronization", nullmsg);
247  if(nullmsg)
248    {
249      GlobalValue::Bind ("SimulatorImplementationType",
250                         StringValue ("ns3::NullMessageSimulatorImpl"));
251    }
252  else
253    {
254      GlobalValue::Bind ("SimulatorImplementationType",
255                         StringValue ("ns3::DistributedSimulatorImpl"));
256    }
257
258  // Enable parallel simulator with the command line arguments
259  MpiInterface::Enable (&argc, &argv);
260
261
262
263Creating custom topologies
264++++++++++++++++++++++++++
265.. highlight:: cpp
266
267The example programs in src/mpi/examples give a good idea of how to create different
268topologies for distributed simulation. The main points are assigning system ids
269to individual nodes, creating point-to-point links where the simulation should
270be divided, and installing applications only on the LP associated with the
271target node.
272
273Assigning system ids to nodes is simple and can be handled two different ways.
274First, a NodeContainer can be used to create the nodes and assign system ids::
275
276    NodeContainer nodes;
277    nodes.Create (5, 1); // Creates 5 nodes with system id 1.
278
279Alternatively, nodes can be created individually, assigned system ids, and added
280to a NodeContainer. This is useful if a NodeContainer holds nodes with different
281system ids::
282
283    NodeContainer nodes;
284    Ptr<Node> node1 = CreateObject<Node> (0); // Create node1 with system id 0
285    Ptr<Node> node2 = CreateObject<Node> (1); // Create node2 with system id 1
286    nodes.Add (node1);
287    nodes.Add (node2);
288
289Next, where the simulation is divided is determined by the placement of
290point-to-point links. If a point-to-point link is created between two
291nodes with different system ids, a remote point-to-point link is created,
292as described in :ref:`current-implementation-details`.
293
294Finally, installing applications only on the LP associated with the target node
295is very important. For example, if a traffic generator is to be placed on node
2960, which is on LP0, only LP0 should install this application.  This is easily
297accomplished by first checking the simulator system id, and ensuring that it
298matches the system id of the target node before installing the application.
299
300Tracing During Distributed Simulations
301**************************************
302
303Depending on the system id (rank) of the simulator, the information traced will
304be different, since traffic originating on one simulator is not seen by another
305simulator until it reaches nodes specific to that simulator. The easiest way to
306keep track of different traces is to just name the trace files or pcaps
307differently, based on the system id of the simulator. For example, something
308like this should work well, assuming all of these local variables were
309previously defined::
310
311    if (MpiInterface::GetSystemId () == 0)
312      {
313        pointToPoint.EnablePcapAll ("distributed-rank0");
314        phy.EnablePcap ("distributed-rank0", apDevices.Get (0));
315        csma.EnablePcap ("distributed-rank0", csmaDevices.Get (0), true);
316      }
317    else if (MpiInterface::GetSystemId () == 1)
318      {
319        pointToPoint.EnablePcapAll ("distributed-rank1");
320        phy.EnablePcap ("distributed-rank1", apDevices.Get (0));
321        csma.EnablePcap ("distributed-rank1", csmaDevices.Get (0), true);
322      }
323