1\documentclass[a4paper,12pt]{article}
2\usepackage{fancyvrb}
3\usepackage{amsfonts}
4\usepackage{graphicx}
5\usepackage{tabularx}
6% \usepackage{color}
7\usepackage{xcolor}
8\usepackage{psfrag}
9\usepackage[hyphens]{url}
10% \usepackage{vmargin}
11\usepackage{cite}
12\usepackage{caption2}
13\usepackage{hyperref}
14\usepackage{makeidx}
15\usepackage[fleqn]{amsmath}
16\renewcommand{\captionlabeldelim}{.}
17\def\thesection{\arabic{section}}
18\renewcommand{\thefigure}{\arabic{figure}}
19\renewcommand{\thetable}{\arabic{table}}
20\newcommand{\hl}{\noalign{\vskip3pt}\hline\noalign{\vskip3pt}}
21\newcommand{\hrf}{\hrulefill}
22\newcommand{\tc}[1]{\textcolor{black}{#1}}
23\newcommand{\dc}[1]{\textcolor{green}{#1}}
24% \usepackage[none]{hyphenat}
25\newcommand{\rep}[3][1]
26{
27  \psfrag{#2}[c][c][#1]{#3}
28}
29\newcommand{\x}[1]{\texttt{#1}}
30\newcommand{\sfv}{$\Lambda$V}
31
32\DeclareMathOperator*{\Ex}{\mathbb{E}}
33\newcommand{\Var}{\mathrm{Var}}
34
35% \setpapersize{A4}
36% \hypersetup{colorlinks=true,linkcolor=blue,urlcolor=red,linkbordercolor=000}
37\hypersetup{colorlinks=true,linkcolor=blue,urlcolor=orange}
38\renewcommand{\baselinestretch}{1.}
39\makeindex
40\begin{document}
41\sloppy
42\begin{center}
43\thispagestyle{empty}
44\vfill\vfill
45% \rule{\linewidth}{0.02cm}\\
46{\Huge \textbf{ P~h~y~M~L~~--~~M~a~n~u~a~l}}
47\vspace{-0.4cm}\\
48% \rule{\linewidth}{0.02cm}\\
49\vfill
50{\huge Version 3.0 \\
51\today
52\vfill
53\normalsize
54\href{https://github.com/stephaneguindon/phyml}{https://github.com/stephaneguindon/phyml}\\
55\vspace{0.4cm}
56\href{http://www.atgc-montpellier.fr/phyml}{http://www.atgc-montpellier.fr/phyml}}
57\end{center}
58\clearpage
59\tableofcontents
60\clearpage
61
62{\par
63\small
64\noindent
65\copyright Copyright 1999 - 2008 by PhyML Development Team.\\
66\noindent The software PhyML is provided ``as is''  without warranty of any kind.  In no event shall
67the authors  or his  employer be  held responsible  for any damage  resulting from  the use  of this
68software, including but not limited to the frustration that you may experience in using the package.
69All parts of the source and documentation except where indicated are distributed under
70the GNU public licence. See http://www.opensource.org for details.
71
72}
73
74{
75\noindent
76\setlength{\baselineskip}{0.5\baselineskip}
77\section{Availability}
78\begin{itemize}
79\item Binaries: \href{http://www.atgc-montpellier.fr/phyml}{http://www.atgc-montpellier.fr/phyml}
80\item Sources: \href{http://stephaneguindon.github.io/phyml-downloads/}{http://stephaneguindon.github.io/phyml-downloads/}
81\item Discussion forum: \href{http://groups.google.com/group/phyml-forum}{http://groups.google.com/group/phyml-forum}
82\end{itemize}
83}
84
85{
86\noindent
87\setlength{\baselineskip}{0.7\baselineskip}
88\section{Authors}
89\begin{itemize}
90\item { St\'ephane Guindon} and { Olivier Gascuel} conceived the original PhyML algorithm.
91\item { St\'ephane Guindon} conceived the PhyTime method.
92\item { St\'ephane Guindon, David Welch and Louis Ranjard} conceived the PhyloGeo method.
93\item { St\'ephane Guindon, Wim Hordjik} and { Olivier Gascuel} conceived the SPR-based tree search algorithm.
94\item { Maria Anisimova} and { Olivier Gascuel} conceived the aLRT method for branch support.
95\item { St\'ephane Guindon, Franck Lethiec}, Jean-Francois Dufayard and Vincent Lefort implemented PhyML.
96\item St\'ephane Guindon implemented PhyloGeo, PhyREX and PhyTime.
97\item { Jean-Francois Dufayard} created the benchmark and implemented the tools that are used to check
98  PhyML accuracy and performances.
99\item { Vincent Lefort, St\'ephane Guindon, Patrice Duroux} and { Olivier Gascuel} conceived and
100  implemented PhyML web server.
101\item { Imran Fanaswala} interfaced PhyML with BEAGLE.
102\item St\'ephane Guindon wrote this document.
103\end{itemize}
104}
105\clearpage
106
107% \section{ML in phylogenetics: the basics.}
108\section{Overview}
109
110PhyML  \cite{guindon03} is  a  software  package which  primary  task that  is  to estimate  maximum
111likelihood phylogenies  from alignments of nucleotide or  amino-acid sequences.  It  provides a wide
112range  of  options that  were  designed  to facilitate  standard  phylogenetic  analyses.  The  main
113strength of  PhyML lies in the  large number of substitution  models coupled to  various options to
114search the  space of phylogenetic  tree topologies,  going from very  fast and efficient  methods to
115slower but  generally more accurate approaches.  It  also implements two methods  to evaluate branch
116supports  in  a  sound statistical  framework  (the  non-parametric  bootstrap and  the  approximate
117likelihood ratio test).
118
119PhyML was designed to  process moderate to large data sets.  In theory,  alignments with up to 4,000
120sequences 2,000,000 character-long can analyzed.  In practice however, the amount of memory required
121to process  a data set is proportional  of the product of  the number of sequences  by their length.
122Hence, a large number  of sequences can only be processed provided that  they are short. Also, PhyML
123can  handle long  sequences  provided  that they  are  not numerous.   With  most standard  personal
124computers, the ``comfort  zone'' for PhyML generally lies around 100-500  sequences less than 10,000
125character  long.   For  larger  data  sets,  we  recommend  using  other  softwares  such  as  RAxML
126\cite{raxml}\index{RAxML}      or       GARLI      \cite{garli}\index{GARLI}      or      Treefinder
127(\href{http://www.treefinder.de}{http://www.treefinder.de}).
128
129
130\section{Bug report}\index{bug}
131
132While PhyML is, of  course, bug-free (!) (please read the disclaimer  carefuly...), if you ever come
133across an  issue, please feel free to  report it using the  discuss group web site  at the following
134address:  \url{https://groups.google.com/forum/?fromgroups#!forum/phyml-forum}.  Alternatively,  you
135can send an email  to \url{s.guindon@auckland.ac.nz}. Do not forget to mention  the version of PhyML
136and program options you are using.
137
138
139\section{Installing PhyML}
140
141\subsection{Sources and compilation}\index{compilation}
142
143The sources of the  program are available free of charge from \url{http://stephaneguindon.github.io/phyml-downloads/}.  The  compilation on UNIX-like
144systems  is  fairly standard.  It  is  described  in the  `\x{INSTALL}'  file  that comes  with  the
145sources. In a command-line window, go to the directory that contains the sources and type:
146
147{\setlength{\baselineskip}{0.5\baselineskip}
148\begin{verbatim}
149./configure;
150make clean;
151make V=0;
152\end{verbatim}
153}
154
155By default, PhyML will be compiled with optimization flags turned on. It is possible to generate a
156version of PhyML that can run through a debugging tool (such as \x{ddd}\label{ddd}) or a profiling
157tool (such as \x{gprof}\label{gprof}) using the following instructions:
158
159{\setlength{\baselineskip}{0.5\baselineskip}
160\begin{verbatim}
161./configure --enable-debug;
162make clean;
163make V=0;
164\end{verbatim}
165}
166
167% {\em Note} -- when PhyML  is going to be used mostly of exclusively in  batch mode, it is preferable
168% to turn on the batch mode option in the  Makefile. In order to do so, the file \x{Makefile.am} needs
169% to be modified: add \x{-DBATCH} to the line with \x{DEFS=-DUNIX -D\$(PROG) -DDEBUG}.
170
171
172
173\subsection{Installing PhyML on UNIX-like systems (including Mac OS)}
174
175Copy PhyML binary file in the directory you like.  For the operating system to be able to locate the
176program, this directory must be specified in the global variable \x{PATH}. In order to achieve this,
177you  will   have  to  add  \x{export   PATH="/your\_path/:\${PATH}"}  to  the   \x{.bashrc}  or  the
178\x{.bash\_profile} located in your home directory  (\x{your\_path} is the path to the directory that
179contains PhyML binary).
180
181
182\subsection{Installing PhyML on Microsoft Windows}\label{sec:install-windows}
183
184Copy the files \x{phyml.exe} and \x{phyml.bat} in  the same directory. To launch PhyML, click on the
185icon  corresponding to \x{phyml.bat}.   Clicking on  the icon  for \x{phyml.exe}  works too  but the
186dimensions of the window will not fit PhyML PHYLIP-like interface.
187
188\subsection{Installing the parallel version of PhyML}\label{sec:MPI}\index{MPI}\index{bootstrap!parallel}
189
190Bootstrap analysis can run on multiple  processors. Each processor analyses one bootstraped dataset.
191Therefore, the computing time needed to perform $R$ bootstrap replicates is divided by the number of
192processors available.
193
194This  feature of  PhyML relies  on the  MPI (Message  Passing Interface)  library. To  use  it, your
195computer must  have MPI  installed on  it. In case  MPI is  not installed, you  can dowload  it from
196\href{http://www.mcs.anl.gov/research/projects/mpich2/}{http://www.mcs.anl.gov/research/projects/mpich2/}.
197Once MPI is  installed, it is necessary to launch  the MPI daemon. This can be  done by entering the
198following instruction: \x{mpd \&}.  Note however that in most cases, the  MPI daemon will already be
199running on your server so  that you most likely do not need to worry  about this. You can then just
200go in the \x{phyml/} directory (the directory that contains the \x{src/}, \x{examples/} and \x{doc/}
201folders) and enter the commands below:
202
203{\setlength{\baselineskip}{0.5\baselineskip}
204\begin{verbatim}
205./configure --enable-mpi;
206make clean;
207make;
208\end{verbatim}
209}
210
211A binary file named \x{phyml-mpi} has now been created in the \x{src/} directory and is ready to use
212with MPI. A typical MPI command-line which uses 4 CPUs is given below:
213
214{\setlength{\baselineskip}{0.5\baselineskip}
215\begin{verbatim}
216mpirun -n 4 ./phyml-mpi -i myseq -b 100
217\end{verbatim}
218}
219
220\noindent Please read section \ref{sec:parallel_bootstrap} of this document for more information.
221
222\subsection{Installing PhyML-BEAGLE}\label{sec:install-phyml-beagle}
223
224{\em Note: I haven't found the time nor the resources to make sure the code of PhyML stays
225  compatible with the BEAGLE library. Please do not hesitate at all to contact me if you'd
226  like to contribute here.}
227\\
228\\
229PhyML  can use  the  BEAGLE\cite{ayres12}  library for  the  likelihood computation.  BEAGLE
230provides provides  significant speed-up: the  single core  version of PhyML-BEAGLE  can be up  to 10
231times  faster  than  PhyML  on  a  single   core  and  up  to  150  times  on  Graphical  Processing
232Units. PhyML-BEAGLE  will eventually have of  the features of PhyML,  even though at  the moment the
233boostrap and the invariant site options are not available. Also, please note that in some cases, the
234final  log-likelihood  reported  by  PhyML  and  PhyML-BEALGE may  not  exactly  match,  though  the
235differences observed are very minor (in the 10$^{-4}$ to 10$^{-4}$ range).
236
237In  order to  install  PhyML-BEAGLE, you  first  need to  download and  install  the BEAGLE  library
238available from \url{https://code.google.com/p/beagle-lib/}.  Then run the following commands:
239
240{\setlength{\baselineskip}{0.5\baselineskip}
241\begin{verbatim}
242./configure --enable-beagle;
243make clean;
244make;
245\end{verbatim} } A binary file named \x{phyml-beagle} will be created in the \x{src/} directory. The
246interface  to  \x{phyml-beagle} (i.e.,  commandline  option  of  PHYLIP-like interface)  is  exactly
247identical to that of PhyML.
248
249\section{Program usage.}\label{sec:phyml_new}
250
251PhyML has  three distinct user-interfaces.   The first corresponds  to a PHYLIP-like  text interface
252that makes the choice of the options self-explanatory. The command-line interface is well-suited for
253people that are familiar with PhyML options or for running PhyML in batch mode. The XML interface is
254more sophisticated. It allows the user to analyse partitionned data using flexible mixture models of evolution.
255
256\subsection{PHYLIP-like interface}
257
258The default is to use the PHYLIP-like  text interface by simply typing `\x{phyml}' in a command-line
259window or by clicking on the PhyML icon (see Section \ref{sec:install-windows}).  After entering the
260name of the input sequence file, a list of  sub-menus helps the users set up the analysis.  There
261are currently four distinct sub-menus:
262
263% \begin{figure}
264% \resizebox{15cm}{9cm}{\includegraphics{./fig/interface.eps}}
265% \caption{PHYLIP-like interface to PhyML.}
266% \label{fig:interface}
267% \end{figure}
268
269\begin{enumerate}
270
271\item  {\em  Input  Data}:  specify  whether  the  input  file  contains  amino-acid  or  nucleotide
272sequences. What the  sequence format is (see Section \ref{sec:input_output}) and  how many data sets
273should be analysed.
274
275\item  {\em  Substitution  Model}: selection  of  the  Markov  model  of substitution.
276
277\item  {\em  Tree Searching}:  selection  of  the tree  topology  searching  algorithm.
278
279\item {\em  Branch Support}: selection  of the method  that is used  to measure branch  support.
280
281\end{enumerate}
282\noindent `\x{+}' and `\x{-}' keys are used to  move forward and backward in the sub-menu list. Once
283the model parameters  have been defined, typing `\x{Y}' (or `\x{y}')  launches the calculations. The
284meaning of  some options may not  be obvious to users  that are not familiar  with phylogenetics. In
285such situation, we strongly recommend to use the default options. As long as the format of the input
286sequence file is  correctly specified (sub-menu {\em Input data}), the  safest option for non-expert
287users  is to  use the  default settings.  The different  options provided  within each  sub-menu are
288described in what follows.
289
290
291\subsubsection{Input  Data sub-menu}
292
293\begin{center}\framebox{\x{[D] ............................... Data type (DNA/AA)}} \end{center}
294Type of data in the input file. It can be either DNA or amino-acid sequences in PHYLIP format (see
295Section \ref{sec:input_output}). Type \x{D} to change settings.
296
297\vspace{0.7cm}
298\begin{center} \framebox{\x{[I] ...... Input sequences interleaved (or sequential)}} \end{center}
299PHYLIP format comes in two flavours: interleaved or sequential (see Section
300\ref{sec:input_output}). Type \x{I} to selected among the two formats.
301
302\vspace{0.7cm}
303\begin{center} \framebox{\x{[M] ....................... Analyze multiple data sets}} \end{center}
304If the input sequence file contains more than one data sets, PhyML can analyse each of them
305in a single run of the program. Type \x{M} to change settings.
306
307\vspace{0.7cm}
308\begin{center}  \framebox{\x{[R] ............................................ Run  ID}} \end{center}
309This option allows  you to append a string  that identifies the current PhyML run.  Say for instance
310that you want to analyse  the same data set with two models. You can  then `tag' the first PhyML run
311with the name of the first model while the second run is tagged with the name of the second model.\index{run ID}
312
313
314
315\subsubsection{Substitution model sub-menu}\label{sec:submenus}
316
317\begin{center} \framebox{\x{[M] ................. Model of nucleotide substitution}} \end{center}
318\begin{center}  \framebox{\x{[M] ................ Model  of amino-acids  substitution}} \end{center}
319PhyML implements a wide range of  substitution models: JC69 \cite{jukes69}, K80 \cite{kimura80}, F81
320\cite{felsenstein81a},  F84  \cite{phylip2},   HKY85  \cite{hasegawa85},  TN93  \cite{tamura93}  GTR
321\cite{lanave84,tavare86} and  custom for nucleotides; LG \cite{le08},  WAG \cite{whelan01b}, Dayhoff
322\cite{dayhoff78},  JTT  \cite{jones92},  Blosum62  \cite{henikoff92}, mtREV  \cite{adachi96},  rtREV
323\cite{dimmic02},  cpREV  \cite{adachi00},  DCMut   \cite{kosiol04},  VT  \cite{muller00}  and  mtMAM
324\cite{cao98}  and custom  for amino  acids.  Cycle  through the  list of  nucleotide  or amino-acids
325substitution models by typing \x{M}. Both  nucleotide and amino-acid lists include a `custom' model.
326The custom option provides the most flexible  way to specify the nucleotide substitution model.  The
327model is defined  by a string made of  six digits.  The default string is  `\x{000000}', which means
328that the six relative rates of nucleotide  changes: $A \leftrightarrow C$, $A \leftrightarrow G$, $A
329\leftrightarrow T$,  $C \leftrightarrow  G$, $C  \leftrightarrow T$ and  $G \leftrightarrow  T$, are
330equal.   The   string  `\x{010010}'  indicates  that   the  rates  $A  \leftrightarrow   G$  and  $C
331\leftrightarrow  T$ are equal  and distinct  from $A  \leftrightarrow C  = A  \leftrightarrow T  = C
332\leftrightarrow G = G  \leftrightarrow T$.  This model corresponds to HKY85  (default) or K80 if the
333nucleotide frequencies  are all set to 0.25.   `\x{010020}' and `\x{012345}' correspond  to TN93 and
334GTR models respectively.  The digit string  therefore defines groups of relative substitution rates.
335The  initial rate within  each group  is set  to 1.0,  which corresponds  to F81  (JC69 if  the base
336frequencies are equal).   Users also have the  opportunity to define their own  initial rate values.
337These rates are  then optimised afterwards (option  `\x{O}') or fixed to their  initial values.  The
338custom option can be used to implement all substitution models that are special cases of
339GTR. Table \ref{tab:modelcode} on page \pageref{tab:modelcode} gives the correspondence between the `standard' name of the model
340(see \url{http://mbe.oxfordjournals.org/content/18/6/897/F2.large.jpg}) and the custom model code.
341The custom model also exists for protein sequences. It is useful when one wants to use an amino-acid
342substitution model that is  not hard-coded in PhyML. The symmetric part of  the rate matrix, as well
343as the equilibrium amino-acid  frequencies, are given in a file which name is  given as input of the
344program. The format of this file is described in the section \ref{sec:customaa}.
345
346\vspace{0.7cm}
347\begin{center} \framebox{\x{[F] ................. Optimise equilibrium frequencies}} \end{center}
348\begin{center} \framebox{\x{[E] ......... Equilibrium frequencies (empirical/user)}} \end{center}
349\begin{center} \framebox{\x{[F]  . Amino  acid frequencies (empirical/model  defined)}} \end{center}
350For  nucleotide  sequences,  optimising  equilibrium  frequencies  means that  the  values  of  these
351parameters  are estimated in  the maximum  likelihood framework.   When the  custom model  option is
352selected, it is  also possible to give the program a  user-defined nucleotide frequency distribution
353at equilibrium  (option \x{E}).   For protein sequences,  the stationary amino-acid  frequencies are
354either  those defined  by  the substitution  model  or those  estimated by  counting  the number  of
355different amino-acids observed  in the data. Hence, the  meaning of the \x{F} option  depends on the
356type of the data to be processed.
357
358\vspace{0.7cm}
359\begin{center}          \framebox{\x{[T]          ....................          Ts/tv          ratio
360(fixed/estimated)}}   \end{center}\index{$\kappa$}\index{ts/tv ratio}    Fix   or   estimate   the
361transition/transversion ratio  in the maximum likelihood  framework.  This option  is only available
362when DNA sequences are to be analysed under  K80, HKY85 or TN93 models. The definition given to this
363parameter by PhyML  is the same as  PAML's\index{PAML} one.  Therefore, the value  of this parameter
364does {\it not} correspond  to the ratio between the expected number  of transitions and the expected
365number  of  transversions  during  a unit  of  time.   This  last  definition  is the  one  used  in
366PHYLIP\index{PHYLIP}.   PAML's  manual gives  more  detail about  the  distinction  between the  two
367definitions (\url{http://abacus.gene.ucl.ac.uk/software/paml.html}).
368
369\vspace{0.7cm}
370\begin{center}       \framebox{\x{[V]       .        Proportion      of       invariable       sites
371(fixed/estimated)}}   \end{center}\index{invariable sites}\index{proportion of invariants}  The
372proportion of  invariable sites, i.e., the  expected frequency of sites  that do not  evolve, can be
373fixed or estimated. The default is to fix this proportion to 0.0. By doing so, we consider that each
374site in  the sequence may accumulate  substitutions at some point  during its evolution,  even if no
375differences across sequences are actually observed at  that site.  Users can also fix this parameter
376to  any value  in the  $[0.0,1.0]$ range  or estimate  it from  the data  in  the maximum-likelihood
377framework.
378
379\vspace{0.7cm}
380\index{gamma distribution (discrete)!mean vs. median}
381\index{gamma distribution (discrete)!number of categories}
382\index{gamma distribution (discrete)!shape parameter}
383\begin{center} \framebox{\x{[R]  ....... One category  of substitution rate  (yes/no)}} \end{center}
384\begin{center} \framebox{\x{[C] ........... Number of substitution rate categories}} \end{center}
385\begin{center} \framebox{\x{[A] ... Gamma distribution parameter (fixed/estimated)}} \end{center}
386\begin{center} \framebox{\x{[G]  .........`Middle' of  each rate class  (mean/median)}} \end{center}
387Rates of evolution often vary from site to site. This heterogeneity can be modelled using a discrete
388gamma distribution.  Type \x{R} to switch  this option on or  off. The different  categories of this
389discrete  distribution  correspond  to  different  (relative)  rates of  evolution.  The  number  of
390categories of this  distribution is set to 4 by  default.  It is probably not wise  to go below this
391number.   Larger values  are  generally preferred.  However,  the computational  burden involved  is
392proportional to the  number of categories (i.e.,  an analysis with 8 categories  will generally take
393twice the  time of  the same analysis  with only 4  categories). Note  that the likelihood  will not
394necessarily increase as  the number of categories increases. Hence, the  number of categories should
395be kept below a  ``reasonable'' number, say 20.  The default number of  categories can be changed by
396typing \x{C}.
397
398The middle  of each  discretized substitution rate  class can  be determined using  the mean  or the
399median. PAML,  MrBayes and RAxML  use the  mean.  However, the  median is generally  associated with
400greater likelihoods than the  mean.  This conclusion is based on our  analysis of several real-world
401data sets extracted from TreeBase.  Despite this, the  default option in PhyML is to use the mean in
402order to make PhyML  likelihoods comparable to those of other phylogenetic  software.  One must bare
403in  mind that  {\color{red}{likelihoods  calculated with  the  mean approximation  are not  directly
404comparable to the likelihoods calculated using the median approximation}}.
405
406The shape  of the  gamma distribution  determines the range  of rate  variation across  sites. Small
407values,  typically  in  the $[0.1,1.0]$  range,  correspond  to  large variability.   Larger  values
408correspond to moderate to  low heterogeneity. The gamma shape parameter can be  fixed by the user or
409estimated via maximum-likelihood. Type \x{A} to select one or the other option.
410
411
412\subsubsection{Tree searching sub-menu}
413
414\begin{center} \framebox{\x{[O] ........................... Optimise tree topology}} \end{center} By
415default the  tree topology is  optimised in order  to maximise the  likelihood. However, it  is also
416possible to avoid  any topological alteration. This option  is useful when one wants  to compute the
417likelihood of a tree given as input (see below). Type \x{O} to select among these two options.
418
419\vspace{0.7cm}
420\begin{center}  \framebox{\x{[S] .................. Tree  topology search  operations}} \end{center}
421PhyML proposes three different  methods to estimate tree topologies. The default  approach is to use
422simultaneous  NNI. This option  corresponds to  the original  PhyML algorithm  \cite{guindon03}. The
423second approach  relies on  subtree pruning and  regrafting (SPR).   It generally finds  better tree
424topologies  compared to NNI  but is  also significantly  slower.  The  third approach,  termed BEST,
425simply estimates the phylogeny using both methods  and returns the best solution among the two. Type
426\x{S} to choose among these three choices.
427
428\vspace{0.7cm}
429\begin{center}  \framebox{\x{[R] ......................... Use  random starting  tree}} \end{center}
430\begin{center} \framebox{\x{[N]  .................. Number  of random starting  trees}} \end{center}
431When the SPR or the BEST options are selected,  is is possible to use random trees rather than BioNJ
432or a user-defined tree,  as starting tree. If this option is turned on  (type \x{R} to change), five
433trees, corresponding to five random starts, will be estimated. The output tree file will contain the
434best  tree  found  among  those five.  The  number  of  random  starts  can be  modified  by  typing
435\x{N}.  Setting the  number of  random  starting trees  to $N$  means  that the  analysis will  take
436(slightly more than) $N$ times the time required for a standard analysis where only one (BioNJ)
437starting tree is used.  However, the analysis of real data sets shows  that the best trees estimated
438using the  random start  option almost  systematically have higher  likelihoods than  those inferred
439using a single starting tree.
440
441\vspace{0.7cm}
442\begin{center}      \framebox{\x{[U]     ........       Starting      tree     (BioNJ/parsimony/user
443tree)}} \end{center}\index{BioNJ}  When the  tree topology optimisation  option is turned  on, PhyML
444proceeds  by  refining an  input  tree.   By  default, this  input  tree  is estimated  using  BioNJ
445\cite{gascuelNJ}. The alternative option is to use  a parsimony tree. We found this option specially
446useful when analysing  large data sets with NNI  moves as it generally leads  to greater likelihoods
447than  those obtained  when starting  from a  BioNJ trees.  The user  can also  to input  her/his own
448tree. This  tree should  be in Newick  format (see  Section \ref{sec:input_output}). This  option is
449useful when  one wants  to evaluate  the likelihood  of a given  tree with  a fixed  topology, using
450PhyML. Type \x{U} to choose among these two options.
451
452\subsubsection{Branch support sub-menu}
453
454\begin{center}  \framebox{\x{[B] ................ Non  parametric bootstrap  analysis}} \end{center}
455The  support  of the  data  for  each  internal branch  of  the  phylogeny  can be  estimated  using
456non-parametric bootstrap.   By default, this option is  switched off.  Typing \x{B}  switches on the
457bootstrap analysis. The user is then prompted for a number of bootstrap replicates. The largest this
458number the more precise the bootstrap  support estimates are.  However, for each bootstrap replicate
459a phylogeny is estimated. Hence, the time  needed to analyse $N$ bootstrap replicates corresponds to
460$N$-times the time spent  on the analysis of the original data  set. $N=100$ is generally considered
461as a reasonable number of replicates.
462
463\begin{center}  \framebox{\x{[A] ................ Approximate  likelihood ratio  test}} \end{center}
464When the  bootstrap option is switched off  (see above), approximate likelihood  branch supports are
465estimated. This approach is considerably faster than the bootstrap one. However, both methods intend
466to  estimate different quantities  and conducting  a fair  comparison between  both criteria  is not
467straightforward.  The  estimation  of  approximate  likelihood  branch  support  comes  in  multiple
468flavours. The default is  set to aBayes, corresponding to the approximate  Bayes method described in
469\cite{anisimova11}.   The    approximate   likelihood   ratio    test   (aLRT)   \cite{anisimova06},
470Shimodaira–Hasegawa aLRT (SH-aLRT) statistics are the other available options.
471
472
473\subsection{Command-line interface}
474
475An alternative to the PHYLIP-like interface is the command-line interface. Users that do not need to
476modify  the default  parameters  can launch  the  program with  the  `\x{phyml -i  seq\_file\_name}'
477command.  The list of all command line arguments and  how to use them is given in the `Help' section
478which is displayed when entering the `\x{phyml --help}' command.  The available command-line options
479are described in what follows.
480
481\begin{itemize}
482
483\item \x{-i} (or \x{--input}) \x{seq\_file\_name}\index{command-line options!\x{--input}} \\
484\x{seq\_file\_name} is the name of the nucleotide or amino-acid sequence file in PHYLIP format.
485
486\item \x{-d} (or \x{--datatype}) \x{data\_type}\index{command-line options!\x{--data\_type}}\\
487\x{data\_type} is \x{nt} for nucleotide (default) and \x{aa} for amino-acid sequences.
488
489\item \x{-q} (or \x{--sequential})\index{sequence format!interleaved}\index{sequence format!sequential}\index{command-line options!\x{--sequential}} \\
490Changes interleaved format (default) to sequential format.
491
492\item \x{-n} (or \x{--multiple}) \x{nb\_data\_sets}\index{multiple data sets}\index{command-line options!\x{--multiple}}\\
493\x{nb\_data\_sets} is an integer giving the number of data sets to analyse.
494
495\item \x{-p} (or \x{--pars})\index{command-line options!\x{--pars}}\\
496Use a minimum parsimony starting tree. This option is taken into account when the `-u' option
497is absent and when tree topology modifications are to be done.
498
499\item \x{-b} (or \x{--bootstrap}) \x{int}\index{bootstrap}\index{command-line options!\x{--bootstrap}}
500\begin{itemize}
501\item \x{int} $>$  0: \x{int} is the number of bootstrap replicates.
502\item \x{int} =  0: neither approximate likelihood ratio test nor bootstrap values are computed.
503\item \x{int} = -1: approximate likelihood ratio test returning aLRT statistics.
504\item \x{int} = -2: approximate likelihood ratio test returning Chi2-based parametric branch supports.
505% \item \x{int} = -3: minimum of Chi2-based parametric and SH-like branch supports.
506\item \x{int} = -4: SH-like branch supports alone.
507\item \x{int} = -5: (default) approximate Bayes branch supports.
508\end{itemize}
509
510\item \x{-m} (or \x{--model}) \x{model\_name}\index{substitution models!DNA}\index{substitution models!amino acids}\index{command-line options!\x{--model}} \\
511\x{model\_name} : substitution model name.
512\begin{itemize}
513\item {\it Nucleotide-based models}: \x{HKY85} (default) \x{| JC69 |  K80 | F81 | F84 | TN93 | GTR |
514    custom | r(ac),r(ag),r(at),r(cg),r(ct),r(gt)} \\ JC90, K80, F81. F84, TN93 and GTR denote the
515  ``standard'' substitution models. The \x{custom} option can be  used to define a  new substitution model. A  string of six
516digits  identifies  the model.  For  instance,  000000 corresponds  to  F81  (or  JC69 provided  the
517distribution of nucleotide  frequencies is uniform).  012345 corresponds to GTR.  This option can be
518used for encoding any model that is a nested within GTR. See Section \ref{sec:submenus} and Table \ref{tab:modelcode}. {\em NOTE:}
519the  substitution  parameters  of  the  custom  model  will be  optimised  so  as  to  maximise  the
520likelihood.  Also, it is  possible  to  specify and  fix  (i.e., avoid  optimisation)  the  values of  the
521substitution rates in the GTR model by passing the corresponding five rates to the \x{-m}
522option. Commas are used to separate the five relative rates and no blank space should be introduced here.
523
524\item {\it Amino-acid based models}: \x{LG} (default) \x{| WAG | JTT | MtREV | Dayhoff | DCMut | RtREV
525 | CpREV | VT | Blosum62 | MtMam | MtArt | HIVw |  HIVb | custom} \\
526The \x{custom} option is  useful when one wants to use an amino-acid  substitution model that is not
527available by  default in PhyML. The  symmetric part of the  rate matrix, as well  as the equilibrium
528amino-acid frequencies, are  given in a file which name  is asked for by the  program. The format of
529this file is described in section \ref{sec:customaa}.
530\end{itemize}
531
532\begin{table}\index{custom models}
533\begin{center}
534\begin{tabular}{ll}
535\hline
536Name & Command-line option \\
537\hline
538JC69 &  \x{-m 000000 -f 0.25,0.25,0.25,0.25} \\
539F81 &   \x{-m 000000}\\
540K80 &   \x{-m 010010 -f 0.25,0.25,0.25,0.25} \\
541HKY85 & \x{-m 010010}\\
542TrNef & \x{-m 010020 -f 0.25,0.25,0.25,0.25} \\
543TrN &   \x{-m 010020}\\
544K81 &   \x{-m 123321 -f 0.25,0.25,0.25,0.25}\\
545K81uf & \x{-m 123321}\\
546TIMef & \x{-m 132241 -f 0.25,0.25,0.25,0.25}\\
547TIM &   \x{-m 132241}\\
548TVMef & \x{-m 102304 -f 0.25,0.25,0.25,0.25}\\
549TVM &   \x{-m 102304}\\
550SYM &   \x{-m 123456 -f 0.25,0.25,0.25,0.25}\\
551GTR &   \x{-m 123456}\\
552\hline
553\end{tabular}
554\caption{Nucleotide substitution model names (as defined in \cite{posada01}) and the corresponding
555  custom model code used in PhyML.}\label{tab:modelcode}
556\end{center}
557\end{table}
558
559\item \x{--aa\_rate\_file file\_name}\index{command-line options!\x{--aa\_rate\_file}} \\
560This option is compulsory when analysing amino-acid sequences under a `custom' model (see above). \x{file\_name}
561should provide a rate matrix and equilibrium amino acid in PAML format (see Section \ref{sec:customaa}).
562
563\item \x{-f e}, \x{m}, or \x{fA,fC,fG,fT} \index{frequencies!nucleotide}\index{frequencies!amino-acid}\index{stationary frequencies}\index{command-line options!\x{-f}}\\
564Nucleotide or amino-acid frequencies.
565\begin{itemize}
566\item \x{e} : the character frequencies are determined as follows :
567\begin{itemize}
568\item {\it Nucleotide sequences}: (Empirical) the equilibrium base frequencies are estimated by counting
569                 the occurence of the different bases in the alignment.
570\item {\it Amino-acid sequences}: (Empirical) the equilibrium amino-acid frequencies are estimated by counting
571                 the occurence of the different amino-acids in the alignment.
572\end{itemize}
573
574\item \x{m} : the character frequencies are determined as follows :
575\begin{itemize}
576\item {\it Nucleotide sequences}: (ML) the equilibrium base frequencies are estimated using maximum
577  likelihood.
578\item {\it Amino-acid sequences}: (Model) the equilibrium amino-acid frequencies are estimated using
579                 the frequencies defined by the substitution model.
580\end{itemize}
581
582\item \x{fA,fC,fG,fT}: only valid for nucleotide-based models. \x{fA}, \x{fC}, \x{fG} and \x{fT} are floating numbers that
583correspond to the frequencies of A, C, G and T respectively.
584\end{itemize}
585
586\item \x{-t} (or \x{--ts/tv}) \x{ts/tv\_ratio} \index{$\kappa$}\index{ts/tv ratio}\index{command-line options!\x{--ts/tv}}\\
587\x{ts/tv\_ratio}: transition/transversion ratio. DNA sequences only. Can be a fixed positive value
588(e.g., 4.0) or type \x{e} to get the maximum likelihood estimate.
589
590\item \x{-v} (or \x{--pinv}) \x{prop\_invar}\index{proportion of invariants}\index{invariable sites} \index{command-line options!\x{--pinv}}\\
591\x{prop\_invar}: proportion of invariable sites. Can be a fixed value in the [0,1] range or type \x{e} to get the maximum likelihood estimate.
592
593\item \x{-c} (or \x{--nclasses}) \x{nb\_subst\_cat}\index{gamma distribution (discrete)!number of categories} \index{command-line options!\x{--nclasses}}\\
594\x{nb\_subst\_cat}: number of relative substitution rate categories. Default: \x{nb\_subst\_cat=4}. Must be a positive integer.
595
596\item \x{-a} (or \x{--alpha}) \x{gamma} \index{gamma distribution (discrete)!shape parameter}\index{command-line options!\x{--alpha}} \\
597\x{gamma}: value of the gamma shape parameter. Can be a fixed positive value or e to get the maximum
598likelihood estimate. The value of this parameter is estimated in the maximum likelihood framework by default.
599
600\item \x{--use\_median} \index{gamma distribution (discrete)!mean vs. median} \index{command-line options!\x{--use\_median}}\\
601The middle of each substitution rate class in the discrete gamma distribution is taken as the
602median. The mean is used by default.
603
604\item \x{--free\_rates} or \x{--freerates} \index{command-line options!\x{--free\_rates}}\\
605As an alternative to the discrete gamma model, it is possible to estimate the (relative) rate in
606each class of the (mixture) model and the corresponding frequencies directly from the data. This
607model, called the FreeRate model, has more parameters
608than the discrete gamma one but usually provides a significantly better fit to the data. See
609\cite{soubrier12} for more information about this model and an illustration of its use.
610
611\item \x{--il} \index{command-line options!\x{--il}}\\ \x{il} stands here for integrated (branch)
612  length. This model, described in \cite{guindon13} in the context of molecular dating, provides an
613  efficient way to implement the covarion model \index{covarion}. Under the integrated length (IL) model, the length
614  of each edge is described by a distribution of values, instead of a single value corresponding to
615  the expected number of substitutions per position along the sequence. Let $l_{a,s}$ and $l_{b,s}$
616  be the number of substitutions at site $s$ along edges $a$ and $b$, and $l_{a,t}$ and $l_{b,t}$,
617  the number of substitutions at site $t$. Standard models have $l_{a,s}=l_{a,t}$ and $l_{b,s}=l_{b,t}$,
618  or $l_{a,s}=\alpha l_{a,t}$ and $l_{b,s}=\alpha l_{b,t}$ if rates vary across sites. The IL model
619  has instead $l_{a,s}= \alpha l_{a,t}$ and $l_{b,s}= \beta l_{b,t}$ with $\alpha \neq \beta$,
620  i.e. substitution rates vary across sites and edges. The IL approach is somehow an analytical
621  approximation to the covarion model that, unlike the covarion model, does not incur any computational overhead compared to the
622  traditional models. A notable difference with the plain vanilla covarion model and the IL model however is that
623  substitution rates are not autocorrelated along the phylogeny under the IL model.
624
625
626
627\item \x{--codpos} \x{1,2 or 3} \index{command-line options!\x{--codpos}}\\
628When analysing an alignment of coding sequences, use this option to consider only the first, second
629or the third coding position for the estimation.
630
631\item \x{-s} (or \x{--search}) \x{move}\index{NNI}\index{SPR} \index{command-line options!\x{--search}}\\
632Tree topology search operation option. Can be either \x{NNI} (default, fast) or \x{SPR} (usually
633slower than \x{NNI} but more accurate) or \x{BEST} (best of NNI and SPR search).
634
635\item \x{-u} (or \x{--inputtree}) \x{user\_tree\_file}\index{input tree}\index{user tree} \index{command-line options!\x{--inputtree}}\\
636\x{user\_tree\_file}: starting tree filename. The tree must be in Newick format.
637
638\item \x{-o params}\index{optimisation!topology}\index{optimisation!substitution parameters} \index{command-line options!\x{-o}}\\
639This option focuses on specific parameter optimisation.
640\begin{itemize}
641\item \x{params=tlr}: tree topology (\x{t}), branch length (\x{l}) and substitution rate parameters (\x{r}) are optimised.
642\item \x{params=tl}: tree topology and branch lengths are optimised.
643\item \x{params=lr}: branch lengths and substitution rate parameters are optimised.
644\item \x{params=l}: branch lengths are optimised.
645\item \x{params=r}: substitution rate parameters are optimised.
646\item \x{params=n}: no parameter is optimised.
647\end{itemize}
648
649\item \x{--rand\_start}\index{random tree} \index{command-line options!\x{--rand\_start}}\\
650This option sets the initial tree to random. It is only valid if SPR searches are to be performed.
651
652\item \x{--n\_rand\_starts num} \index{command-line options!\x{--n\_rand\_starts}}\\
653\x{num} is the number of initial random trees to be used. It is only valid if SPR searches are to be performed.
654
655\item \x{--r\_seed num}\index{random number} \index{command-line options!\x{--r\_seed}}\\
656\x{num} is the seed used to initiate the random number generator. Must be an integer.
657
658\item \x{--print\_site\_lnl}\index{likelihood!print site likelihood} \index{command-line options!\x{--print\_site\_lk}}\\
659Print the likelihood for each site in file *\_phyml\_lk.txt. For $\Gamma$ or $\Gamma$+I or
660FreeRate models, this option returns the posterior probability of each relative rate class at
661each site. Such information can then be used to identify fast- and slow-evolving regions of the alignment.
662
663\item \x{--print\_trace}\index{command-line options!\x{--print\_trace}}\\
664Print each phylogeny explored during the tree search process in file *\_phyml\_trace.txt. This
665option can be useful for monitoring the progress of the analysis for very large data sets and have
666an approximate idea of what the final phylogeny will look like.
667
668\item \x{--json\_trace}\index{command-line options!\x{--json\_trace}}\\
669Print each phylogeny explored during the tree search process in file *\_phyml\_json\_trace.txt in
670JSON format (see \url{http://www.json.org/}). This option can be useful for monitoring the progress of the analysis for very large data sets and have
671an approximate idea of what the final phylogeny will look like.
672
673
674\item \x{--run\_id ID\_string}\index{run ID} \index{command-line options!\x{--run\_id}}\\
675Append the string ID\_string at the end of each PhyML output file. This option may be useful when
676running simulations involving PhyML. It can also be used to `tag' multiple analysis of the same data
677set with various program settings.
678
679\item \x{--no\_memory\_check} \index{command-line options!\x{--no\_memory\_check}}\\
680By default, when processing a large data set, PhyML will pause and ask the user to confirm that
681she/he wants to continue with the execution of the analysis despite the large amount of memory
682required. The \x{--no\_memory\_check}  skips this question. It is especially useful when running
683PhyML in batch mode.
684
685\item \x{--no\_colalias} \index{command-line options!\x{--no\_colalias}}\\
686By default, PhyML preprocesses each alignment by putting together (or aliasing) the columns that are
687identical. Use this option to skip this step but be aware that the analysis might then take more
688time to complete.
689
690\item \x{--constrained\_lens} \index{command-line options!\x{--constrained\_lens}}\\
691When an input tree with branch lengths is provided, this option will find the branch multiplier that
692maximises the likelihood (i.e., the relative branch lengths remain constant)
693
694\item  \x{--constraint\_file}  \x{file\_name} \index{command-line  options!\x{--constraint\_file}}\\
695\x{file\_name}  lists  the  topological  constraints   under  which  the  tree  topology  search  is
696conducted.  This option  should  be used  in  conjunction with  \x{-u}  \x{file\_name}. See  Section
697\ref{sec:topoconstraints} for more information.
698
699\item \x{--quiet} \index{command-line options!\x{--quiet}}\\
700Runs PhyML in quiet mode. The program will not pause if the memory required to run the analysis
701exceeds 256MB and will not output the progression of the log-likelihood scores on the standard output.
702
703\item \x{--ancestral} \index{command-line options!\x{--ancestral}}\\
704PhyML calculates the marginal probabilities of each character state at each internal node and each
705site of the sequence alignment. It then uses the ``Minimum Posterior Expected Error'' criterion to
706infer ancestral sequences. Works for both nucleotide and amino-acid data.
707
708
709\item \x{--leave\_duplicates} \index{command-line options!\x{--leave\_duplicates}}\\
710  PhyML removes duplicate sequences by default. Use this option if you do not want PhyML to discard
711  them (even though leaving these sequences in will slow down the analysis).
712
713\end{itemize}
714
715\subsection{XML interface}
716\begin{itemize}
717\item \x{--xml=xml\_file\_name}\index{command-line options!\x{--xml}} \\
718\x{xml\_file\_name} is the name of the XML file containing the information required to run the
719analysis. More details about this type of file is given in the section \ref{sec:xmlio}.
720\end{itemize}
721
722\subsection{Parallel  bootstrap}\label{sec:parallel_bootstrap}\index{MPI}\index{bootstrap!parallel}
723
724Bootstrapping is  a highly  parallelizable task. Indeed,  bootstrap replicates are  independent from
725one another.   Each bootstrap replicate can then  be analysed separately. Modern  computers often have
726more than one CPU. Each CPU can therefore be used to process a bootstrap sample. Using this parallel
727strategy, performing  $R$ bootstrap replicates  on $C$ CPUs  `costs' the same amount  of computation
728time as processing $R  \times C$ bootstrap replicates on a single CPU.  In  other words, for a given
729number of replicates, the computation time is divided by $R$ compared to the non-parallel approach.
730
731PhyML sources  must be compiled with  specific options to turn  on the parallel  option (see Section
732\ref{sec:MPI}). Once  the binary file (\x{phyml})  has been generated, running  a bootstrap analysis
733with, say 100 replicates on 2 CPUs, can be done by typing the following command-line:
734\begin{verbatim}
735mpd &;
736mpirun -np 2 ./phyml-mpi -i seqfile -b 100;
737\end{verbatim}
738The  first command  launches  the mpi  daemon  while the  second launches  the  analysis. Note  that
739launching the daemon needs to be done only once.  The output files are similar to the ones generated
740using the standard, non-parallel, analysis (see Section \ref{sec:input_output}). Note that running
741the program in batch mode, i.e.:
742\begin{verbatim}
743mpirun -np 2 ./phyml-mpi -i seqfile -b 100 &
744\end{verbatim}
745will probably NOT work. I do not know how to run a mpi process in batch mode yet. Suggestions welcome...
746Also, at the moment, the number of bootstrap replicates must be a multiple of the number of CPUs
747required in the mpirun command.
748
749\section{Inputs \& outputs for command-line and PHYLIP interface }\label{sec:input_output}
750
751PhyML reads data from standard text files,  without the need for any particular file name extension.
752
753\subsection{Sequence formats}
754
755\begin{figure}
756\begin{small}
757\begin{Verbatim}[frame=single, label=PHYLIP interleaved, samepage=true, baselinestretch=0.5]
758
7595 80
760seq1  CCATCTCACGGTCGGTACGATACACCKGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
761seq2  CCATCTCACGGTCAG---GATACACCKGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
762seq3  RCATCTCCCGCTCAG---GATACCCCKGCTGTTG????????????????ATTAAAAGGT
763seq4  RCATCTCATGGTCAA---GATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
764seq5  RCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAAT????????GT
765
766ATCKGCTTTTGGCAGGAAAT
767ATCKGCTTTTGGCGGGAAAT
768AGCKGCTGTTG?????????
769ATCTGCTTTTGGCGGGAAAT
770ATCTGCTTTTGGCGGGAAAT
771
772\end{Verbatim}
773\begin{Verbatim}[frame=single, label=PHYLIP sequential, samepage=true, baselinestretch=0.5]
774
7755 40
776seq1  CCATCTCANNNNNNNNACGATACACCKGCTTTTGGCAGG
777seq2  CCATCTCANNNNNNNNGGGATACACCKGCTTTTGGCGGG
778seq3  RCATCTCCCGCTCAGTGAGATACCCCKGCTGTTGXXXXX
779seq4  RCATCTCATGGTCAATG-AATACTCCTGCTTTTGXXXXX
780seq5  RCATCTCACGGTCGGTAAGATACACCTGCTTTTGxxxxx
781
782\end{Verbatim}
783\end{small}
784\label{fig:align_tree}
785\caption{\bf PHYLIP interleaved and sequential formats.}
786\end{figure}
787
788
789
790\begin{figure}
791\begin{small}
792\begin{Verbatim}[frame=single, label=Nexus nucleotides, samepage=true, baselinestretch=0.5]
793
794[ This is a comment ]
795#NEXUS
796BEGIN DATA;
797DIMENSIONS NTAX=10 NCHAR=20;
798FORMAT DATATYPE=DNA;
799MATRIX
800tax1       ?ATGATTTCCTTAGTAGCGG
801tax2       CAGGATTTCCTTAGTAGCGG
802tax3       ?AGGATTTCCTTAGTAGCGG
803tax4       ?????????????GTAGCGG
804tax5       CAGGATTTCCTTAGTAGCGG
805tax6       CAGGATTTCCTTAGTAGCGG
806tax7       ???GATTTCCTTAGTAGCGG
807tax8       ????????????????????
808tax9       ???GGATTTCTTCGTAGCGG
809tax10      ???????????????AGCGG;
810END;
811
812\end{Verbatim}
813\end{small}
814
815\begin{small}
816\begin{Verbatim}[frame=single, label=Nexus digits, samepage=true, baselinestretch=0.5]
817
818[ This is a comment ]
819#NEXUS
820BEGIN DATA;
821DIMENSIONS NTAX=10 NCHAR=20;
822FORMAT DATATYPE=STANDARD SYMBOLS="0 1 2 3";
823MATRIX
824tax1       ?0320333113302302122
825tax2       10220333113302302122
826tax3       ?0220333113302302122
827tax4       ?????????????2302122
828tax5       10220333113302302122
829tax6       10220333113302302122
830tax7       ???20333113302302122
831tax8       ????????????????????
832tax9       ???22033313312302122
833tax10      ???????????????02122;
834END;
835
836\end{Verbatim}
837\end{small}
838
839\begin{small}
840\begin{Verbatim}[frame=single, label=Nexus digits, samepage=true, baselinestretch=0.5]
841
842[ This is a comment ]
843#NEXUS
844BEGIN DATA;
845DIMENSIONS NTAX=10 NCHAR=20;
846FORMAT DATATYPE=STANDARD SYMBOLS="00 01 02 03";
847MATRIX
848tax1       ??00030200030303010103030002030002010202
849tax2       0100020200030303010103030002030002010202
850tax3       ??00020200030303010103030002030002010202
851tax4       ??????????????????????????02030002010202
852tax5       0100020200030303010103030002030002010202
853tax6       0100020200030303010103030002030002010202
854tax7       ??????0200030303010103030002030002010202
855tax8       ????????????????????????????????????????
856tax9       ??????0202000303030103030102030002010202
857tax10      ??????????????????????????????0002010202;
858END;
859
860\end{Verbatim}
861\end{small}
862\caption{\bf NEXUS formats.}\label{fig:nexus}
863\end{figure}
864
865
866Alignments   of   DNA   or   protein   sequences   must  be   in   PHYLIP\index{PHYLIP}   or   NEXUS
867\cite{maddison97}\index{NEXUS} sequential\index{sequential} or interleaved\index{interleaved} format
868(Figures \ref{fig:align_tree}  and \ref{fig:nexus}).  For  PHYLIP formated sequence  alignments, the
869first line of  the input file contains the number  of species and the number  of characters, in free
870format, separated by blank characters.  One slight difference with PHYLIP format deals with sequence
871name  lengths.  While PHYLIP  format limits  this length  to ten  characters, PhyML  can read  up to
872hundred  character long sequence  names.  Blanks  and the  symbols ``(),:''  are not  allowed within
873sequence names because  the Newick tree format  makes special use of these  symbols.  Another slight
874difference with  PHYLIP format is  that actual sequences  must be separated  from their names  by at
875least one blank character.
876
877A PHYLIP input sequence file  may also display more than a single data set.  Each of these data sets
878must  be  in  PHYLIP   format  and  two  successive  alignments  must  be   separated  by  an  empty
879line. Processing multiple  data sets requires to toggle  the `\x{M}' option in the  {\em Input Data}
880sub-menu or use the `\x{-n}' command line option  and enter the number of data sets to analyse.  The
881multiple  data set  option can  be  used to  process re-sampled  data  that were  generated using  a
882non-parametric  procedure such  as  cross-validation or  jackknife  (a bootstrap  option is  already
883included in PhyML).  This  option is also useful in multiple gene studies,  even if fitting the same
884substitution model to all data sets may not be suitable.
885
886PhyML can  also process alignments in  NEXUS format. Although not  all the options provided  by this
887format are supported  by PhyML, a few specific  features are exploited.  Of course,  this format can
888handle nucleotide and protein  sequence alignments in sequential or interleaved  format.  It is also
889possible  to  use custom  alphabets,  replacing  the standard  4-state  and  20-state alphabets  for
890nucleotides and amino-acids respectively. Examples of a  4-state custom alphabet are given in Figure
891\ref{fig:nexus}. Each state must here  correspond to one digit or more. The set  of states must be a
892list of consecutive  digits starting from 0.  For instance,  the list ``0, 1, 3, 4''  is not a valid
893alphabet. Each  state in the  symbol list must  be separated  from the next  one by a  space. Hence,
894alphabets with large number of states can be easily defined by using two-digit number (starting with
89500 up  to 19  for a 20  state alphabet).  Most  importantly, this  feature gives the  opportunity to
896analyse data sets made of presence/absence character states (use the \texttt{symbols=``0 1''} option
897for such data).\index{binary characters} Alignments made  of custom-defined states will be processed
898using the Jukes and Cantor model.  Other options  of the program (e.g., number of rate classes, tree
899topology search algorithm) are freely configurable. Note  that, at the moment, the maximum number of
900different states is  set to 22 in order to  save memory space.  It is however  possible to lift this
901threshold   by  modifiying   the   value  of   the  variable   \x{T\_MAX\_ALPHABET}   in  the   file
902`\x{utilities.h}'. The program will then have to be re-compiled.
903
904
905
906\subsubsection{Gaps and ambiguous characters}
907
908Gaps correspond to  the `\x{-}' symbol.  They are systematically treated  as unknown characters ``on
909the grounds  that we  don't know what  would be  there if something  were there''  (J.  Felsenstein,
910PHYLIP main documentation).   The likelihood at these  sites is summed over all  the possible states
911(i.e.,  nucleotides  or   amino  acids)  that  could  actually  be   observed  at  these  particular
912positions. Note however that  columns of the alignment that display only  gaps or unknown characters
913are simply discarded because  they do not carry any phylogenetic information  (they are equally well
914explained  by any  model).  PhyML  also handles  ambiguous characters  such as  $R$ for  $A$  or $G$
915(purines) and $Y$ for $C$  or $T$ (pyrimidines).  Tables \ref{tab:ambigu_nt} and \ref{tab:ambigu_aa}
916give the list of valid characters/symbols and the corresponding nucleotides or amino acids.
917
918\begin{table}
919\begin{center}
920\begin{tabular}{lr|lr}
921\hline
922Character & Nucleotide &   Character & Nucleotide \\
923\hline
924$A$       & Adenosine &     $Y$       & $C$ or $T$ \\
925$G$       & Guanosine &       $K$       & $G$ or $T$ \\
926$C$       & Cytidine &      $B$       & $C$ or $G$ or $T$\\
927$T$       & Thymidine &       $D$       & $A$ or $G$ or $T$ \\
928$U$       & Uridine (=$T$) & $H$       & $A$ or $C$ or $T$ \\
929$M$       & $A$ or $C$ &    $V$       & $A$ or $C$ or $G$ \\
930$R$       & $A$ or $G$ &    $-$ or $N$ or $X$ or $?$ & unknown  \\
931$W$       & $A$ or $T$ &    & (=$A$ or $C$ or $G$ or $T$)\\
932$S$       & $C$ or $G$ &   & \\
933\hline
934\end{tabular}
935\end{center}
936\caption{{\bf List of valid characters in DNA sequences and the corresponding nucleotides.}}\label{tab:ambigu_nt}
937\end{table}
938\begin{table}
939\begin{center}
940\begin{tabular}{lr|lr}
941\hline
942Character & Amino-Acid & Character & Amino-Acid \\
943\hline
944$A$       & Alanine &         $L$       & Leucine \\
945$R$       & Arginine &        $K$       & Lysine \\
946$N$ or $B$& Asparagine &      $M$       & Methionine \\
947$D$       & Aspartic acid &   $F$       & Phenylalanine \\
948$C$       & Cysteine &        $P$       & Proline \\
949$Q$ or $Z$& Glutamine &       $S$       & Serine \\
950$E$       & Glutamic acid &   $T$       & Threonine \\
951$G$       & Glycine &         $W$       & Tryptophan \\
952$H$       & Histidine &       $Y$       & Tyrosine \\
953$I$       & Isoleucine &      $V$       & Valine \\
954$L$       & Leucine &         $-$ or $X$ or $?$ & unknown \\
955$K$       & Lysine &          & (can be any amino acid) \\
956\hline
957\end{tabular}
958\end{center}
959\caption{{\bf List of valid characters in protein sequences and the corresponding amino acids.}}\label{tab:ambigu_aa}
960\end{table}
961
962\subsubsection{Specifying outgroup sequences}\label{sec:outgroupspecify}
963
964PhyML can return rooted trees provided outgroup taxa are identified from the sequence file. In
965order to do so, sequence names that display a `*' character will be automatically considered as
966belonging to the outgroup.
967
968The topology of  the rooted tree is  exactly the same as the  unrooted version of the  same tree. In
969other words,  PhyML first ignores the distinction  between ingroup and outgroup  sequences, builds a
970maximum likelihood unrooted tree  and then tries to add the root. If the  outgroup has more than one
971sequence, the position  of the root might be  ambiguous. In such situation, PhyML  tries to identify
972the  most relevant  position of  the root  by considering  which edge  provides the  best separation
973between ingroup  and outgroup taxa (i.e.,  we are trying to  make the outgroup  ``as monophyletic as
974possible'').
975
976\subsection{Tree format}
977
978PhyML can  read one or  several phylogenetic trees  from an input  file.  This option  is accessible
979through the  {\em Tree Searching} sub  menu or the `\x{-u}'  argument from the  command line.  Input
980trees are generally used as initial maximum  likelihood estimates to be subsequently adjusted by the
981tree searching algorithm.   Trees can be either rooted or unrooted  and multifurcations are allowed.
982Taxa names must, of course, match the corresponding sequence names.
983
984\begin{figure}[h]
985\begin{small}
986\begin{minipage}{\textwidth}
987\begin{verbatim}
988((seq1:0.03,seq2:0.01):0.04,(seq3:0.01,(seq4:0.2,seq5:0.05):0.2):0.01);
989((seq3,seq2),seq1,(seq4,seq5));
990\end{verbatim}
991\end{minipage}
992\end{small}
993\caption{{\bf Input trees}. The first tree (top) is rooted and has branch lengths. The second tree
994  (bottom) is unrooted and does not have branch lengths.}
995\label{fig:trees}\index{Newick format}
996\end{figure}
997
998
999\subsection{Multiple alignments and trees}\index{multiple data sets}
1000
1001Single or  multiple sequence  data sets may  be used  in combination with  single or  multiple input
1002trees. When the number of data sets is one ($n_D = 1$) and there is only one input tree ($n_T = 1$),
1003then this tree is simply  used as input for the single data set analysis. When  $n_D = 1$ and $n_T >
10041$,  each input tree  is used  successively for  the analysis  of the  single alignment.  PhyML then
1005outputs the tree  with the highest likelihood.  If $n_D > 1$ and  $n_T = 1$, the same  input tree is
1006used for the analysis  of each data set.  The last  combination is $n_D > 1$ and $n_T  > 1$. In this
1007situation, the  $i$-th tree in the input  tree file is used  to analyse the $i$-th  data set. Hence,
1008$n_D$ and $n_T$ must be equal here.
1009
1010
1011\subsection{Custom amino-acid rate model}\label{sec:customaa}
1012
1013The custom amino-acid model of substitutions can be used to implement a model that is not hard-coded
1014in  PhyML.   This model  must  be  time-reversible.  Hence,  the  matrix  of  substitution rates  is
1015symmetrical. The format  of the rate matrix with the associated  stationary frequencies is identical
1016to the one used in PAML\index{PAML}. An example is given below:
1017
1018\begin{center}
1019{\tiny
1020\begin{tabular}{p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}}
1021% Ala & Arg & Asn & Asp & Cys & Gln & Glu & Gly & His & Ile & Leu & Lys & Met & Phe & Pro & Ser & Thr & Trp & Tyr & Val \\
1022  &&&&&&&&&&&&&&&&&&& \\
10230.55 &  &&&&&&&&&&&&&&&&&& \\
10240.51 & 0.64 &  &&&&&&&&&&&&&&&& \\
10250.74 & 0.15 & 5.43 &  &&&&&&&&&&&&&&&& \\
10261.03 & 0.53 & 0.27 & 0.03 &  &&&&&&&&&&&&&&& \\
10270.91 & 3.04 & 1.54 & 0.62 & 0.10 &   &&&&&&&&&&&&&& \\
10281.58 & 0.44 & 0.95 & 6.17 & 0.02 & 5.47 &  &&&&&&&&&&&&& \\
10291.42 & 0.58 & 1.13 & 0.87 & 0.31 & 0.33 & 0.57 &  &&&&&&&&&&&& \\
10300.32 & 2.14 & 3.96 & 0.93 & 0.25 & 4.29 & 0.57 & 0.25 &  &&&&&&&&&&& \\
10310.19 & 0.19 & 0.55 & 0.04 & 0.17 & 0.11 & 0.13 & 0.03 & 0.14 &  &&&&&&&&&& \\
10320.40 & 0.50 & 0.13 & 0.08 & 0.38 & 0.87 & 0.15 & 0.06 & 0.50 & 3.17 &  &&&&&&&&& \\
10330.91 & 5.35 & 3.01 & 0.48 & 0.07 & 3.89 & 2.58 & 0.37 & 0.89 & 0.32 & 0.26 &  &&&&&&&& \\
10340.89 & 0.68 & 0.20 & 0.10 & 0.39 & 1.55 & 0.32 & 0.17 & 0.40 & 4.26 & 4.85 & 0.93 &  &&&&&&& \\
10350.21 & 0.10 & 0.10 & 0.05 & 0.40 & 0.10 & 0.08 & 0.05 & 0.68 & 1.06 & 2.12 & 0.09 & 1.19 &  &&&&&& \\
10361.44 & 0.68 & 0.20 & 0.42 & 0.11 & 0.93 & 0.68 & 0.24 & 0.70 & 0.10 & 0.42 & 0.56 & 0.17 & 0.16 &  &&&&& \\
10373.37 & 1.22 & 3.97 & 1.07 & 1.41 & 1.03 & 0.70 & 1.34 & 0.74 & 0.32 & 0.34 & 0.97 & 0.49 & 0.55 & 1.61 &  &&&& \\
10382.12 & 0.55 & 2.03 & 0.37 & 0.51 & 0.86 & 0.82 & 0.23 & 0.47 & 1.46 & 0.33 & 1.39 & 1.52 & 0.17 & 0.80 & 4.38 &  &&& \\
10390.11 & 1.16 & 0.07 & 0.13 & 0.72 & 0.22 & 0.16 & 0.34 & 0.26 & 0.21 & 0.67 & 0.14 & 0.52 & 1.53 & 0.14 & 0.52 & 0.11 &  && \\
10400.24 & 0.38 & 1.09 & 0.33 & 0.54 & 0.23 & 0.20 & 0.10 & 3.87 & 0.42 & 0.40 & 0.13 & 0.43 & 6.45 & 0.22 & 0.79 & 0.29 & 2.49 &   & \\
10412.01 & 0.25 & 0.20 & 0.15 & 1.00 & 0.30 & 0.59 & 0.19 & 0.12 & 7.82 & 1.80 & 0.31 & 2.06 & 0.65 & 0.31 & 0.23 & 1.39 & 0.37 & 0.31 &   \\
1042\\
10438.66 & 4.40 & 3.91 & 5.70 & 1.93 & 3.67 & 5.81 & 8.33 & 2.44 & 4.85 & 8.62 & 6.20 & 1.95 & 3.84 & 4.58 & 6.95 & 6.10 & 1.44 & 3.53 & 7.09  \\
1044\end{tabular}
1045}
1046\end{center}
1047
1048The  entry  on the  $i$-th  row  and  $j$-th  column of  this  matrix  corresponds  to the  rate  of
1049substitutions between  amino-acids $i$  and $j$.   The last line  in the  file gives  the stationary
1050frequencies and must be separated from the rate  matrix by one line. The ordering of the amino-acids
1051is alphabetical,  i.e, Ala, Arg, Asn, Asp,  Cys, Gln, Glu, Gly,  His, Ile, Leu, Lys,  Met, Phe, Pro,
1052Ser, Thr, Trp, Tyr and Val.
1053
1054
1055\subsection{Topological constraint file}\label{sec:topoconstraints}
1056
1057PhyML can perform phylogenetic tree estimation under user-specified topological constraints. In
1058order to do so, one should use the \x{--constraint\_file}  \x{file\_name} command-line option where
1059\x{file\_name} lists the topological constraints. Such constraints are  straightforward to
1060define. For instance, the following constraints:
1061\vspace{0.2cm}
1062\begin{Verbatim}
1063((A,B),C,(D,E,F));
1064\end{Verbatim}
1065indicate that taxa D, E and F belong to the same clade. A, B and C also belong to the
1066same  clade and the  two clades  hence defined  should not  overlap. Under  these two
1067constraints, the tree ((A,B),D,((E,F),C)) is not valid. From the example above, you will notice that
1068the constraints are defined  using a multifurcating tree in NEWICK format.  Note that this tree does
1069not need to display the whole list of taxa. For instance, while the only taxa involved in specifying
1070topological constraints above  are A, B, C, D, E  \& F, the actual data set  could include more than
1071these six taxa only.
1072
1073PhyML tree topology  search algorithms all rely on  improving a starting tree. By  default, BioNJ is
1074the method  of choice  for building this  tree. However,  there is no  guarantee that  the phylogeny
1075estimated with PhyML does comply with the  topological constraints. While it is probably possible to
1076implement  BioNJ  with  topological constraints,  we  have  not  done  so yet.   Instead,  the  same
1077multifurcating tree that  defines the topological constraints  should also be used  as starting tree
1078using  the \x{-u}  (\x{--inputtree})  option. Altogether,  the  command line  should  look like  the
1079following: \x{-u}=\x{file\_name} \x{--constraint\_file}=\x{file\_name}. It is not possible to use as
1080input  tree a  non-binary phylogeny  that is  distinct  from that  provided in  the constraint  tree
1081file. However, any binary tree compatible with the constraint one can be used as input tree.
1082
1083
1084\subsection{Output files}
1085
1086\begin{table}
1087Sequence file name~: `{\x seq}'\\
1088\begin{center}
1089\begin{tabular}{ll}
1090\hline
1091Output file name & Content \\
1092\hline
1093\x{seq\_phyml\_tree.txt} & ML tree\\
1094\x{seq\_phyml\_stats.txt} &  ML model parameters\\
1095\x{seq\_phyml\_boot\_trees.txt} & ML trees -- bootstrap replicates\\
1096\x{seq\_phyml\_boot\_stats.txt} & ML model parameters -- bootstrap replicates \\
1097\x{seq\_phyml\_rand\_trees.txt} & ML trees -- multiple random starts\\
1098\x{seq\_phyml\_ancestral\_seq.txt} & ancestral sequences\\
1099\x{seq\_phyml\_ancestral\_tree.txt} & ML tree with node labels as in ancestral sequence file\\
1100\hline
1101\end{tabular}
1102\end{center}
1103\caption{{\bf Standard output files}}\label{tab:output}
1104\end{table}
1105
1106Table  \ref{tab:output} presents  the list  of files  resulting from  an analysis.   Basically, each
1107output file  name can be divided into  three parts.  The first  part is the sequence  file name, the
1108second part corresponds to  the extension `\x{\_phyml\_}' and the third part  is related to the file
1109content.  When launched with the default options,  PhyML only generates two files: the tree file and
1110the model parameter file.   The estimated maximum likelihood tree is in  standard Newick format (see
1111Figure  \ref{fig:trees}).  The  model  parameters file,  or  statistics file,  displays the  maximum
1112likelihood estimates of the substitution model  parameters, the likelihood of the maximum likelihood
1113phylogenetic model, and  other important information concerning the settings  of the analysis (e.g.,
1114type of data, name of the substitution model, starting tree, etc.).  Two additional output files are
1115created if  bootstrap supports were  evaluated.  These files  simply contain the  maximum likelihood
1116trees  and  the  substitution  model  parameters  estimated from  each  bootstrap  replicate.   Such
1117information can be used to estimate sampling errors around each parameter of the phylogenetic model.
1118When the random  tree option is turned on,  the maximum likelihood trees estimated  from each random
1119starting trees are printed in a separate tree file (see last row of Table \ref{tab:output}).
1120
1121PhyML estimates ancestral sequences by calculating the {\em marginal} (as opposed to the {\em joint}) probability of each character
1122  state at each internal node of the phylogeny. These probabilities are given in the file
1123\x{seq\_phyml\_ancestral\_seq.txt}. The bulk of this file is a table where each row corresponds to a
1124site in the original alignment and a number corresponding labeling each internal node. The different
1125columns of that file give the probability of each character state given the data observed at the
1126corresponding sites and the estimated phylogeny. PhyML also outputs
1127an extra column called MPEE\index{MPEE}\index{ancestral reconstruction}, which stands for Minimum Posterior Expected Error. This column gives the
1128state, which can be a single nucleotide/amino-acid or a combination of more than one nucleotide/amino-acid, that is
1129optimal under the MPEE criterion \cite{oliva19}.
1130
1131Also, it is relatively straightforward to identify which number corresponds to which node in the tree by examining the
1132information provided in the Newick-formatted file \x{seq\_phyml\_ancestral\_tree.txt}.
1133
1134
1135\subsection{Treatment of invariable sites with fixed branch lengths}
1136
1137PhyML  allows users  to give  an input  tree with  fixed topology  and branch  lengths and  find the
1138proportion of invariable sites that maximise the likelihood (option \x{-o r}). These two options can
1139be considered  as conflicting since  branch lengths depend  on the proportion of  invariants. Hence,
1140changing the proportion  of invariants implies that branch lengths are  changing too. More formally,
1141let $l$ denote the length of a branch,  i.e., the expected number of substitutions per site, and $p$
1142be  the proportion  of invariants.  We have  $l =  (1-p)l'$, where  $l'$ is  the expected  number of
1143substitutions {\em at  variable sites}.  When  asked to optimize  $p$ but leave $l$  unchanged, PhyML
1144does the following:
1145\begin{enumerate}
1146\item Calculate $l' = l/(1-p)$ and leave $l'$ unchanged throughout the optimization.
1147\item Find the value of $p$ that maximises the likelihood. Let $p^{*}$ denote this value.
1148\item Set $l^{*} = (1-p^{*})l'$ and print out the tree with $l^{*}$ (instead of $l$).
1149\end{enumerate}
1150
1151PhyML therefore  assumes that the  users wants  to fix the  branch lengths measured  at {\em variable}
1152sites only  (i.e., $l^{*}$ is  fixed). This is the  reason why the  branch lengths in the  input and
1153output trees  do differ  despite the  use of the  the \x{-o  r} option. While  we believe  that this
1154approach relies on a sound rationale, it  is not perfect. In particular, the original transformation
1155of  branch lengths  ($l' =  l/(1-p)$) relies  on a  default  value for  $p$ with  is set  to 0.2  in
1156practice. It is difficult  to justify the use of this value rather  than another one. One suggestion
1157proposed by  Bart Hazes is  to avoid fixing  the branch lengths  altogether and rather  estimate the
1158value  of   a  scaling  factor   applied  to   each  branch  length   in  the  input   tree  (option
1159\x{--contrained\_lens}).  We  agree  that  this  solution  probably matches  very  well  most  users
1160expectation, i.e., ``find the  best value of $p$ while constraining the ratio  of branch lengths to be
1161that given in the input tree''. Please feel free to send us your suggestions regarding this problem
1162by posting on the forum (\url{http://groups.google.com/group/phyml-forum}).
1163
1164
1165\section{Inputs \& outputs for the XML interface }\label{sec:xmlio}\index{XML}
1166
1167\subsection{Mixture models in PhyML}\index{mixture models}\label{sec:mixtures}
1168
1169PhyML implements a wide range of mixture models. The discrete gamma model \cite{yang94b} is arguably
1170the  most popular  of these  models in  phylogenetics. However,  in theory,  mixture models  are not
1171restricted to the description of the variation  of substitution rates across sites. For instance, if
1172there are good reasons  to believe that the relative rates of  substitution between nucleotides vary
1173along the  sequence alignments, it  makes sense to  use a mixture of  GTR models. Consider  the case
1174where substitutions between $A$ and $C$ occur at  high rate in some regions of the alignment and low
1175rate elsewhere,  a mixture with two  classes, each class having  its own GTR rate  matrix, would be
1176suitable. The likelihood at any site of  the alignment is then obtained by averaging the likelihoods
1177obtained for each GTR rate matrix, with the same weight given to each of these matrices.
1178
1179PhyML implements  a generic framework  that allows users to  define mixtures on  substitution rates,
1180rate matrices and nucleotide or amino-acid equilibrium  frequencies. Each class of the mixture model
1181is built by assembling a substitution rate,  a rate matrix\footnote{the rate matrix corresponds here
1182the symmetrical  matrix giving the so-called  ``echangeability rates''} and a  vector of equilibrium
1183frequencies.  For  instance, let $\{R_1,R_2,R_3\}$ be  a set of substitution  rates, $\{M_1,M_2\}$ a
1184set of rate matrices and $\{F_1,F_2\}$ a set  of vectors of equilibrium frequencies.  One could then
1185define the first class of the mixture model  as $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, a second class as
1186$\mathcal{C}_2  = \{R_2,M_1,F_1\}$,  and a  third class  as $\mathcal{C}_3  = \{R_3,M_2,F_2\}$.   If
1187$R_1$, $R_2$  and $R_3$ correspond to  slow, medium and  fast substitution rates, then  this mixture
1188model allows the  fast evolving rates to have  their own vector of equilibrium  frequencies and rate
1189matrix, distinct from that found at the medium  or slow evolving sites.  The likelihood at any given
1190site $D_s$ of the alignment is then:
1191\begin{eqnarray*}
1192\Pr(D_s) = \sum_{c=1}^{3} \Pr(D_s | \mathcal{C}_s=c) \Pr(\mathcal{C}_s=c),
1193\label{equ:mixtlk}
1194\end{eqnarray*}
1195where  $\Pr(\mathcal{C}_s=c)$ is  obtained by  multiplying the  probability (density)  of  the three
1196components (i.e., rate, matrix, frequencies). For instance, $\Pr(\mathcal{C}_1=\{R_1,M_1,F_1\}) =
1197 \Pr(R_1)\times \Pr(M_1) \times \Pr(F_1)$.
1198We therefore assume here that substitution rates, rate
1199matrices and equilibrium frequencies are independent from one another.
1200
1201Note that, using the  same substitution rates, rate matrices and  vector of equilibrium frequencies,
1202it is  possible to construct  many other mixture  models. For instance,  the mixture model  with the
1203largest  number of  classes  can be  created  by considering  all the  combinations  of these  three
1204components.  We would  then get a mixture of  $3\times 2 \times 2=12$ classes,  corresponding to all
1205the possible combinations of 3 rates, 2 matrices and 2 vectors of frequencies.
1206
1207
1208% :  $\mathcal{C}_1 =
1209% \{R_1,M_1,F_1\}$   $\mathcal{C}_2    =   \{R_1,M_1,F_2\}$,   $\mathcal{C}_3    =   \{R_1,M_2,F_1\}$,
1210% $\mathcal{C}_4   =   \{R_1,M_2,F_2\}$,   $\mathcal{C}_5   =   \{R_2,M_1,F_1\}$,   $\mathcal{C}_6   =
1211% \{R_2,M_1,F_2\}$,   $\mathcal{C}_7   =    \{R_2,M_2,F_1\}$,   $\mathcal{C}_8   =   \{R_2,M_2,F_2\}$,
1212% $\mathcal{C}_9  =  \{R_3,M_1,F_1\}$,  $\mathcal{C}_{10}   =  \{R_3,M_1,F_2\}$,  $\mathcal{C}_{11}  =
1213% \{R_3,M_2,F_1\}$ and $\mathcal{C}_{12} = \{R_3,M_2,F_2\}$.
1214
1215
1216\subsection{Partitionned (i.e., multiple-gene) analyses}\index{partitionned analysis}\index{data partitions}\index{multiple-gene analysis}
1217
1218We first introduce some terms of vocabulary that have not been presented before. A partitionned data
1219set, also referred to as partition, is a  set of partition elements.  Typically, a partitionned data
1220set will be made of  a set of distinct gene alignments. A partition  element will then correspond to
1221one (or  several) of these gene  alignments. Note that the  biology litterature often uses  the term
1222partition to refer to an element of a  partitionned data.  We thus use here instead the mathematical
1223definition of the terms `partition' and `partition element'.
1224
1225Phylogenetics models usually assume individual columns  of an alignment to evolve independently from
1226one  another. Codon-based  models (e.g.,  \cite{yang98,yang00b,yang02,guindon04}) are  exceptions to
1227this rule  since the substitution process  applies here to  triplets of consecutive sites  of coding
1228sequences.  The non-independence of  the substitution process at the three  coding positions (due to
1229the specificities of the genetic code), can  therefore be accounted for.  Assuming that sites evolve
1230independently  does not  mean  that a  distinct  model is  fitted  to each  site  of the  alignment.
1231Estimating the  parameters of these  models would not  make much sense in  practice due to  the very
1232limited amount of phylogenetic signal conveyed by individual sites.  Site independence means instead
1233that the  columns of  the observed  alignment were sampled  randomly from  the same  ``population of
1234columns''.   The  stochasticity  of the  substitution  process  running  along  the tree  is  deemed
1235responsible to the variability of site patterns.
1236
1237Some parameters  of the  phylogenetic model  are considered  to be common  to all  the sites  in the
1238alignment. The tree topology is typically  one such parameter.  The transition/transversion ratio is
1239also generally assumed to be the same for all columns.  Other parameters can vary from site to site.
1240The rate at  which substitutions accumulate is  one of these parameters. Hence,  different sites can
1241have distinct rates. However,  such rates are all ``drawn'' from  the same probabilitic distribution
1242(generally a  discrete Gamma  density).  Hence,  while different  sites may  have distinct  rates of
1243evolution, they all share the same {\em distribution} of rates.
1244
1245This reasonning  also applies on a  larger scale. When  analysing multiple genes, one  can indeed
1246assume that the same  mechanism generated the different site patterns observed  for every gene. Here
1247again, we can assume that all the genes share the same underlying tree topology (commonly refered to
1248as the ``species  tree'').  Other parameters of  the phylogenetic model, such as  branch lengths for
1249instance, might  be shared across  genes. However,  due to the  specificities of the  gene evolution
1250processes, some  model parameters  need to be  adjusted for  each gene separately.   To sum  up, the
1251phylogenetic analysis of partitionned data requires flexible models with parameters, or distribution
1252of parameters,  shared across several partition  elements and other parameters  estimated separately
1253for each element of the partition.
1254
1255The likelihood of a  data set made of the concatenation of $n$  sequence alignments noted $D^{(1)}$,
1256$D^{(2)}, \ldots, D^{(n)}$ is then obtained as follows:
1257
1258\begin{eqnarray*}
1259\Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) &=& \prod_{i=1}^{n}  \Pr(D^{(i)}) \\
1260&=& \prod_{i=1}^{n}  \prod_{s=1}^{L_i} \Pr(D^{(i)}_s),
1261\end{eqnarray*}
1262
1263where $L_i$ is the number of site columns in partition element $i$.  $\Pr(D^{(i)}_s)$
1264is then obtained using Equation \ref{equ:mixtlk}, i.e., by summing over the different classes of the
1265mixture model that  applies to site $s$ for  partition element $i$. Hence, the  joint probability of
1266all the partition elements is here broken down into the product of likelihood of every site for each
1267partition  element. As  noted just  above,  any given  component of  the  mixture model  at a  given
1268particular site is shared by the other sites that belong to the same partition element and, for some
1269of them, by  sites in other partition  elements (e.g., the same  tree topology is shared  by all the
1270sites, throughout all the partition elements).
1271
1272PhyML implements a wide  variety of partition models.  The only parameter that  is constrained to be
1273shared  by all  the  partition elements  is  the tree  topology. This  constraint  makes sense  when
1274considering distantly  related taxa, typically inter-species  data. For closely related  taxa, i.e.,
1275when  analysing intra-species  or population-level  data,  not all  the  genes might  have the  same
1276evolutionary history.   Recombination events combined  to the incomplete lineage  sorting phenomenon
1277can  generate  discrepancies   between  the  gene  trees  and  the   underlying  species  tree  (see
1278\cite{degnan09}  for  a review).   The  phylogenetic  softwares BEST  \cite{best}\index{BEST},  STEM
1279\cite{stem}\index{STEM} and  *BEAST \cite{startbeast}\index{*BEAST} are dedicated  to the estimation
1280of species tree phylogenies  from the analysis of multi-gene data and  allow gene-tree topologies to
1281vary across genes.
1282
1283Aside from the tree topology  that is common to all the sites and  all the partition elements, other
1284parameters of  the phylogenetic model  can be either shared  across partition elements  or estimated
1285separately  for each  of  these. When  analysing  three partition  elements, $A$,  $B$  and $C$  for
1286instance, PhyML can  fit a model where the same  set of branch lengths applies to  $A$ and $B$ while
1287$C$ has its  own estimated lengths.  The same  goes for the substitution model: the  same GTR model,
1288with identical parameter values, can be fitted to $A$  and $C$ and JC69 for instance can be used for
1289$B$. The sections below  give more detailed information on the range of  models available and how to
1290set up the corresponding XML configuration files to implement them.
1291
1292
1293\subsection{Combining mixture and partitions in PhyML: the  theory}
1294
1295The rationale behind mixture  models as implemented in PhyML lies in (1)  the definition of suitable
1296rate matrices, equilibrium frequency vectors and relative rates of substitution and (2) the assembly
1297of these  components so as  to create the  classes of a mixture.  The main idea  behind partitionned
1298analysis in  PhyML lies  in (1)  the hypothesis of  statistical independance  of the  different data
1299partition elements and (2) distinct data partition can share model components such as rate matrices,
1300equilibrium frequencies or  distribution of rates across  sites. More formally, the  likelihood of a
1301data set made of $n$ partition elements is written as follows:
1302\begin{eqnarray*}
1303\Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) &=& \prod_{i=1}^{n}  \prod_{s=1}^{L_i} \Pr(D^{(i)}_s) \\
1304&=& \prod_{i=1}^{n}  \prod_{s=1}^{L_i} \sum_{c=1}^{K_i} \Pr(D^{(i)}_s|\mathcal{C}=c) \Pr(\mathcal{C}=c),
1305\end{eqnarray*}
1306where $L_i$ is  the number of sites in partition  element $i$ and $K_i$ is the  number of classes in
1307the mixture model that applies to this same partition  element. Each class of a mixture is made of a
1308rate  matrix $M$,  a vector  of equilibrium  frequencies  $F$ and  a relative  rate of  substitution
1309$R$. Branch  lengths, $L$ and  tree topology $\tau$  are also required  for the calculation  of the
1310likelihood. Hence we have:
1311\begin{eqnarray*}
1312&& \Pr(D^{(1)},D^{(2)},\ldots,D^{(n)})  \\
1313&&=  \prod_{i=1}^{n}  \prod_{s=1}^{L_i} \sum_{c=1}^{K_i} \Pr(D^{(i)}_s|\mathcal{C}=c) \Pr(\mathcal{C}=c) \\
1314&&= \prod_{i=1}^{n}  \prod_{s=1}^{L_i} \sum_{m}^{\mathcal{M}_i} \sum_{f}^{\mathcal{F}_i} \sum_{r}^{\mathcal{R}_i}  \Pr(D^{(i)}_s|M_m^{(i)},F_f^{(i)},R_r^{(i)},L^{(i)},\tau) \Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) \mathcal{I}(m,f,r,i)
1315% &&= \prod_{i=1}^{n}  \prod_{s=1}^{L_i} \sum_{m}^{\mathcal{M}_i} \sum_{f}^{\mathcal{F}_i} \sum_{r}^{\mathcal{R}_i}
1316% \Pr(D^{(i)}_s|M_m^{(i)},F_f^{(i)},R_r^{(i)},L^{(i)},\tau) \Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)})
1317\end{eqnarray*}
1318where $\mathcal{M}_i$, $\mathcal{F}_i$ and $\mathcal{R}_i$ are the number of rate matrices,
1319vector of equilibrium frequencies and relative rates that apply to partition element $i$
1320respectively. $\mathcal{I}(m,f,r,i)$ is an indicator function that takes value 1 if the combination
1321$M_m$, $F_f$ and $R_r$ is acually defined in the model for this particular partition element
1322$i$. Its value is 0 otherwise. In the example given in section \ref{sec:mixtures} $\{R_1,R_2,R_3\}$
1323is the  set of substitution  rates, $\{M_1,M_2\}$ the set of rate matrices and $\{F_1,F_2\}$ the set  of vectors of equilibrium frequencies.  We then
1324define the first class of the mixture model  as $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, a second class as
1325$\mathcal{C}_2  = \{R_2,M_1,F_1\}$ and the third as $\mathcal{C}_3  = \{R_3,M_2,F_2\}$. Hence, we
1326have $\mathcal{I}(1,1,1,i)$, $\mathcal{I}(1,1,2,i)$ and $\mathcal{I}(2,2,3,i)$ equal to one while
1327the nine other values that this indicator function takes, corresponding to the possible combinations of
1328two vectors of frequencies, two  matrices and three rates, are all zero.
1329
1330As stated before, our implementation assumes that the different components of a mixture are
1331independant. In other words, we have $\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) = \Pr(M_m^{(i)}) \times
1332\Pr(F_f^{(i)}) \times \Pr(R_r^{(i)})$. In practice, the joint probability
1333$\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)})$ is obtained as follows:
1334\begin{eqnarray*}
1335\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) = \frac{\Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)})}{
1336  \sum_{m,f,r} \Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)}) \mathcal{I}(m,f,r,i)}
1337\label{equ:weights}
1338\end{eqnarray*}
1339The probabilities $\Pr(M_m^{(i)})$, $\Pr(F_f^{(i)})$ and $\Pr(R_r^{(i)})$, also called `weights', can be fixed or estimated
1340from the data.
1341
1342\subsection{The XML format and its use in PhyML}\label{sec:XML format}
1343
1344The few paragraphs below are largely inspired from  the Wikipedia page that describes the XML format
1345(\url{http://en.wikipedia.org/wiki/XML}). XML (eXtensible Markup Language) is a markup language that
1346defines  a set  of  rules  for encoding  documents  in  a format  that  is  both human-readable  and
1347machine-readable.  An  XML document is  divided into  {\em markup} and  {\em content}, which  may be
1348distinguished  by the  application of  simple syntactic  rules. Generally,  strings that  constitute
1349markup either begin  with the character `\x{<}' and  end with a `\x{>}'. Strings  of characters that
1350are not markup are content:
1351
1352\vspace{0.2cm}
1353\begin{Verbatim}[frame=single, label=XML markup and content example, samepage=true,
1354  baselinestretch=0.5, fontsize=\small]
1355<markup>
1356 content
1357</markup>
1358\end{Verbatim}
1359
1360A markup construct that begins  with `\x{<}' and ends with `\x{>}' is called  a {\em tag}. Tags come
1361in  three  flavors:  (1)  start-tags  (e.g,  \x{<section>}),  end-tags  (e.g.,  \x{</section>})  and
1362empty-element tags (e.g., \x{<line-break />}). A {\em  component} either begins with a start-tag and
1363ends with a  matching end-tag or consists only  of an empty-element tag. The  characters between the
1364start- and  end-tags, if any,  are the  element's content, and  may contain markup,  including other
1365elements, which are  called child elements.  In  the following example, the element  \x{img} has two
1366{\em  attributes},  \x{src}   and  \x{alt}:  \x{<img  src="madonna.jpg"   alt="Foligno  Madonna,  by
1367Raphael"/>}. Another example would be \x{<step number="3">Connect  A to B.</step>} where the name of
1368the attribute is ``\x{number}" and the value is ``\x{3}".
1369
1370In  practice,  building   a  mixture  model  in   a  XML  file  readable  by   PhyML  is  relatively
1371straightforward.  The first  step  is  to define  the  different components  of  each  class of  the
1372mixture.  Consider for  instance that  the fitted  model will  have a  Gamma distribution  with four
1373classes plus  a proportion of invariants.  The rate component of  the mixture can then  be specified
1374using the following XML code:
1375
1376\vspace{0.2cm}
1377\begin{Verbatim}[frame=single, label=$\Gamma4$+I rates, samepage=true, baselinestretch=0.5,
1378  fontsize=\small, numbers=left]
1379
1380<siterates id="SiteRates1">
1381  <weights  id="Distrib" family="gamma+inv" alpha=".1" \
1382  optimise.alpha="yes" pinv="0.4" optimise.pinv="yes">
1383  </weights>
1384  <instance id="R1" init.value="1.0"/>
1385  <instance id="R2" init.value="1.0"/>
1386  <instance id="R5" init.value="0.0"/>
1387  <instance id="R3" init.value="1.0"/>
1388  <instance id="R4" init.value="1.0"/>
1389</siterates>
1390
1391\end{Verbatim}
1392
1393In the example above, the \x{<siterates>} component  completely defines a model of substitution rate
1394variation across  sites. This component  has a  particular identity, i.e.,  a name associated  to it
1395(``\x{SiteRates1}''  here),  which  is  not  mandatory.   This  \x{<siterates>}  component  has  six
1396sub-components.   The  first  is  the  \x{<weights>}  component,  followed  by  five  \x{<instance>}
1397components. The  \x{<weights>} component  defines the  type of  distribution that  characterizes the
1398variation of  rates across sites.  A discrete  Gamma plus invariants  is used here.   Two parameters
1399specify this distribution: the gamma shape and the proportion of invariant parameters. Their initial
1400values  are set  by using  the corresponding  attributes and  attribute values  (\x{alpha="0.1"} and
1401\x{pinv="0.4"}). Also, PhyML can  optimise these parameters so as to maximise  the likelihood of the
1402whole phylogenetic model (\x{optimise.pinv="yes"}  and \x{optimise.alpha="yes"}). The following five
1403\x{<instance>}  components  define  the  rate  classes themselves.  The  \x{id}  attribute  is  here
1404mandatory  and must  be  unique  to each  class.   Note  that one  of  the  initial (relative)  rate
1405(\x{init.value} attribute) is set to zero. The  corresponding rate class (the third in this example)
1406will then correspond to the invariant site category.
1407
1408Having specified the  part of the phylogenetic  model that describes the variation  of rates across
1409sites,  we can  now move  on  to build  the rest  of the  model.   The component  below defines  two
1410substitution models:
1411
1412\vspace{0.2cm}
1413\begin{Verbatim}[frame=single, label=Rate matrices, samepage=true, baselinestretch=0.5,
1414  fontsize=\small, numbers=left]
1415
1416<ratematrices id="RateMatrices">
1417  <instance id="M1" model="HKY85" tstv="4.0" optimise.tstv="no"/>
1418  <instance id="M2" model="GTR" optimise.rr="yes"/>
1419</ratematrices>
1420\end{Verbatim}
1421
1422This \x{<ratematrices>} component sets out a list  of substitution models (HKY85 and GTR here). Here
1423again, the  different elements in  this list correspond  to the \x{<instance>}  sub-components. Each
1424instance must  have a unique \x{id}  attribute for a reason  that will become obvious  shortly.  The
1425remaining attributes and their functions are described in Section \ref{sec:xmlratematrices}.
1426
1427The next ``ingredient'' in our phylogenetic model are vectors of nucleotide frequencies. The
1428\x{<equfreqs>} component below specifies two of such vectors:
1429
1430
1431\vspace{0.2cm}
1432\begin{Verbatim}[frame=single, label=Equilibrium frequencies, samepage=true, baselinestretch=0.5,
1433  fontsize=\small, numbers=left]
1434
1435<equfreqs id="EquFreq">
1436  <instance id="F1"/>
1437  <instance id="F2"/>
1438</equfreqs>
1439
1440\end{Verbatim}
1441
1442Now, we need to assemble these three components (rate variation across sites, rate matrices and
1443vectors of equilibrium frequencies) into a mixture model. The \x{<partitionelem>} component below
1444defines one such model:
1445
1446\vspace{0.2cm}
1447\begin{Verbatim}[frame=single, label=Mixture model, samepage=true, baselinestretch=0.5,
1448  fontsize=\small, numbers=left]
1449
1450<partitionelem id="Part1" file.name="./nucleic.txt" data.type="nt">
1451  <mixtureelem list="R1, R2, R3, R4, R5"/>
1452  <mixtureelem list="M1, M1, M1, M2, M2"/>
1453  <mixtureelem list="F1, F2, F1, F2, F2"/>
1454</partitionelem>
1455
1456\end{Verbatim}
1457
1458The  \x{<partitionelem>} component  defines a  particular partition  element. In  this example,  the
1459partition element corresponds to the sequence file  called \x{nucleic.txt}, which is an alignment of
1460nucleotide  sequences   (see  the   \x{data.type}  attribute   value).   The   \x{<mixtureelem>}  are
1461sub-components  of  the  \x{<partitionelem>}  component.   Each  \x{<mixtureelem>}  has  a  \x{list}
1462atrribute.   Each such  \x{list} gives  the ID  of components  that have  been defined  before.  For
1463instance,  the  first   \x{<mixtureelem>}  refers  to  the  five  classes   of  the  \x{<siterates>}
1464component. The  ordering of  the different term  in these list  matters a  lot since it  is directly
1465related  to the  elements in  each class  of  the mixture  model. Hence,  the first  element in  the
1466\x{<list>} attribute  of the first  \x{<mixtureelem>} added to the  first element in  the \x{<list>}
1467attribute of the second \x{<mixtureelem>} plus the  the first element in \x{<list>} attribute of the
1468third \x{<mixtureelem>} defines the  first class of the mixture model.  Therefore, the mixture model
1469defined   above   has   five   classes:  $\mathcal{C}_1   =   \{R_1,M_1,F_1\}$,   $\mathcal{C}_2   =
1470\{R_2,M_1,F_2\}$,  $\mathcal{C}_3   =  \{R_3,M_1,F_1\}$,   $\mathcal{C}_4  =   \{R_4,M_2,F_2\}$  and
1471$\mathcal{C}_5 = \{R_5,M_2,F_2\}$.
1472
1473% Going back  to the different components  of this model, the  XML code dealing  with the substitution
1474% rates defines five  classes with names {\tt R1} to  {\tt R5}. The initial values  of these rates are
1475% set to 1.0, except for  {\tt R5}, which is set to 0 and will  therefore correspond to the invariable
1476% site class. The {\tt  <weight>} tag that follows indicate that these  rates define a $\Gamma 4$+Inv
1477% model, with  initial gamma shape parameter  set to 0.1 and  initial proportion of  invariants set to
1478% 0.4.  These  two parameters  will  be  estimated in  the  analysis  ({\tt  optimise.alpha} and  {\tt
1479% optimise.pinv} attributes  set to {\tt yes}).   The two rate matrices  have names {\tt  M1} and {\tt
1480% M2}.  {\tt M1} corresponds  to a HKY85 model, with transition/transversion ratio  set to 4.0 and set
1481% to be  optimised in the  analysis.  {\tt M2}  is a GTR  model, which parameters  are also set  to be
1482% optimised. {\tt F1} and {\tt F2} are two vectors of nucleotide frequencies at equilibrium. These two
1483% sets of frequencies will therefore be estimated during the analysis.
1484
1485
1486
1487\subsection{Setting up mixture and partition models in PhyML: the basics}\index{mixture
1488  models}\index{partitionned analysis}\index{data partitions}
1489
1490Mixture models are particularly relevant to the analysis of partitionned data. Indeed, some features
1491of  evolution are  gene-specific (e.g.,  substitution  rates vary  across genes).   Models that  can
1492accomodate  for such  variation,  as mixture  models do,  are  therefore relevant  in this  context.
1493However, other evolutionary features are shared across loci (e.g., genes located in the same genomic
1494region usually have similar  GC contents). As a consequence, some components  of mixture models need
1495to be  estimated separately for each  partition element while  others should be shared  by different
1496partition elements.
1497
1498Below is a simple example with a partitionned data set made of two elements, corresponding to the
1499sequence alignment files \x{nucleic1.txt} and \x{nucleic2.txt}. Importantly, the number and names of
1500sequences in these two alignments have to match exactly.
1501
1502\vspace{0.2cm}
1503\begin{Verbatim}[frame=single, label=Two sets of branch lengths (one per partition element),
1504  samepage=true, baselinestretch=0.5, fontsize=\small, numbers=left]
1505
1506<branchlengths id="BranchLens">
1507  <instance id="L1"/>
1508  <instance id="L2"/>
1509</branchlengths>
1510
1511<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
1512  <mixtureelem list="R1, R2, R3, R4, R5"/>
1513  <mixtureelem list="L1, L1, L1, L1, L1"/>
1514</partitionelem>
1515
1516<partitionelem id="Part2" file.name="./nucleic2.txt" data.type="nt">
1517  <mixtureelem list="R1, R2, R3, R4, R5"/>
1518  <mixtureelem list="L2, L2, L2, L2, L2"/>
1519</partitionelem>
1520
1521\end{Verbatim}
1522
1523Mixture elements with names  \x{R1},$\ldots$, \x{R5} refer to the $\Gamma4+$I model defined
1524previsouly  (see Section  \ref{sec:XML format}).   The \x{<branchlengths>}  XML component  defines a
1525mixture element  that had not  been introduced  before.  It defined  vectors of branch  lengths that
1526apply to the estimated phylogeny. Two instances of  such vectors are defined: \x{L1} and \x{L2}.
1527When examining the  two partition elements (\x{<partitionelem>} component), it  appears that \x{L1}
1528is associated with \x{Part1} while \x{L2} is associated with \x{Part2}.  Hence, branch lengths
1529will be estimated separately for these two partition elements.
1530
1531Note that  a given partition element  can only have  one {\tt branchlengths} instance  associated to
1532it. For instance, the example given below is not valid:
1533
1534\vspace{0.2cm}
1535\begin{Verbatim}[frame=single, label=Invalid mixture, samepage=true, baselinestretch=0.5,
1536  fontsize=\small, numbers=left]
1537
1538<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
1539  <mixtureelem list="R1, R2, R3, R4, R5"/>
1540  <mixtureelem list="L1, L1, L1, L2, L2"/>
1541</partitionelem>
1542
1543\end{Verbatim}
1544
1545In other words, mixture of branch lengths are forbidden. One reason for this restriction is that
1546mixture of edge lengths sometimes lead to non-identifiable models (i.e., models with distinct sets
1547of branch lengths have the same likelihood) \cite{matsen07}. But mostly, combining mixture of branch
1548lengths with mixture of rates appears like a deadly combination. Consider for instance the following model:
1549
1550\vspace{0.2cm}
1551\begin{Verbatim}[frame=single, label=Invalid mixture, samepage=true, baselinestretch=0.5,
1552  fontsize=\small, numbers=left]
1553
1554<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
1555  <mixtureelem list="R1, R2, R3"/>
1556  <mixtureelem list="L1, L2, L3"/>
1557</partitionelem>
1558
1559\end{Verbatim}
1560
1561It is here impossible to tell apart  branch lengths and substitution rates. Such model is strongly
1562non-identifiable and therefore not relevant.
1563
1564In the example given above, the same $\Gamma4+$I model (i.e. the same gamma shape parameter and
1565proportion of invariant ) applies to the two partition elements. It is possible to use two distinct
1566$\Gamma4+$I models instead using the following XML code:
1567
1568
1569
1570\vspace{0.2cm}
1571\begin{Verbatim}[frame=single, label=Two distinct $\Gamma4+$I models, samepage=true,
1572  baselinestretch=0.5, fontsize=\small, numbers=left]
1573
1574<siterates id="SiteRates1">
1575  <weights  id="Distrib1" family="gamma+inv" alpha=".1" \
1576  optimise.alpha="yes" pinv="0.4" optimise.pinv="yes">
1577  </weights>
1578  <instance id="R1" init.value="1.0"/>
1579  <instance id="R2" init.value="1.0"/>
1580  <instance id="R5" init.value="0.0"/>
1581  <instance id="R3" init.value="1.0"/>
1582  <instance id="R4" init.value="1.0"/>
1583</siterates>
1584
1585<siterates id="SiteRates2">
1586  <weights  id="Distrib2" family="gamma+inv" alpha=".1" \
1587  optimise.alpha="yes" pinv="0.4" optimise.pinv="yes">
1588  </weights>
1589  <instance id="R6"  init.value="1.0"/>
1590  <instance id="R7"  init.value="1.0"/>
1591  <instance id="R8"  init.value="0.0"/>
1592  <instance id="R9"  init.value="1.0"/>
1593  <instance id="R10" init.value="1.0"/>
1594</siterates>
1595
1596<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
1597  <mixtureelem list="R1, R2, R3, R4, R5"/>
1598  <mixtureelem list="L1, L1, L1, L1, L1"/>
1599</partitionelem>
1600
1601<partitionelem id="Part2" file.name="./nucleic2.txt" data.type="nt">
1602  <mixtureelem list="R6, R7, R8, R9, R10"/>
1603  <mixtureelem list="L2, L2, L2, L2, L2"/>
1604</partitionelem>
1605
1606\end{Verbatim}
1607
1608\x{SiteRates1} and \x{SiteRates2} here define two distinct $\Gamma4+$I models. Each of these models apply to
1609one of the two partition elements (\x{nucleic1.txt} and \x{nucleic2.txt}), allowing them to display
1610different patterns of rate variation across sites.
1611
1612
1613\subsection{XML options}
1614\subsubsection{{\tt phyml} component}\index{XML options!{\tt phyml} component}
1615Options:
1616\begin{itemize}
1617\item \x{output.file="filename"}. The main output files of PhyML analysis will be named
1618  \x{filename\_phyml\_tree} and \x{filename\_phyml\_stats}.
1619\item \x{bootstrap="nreplicates"}. Run \x{nreplicates} replicates for the non-parametric bootstrap analysis.
1620\item \x{run.id="idstring"}. PhyML will append the string \x{idstring} to each output file.
1621\item \x{print.json.trace="yes|true|no|false"}. PhyML will print the estimated trees, the
1622  corresponding loglikelihoods and various model parameters at multiple stages of the estimation process. This option is useful
1623  for monitoring the progress of the analysis when processing large data sets.
1624\item \x{print.trace="yes|true|no|false"}. PhyML will print the estimated trees (and the
1625  corresponding loglikelihoods) at multiple stages of the estimation process. This option is also useful
1626  for monitoring the progress of the analysis when processing large data sets.
1627\item \x{branch.test="aBayes|aLRT|SH|no"}. Calculate fast branch support using the aBayes method
1628  \cite{anisimova11}, aLRT \cite{anisimova06} or SH \cite{shimodaira99} tests. These branch
1629  statistics are much faster to estimate than the bootrap proportions and usually provide good
1630  estimates of the probabilities that the corresponding edges are correctly inferred (see Anisimova et
1631  al. 2011 for more precision). By default and if no bootstrap analysis is performed, branch supports
1632  are estimated using the aBayes approach.
1633\item \x{quiet="yes|no"}. Runs PhyML in quiet mode when \x{quiet=yes}. The program will not pause if the memory required to run the analysis
1634exceeds 256MB and will not output the progresssion of the log-likelihood scores on the standard output.
1635\item \x{memory.check="yes|no"}. By default, when processing a large data set, PhyML will pause and ask the user to confirm that
1636she/he wants to continue with the execution of the analysis despite the large amount of memory
1637required. Setting \x{memory.check=no}  skips this question. It is especially useful when running
1638PhyML in batch mode.
1639
1640
1641
1642\end{itemize}
1643\subsubsection{{\tt topology} component}\index{XML options!{\tt topology} component}
1644Each instance of the \x{topology} component (there should be only one...) has the following options:
1645\begin{itemize}
1646\item \x{init.tree="bionj"|"user"|"random"}.  Starting tree. Default is \x{bionj}.
1647\item \x{n.rand.starts="X"}.  Number of random starting trees. Default is 5.
1648\item \x{file.name="name\_of\_tree\_file"}. In case \x{init.tree="user"}, this
1649  attribute  is mandatory. \x{name\_of\_tree\_file} is a
1650  text file containing a tree in NEWICK format.
1651\item \x{optimise.tree="yes"|"true"|"no"|"false"}. The starting tree topology as defined by
1652  \x{init.tree} is to be optimised (or not) so as to maximise the likelihood function.
1653\item \x{search="nni"|"spr"|"none"}. Tree topology search is conducted using NNI (fast), SPR (a bit
1654  slower but more accurate) or no moves.
1655\end{itemize}
1656\vspace{0.2cm}
1657\begin{Verbatim}[frame=single, label=Example of `topology' component, samepage=true,
1658  baselinestretch=0.5, fontsize=\small, numbers=left]
1659
1660<topology>
1661  <instance id="T1" init.tree="bionj" optimise.tree="true" \
1662   search="spr"/>
1663</topology>
1664
1665\end{Verbatim}
1666
1667\subsubsection{{\tt ratematrices} component}\index{XML options!{\tt ratematrices} component}\label{sec:xmlratematrices}
1668Each instance of a \x{ratematrices} component have the following options:
1669\begin{itemize}
1670\item \x{model="JC69"|"K80"|"F81"|"F84"|"HKY85"|"TN93"|"GTR"|"custom"} for nucleotide data. The default is \x{"HKY85"}.\\
1671\x{model="LG"|"WAG"|"JTT"|"MtREV"|"Dayhoff"|"DCMut"|"RtREV"|"CpREV"|"VT"}\\\x{|"Blosum62"|"MtMam"|"MtArt"|"HIVw"|"HIVb"|"customaa"}
1672for amino-acid sequences. The default is \x{"LG"}.
1673\item \x{model.code="012345"}. For \x{custom} model applied to nucleotide sequences: set the
1674  string of digits that define a custom substitution model. See Table \ref{tab:modelcode} on page
1675  \pageref{tab:modelcode} for more
1676  information about the model codes.
1677\item \x{ratematrix.code="filename"}. When used in conjunction with \x{model="customaa"},
1678  \x{filename} is the name of the file that gives the rates of substitution between amino-acids as
1679  well as their frequences at equilibrium using PAML rate matrix format. An example of such file is
1680  provided in {phyml/examples/X1.mat}.
1681\item \x{optimise.rr="yes"|"true"|"no"|"false"}. For \x{custom} and \x{GTR} nucleotide models only:
1682  optimise the substitution rate model parameters.
1683\item \x{optimise.tstv="yes"|"true"|"no"|"false"}. For \x{K80}, \x{F84}, \x{HKY85} and \x{TN93}
1684  models only: optimise the transition/transversion rate ratio.
1685\item \x{tstv="value"}. For \x{K80}, \x{HKY85} and \x{TN93} models only: set the transition/transversion to a
1686  given value.
1687\end{itemize}
1688
1689An instance of a \x{ratematrices} component where a GTR or a custom model of substitutions between
1690nucleotides is used can have a \x{rr} component associated to it in order to specificy the relative
1691rates of substitutions (see example below).
1692
1693Also, the {\tt ratematrices} component has the attribute {\tt optimise.weights=yes/no} (default is {\tt
1694  no}). If {\tt optimise.weights=yes}, then the probabilities (or weights) or each matrix in the
1695set of matrices defined by this component (see Equation \ref{equ:weights}), will be estimated from the data.
1696
1697\vspace{0.2cm}
1698\begin{Verbatim}[frame=single, label=Example of `ratematrices' component, samepage=true,
1699  baselinestretch=0.5, fontsize=\small, numbers=left]
1700
1701  <ratematrices id="RM1" optimise.weights="yes">
1702    <instance id="M1" model="custom" model.code="000000"/>
1703      <rr AC="1.2" AG="1.5" AT="2.3" CT="1.2" CG="1.5"/>
1704    <instance id="M2" model="GTR" optimise.rr="yes"/>
1705    <instance id="M3" model="WAG"/>
1706  </ratematrices>
1707
1708\end{Verbatim}
1709
1710\subsubsection{{\tt equfreqs} component}\index{XML options!{\tt equfreqs} component}
1711Each instance of a \x{equfreqs} component has the following options:
1712\begin{itemize}
1713\item \x{base.freqs="a,b,c,d"} where \x{a-d} are nucleotide frequencies. Make sure that these
1714  frequencies are separated by comas and no space character is inserted.
1715\item \x{aa.freqs="empirical|model"}. Amino-acid frequencies are derived from counting the number of
1716  occurence of each amino-acid in the alignment (\x{aa.freqs="empirical"}) or given by the
1717  substitution model (\x{aa.freqs="model"}).
1718\item \x{optimise.freqs="true|yes|false|no"}. Nucleotide frequencies can be optimised so as to maximise
1719  the likelihood (\x{optimise.freqs="yes|true"}).
1720\end{itemize}
1721
1722The {\tt equfreqs} component has the attribute {\tt optimise.weights=yes/no} (default is {\tt
1723  no}). If {\tt optimise.weights=yes}, then the probabilities (or weights) or each vector of
1724equilibrium frequencies in the
1725set of vectors defined by this component (see Equation \ref{equ:weights}), will be estimated from the data.
1726
1727\vspace{0.2cm}
1728\begin{Verbatim}[frame=single, label=Example of `equfreqs' component, samepage=true,
1729  baselinestretch=0.5, fontsize=\small, numbers=left]
1730
1731  <equfreqs id="EF1" optimise.weights="yes">
1732    <instance id="F1" base.freqs="0.25,0.25,0.25,0.25"/>
1733    <instance id="F2" aa.freqs="empirical"/>
1734    <instance id="F3" optimise.freqs="yes"/>
1735  </equfreqs>
1736
1737\end{Verbatim}
1738
1739\subsubsection{{\tt branchlengths} component}\index{XML options!{\tt branchlengths} component}
1740Options:
1741\begin{itemize}
1742\item \x{optimise.lens="yes"|"true"|"no"|"false"}: branch lengths are optimised or not. The default
1743  is set to \x{"yes"}.
1744\end{itemize}
1745\vspace{0.2cm}
1746\begin{Verbatim}[frame=single, label=Example of `branchlengths' component, baselinestretch=0.5,
1747  fontsize=\small, numbers=left]
1748
1749  <branchlengths id="BL1">
1750    <instance id="L1" optimise.lens="yes"/>
1751    <instance id="L2"/>
1752    <instance id="L3" optimise.lens="false"/>
1753  </branchlengths>
1754
1755\end{Verbatim}
1756
1757\subsubsection{{\tt siterates} component}\index{XML options!{\tt siterates} component}
1758Each instance defines a class of relative rate of substitution. It has the following options:
1759\begin{itemize}
1760\item \x{value="val"}, where \x{"val"} is the relative substitution rate for the corresponding class.
1761\end{itemize}
1762A \x{siterates} component generally includes a \x{weights} element that specifies the probabilitic
1763distribution of the relative rates. The available options for such element are:
1764\begin{itemize}
1765\item \x{family="gamma|gamma+inv|freerates"}. \x{gamma} indicates that the distribution of the
1766  relative rates is set to be a discrete Gamma density. \x{gamma+inv} indicates that the relative rate model
1767  is a mixture of Gamma and invariant sites (this is the common $\Gamma+$I model). FreeRate is
1768  a model that does not use any parametric function to describe the distribution of the relative
1769  rates (see \cite{soubrier12}). Under this option, relative rates and the corresponding frequencies of these classes are
1770  directly estimated from the data. While such approach is slightly more computationally demanding
1771  than the $\Gamma$ (or $\Gamma$+I) model, it often provides a significantly better fit to the data.
1772\item \x{alpha="value|optimised"}, where \x{value} is a real positive number. Use this option to set
1773   the gamma shape parameter to the selected value. \x{optimised}: the parameter is estimated from
1774   the data (see also next option).
1775\item \x{optimise.alpha="yes|true|no|false"}. Optimise the shape of the Gamma distribution of
1776  relative rates (or not).
1777\item \x{pinv="value|optimised"}, where \x{value} is in $[0,1]$. Use this option to set
1778   the proportion of invariants to the selected value. \x{optimised}: the parameter is estimated from
1779   the data (see also next option).
1780\item \x{optimise.pinv="yes|true|no|false"}. Optimise the proportion of invariable sites (or not).
1781\item \x{optimise.freerates="yes|true|no|false"}. Optimise the parameters of the FreeRate model,
1782  i.e., the relative rates and the corresponding frequencies.
1783\end{itemize}
1784\vspace{0.2cm}
1785\begin{Verbatim}[frame=single, label=Example of `siterates' component (discrete gamma model), samepage=true,
1786  baselinestretch=0.5, fontsize=\small, numbers=left]
1787
1788  <siterates id="SR1">
1789    <instance id="R1" init.value="1.0"/>
1790    <instance id="R2" init.value="1.0"/>
1791    <instance id="R3" init.value="1.0"/>
1792    <instance id="R4" init.value="1.0"/>
1793    <weights  id="D1" family="gamma" optimise.alpha="yes" \
1794    optimise.pinv="no">
1795    </weights>
1796  </siterates>
1797\end{Verbatim}
1798
1799Setting up an analysis using the FreeRate\index{FreeRate} mixture model of rate variation across
1800sites \cite{soubrier12} requires a
1801bit more work. As in the discrete gamma model, the FreeRate model relies on definies classes of
1802(relative) rates. In the discrete gamma model, each class has the same frequency. This is no longer
1803the case with FreeRate. It is thus necessary to specify the frequency, or weight, of each class of rate in the
1804XML file. The example below illustrates how these weights are defined in practice:
1805\vspace{0.2cm}
1806\begin{Verbatim}[frame=single, label=Example of `siterates' component (FreeRate model), samepage=true,
1807  baselinestretch=0.5, fontsize=\small, numbers=left]
1808  <!-- Freerate model of variation of rates across sites -->
1809  <siterates id="SR1">
1810    <instance id="R1" init.value="0.1"/>
1811    <instance id="R2" init.value="0.7"/>
1812    <instance id="R3" init.value="1.9"/>
1813    <instance id="R4" init.value="5.1"/>
1814    <weights  id="D1" family="freerates" optimise.freerates="yes">
1815      <instance appliesto="R1" value="0.4"/>
1816      <instance appliesto="R2" value="0.3"/>
1817      <instance appliesto="R3" value="0.1"/>
1818      <instance appliesto="R4" value="0.1"/>
1819    </weights>
1820  </siterates>
1821\end{Verbatim}
1822Note that the weights do not have to sum to one. PhyML rescales them before starting the analysis
1823such that these weights are transformed into proper probabilities. The relative rates themselves are
1824rescaled too such that the weighted average (relative) rate is equal to one after rescaling.
1825
1826
1827\subsubsection{{\tt partitionelem} and {\tt mixtureelem} components}\index{XML options!{\tt partitionelem}
1828  component}\index{XML options!{\tt mixtureelem} component}
1829
1830Options:
1831\begin{itemize}
1832\item \x{file.name="inputfilename"}, where \x{inputfilename} is the name of the input sequence file
1833  (in PHYLIP format) to be analysed.
1834\item \x{data.type="nt|aa"}. Specify the type of sequences to be processed (nucleotide of amino-acid sequences).
1835\item \x{interleaved="yes|true|no|false"}. Interleaved (\x{yes|true}) or sequential format
1836  (\x{no|false}) for the sequence alignment.
1837\item \x{optimise.tree.scale="yes|true|no|false"}. The sum of edge length (or tree size) is
1838  optimized. This option is relevant when different data partition elements point to the same set of
1839  edge lengths so that setting \x{optimise.tree.scale="yes"} will find the optimal ratio of tree sizes
1840  considering the two elements. In other words, the different trees all share the same (relative) edge lengths
1841  but the corresponding partition elements have different mean rates of substitution.
1842\item \x{tree.scale="val"}. \x{val} is the value of the (relative) substitution rate. By default,
1843  its value is set to 1.0.
1844\item \x{print.site.lk="yes|true|no|false"}. The likelihood at each site (and other information)
1845  will be written in a file named  \x{inputfilename\_phyml\_lk}.
1846\end{itemize}
1847
1848Each \x{partitionelem} element should include exactly four \x{mixtureelem} elements, corresponding to
1849branch lengths, equilibrium frequencies, substitution rate model and tree topology. The ordering of
1850in which the \x{mixtureelem} elements are given does not matter, though exceptions apply for the
1851$\Gamma+I$ model (see below). The $n$-th element in the \x{list}
1852attribute of each \x{mixtureelem} defines the $n$-th class of the mixture model. In the example given
1853below, the first class of the mixture is made of the following elements: \x{T1}, \x{F1}, \x{R1} and
1854\x{L1}, the second class is made of \x{T1}, \x{F1}, \x{R2} and \x{L1}, etc.
1855
1856
1857\vspace{0.2cm}
1858\begin{Verbatim}[frame=single, label=Example of `partitionelem' component, samepage=true,
1859  baselinestretch=0.5, fontsize=\small, numbers=left]
1860
1861  <partitionelem id="partition1" file.name="./small_p1.nxs" \
1862   data.type="nt" interleaved="yes">
1863    <mixtureelem list="T1, T1, T1, T1"/>
1864    <mixtureelem list="F1, F1, F1, F1"/>
1865    <mixtureelem list="R1, R2, R3, R4"/>
1866    <mixtureelem list="L1, L1, L1, L1"/>
1867  </partitionelem>
1868
1869\end{Verbatim}
1870
1871In general, the  ordering of the \x{mixtureelem}  elements does not matter. However,  when the model
1872has invariable sites, then  the corresponding class should be first in  the list of classes provided
1873by \x{mixtureelem}. For instance, in the example above, if the rates are defined as follows:
1874
1875\vspace{0.2cm}
1876\begin{Verbatim}[frame=single, label=Example of `siterates' component, samepage=true,
1877  baselinestretch=0.5, fontsize=\small, numbers=left]
1878
1879  <siterates id="SR1">
1880    <instance id="R1" init.value="0.0"/>
1881    <instance id="R2" init.value="1.0"/>
1882    <instance id="R3" init.value="1.0"/>
1883    <instance id="R4" init.value="1.0"/>
1884    <weights  id="D1" family="gamma+inv" optimise.alpha="yes" \
1885    optimise.pinv="no">
1886    </weights>
1887  </siterates>
1888\end{Verbatim}
1889
1890then \x{R1} corresponds to the invariable  rate class (as \x{init.value="0.0"}).  As \x{R1} is first
1891in the  \x{mixtureelem} (see line  6 in  the example of  \x{`partionelem'} given above),  PhyML will
1892print out an  explicit error message and bail out.   One way to avoid this  shortcoming is to define
1893\x{mixtureelem} as \x{R4, R2, R3, R1} instead.
1894
1895
1896\subsection{Example: GTR + $\Gamma$4 + I}
1897
1898The example below provides all the required options to fit a $\Gamma$4+I model to a single alignment
1899of nucleotide  sequences under the GTR  model of substitution using  a SPR search for  the best tree
1900topology. The \x{phyml} component sets the name for the analysis to \x{simple.example}, meaning that
1901each output file  will display this particular  string of characters. Also, the  tree and statistics
1902file names will begin with \x{p1.output}. The tree  topology will be estimated so as to maximise the
1903likelihood and  the topology search  algorithm used here  is SPR, as indicated  by the value  of the
1904corresponding attribute  (i.e., \x{search="spr"}). Only  one vector of  branch lengths will  be used
1905here since  only one partition element  will be processed. Hence,  the \x{<branchlengths>} component
1906only has one  \x{<instance>} sub-component. Also, a single  GTR model will apply to  all the classes
1907for the mixture model -- the \x{<ratematrices>} component has only one \x{<instance>} sub-component,
1908corresponding to this  particular substitution model. The next  component, \x{<equfreqs>}, indicates
1909that a single vector of equilibrium frequencies will apply here. Next, the \x{<siterates>} component
1910has five \x{<instance>} sub-components.  Four of these  correspond to the non-zero relative rates of
1911evolution  a  defined  by  a  discrete  Gamma distribution.   The  last  one  (\x{<instance  id="R5"
1912value="0.0"/>})  defines  the  class  of  the   mixture  corresponding  to  invariable  sites.   The
1913\x{<weight>} component indicates that a $\Gamma+$I model will be fitted here. The shape parameter of
1914the  Gamma distribution  and the  proportion of  invariants will  be estimated  from the  data.  The
1915\x{<partitionelem>} gives information about the sequence alignment (the corresponding file name, the
1916type of  data and the  alignment format). The \x{<mixtureelem>}  components next define  the mixture
1917model. Each class of the  fitted model corresponds to one column, with the  first column made of the
1918following elements: \x{T1, M1, F1, R1} and \x{L1}. The second class of the mixture is made of \x{T1,
1919M1, F1, R2, L1} and so forth.
1920
1921\vspace{0.2cm}
1922\begin{Verbatim}[frame=single, label=Simple PhyML XML example, samepage=true, baselinestretch=0.5,
1923  fontsize=\small, numbers=left]
1924
1925<phyml runid="simple.example" output.file="p1.output">
1926
1927  <topology>
1928    <instance id="T1" init.tree="bionj" optimise.tree="yes" \
1929    search="spr"/>
1930  </topology>
1931
1932  <branchlengths id="BL1">
1933    <instance id="L1" optimise.lens="yes"/>
1934  </branchlengths>
1935
1936  <ratematrices id="RM1">
1937    <instance id="M1" model="GTR"/>
1938  </ratematrices>
1939
1940  <equfreqs id="EF1">
1941    <instance id="F1"/>
1942  </equfreqs>
1943
1944  <siterates id="SR1">
1945    <instance id="R1" value="1.0"/>
1946    <instance id="R2" value="1.0"/>
1947    <instance id="R3" value="1.0"/>
1948    <instance id="R4" value="1.0"/>
1949    <instance id="R5" value="0.0"/>
1950    <weights  id="D1" family="gamma+inv" optimise.alpha="yes" \
1951    optimise.pinv="yes">
1952    </weights>
1953  </siterates>
1954
1955  <partitionelem id="partition_elem1" file.name=\
1956  "./p1.seq" data.type="nt" interleaved="yes">
1957    <mixtureelem list="T1, T1, T1, T1, T1"/>
1958    <mixtureelem list="M1, M1, M1, M1, M1"/>
1959    <mixtureelem list="F1, F1, F1, F1, F1"/>
1960    <mixtureelem list="R1, R2, R3, R4, R5"/>
1961    <mixtureelem list="L1, L1, L1, L1, L1"/>
1962  </partitionelem>
1963
1964</phyml>
1965
1966\end{Verbatim}
1967
1968
1969\subsection{Example: LG4X}\index{lg4x}
1970
1971The example below shows how to fit the LG4X model \cite{lg4x} to a given alignment of amino-acid
1972sequences (file \x{M587.nex.Phy}). LG4X is a mixture model with four classes. Each class has its own
1973rate and corresponding frequencies (hence the use of the FreeRate model below, see the
1974\x{<siterates>} component). In the particular example given here, the rate values and frequencies
1975are set by the users. These parameters will then be optimized by PhyML (\x{optimise.freerates="yes"}).
1976Each class also has its own rate matrix and vector of equilibrium frequencies, which need to be provided by
1977the user (Note that these matrices can be downloaded from the following web address:
1978\url{http://www.atgc-montpellier.fr/download/datasets/models/lg4x/LG4X_4M.txt}. They are also
1979provided in the PhyML package \x{example/lg4x/} directory.)
1980
1981\vspace{0.2cm}
1982\begin{Verbatim}[frame=single, label=LG4X, samepage=true, baselinestretch=0.5,
1983  fontsize=\small, numbers=left]
1984<phyml run.id="lg4x" output.file="M587.tests" branch.test="no">
1985
1986  <!-- Tree topology: start with BioNJ and then SPRs -->
1987  <topology>
1988    <instance id="T1" init.tree="user" file.name="user_tree.txt" \
1989    search="spr" optimise.tree="no"/>
1990  </topology>
1991
1992
1993  <!-- Four rate matrices, read from files -->
1994  <ratematrices id="RM1">
1995    <instance id="M1" model="customaa" ratematrix.file="X1.mat"/>
1996    <instance id="M2" model="customaa" ratematrix.file="X2.mat"/>
1997    <instance id="M3" model="customaa" ratematrix.file="X3.mat"/>
1998    <instance id="M4" model="customaa" ratematrix.file="X4.mat"/>
1999  </ratematrices>
2000
2001  <!-- Freerate model of variation of rates across sites -->
2002  <siterates id="SR1">
2003    <instance id="R1" init.value="0.197063"/>
2004    <instance id="R2" init.value="0.750275"/>
2005    <instance id="R3" init.value="1.951569"/>
2006    <instance id="R4" init.value="5.161586"/>
2007    <weights  id="D1" family="freerates" optimise.freerates="yes">
2008      <instance appliesto="R1" value="0.422481"/>
2009      <instance appliesto="R2" value="0.336848"/>
2010      <instance appliesto="R3" value="0.180132"/>
2011      <instance appliesto="R4" value="0.060539"/>
2012    </weights>
2013  </siterates>
2014
2015  <!-- Amino-acid equilibrium freqs. are given by the models -->
2016  <equfreqs id="EF1">
2017    <instance id="F1" aa.freqs="model"/>
2018    <instance id="F2" aa.freqs="model"/>
2019    <instance id="F3" aa.freqs="model"/>
2020    <instance id="F4" aa.freqs="model"/>
2021  </equfreqs>
2022
2023
2024  <!-- One vector of branch lengths -->
2025  <branchlengths id="BL1" >
2026    <instance id="L1" optimise.lens="yes"/>
2027  </branchlengths>
2028
2029
2030  <!-- Mixture model assemblage -->
2031  <partitionelem id="partition1" file.name="M587.nex.Phy" \
2032  data.type="aa" interleaved="yes">
2033    <mixtureelem list="T1, T1, T1, T1"/>
2034    <mixtureelem list="M1, M2, M3, M4"/>
2035    <mixtureelem list="F1, F2, F3, F4"/>
2036    <mixtureelem list="R1, R2, R3, R4"/>
2037    <mixtureelem list="L1, L1, L1, L1"/>
2038  </partitionelem>
2039
2040</phyml>
2041\end{Verbatim}
2042
2043In order to fit the LG4X model to the \x{proteic} sequence file provided in the \x{examples/}
2044directory, simply type \x{./phyml --xml=../examples/lg4x/lg4x.xml} (assuming the PhyML binary is installed
2045in the \x{src/} directory). You can of course slightly tweak the file \x{../examples/lg4x/lg4x.xml}
2046and use it as a template to fit this model to another data set.
2047
2048
2049\subsection{Example: CAT-like model}\index{CAT model}
2050
2051The CAT model \cite{lartillot04} is a mixture model whereby each site may have its own
2052exchangeability rate matrix and vector of state frequencies. In its original exposition, the number
2053of classes in the mixture was also a parameter to be estimated from the data. PhyML uses a fixed
2054number of classes instead. This lesser degree of sophistication facilitates the interpretation
2055of parameter estimates. Examination of the statistics and site-likelihood files produced by PhyML
2056under the CAT-like model makes it indeed straightforward to spot sites with peculiar substitution
2057pattern (e.g., sites that sustained only certain types of substitutions, for instance between
2058purines or between pyrimidines).
2059
2060In the  example that follows,  the CAT-like model  implemented has six  rate classes goverened  by a
2061FreeRate model. The first  and second class share the same GTR matrix  (M1) and vector of nucleotide
2062frequencies (F1) (see the two leftmost columns in  the assemblage matrix). The third and fourth also
2063share the same GTR matrix (M2) and vector of frequencies (F2) which are estimated independantly from
2064M1 and F1. Finally, the last two classes of  the mixture share M3 and F3, which are independant from
2065the other rate matrices and state frequencies.
2066
2067
2068
2069
2070
2071\vspace{0.2cm}
2072\begin{Verbatim}[frame=single, label=CAT, samepage=true, baselinestretch=0.5,
2073  fontsize=\small, numbers=left]
2074  <phyml runid="CAT" output.file="example" branch.test="no" \
2075  print.json.trace="no" print.site.lk="yes">
2076
2077  <topology>
2078  <instance id="T1" init.tree="bionj"  optimise.tree="yes" \
2079  search="spr"/>
2080  </topology>
2081
2082  <branchlengths id="BL1">
2083    <instance id="L1" optimise.lens="yes"/>
2084  </branchlengths>
2085
2086  <!-- GTR rate matrices -->
2087  <ratematrices id="RM1">
2088    <instance id="M1" model="GTR" optimise.rr="yes"/>
2089    <instance id="M2" model="GTR" optimise.rr="yes"/>
2090    <instance id="M3" model="GTR" optimise.rr="yes"/>
2091  </ratematrices>
2092
2093
2094  <!-- Vectors of nucleotide frequencies -->
2095  <equfreqs id="EF1">
2096    <instance id="F1" optimise.freqs="yes"/>
2097    <instance id="F2" optimise.freqs="yes"/>
2098    <instance id="F3" optimise.freqs="yes"/>
2099  </equfreqs>
2100
2101
2102  <!-- Variability of rates across sites -->
2103  <siterates id="SR1">
2104    <instance id="R1" init.value="1.0"/>
2105    <instance id="R2" init.value="1.0"/>
2106    <instance id="R3" init.value="1.0"/>
2107    <instance id="R4" init.value="1.0"/>
2108    <instance id="R5" init.value="1.0"/>
2109    <instance id="R6" init.value="1.0"/>
2110    <weights id="D1" family="freerates" optimise.alpha="yes" \
2111    optimise.pinv="yes">
2112    </weights>
2113  </siterates>
2114
2115
2116  <!-- Assemblage -->
2117  <partitionelem id="partition1" file.name="./seqfile.phy" \
2118  data.type="nt" interleaved="no" optimise.tree.scale="no">
2119    <mixtureelem list="T1,T1,T1,T1,T1,T1"/>
2120    <mixtureelem list="R1,R2,R3,R4,R5,R6"/>
2121    <mixtureelem list="M1,M1,M2,M2,M3,M3"/>
2122    <mixtureelem list="F1,F1,F2,F2,F3,F3"/>
2123    <mixtureelem list="L1,L1,L1,L1,L1,L1"/>
2124  </partitionelem>
2125
2126
2127</phyml>
2128\end{Verbatim}
2129
2130\subsection{Example: multiple partition elements}
2131
2132The example below gives  the complete XML file to specify the analysis  of three partition elements,
2133corresponding to the nucleotide  sequence files \x{small\_p1\_pos1.seq}, \x{small\_p1\_pos2.seq} and
2134\x{small\_p1\_pos3.seq} in  interleaved PHYLIP  format. Importantly, the number and names of
2135sequences in these three alignments match exactly. \x{small\_p1\_pos1.seq}  is fitted  with the
2136HKY85 model of substitution (with the  transition/transversion ratio being estimated from the data),
2137combined to a $\Gamma4$  model of rate variation across sites (with the  gamma shape parameter being
2138estimated from the data).  \x{small\_p1\_pos2.seq} is fitted to a custom substitution model with the
2139constraint  $A\leftrightarrow  G$=$C\leftrightarrow  T$.  The  nucleotide  frequencies  are  set  to
2140$\frac{1}{4}$   here.    The   model   does   not  allow   substitution   rates   to   vary   across
2141sites. \x{small\_p1\_pos3.seq} is fitted  using a GTR model conbined to a  $\Gamma4+$I model of rate
2142variation across sites.  Note that the equilibrium  nucleotide frequencies for the  fourth and fifth
2143class of the mixture are  set to be equal to that estimated from  the first partition element (i.e.,
2144\x{F1}) . The initial phylogeny is built using BioNJ  and the tree topology is to be estimated using
2145a NNI search algorithm.
2146
2147\vspace{0.2cm}
2148\begin{Verbatim}[frame=single, label=Example of PhyML XML file, samepage=true, baselinestretch=0.5,
2149  fontsize=\small, numbers=left]
2150
2151<phyml runid="nnisearch" output.file="small_p1_output">
2152
2153  <topology>
2154    <instance id="T1" init.tree="bionj" optimise.tree="yes" \
2155    search="nni"/>
2156  </topology>
2157
2158  <branchlengths id="BL1">
2159    <instance id="L1" optimise.lens="yes"/>
2160    <instance id="L2"/>
2161    <instance id="L3"/>
2162  </branchlengths>
2163
2164  <ratematrices id="RM1">
2165    <instance id="M1" model="HKY85" optimise.tstv="yes"/>
2166    <instance id="M2" model="custom" model.code="102304" \
2167    optimise.rr="yes"/>
2168    <instance id="M3" model="GTR"/>
2169  </ratematrices>
2170
2171  <equfreqs id="EF1">
2172    <instance id="F1"/>
2173    <instance id="F2" base.freqs="0.25,0.25,0.25,0.25"/>
2174    <instance id="F3"/>
2175  </equfreqs>
2176
2177  <siterates id="SR1">
2178    <instance id="R1" value="1.0"/>
2179    <instance id="R2" value="1.0"/>
2180    <instance id="R3" value="1.0"/>
2181    <instance id="R4" value="1.0"/>
2182    <weights  id="D1" family="gamma" optimise.alpha="yes" \
2183    optimise.pinv="no">
2184    </weights>
2185  </siterates>
2186
2187  <siterates id="SR2">
2188    <instance id="R8" value="1.0"/>
2189    <weights  id="D2" family="gamma" optimise.alpha="yes" \
2190    optimise.pinv="yes">
2191    </weights>
2192  </siterates>
2193
2194  <siterates id="SR3">
2195    <instance id="R10" value="1.0"/>
2196    <instance id="R11" value="1.0"/>
2197    <instance id="R12" value="1.0"/>
2198    <instance id="R13" value="1.0"/>
2199    <instance id="R14" value="1.0"/>
2200    <weights  id="D3" family="gamma" optimise.alpha="yes" \
2201    optimise.pinv="yes">
2202    </weights>
2203  </siterates>
2204
2205\end{Verbatim}
2206
2207\vspace{0.2cm}
2208\begin{Verbatim}[frame=single, label=Example of PhyML XML file (ctnd), samepage=true,
2209  baselinestretch=0.5, fontsize=\small, numbers=left]
2210
2211  <partitionelem id="partition_elem1" file.name=\
2212  "./small_p1_pos1.seq" data.type="nt" interleaved="yes">
2213    <mixtureelem list="T1, T1, T1, T1"/>
2214    <mixtureelem list="M1, M1, M1, M1"/>
2215    <mixtureelem list="F1, F1, F1, F1"/>
2216    <mixtureelem list="R1, R2, R3, R4"/>
2217    <mixtureelem list="L1, L1, L1, L1"/>
2218  </partitionelem>
2219
2220  <partitionelem id="partition_elem2" file.name=\
2221  "./small_p1_pos2.seq" data.type="nt" interleaved="yes">
2222    <mixtureelem list="T1"/>
2223    <mixtureelem list="M2"/>
2224    <mixtureelem list="R8"/>
2225    <mixtureelem list="F2"/>
2226    <mixtureelem list="L2"/>
2227  </partitionelem>
2228
2229  <partitionelem id="partition_elem3" file.name=\
2230  "./small_p1_pos3.seq" data.type="nt" interleaved="yes">
2231    <mixtureelem list="T1, T1, T1, T1, T1"/>
2232    <mixtureelem list="M3, M3, M3, M3, M3"/>
2233    <mixtureelem list="R10, R11, R12, R13, R14"/>
2234    <mixtureelem list="F3, F3, F3, F1, F1"/>
2235    <mixtureelem list="L3, L3, L3, L3, L3"/>
2236  </partitionelem>
2237
2238</phyml>
2239
2240\end{Verbatim}
2241
2242
2243\subsection{Branch lengths with invariants and partionned data}
2244
2245Accommodating for models with invariable sites applying to some elements of a partitioned data, with
2246these elements  sharing the  same set  of edge lengths  can lead  to inconsistencies.   Consider for
2247instance a partitioned  data set with two elements.   Assume that these two elements  share the same
2248set of edge lengths.  Also, consider that GTR+I  applies to the first element and HKY applies to the
2249second. Now, the expected number of substitutions per site for the first element of the partition is
2250equal  to  $(1-p)l$,  where  $p$  is  the   estimated  proportion  of  invariants  and  $l$  is  the
2251maximum-likelihood estimate  for the length  of that  specific edge. For  the second element  of the
2252partition, the  expected number of  substitutions per  site is equal  to $l$, rather  than $(1-p)l$.
2253While $l$ are common to the two elements,  matching the specification of the input model, the actual
2254edge  lengths do  differ across  the  two partition  elements.  Please  be  aware that,  due to  the
2255programming structure implemented in PhyML, the program  will only return one value here, which will
2256be equal to $(1-p)l$.
2257
2258\section{Citing PhyML}
2259The ``default citation'' for PhyML is:
2260\begin{itemize}
2261\item
2262``New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance
2263of PhyML 3.0''. Guindon S., Dufayard J.F., Lefort V., Anisimova M., Hordijk W., Gascuel O. 2010, {\it Systematic
2264  Biology}, 59(3):307-321
2265
2266\end{itemize}
2267The ``historic citation'' for PhyML is:
2268\begin{itemize}
2269\item ``A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood''
2270  Guindon S., Gascuel O. 2003, {\it Systematic Biology}, 52(5):696-704
2271\end{itemize}
2272
2273
2274
2275\section{Other programs}
2276
2277PhyML  is software package  that provides  tools to  tackle problems  other than  estimating maximum
2278likelihood  phylogenies.  Installing these  tools  and  processing data  sets  is  explained is  the
2279following sections.
2280
2281\subsection{PhyTime}\index{PhyTime} PhyTime is  a program that uses Bayesian sampling techniques to
2282infer phylogenies from the analysis of genetic and fossil data. Edge lengths (and node heights) are
2283expressed in calendar time units as opposed to expected number of substitutions as in
2284PhyML. PhyTime can thus be used to date past evolutionary events. The main original features
2285compared to other software for molecular dating are:
2286\begin{itemize}
2287\item a model of rate evolution whereby rates can vary along each branch (other
2288  software consider that the rates of evolution vary from one branch to another but stay constant
2289  along any given branch) \cite{guindon13}.
2290\item the opportunity to accomodate for uncertainty in calibration information. A single time
2291  constraint can apply to multiple clades with corresponding probabilities as determined from prior
2292  analysis of fossil data \cite{guindon18}.
2293\end{itemize}
2294
2295\subsubsection{Installing PhyTime}
2296
2297Compiling PhyTime is straightforward on Unix-like  machines (i.e., linux and MacOS systems). PhyTime
2298is not readily available for Windows machines but  compilation should be easy on this system too. In
2299the `phyml' directory, where the `src/'  and `doc/' directories stand, enter the following commands:
2300{\setlength{\baselineskip}{0.5\baselineskip}
2301\begin{verbatim}
2302./configure --enable-phytime;
2303make clean;
2304make;
2305\end{verbatim} } This set of commands generates  a binary file called \x{phytime} which can be found
2306  in the `src/' directory.
2307
2308\subsubsection{Running  PhyTime} Passing  options and  running  PhyTime on  your data  set is  quite
2309similar to running  PhyML using an XML parameter input file, i.e., a typical run would be launched
2310using the following command: \x{./phytime --xml=./dating\_example.xml}, assuming that the \x{phytime} binary file is in the same
2311directory as the XML control file \x{dating\_example.xml}, which happens to be the current directory here. The main
2312differences between PhyML and PhyTime are explained below:
2313\begin{itemize}
2314\item Unlike PhyML, PhyTime requires calibration (along with genetic sequence) data as input. The format for
2315  defining the relevant time intervals is described below.
2316\item  PhyTime does not allow partitionned (i.e., multigene) analysis yet.
2317\end{itemize}
2318
2319\subsubsection{PhyTime input}
2320As stated above, PhyTime takes as input an XML file very similar to those compatible with PhyML. An
2321example is given thereafter. The first part of the input file (up to the \x{partitionelem} block) corresponds to a standard XML input
2322file for PhyML. Note however that the tag name ``\x{phytime}'' replaces ``\x{phyml}''. The second
2323part is PhyTime-specific. Three new tags can be found here: \x{lineagerates}, \x{clade}
2324and \x{calibration}\index{PhyTime!calibration}.
2325
2326The \x{lineagerates}\index{PhyTime!lineagerates} element is used to define the model of rate
2327variation across lineages. Its only attribute is \x{model} which can take the following values:
2328\begin{itemize}
2329\item \x{model=strictclock} or \x{model=clock}  implements the strict clock model, i.e., all lineages
2330  evolve at the same instantaneous rate, at any point in time.
2331\item \x{model=lognormal} or \x{model=normal}. The logarithm of the average rate along each branch is distributed as a
2332  normal distribution. The mean and variance of this distribution are the same for all branches. More
2333  precisely, let $R_b$ be the average rate of substitution along edge $b$. We have $R_b=X_b\cdot
2334  \mu$, where $\mu$ is the ``baseline'' (or average over lineages) rate of substitution and $X_b$ is
2335  thus the branch-specific {\em relative} rate of substitution. The lognormal model has $\log(X_b)
2336  \sim \mathcal{N}(1,\nu)$, i.e., the logarithm of the relative rate is normally distributed with
2337  mean (mode and median) set to 1.0 and standard deviation equal to $\nu$.
2338\item \x{model=geometricbrownian} or \x{model=brownian} or \x{model=geo}. The logarithm of the rate
2339  of substitution evolves according to a Brownian motion process along the tree. More precisely, the
2340  logarithm of the rate of evolution at the end of edge $b$, noted as $Y_b^{\text{stop}}=\log(R_b^{\text{stop}})$, is
2341  normally distributed with variance $v := \nu \cdot \Delta t_b$, where $\Delta t_b$ is the (calendar) time elapsed
2342  along edge $b$, and mean equal to $Y_b^{\text{start}}-\frac{1}{2} v$, so that $\Ex(R_b^{\text{stop}})=\Ex(R_b^{\text{start}})$.
2343  Under the geometric Brownian motion model, the average rate of evolution is also random. This
2344  variability is taken into account in the calculation of the transition between nucleotides or
2345  amino-acids along edges. See Guindon, 2012, {\em Syst. Biol.} for more information.
2346\end{itemize}
2347
2348The \x{clade}\index{PhyTime!clade} element is used to define a subset of taxa. Each of these taxa is
2349given in a \x{taxon} element. The  value that each taxon takes is a string corresponding to the
2350name of one of the sequences in the alignment file. In the example that follows, \x{Gymno\_Araucaria},
2351\x{Gymno\_Ginko}, \x{Gymno\_Juniperus} and
2352\x{Gymno\_Juniperus} are three taxa that define a clade called \x{Gymno1}. Note that  this clade may
2353not be monophyletic. In fact, it is not considered as such (i.e., monophyletic) during the
2354inference. The second important element is \x{calibration}. It defines time intervals corresponding
2355to the time of diversification (i.e., the time of the most recent common ancestor) of the set of
2356taxa the calibration points to. In other words, a given time
2357interval defines the calibration for the timing of the crown node of a given clade. Each interval is defined using
2358the \x{upper} and \x{lower} tags. The upper (resp. lower) bound for each interval is a date
2359expressed in any unit you fancy.
2360
2361\vspace{0.2cm}
2362\begin{Verbatim}[frame=single, label=Example of PhyTime XML file, samepage=true, baselinestretch=0.5,
2363  fontsize=\small, numbers=left]
2364  <phytime run.id="June2018" output.file="out" \
2365  mcmc.chain.len="1E+7" mcmc.sample.every="100" \
2366  mcmc.print.every="50" mcmc.burnin="10000" \
2367  mutmap="no">
2368
2369  <!-- Tree topology -->
2370  <topology>
2371    <instance id="T1" init.tree="bionj" optimise.tree="yes"/>
2372  </topology>
2373
2374
2375  <!-- Model of rate variation across lineages -->
2376  <lineagerates model="lognormal"/>
2377
2378  <!-- Substitution model -->
2379  <ratematrices id="RM1">
2380    <instance id="M1" model="HKY85" optimise.tstv="no" tstv="4.0"/>
2381  </ratematrices>
2382
2383
2384  <!-- Freerate model of variation of rates across sites -->
2385  <siterates id="SR1">
2386    <instance id="R3" init.value="0.5"/>
2387    <instance id="R2" init.value="1.0"/>
2388    <instance id="R1" init.value="2.0"/>
2389    <weights  id="D1" family="freerates" optimise.freerates="no">
2390      <instance appliesto="R3" value="0.33"/>
2391      <instance appliesto="R2" value="0.33"/>
2392      <instance appliesto="R1" value="0.33"/>
2393    </weights>
2394  </siterates>
2395
2396  <!-- Nucleotide frequencies -->
2397  <equfreqs id="EF1">
2398    <instance id="F1" optimise.freqs="no"/>
2399  </equfreqs>
2400
2401
2402  <!-- Vector of edge lengths -->
2403  <branchlengths id="BL1" >
2404    <instance id="L1" optimise.lens="no"/>
2405  </branchlengths>
2406
2407  <!-- Model assembly -->
2408  <partitionelem id="partition1" file.name="./seq.txt" \
2409  data.type="nt" interleaved="no">
2410    <mixtureelem list="T1, T1, T1"/>
2411    <mixtureelem list="M1, M1, M1"/>
2412    <mixtureelem list="F1, F1, F1"/>
2413    <mixtureelem list="R1, R2, R3"/>
2414    <mixtureelem list="L1, L1, L1"/>
2415  </partitionelem>
2416\end{Verbatim}
2417
2418\begin{Verbatim}[frame=single, label=Example of PhyTime XML file, samepage=true, baselinestretch=0.5,
2419  fontsize=\small, numbers=left]
2420  <clade id="Gymno1">
2421    <taxon value="Gymno_Araucaria"/>
2422    <taxon value="Gymno_Ginkgo"/>
2423    <taxon value="Gymno_Juniperus"/>
2424    <taxon value="Gymno_Sciadopitys"/>
2425  </clade>
2426
2427  <clade id="Gymno2">
2428    <taxon value="Gymno_Araucaria"/>
2429    <taxon value="Gymno_Juniperus"/>
2430    <taxon value="Gymno_Sciadopitys"/>
2431  </clade>
2432
2433  <calibration id="cal1">
2434    <lower>40</lower>
2435    <upper>60</upper>
2436    <appliesto clade.id="Gymno1" probability="0.8"/>
2437    <appliesto clade.id="Gymno2" probability="0.2"/>
2438  </calibration>
2439
2440  <calibration id="cal2">
2441    <lower>50</lower>
2442    <upper>200</upper>
2443    <appliesto clade.id="Gymno1">
2444    </appliesto>
2445  </calibration>
2446
2447</phytime>
2448\end{Verbatim}
2449
2450\subsubsection{Accounting for calibration uncertainty}
2451
2452It is not always obvious to determine with full precision and accuracy where in the tree a given
2453fossil branches in {\em a priori}. For instance, a first morphological feature of an ancestral (fossilized) species
2454might be shared by species A but not B nor C, while a second feature might be displayed by A and B but
2455not C. Assuming for simplicity that the true tree topology is ((A,B),C), then the first feature calibrates the most recent common ancestor (MRCA)\index{MRCA} of A and
2456B. Indeed, this ancestor cannot be younger than the fossil itself because, if this was  the case, then
2457A {\em and} B would display the morphological feature of interest. In a similar fashion, the second
2458feature suggests that the calibration should instead apply to the MRCA of A, B and C. Therefore, assuming  that the fossil was discovered in a
2459well-defined geological layer which age range is known without ambiguity, there is still uncertainty
2460around the node in the tree this time interval calibrates. In the particular example given above, it
2461is not clear whether the calibration constraints applies to the node corresponding to the MRCA of A
2462and B, or the root of the tree.
2463
2464PhyTime accomodates for this uncertainty by giving a
2465probability to the two scenari: with probability $\alpha$ the calibration time interval applies to
2466the MRCA of species A and B and with probability $1-\alpha$, this interval calibrates the age of the
2467MRCA of A, B and C.
2468
2469It is fairly straightforward to set up an analysis that incorporates probabilistic distributions on
2470calibrations in PhyTime. The section of the XML file below gives a simple example whereby the
2471calibration  ``\x{cal1}'' applies to clade ``\x{Gymno1}'' (i.e., the smallest clade in the tree that
2472displays the four corresponding taxa) with probability 0.8 and to clade ``\x{Gymno2}''  with probability 0.2.
2473Also, calibration \x{cal2} applies to ``\x{Gymno1}'' with probability 1.0, which is implicit here and
2474thus does not need to be specified in the XML file.
2475
2476\begin{Verbatim}[frame=single, label=Calibrating with uncertainty, samepage=true, baselinestretch=0.5,
2477  fontsize=\small, numbers=left]
2478  <clade id="Gymno1">
2479    <taxon value="Gymno_Araucaria"/>
2480    <taxon value="Gymno_Ginkgo"/>
2481    <taxon value="Gymno_Juniperus"/>
2482    <taxon value="Gymno_Sciadopitys"/>
2483  </clade>
2484
2485  <clade id="Gymno2">
2486    <taxon value="Gymno_Araucaria"/>
2487    <taxon value="Gymno_Juniperus"/>
2488    <taxon value="Gymno_Sciadopitys"/>
2489  </clade>
2490
2491  <calibration id="cal1">
2492    <lower>40</lower>
2493    <upper>60</upper>
2494    <appliesto clade.id="Gymno1" probability="0.8"/>
2495    <appliesto clade.id="Gymno2" probability="0.2"/>
2496  </calibration>
2497
2498  <calibration id="cal2">
2499    <lower>50</lower>
2500    <upper>200</upper>
2501    <appliesto clade.id="Gymno1">
2502    </appliesto>
2503  </calibration>
2504
2505</phytime>
2506\end{Verbatim}
2507
2508Note that in this example, it can be the case that both \x{cal1} and \x{cal2} apply to the same
2509clade (\x{Gymno1}). In that particular situation, the actual calibration interval
2510is conservative and derives from the overlap of the two calibration time intervals (one
2511corresponding to \x{cal1}, the other to \x{cal2}). The MRCA of
2512\x{Gymno\_Araucaria}, \x{Gymno\_Ginko}, \x{Gymno\_Juniperus} and \x{Gymno\_Juniperus} has an age
2513that falls here in the $[50,60]$ interval.
2514
2515\subsubsection{MCMC settings}\label{sec:phytimesettings}
2516
2517PhyTime estimates the joint posterior distribution  of the phylogenetic model parameters  using an
2518MCMC algorithm. This algorithm relies on sampling values of these parameters iteratively. The
2519default number of iterations is fixed to 1E+07 and values of parameters are recorded every 1E+03
2520iteration of the MCMC algorithm. The burn-in period, during which tuning parameters of the MCMC as
2521adjusted, lasts for 1E+04 iterations. These three parameters can be modified by setting the values
2522of the attributes \x{mcmc.chain.len}, \x{mcmc.sample.every} and \x{mcmc.burnin} in the XML parameter
2523file accordingly. For instance, the first line of this file could look as follows:
2524\vspace{0.2cm}
2525\begin{Verbatim}[frame=single, label=MCMC settings, samepage=true, baselinestretch=0.5,
2526  fontsize=\small, numbers=left]
2527<phytime
2528run.id="example" output.file="small" mcmc.chain.len="1E+7"
2529mcmc.sample.every="100" mcmc.print.every="50" mcmc.burnin="10000">
2530\end{Verbatim}
2531
2532The  majority of analyses will not warrant 1E+07 iterations  to converge to the target
2533distribution (i.e., converge to the solution). I thus recommend  monitoring each analysis by loading
2534on a regular basis the statistics output file produced by PhyTime and interrupt the analysis once
2535the effective sample sizes of all parameters have reached 200.
2536
2537The MCMC algorithm implemented in
2538PhyTime relies on multiple operators that update different parameters, or the same parameter in
2539different ways. The weights of these operators, i.e., the frequency with which they are applied,
2540were adjusted for the analysis of the plant data set described in \cite{guindon18}. If you think
2541these default values are not suitable for your own analysis, please feel free to have a look at the
2542function \x{MCMC\_Complete\_MCMC} in the file \x{mcmc.c} to make  the appropriate changes to the
2543values of \x{mcmc->move\_weight[XXX]}, where \x{XXX} is the name of the operator. Alternatively, you
2544might want to send me an email (\url{guindon@lirmm.fr}) if you are unsure about all this.
2545
2546It is generally very useful to compare the posterior estimates of model parameters to that obtained
2547from a sampler that ignores sequence data (i.e., the Bayesian sampler only takes as input data the
2548calibration information). Such analysis helps quantifying how sensitive are the marginal posterior
2549estimates to the estimates one would get prior collecting and analysing genetic sequences. The
2550attribute \x{ignore.sequences} used in the \x{phytime} tag permits to run such analysis. The first
2551lines of a corresponding XML file would then look as follows:
2552
2553\vspace{0.2cm}
2554\begin{Verbatim}[frame=single, label=Sampling from the prior in PhyTime, samepage=true, baselinestretch=0.5,
2555  fontsize=\small, numbers=left]
2556<phytime run.id="prior" output.file="seq" ignore.sequences>
2557\end{Verbatim}
2558
2559\subsubsection{PhyTime output}
2560
2561The     program     PhyTime     generates     two     output     files.      The statistics file
2562is called  `\x{outputfile\_phytime\_stats\_runid.txt}'. It lists the
2563node times  and (relative) substitution rates on edges  sampled during the  estimation process.   It
2564also  gives the sampled values for other parameters, such as  the autocorrelation of rates
2565(parameter `nu'), and the rate of evolution (parameter `clock') amongst others.   This output file
2566can  be analysed with the program        Tracer\index{Tracer}        from         the
2567BEAST\index{BEAST}        package (\url{http://beast.bio.ed.ac.uk/Main_Page}) or Icylog
2568(\url{http://tgvaughan.github.io/icylog/}\index{Icylog}). It is also possible to process this file using R here. Section
2569\ref{sec:phytimeexample} of this document gives more information about the statistics output file.       The       second      file       is
2570called `\x{outputfile\_phytime\_trees\_runid.txt}'.  It is  the list of rooted trees that were  collected
2571during the estimation process, i.e., phylogenies sampled from the posterior density of trees.  This
2572file can be processed   using   the   software   TreeAnnotator,   also  part   of   the   BEAST
2573package   (see \url{http://beast.bio.ed.ac.uk/Main_Page}) in  order to generate  confidence sets for
2574the  node time estimates.
2575
2576
2577\subsubsection{An example of PhyTime input and output files}\label{sec:phytimeexample}
2578
2579The directory \x{phyml/examples/phytime/} provides a sequence alignment (\x{seq.txt}) and an XML
2580input file (\x{dating\_example.xml}) that can be used to run an analysis (using the following command:
2581\x{phytime --xml=dating\_example.xml}). The XML file is very similar to that described above and will thus not
2582be discussed further. The columns of the statistics output file are as follows:
2583\begin{itemize}
2584\item \x{sample}: the index of the iteration in the MCMC algorithm. For instance, a sample value of
2585  100 means that 100 ``moves'' in the MCMC have been applied so far, one such move can correspond to
2586  an attempt to modify the tree topology, or  height of some internal nodes, etc.
2587\item \x{lnL(posterior)}: the logarithm of the posterior density of the current model parameters. It is this function
2588  that the MCMC attempts to get samples from.
2589\item \x{lnL(seq)}: the logarithm of the conditional probability of the sequence alignment given the current model
2590  parameters.
2591\item \x{lnL(times)}: the logarithm of the probability density of the phylogeny evaluated under the
2592  birth and death branching process.
2593\item \x{lnL(rates)}: the logarithm of the probability density of the relative substitution rates
2594  along every edge of the current phylogeny.
2595\item \x{birth}: sampled values of the birth parameter. Histogram of these values provide an
2596  estimate of the marginal posterior distribution for this parameter.
2597\item \x{death}: death parameter for the birth-death branching process.
2598\item \x{clock}: mean rate of substitution.
2599\item \x{root}: age of the root node, i.e. the MRCA of all taxa in the sample.
2600\item \x{tstv}: transition/transversion ratio (the substitution model used
2601  here is HKY, as selected in the \x{param.xml} file).
2602\item \x{nu}: rate autocorrelation parameter.
2603\item \x{rr0}, \x{rr1}, \x{rr2}, \x{pr0}, \x{pr1}, \x{pr2}: relative rate of substitution and the
2604  corresponding frequencies. The mixture model of rate variation used in this analysis is the
2605  FreeRate model \cite{soubrier12} with three rate classes. These  columns give the values of
2606  these six parameters (though there are only four free parameters) sampled from the target
2607  (posterior) distribution.
2608\item \x{t(calib:cal1\_clade:Gymno1)}: sampled age of the MRCA of \x{Gymno1} when calibration
2609  \x{cal1} applies to this particular node.
2610\item \x{t(calib:cal1\_clade:Gymno2)}: sampled age of the MRCA of \x{Gymno2} when calibration
2611  \x{cal1} applies to this node.
2612\item \x{t(calib:cal2\_clade:Gymno1)}: sampled age of the MRCA of \x{Gymno1} when calibration
2613  \x{cal12} applies to this node.
2614\item \x{clade(calib:cal1)}: the clade id (\x{Gymno1} or \x{Gymno2}) calibration \x{cal1} applies
2615  to (0 for \x{Gymno1} and 1 for \x{Gymno2}).
2616\item \x{clade(calib:cal2)}: the clade id calibration \x{cal2} applies
2617  to.
2618\item \x{br0}, $\ldots$, \x{br41}: relative rate on all edges in the tree.
2619\item \x{t22}, $\ldots$, \x{t42}: ages of all internal nodes.
2620\item The remaining columns provides information about the acceptance rate of various operators in
2621  the MCMC and the corresponding tuning parameters.
2622\end{itemize}
2623
2624The script below can be copied and pasted into R in order to produce the plots in Figure
2625\ref{fig:phytimetrace} (assuming R was launched from the \x{phyml/examples/phytime} directory). The two plots at the top left and center give the trace of the node age of
2626the MRCA of \x{Gymno1} when calibration \x{cal1} applies to it (left) and that of \x{Gymno2} when
2627the same calibration \x{cal1} applies to it (centre). The plot on the top right gives the clade (which of
2628\x{Gymno1} or \x{Gymno2}) the calibration \x{cal1} applies to.
2629
2630The analysis of these  plots
2631shows the impact of applying the calibration constraint \x{cal1} to \x{Gymno1} and \x{Gymno2} alternatively.
2632When \x{cal1} applies to \x{Gymno2} (i.e., ({\em Gymno Araucaria},  {\em Gymno Juniperus}, {\em Gymno
2633  Sciadopitys})) the age of the MRCA of \x{Gymno1} ((i.e., ({\em Gymno\_Ginkgo}, {\em Gymno Araucaria},  {\em Gymno Juniperus}, {\em Gymno
2634  Sciadopitys}) is  free to wander towards older values compared to the situation where both
2635\x{cal1} and \x{cal2} apply to \x{Gymno1}. The age of the MRCA of \x{Gymno2} is here bound to fall in the
2636$[40,60]$ time interval. When \x{cal1} applies to \x{Gymno1} instead, then the MRCA of that clade
2637has to fall in the $[50,60]$ interval, while that of \x{Gymno2} has to be younger than the MRCA of \x{Gymno1}.
2638
2639
2640
2641\begin{Verbatim}[frame=single, label=R script to produce traces from the PhyTime statistics file, samepage=true, baselinestretch=0.5,
2642fontsize=\small, numbers=left]
2643file=paste("out_phytime_stats_June2018.txt",sep="");
2644d=read.table(file,header=T);
2645idx=floor(0.1*length(d$sample)):length(d$sample)
2646par(mfrow=c(2,3),mar=c(5,4,2,2));
2647plot(d$sample[idx],d$t.calib.cal1_clade.Gymno1.[idx],type="l",
2648col="red",ylab="Time",xlab="Sample",main="Time Gymno1")
2649plot(d$sample[idx],d$t.calib.cal1_clade.Gymno2.[idx],type="l",
2650col="red",ylab="Time",xlab="Sample",main="Time Gymno2")
2651plot(d$sample[idx],d$clade.calib.cal1.[idx],type="l",col="black",
2652ylab="Clade",xlab="Sample",main="(0->Gymno1, 1->Gymno2)")
2653plot(d$sample[idx],d$lnL.seq.[idx],type="l",col="orange",
2654ylab="log(Probability)",xlab="Sample",main="Likelihood sequences")
2655plot(d$sample[idx],d$lnL.posterior.[idx],type="l",col="orange",
2656ylab="log(density)",xlab="Sample",main="Joint posterior")
2657plot(d$sample[idx],d$lnL.times.[idx],type="l",col="orange",
2658ylab="log(density)",xlab="Sample",main="Branching process")
2659\end{Verbatim}
2660
2661
2662\begin{figure}
2663\begin{center}
2664\includegraphics[width=13.8cm]{./fig/phytimetrace}
2665\end{center}
2666\caption{{\bf Traces from the statistics output file produced by PhyTime.}}
2667\label{fig:phytimetrace}
2668\end{figure}
2669
2670% Important  information is also  displayed on  the standard  output of  PhyTime (the  standard output
2671% generally corresponds to the terminal window from  which PhyTime was launched).  The first column of
2672% this output gives the current  generation, or run, of the chain. It starts at  1 and goes up to 1E+7
2673% by default.  The second column gives the time elapsed  in seconds since  the sampling  began. The
2674% third column  gives the  log likelihood  of the phylogenetic model (i.e., `Felsenstein's
2675% likelihood'). The  fourth column gives the logarithm of the joint prior  probability of substitution
2676% rates along  the tree and  node heights. The  fifth column gives the  current sampled value  of the
2677% EvolRate  parameter along with the  corresponding Effective Sample Size (ESS)  for this parameter.
2678% The sixth  column gives the tree height  and the  corresponding  ESS.  The seventh  column  gives
2679% the  value  of the  autocorrelation parameter followed by the  corresponding ESS. The eightth column
2680% gives the  values of the birth rate parameter that  governs the birth-rate model  of species
2681% divergence  dates.  The last column  of the standard  output gives  the minimum  of the  ESS  values
2682% taken  over the  whole set  of node  height estimates.  It provides useful information when one has
2683% to decide whether or not the sample size is large enough to draw  valid conclusion, i.e., decide
2684% whether the chain was  run for long enough (see Section \ref{sec:recomphytime} for more detail about
2685% adequate chain length).
2686
2687% \subsubsection{ClockRate vs. EvolRate}
2688
2689% The average rate of evolution along a branch  is broken into two components. One is called ClockRate
2690% and is  the same throughout  the tree. The  other is called EvolRate  and corresponds to  a weighted
2691% average of  branch-specific rates.  The  model of rate  evolution implemented in PhyTime  forces the
2692% branch-specific rate values to  be greater than one. As a consequence,  ClockRate is usually smaller
2693% EvolRate.
2694
2695% In more mathematical terms, let $\mu$ be the value of ClockRate, $r_i$ be the value of the relative
2696% rate along branch $i$ and $\Delta_i$ the time elapsed along branch $i$. The value of EvolRate is
2697% then given by:
2698% \begin{eqnarray*}
2699% \mathrm{EvolRate} = \mu \frac{\sum_{i}^{2n-3} r_i \Delta_i}{\sum_{i}^{2n-3} \Delta_i}.
2700% \end{eqnarray*}  It is  clear from  this equation  that  multiplying each  $r_i$ by  a constant  and
2701% dividing $\mu$ by the same constant does not  change the value of EvolRate. The $r_i$s and $\mu$
2702% are then confounded,  or non-identifiable, and only  the value of EvolRate can  be estimated from
2703% the data.   {\color{red}{Please make sure  that you use  the value of  EvolRate rather than  that of
2704% ClockRate when referring to the estimate of the substitution rate}}.
2705
2706% \subsubsection{Effective sample size}\label{sec:ess}
2707
2708% The MCMC technique  generates samples from a  target distribution (in our case,  the joint posterior
2709% density  of  parameters).  Due  to  the  Markovian  nature of  the  method,  these samples  are  not
2710% independent.  The  ESS is the estimated  number of independent  measurements obtained from a  set of
2711% (usually dependent) measurements. It is calculated using the following formula:
2712% \begin{eqnarray*}
2713% \mathrm{ESS} = N\left(\frac{1-r}{1+r}\right),
2714% \end{eqnarray*}
2715% where  $N$  is the  length  of  the  chain  (i.e., the  `raw'  or `correlated' sample  size)  and $r$  is  the
2716% autocorrelation value, which is obtained using the following formula:
2717% \begin{eqnarray*}
2718% r = \frac{1}{(N-k)\sigma_x^2} \sum_{i=1}^{N-k} (X_i - \mu_x)(X_{i+k}-\mu_x),
2719% \end{eqnarray*} where $\mu_x$ and $\sigma_x$ are the mean and standard deviation of the $X_i$ values
2720% respectively and $k$ is the lag. The value of $r$ that is used in PhyTime corresponds to the case where $k=1$,
2721% which therefore gives a first order  approximation of the `average' autocorrelation value (i.e., the
2722% autocorrelation averaged over the set of possible values of the lag).
2723
2724
2725% \subsubsection{Prior distributions of model parameters}\label{sec:prior}
2726
2727% Any Bayesian analysis requires  specifying a prior distribution of model  parameters. The outcome of
2728% the data analysis, i.e., the posterior distribution,  is influenced by the priors.  It is especially
2729% true if the signal conveyed  by the data is weak.  While some have  argued that the specification of
2730% priors relies more  on arbitrary decisions than sound scientific  reasoning, choosing relevant prior
2731% distributions  is in  fact fully  integrated in  the process  of building  model that  generates the
2732% observed data.  In particular, the problem of  estimating divergence times naturally lends itself to
2733% hierarchical Bayesian modelling.  Based on the hypothesis  that rates of evolution are conserved (to
2734% some extant)  throughout the  course of  evolution, the hierarchical  Bayesian approach  provides an
2735% adequate framework for inferring substitution rates  and divergence dates separately. Hence, in this
2736% situation, it makes good  sense to use what is known about a  relatively well-defined feature of the
2737% evolution of genetic sequences (the ``molecular  clock'' hypothesis combined to stochastic variations of
2738% rates across lineages) to build a prior distribution on rates along edges.
2739
2740
2741
2742
2743\subsubsection{Citing PhyTime}\label{sec:citephytime}
2744
2745The ``default citation'' is:
2746
2747\begin{itemize}
2748\item Guindon S. ``From trajectories to averages: an improved description of the heterogeneity of
2749substitution rates along lineages'', {\it Systematic Biology}, 2013. 62(1): 22-34.
2750\end{itemize}
2751
2752If you are using mutliple calibrations on a given clade, please cite the following article:
2753\begin{itemize}
2754\item Guindon S. ``Accounting for calibration uncertainty: Bayesian molecular dating as a ``doubly intractable'' problem'', {\it Systematic Biology}, 2018. 67(4): 651-661.
2755\end{itemize}
2756
2757An earlier article also describes some of the methods implemented in PhyTime:
2758
2759\begin{itemize}
2760\item Guindon  S. ``Bayesian estimation of divergence  times from large data  sets'', {\it Molecular
2761    Biology and Evolution}, 2010,
276227(8):1768:81.
2763\end{itemize}
2764
2765
2766
2767\subsection{PhyloGeo}\index{PhyloGeo} PhyloGeo is  a program that implements the
2768competition-dispersal phylogeography model described in Ranjard, Welch, Paturel and Guindon
2769``Modelling competition  and  dispersal in a statistical phylogeographic framework''. Accepted for
2770publication in {\it Systematic Biology}.
2771
2772It implements a  Markov Chain Monte Carlo  approach that samples from the  posterior distribution of
2773the three  parameters of  interest in this  model, namely  the competition intensity  $\lambda$, the
2774dispersal  bias parameter  $\sigma$ and  the overal  dispersal rate  $\tau$. The  data consist  in a
2775phylogeny with node heights proportional to their ages and geographical locations for evety taxon in
2776this tree. An important assumption of the model is that each node in the phylogeny corresponds to a
2777speciation {\em and} a dispersal event. As a consequence, this model does not authorize a given taxon to
2778occupy more than one locations. Note however that the converse is not true: a given location can be
2779occupied by several different taxa.
2780
2781\subsubsection{Installing PhyloGeo}
2782
2783Compiling PhyloGeo is straightforward on Unix-like  machines (i.e., linux and MacOS systems). PhyloGeo
2784is not readily available for Windows machines but  compilation should be easy on this system too. In
2785the `phyml' directory, where the `src/'  and `doc/' directories stand, enter the following commands:
2786{\setlength{\baselineskip}{0.5\baselineskip}
2787\begin{verbatim}
2788./configure --enable-geo;
2789make clean;
2790make;
2791\end{verbatim} } This set of commands generates  a binary file called \x{phylogeo} which can be found
2792  in the `src/' directory.
2793
2794\subsubsection{Running PhyloGeo} PhyloGeo takes as input a rooted tree file in Newick format and a file
2795with geographical locations for all the tips of the phylogeny. Here is an example of valid tree and
2796the corresponding spatial locations just below:
2797
2798
2799\begin{scriptsize}
2800\begin{Verbatim}[frame=single, label=Valid PhyloGeo input tree, samepage=true, baselinestretch=0.5]
2801(((unicaA:1.30202,unicaB:1.30202):1.34596,(((nitidaC:0.94617,(nitidaA:0.31497,
2802nitidaB:0.31497):0.63120):0.18955,(((mauiensisA:0.00370,mauiensisB:0.00370):0.20068,
2803(pilimauiensisA:0.05151,pilimauiensisB:0.05151):0.15287):0.78769,(brunneaA:0.10582,
2804brunneaB:0.10582):0.88625):0.14365):0.80126,(((molokaiensisA:0.03728,
2805molokaiensisB:0.03728):0.71371,(deplanataA:0.01814,deplanataB:0.01814):0.73285):0.34764,
2806((parvulaA:0.20487,parvulaB:0.20487):0.40191,(kauaiensisA:0.24276,
2807kauaiensisB:0.24276):0.36401):0.49186):0.83835):0.71099):1.38043,
2808(nihoaA:0.05673,nihoaB:0.05673):3.97168);
2809\end{Verbatim}
2810
2811\begin{Verbatim}[frame=single, label=Valid PhyloGeo spatial location file, samepage=true, baselinestretch=0.5]
2812nihoaA                  23.062222	161.926111
2813nihoaB                  23.062222	161.926111
2814kauaiensisA             22.0644445	159.5455555
2815kauaiensisB             22.0644445	159.5455555
2816unicaA                  21.436875	158.0524305
2817unicaB                  21.436875	158.0524305
2818parvulaA                21.436875	158.0524305
2819parvulaB                21.436875	158.0524305
2820molokaiensisA           20.90532	156.6499
2821molokaiensisB           20.90532	156.6499
2822deplanataA              20.90532	156.6499
2823deplanataB              20.90532	156.6499
2824brunneaA                20.90532	156.6499
2825brunneaB                20.90532	156.6499
2826mauiensisA              20.90532	156.6499
2827mauiensisB              20.90532	156.6499
2828pilimauiensisA          20.90532	156.6499
2829pilimauiensisB          20.90532	156.6499
2830nitidaA                 19.7362	        155.6069
2831nitidaB                 19.7362	        155.6069
2832nitidaC                 19.7362	        155.6069
2833\end{Verbatim}
2834\end{scriptsize}
2835
2836In order to run PhyloGeo, enter the following command: \x{./phylogeo ./tree\_file
2837  ./spatial\_location\_file > phylogeo\_output}.
2838PhyloGeo will then print out the sampled values of the model parameters in the file
2839\x{phylogeo\_output}. This file can then be used to generate the marginal posterior densities of the
2840model parameters. In particular, evidence for competition corresponds to value of $\lambda$ smaller
2841than 1.0. Please see the original article for more information on how to interpret the model
2842parameters.
2843
2844\subsubsection{Citing PhyloGeo}\label{sec:citephylogeo}
2845
2846Ranjard, L., Welch D., Paturel M. and Guindon S. ``Modelling competition  and  dispersal in a
2847statistical phylogeographic framework''. 2014. Systematic Biology.
2848
2849
2850
2851\subsection{PhyREX}\index{PhyREX} PhyREX is a program that implements the spatial
2852$\Lambda$-Fleming-Viot model
2853\cite{etheridge2008,berestycki2009,barton2010,barton2010b,veber2012,barton2013} or \sfv\ for short.
2854This model can be thought of as a structured-coalescent where space is considered as continuous
2855rather than organized into separate demes.
2856Under the \sfv\ model, the spatial distribution of individuals in a population is uniform and does
2857not change during the course of evolution, as opposed to other models such as the very popular
2858isolation by distance model proposed by Wright and Mal\'ecot. The \sfv\ model  does not
2859suffer from the ``pain in the torus'' \cite{felsenstein1975} and clumps of individuals with
2860increasing densities do not arise. Also, estimates of migration parameters are not sensitive to
2861variation in sampling intensities across regions, unlike ``mugration'' or ``discrete trait
2862analysis'' models \cite{lemey2009}.
2863
2864PhyREX implements an original data augmentation technique embedded in a Bayesian sampler to
2865estimate two important biological parameters: the neighorhood size ($\mathcal{N}$) and the dispersal
2866intensity ($\sigma^2$). These two parameters are closely related to the effective size of the
2867population of interest per unit area since $\mathcal{N}$ is defined as follows: $\mathcal{N} := 4\pi
2868\rho_e \sigma^2$, where $\rho_e$ is the effective population density.
2869
2870\subsubsection{Installing PhyREX}
2871
2872Compiling PhyREX is straightforward on Unix-like  machines (i.e., linux and MacOS systems). PhyREX
2873is not readily available for Windows machines but  compilation should be easy on this system too. In
2874the `phyml' directory, where the `src/'  and `doc/' directories stand, enter the following commands:
2875{\setlength{\baselineskip}{0.5\baselineskip}
2876\begin{verbatim}
2877./configure --enable-phyrex;
2878make clean;
2879make;
2880\end{verbatim} } This set of commands generates  a binary file called \x{phyrex} which can be found
2881  in the `src/' directory.
2882
2883\subsubsection{Running PhyREX}
2884
2885Example input files for PhyREX can be found in the \x{examples/phyrex/} directory.
2886PhyREX takes as input a XML file (See Section \ref{sec:xmlio}), in a PhyML-compatible format. This
2887file defines the models and various parameter settings for conducting an analysis. We give more
2888information below about this file. The XML file gives the names of a sequence alignment file in the
2889standard PHYLIP (or NEXUS) format along with a file providing the spatial coordinates of the
2890corresponding sequences.  An example of the coordinates file is given
2891below:
2892\begin{Verbatim}[frame=single, label=Valid PhyREX spatial location file, samepage=true, baselinestretch=0.5]
2893# state.name lon lat
2894|SouthWest| 0 0
2895|NorthEast| 10 10
2896levosl 5.082206 4.133893
2897kmgcwv 5.914294 4.603446
2898uhfwja 4.990937 4.445124
2899ndmwkc 5.178017 4.442268
2900jpadex 3.747484 4.571090
2901lqcdcw 7.081925 5.133123
2902wnsbtg 4.164588 4.720346
2903ptwgnn 5.711159 4.462993
2904jhqdsm 3.539525 4.537706
2905vlnoes 4.613251 4.470530
2906pfrnpk 4.117791 4.489819
2907elwdvr 5.649958 4.824092
2908lptxiv 4.563302 4.005124
2909\end{Verbatim}
2910This first column in this  file gives the sequence names, the second is  the longitude and the third
2911column gives the latitude. The first row in this file gives the names of the columns. It starts with
2912the  `\#' character  signaling  a comment.  The  second and  third  rows define  the  limits of  the
2913population's habitat.  At the moment,  this habitat is  assumed to be  a rectangle. The  position in
2914space of that rectangle are determined by  the coordinates of the bottom-left corner (\x{|SouthWest|
29150 0}) and the top-right one (\x{|NorthEast| 10  10}). The `\x{|}' characters help identify the terms
2916in the list of coordinate corresponding indeed to these two particular points and should thus not be
2917omitted. The  remaining row give  the coordinates of  each taxon. Every  taxon in the  sequence file
2918should also be listed in the coordinate file.
2919
2920Below is an excerpt of a XML file that can be used as input to run a PhyREX analysis. The \sfv\
2921model parameters are estimated using a Bayesian sampler which hyperparameters (number of
2922samples, frequency at which samples are collected, length of the ``burnin'' phase, etc.) are set
2923through the corresponding XML attributes (see \ref{sec:phytimesettings}). PhyREX is similar to
2924PhyTime in the sense that substitution rates and times are separate parameters. The rates can
2925vary across lineages (using for instance a log-normal model) or not (strict clock). In case all
2926sequences were collected at (approximately) the same time, it is recommended to set the average
2927rate of evolution to a specific value (\x{<clockrate value="1E-4"/>}). Yet, with serially-sampled
2928data, the average rate of evolution can be estimated directly from the data and setting it to
2929a particular value is thus not mandatory. Finally, it is necessary to give information about the
2930time at which each sequence was collected. This is done through seeting appropriate calibration
2931constraints. From a technical perspective, these constraints apply to clades where each clade here
2932is a single tip.
2933
2934
2935\begin{Verbatim}[frame=single, label=Example (excerpt) of PhyREX XML file, samepage=true, baselinestretch=0.5,
2936  fontsize=\small, numbers=left]
2937  <phyrex run.id="example" output.file="out"
2938  mcmc.chain.len="1E+7"
2939  mcmc.sample.every="500"
2940  mcmc.print.every="100"
2941  mcmc.burnin="10000"
2942  mutmap="no">
2943
2944  <!-- ``Standard'' stuff goes here, i.e., <topology>, -->
2945  <!-- <partitionelem>, etc -->
2946
2947  <!-- Model of rate variation across lineages -->
2948  <lineagerates model="lognormal"/>
2949
2950  <!-- Set the average (clock) rate of substitution -->
2951  <!-- to a fixed value (optional) -->
2952  <clockrate value="1E-4"/>
2953
2954  <-- File where 2-D spatial coordinates are found -->
2955  <coordinates id="coordinates" file.name="./usa_coord.txt"/>
2956
2957  <!-- Set the date for each tip. Useful in serial sample -->
2958  <!-- analysis -->
2959  <clade id="clad1">
2960  <taxon value="CY130177|South_Carolina|12_13|H1N1"/>
2961  </clade>
2962
2963  <!-- Clade 1 (made of a single sequence) has time set to 0 -->
2964  <calibration id="cal1">
2965    <lower>0</lower>
2966    <upper>0</upper>
2967    <appliesto clade.id="clad1"/>
2968  </calibration>
2969
2970
2971  <clade id="clad2">
2972  <taxon value="KF647978|Idaho|12_13|H1N1"/>
2973  </clade>
2974
2975  <!-- Clade 2 has time set to 10 -->
2976  <calibration id="cal2">
2977    <lower>10</lower>
2978    <upper>10</upper>
2979    <appliesto clade.id="clad2"/>
2980  </calibration>
2981
2982  <!-- More calibration info goes here. At least, one -->
2983  <!-- calibrated clade per tip. -->
2984
2985\end{Verbatim}
2986
2987PhyREX generates correlated samples from the posterior distribution of the \sfv\ model
2988parameters. The estimated distributions and related summary statistics can be monitored during the
2989analysis using the MCMC vizualization software Icylog. Figure \ref{fig:phyrextrace} shows the traces
2990for two parameters after about an hour of calculation on the example data set (36 taxa, 50
2991geographic locations). Most of the names of the sampled parameters are self-explanatory but some of
2992them (listed below) need explanations:
2993\begin{itemize}
2994\item \texttt{lnP} : log-posterior.
2995\item \texttt{alnL} : logarithm of the probability of the sequence alignment given the current tree.
2996\item \texttt{glnL} : logarithm of the probability of the current tree and location data.
2997\item \texttt{clock} : substitution rate
2998\item \texttt{lbda} : value of $\lambda$, the rate at which REX events take place in the \sfv~model.
2999\item \texttt{mu} : value of $\mu$, the term governing the probability for the lineage to be hit by
3000  a REX event.
3001\item \texttt{rad} : value of the parameter $\theta$, i.e., the radius parameter in the \sfv~model.
3002\item \texttt{neigh} : neighborhood size (equal to $2/\mu$ under the \sfv~model).
3003\item \texttt{rhoe} : effective population density.
3004\item \texttt{sigsq} : $\sigma^2 := 4\theta^4\lambda \pi \mu /s$ where $s$ is the surface of the habitat.
3005\item \texttt{sigsqobs} : mean, taken over the (most recent) tips, of one half (in a two-dimensional space) times the Euclidean distance
3006  covered in one unit of time, squared.
3007\item \texttt{dispdist} : mean, taken over the (most recent) tips, of the Haversine distance
3008  covered in one unit of time.
3009\item \texttt{nInt} : number of REX events.
3010\item \texttt{nCoal} : number of coalescent events.
3011\item \texttt{nHit} : number of REX events where a single lineage was hit.
3012\item \texttt{rootTime} : time at the root node.
3013\item \texttt{rootLon} : longitude of the root node location.
3014\item \texttt{rootLat} : latitude of the root node location.
3015\end{itemize}
3016If you
3017are unsure about the precise definition of one of these parameters or have any questions regarding
3018how to interpret the results returned by PhyREX, please do not hesitate to contact me (\url{guindon@lirmm.fr}).
3019
3020\begin{figure}
3021\begin{center}
3022  \includegraphics[width=13.8cm]{./fig/phyrexlog}
3023\end{center}
3024\caption{{\bf Statistics file generated by PhyREX and loaded on Icylog
3025    (\url{http://tgvaughan.github.io/icylog/icylog.html}).} The traces for the neighborhood size
3026  (top) and
3027the dispersal intensity (bottom) are shown here.}
3028\label{fig:phyrextrace}
3029\end{figure}
3030
3031\subsubsection{Citing PhyREX}
3032
3033The ``default citation'' is:
3034
3035\begin{itemize}
3036\item Guindon S.,  H Guo, D Welch. ``Demographic inference under the coalescent in a spatial
3037  continuum'', {\it Theoretical Population Biology}, 2016, 111:43-50.
3038\end{itemize}
3039
3040\section{Recommendations on program usage}\label{sec:progusage}
3041
3042\subsection{PhyML}
3043
3044The choice of the  tree searching algorithm among those provided by PhyML  is generally a tough one.
3045The  fastest option  relies  on local  and simultaneous  modifications  of the  phylogeny using  NNI
3046moves. More  thorough explorations of  the space  of topologies are  also available through  the SPR
3047options.  As these  two classes of tree topology moves involve  different computational burdens, it
3048is important to determine which option is the most suitable for the type of data set or analysis one
3049wants to perform. Below is a list of recommendations for typical phylogenetic analyses.
3050
3051\begin{enumerate}
3052\item {\em Single data set, unlimited computing time.} The best option here is probably to use a SPR
3053search (i.e., straight SPR of best of SPR and NNI).  If the focus is on estimating the relationships
3054between species,  it is a good  idea to use  more than one starting  tree to decrease the  chance of
3055getting stuck  in a  local maximum of  the likelihood  function.  Using NNIs  is appropriate  if the
3056analysis does not mainly focus on  estimating the evolutionary relationships between species (e.g. a
3057tree is needed to  estimate the parameters of codon-based models later  on).  Branch supports can be
3058estimated using bootstrap and approximate likelihood ratios.
3059
3060\item {\em  Single data set, restricted  computing time.}  The  three tree searching options  can be
3061used depending on  the computing time available and the  size of the data set.   For small data sets
3062(i.e., $<$ 50 sequences),  NNI will generally perform well provided that  the phylogenetic signal is
3063strong.  It  is relevant  to estimate a  first tree  using NNI moves  and examine  the reconstructed
3064phylogeny in order to have a rough idea  of the strength of the phylogenetic signal (the presence of
3065small internal  branch lengths  is generally  considered as a  sign of  a weak  phylogenetic signal,
3066specially when  sequences are  short).  For larger  data sets  ($>$ 50 sequences),  a SPR  search is
3067recommended if there  are good evidence of  a lack of phylogenetic signal.   Bootstrap analysis will
3068generally  involve  large  computational  burdens.   Estimating branch  supports  using  approximate
3069likelihood ratios therefore provides an interesting alternative here.
3070
3071\item {\em  Multiple data  sets, unlimited computing  time.} Comparative genomic  analyses sometimes
3072rely on building phylogenies from the analysis of  a large number of gene families.  Here again, the
3073NNI option is the most  relevant if the focus is not on recovering the  most accurate picture of the
3074evolutionary relationships  between species.   Slower SPR-based heuristics  should be used  when the
3075topology of the tree is an important parameter of the analysis (e.g., identification of horizontally
3076transferred genes using phylogenetic tree comparisons).   Internal branch support is generally not a
3077crucial parameter of the multiple data  set analyses. Using approximate likelihood ratio is probably
3078the best choice here.
3079
3080\item {\em Multiple data sets, limited computing time.}  The large amount of data to be processed in
3081a  limited time  generally  requires  the use  of  the fastest  tree  searching  and branch  support
3082estimation methods Hence,  NNI and approximate likelihood ratios rather  than SPR and non-parametric
3083bootstrap are generally the most appropriate here.
3084\end{enumerate}
3085
3086Another important  point is the  choice of the  substitution model. While default  options generally
3087provide acceptable results, it is often warranted to perform a pre-analysis in order to identify the
3088best-fit substitution model.  This pre-analysis can be done using popular software such as Modeltest
3089\cite{posada98} or ProtTest  \cite{abascal05} for instance.  These programs  generally recommend the
3090use of a discrete gamma distribution to model the substitution process as variability of rates among
3091sites is a common  feature of molecular evolution.  The choice of the number  of rate classes to use
3092for this  distribution is  also an important  one. While the  default is  set to four  categories in
3093PhyML, it is  recommended to use larger number  of classes if possible in order  to best approximate
3094the  patterns of rate  variation across  sites \cite{galtier04}.   Note however  that run  times are
3095directly proportional to  the number of classes  of the discrete gamma distribution.   Here again, a
3096pre-analysis with the  simplest model should help the  user to determine the number  of rate classes
3097that represents the best trade-off between computing time and fit of the model to the data.
3098
3099
3100\subsection{PhyTime}\label{sec:recomphytime}
3101
3102Analysing a data set using PhyTime should  involve three steps based on the following questions: (1)
3103do the priors seem to be adequate (2) can I use the fast approximation of the likelihood and (3) how
3104long shall I run the program for? I explain below how to provide answers to these questions.
3105
3106\begin{itemize}
3107\item {\em  Are the priors adequate?} Bayesian  analysis relies on  specifiying the joint
3108prior density  of model parameters.  In  the case of  node age estimation, these  priors essentially
3109describe how rates of substitution vary across lineages and the probabilistic distribution that node
3110ages have when  ignoring the information provided  by the genetic sequences. These  priors vary from
3111tree to tree. It is therefore essential to  check the adequacy of priors for each user-defined input
3112tree. In order to do so, PhyTime needs to be run with the \x{--no\_data} option. When this option is
3113required, the  sequence data provided  as input will  be ignored and the  rest of the  analysis will
3114proceed  normally. The  prior distribution  of  model parameters,  essentially edge  rates and  node
3115heights, can then be  checked using the program Tracer as one would  do for the standard `posterior'
3116analysis.
3117
3118\item {\em  Can I use the  fast approximation to the  likelihood?} The suface  of the log-likelihood
3119function can  be approximated using  a multivariate normal  density.  This technique is  saving very
3120substantial amounts  of computation  time. However, like  most approximations, there  are situations
3121where it does not provide a good fit to the actual function. This usually happens when the phylogeny
3122displays  a lot  of short  branches, i.e.,  the  signal conveyed  by the  sequences is  weak. It  is
3123therefore important to first check whether  using the approximate likelihood is reasonable. In order
3124to do  so, it is  recommended to first  run the program without  the approximation, i.e.,  using the
3125default settings. Once  the minimum value of the ESS  of node ages (the last column  on the right of
3126the  standard output)  has reached  40-50, open  the \x{phytime.XXXX}  output file  with  Tracer and
3127examine the correlation  between the exact and approximate likelihood values.  If the correlation is
3128deemed to be good enough, PhyTime can be re-run using the \x{--fast\_lk} option, which uses the fast
3129normal approximation to the likelihood function.
3130
3131
3132
3133% Figure \ref{fig:approxbad} gives an example where  the correlation is too weak and the approximation
3134% of  the  likelihood  should be  avoided.  Figure  \ref{fig:approxbad}  gives  an example  where  the
3135% approximation  is  good enough.   The  current  execution of  PhyTime  can  be  terminated and  then
3136% re-launched using the \x{--fast\_lk} option.
3137
3138% \begin{figure}
3139% \begin{center}
3140% \resizebox{14cm}{8cm}{\includegraphics{./fig/approx_bad.eps}}
3141% \caption{{\bf Exact vs. approximate likelihoods.} The correlation between the normally approximated
3142%   (Y-axis) and the exact (X-axis) likelihoods is weak here. The exact likelihood should be used (option \x{fastlk=no}).}
3143% \label{fig:approxbad}
3144% \end{center}
3145% \end{figure}
3146
3147
3148% \begin{figure}
3149% \begin{center}
3150% \resizebox{14cm}{8cm}{\includegraphics{./fig/approx_good.eps}}
3151% \caption{{\bf Exact vs. approximate likelihoods.} The correlation between the normally approximated
3152%   (Y-axis) and the exact (X-axis) likelihoods is good. The approximation of the  likelihood can be used (option \x{fastlk=yes}).}
3153% \label{fig:approxgood}
3154% \end{center}
3155% \end{figure}
3156
3157\item {\em How  long shall I run the program  for?} PhyTime should be run long  enough such that the
3158ESS of  each parameter  is `large enough'.  The last column  on the  right handside of  the standard
3159output gives the minimum ESS across all internal  node heights. It is recommended to run the program
3160so that this number reaches at least 100.
3161
3162\end{itemize}
3163
3164
3165\section{Frequently asked questions}
3166
3167\begin{enumerate}
3168\item {\it PhyML crashes before reading the sequences. What's wrong ?}\\
3169\begin{itemize}
3170\item The format of your sequence file is not recognized by PhyML. See Section \ref{sec:input_output}
3171\item The carriage return characters in your sequence files are not recognized by PhyML. You must
3172  make sure that your sequence file is a plain text file, with standard carriage return characters (i.e.,
3173  corresponding to ``$\backslash$\x{n}'', or ``$\backslash$\x{r}'')
3174\end{itemize}
3175\item {\it The program crashes after reading the sequences. What's wrong ?}\\
3176\begin{itemize}
3177\item You analyse protein sequences and did not enter the \x{-d aa} option in the command-line.
3178\item The format of your sequence file is not recognized by PhyML. See Section \ref{sec:input_output}
3179\end{itemize}
3180\item {\it Does PhyML handle outgroup sequences ?}\\
3181\begin{itemize}
3182\item Yes, it does. Outgroup taxa are identified by adding the `*' sign at the end of each
3183  corresponding sequence name (see Section \ref{sec:outgroupspecify})
3184\end{itemize}
3185\item {\it Does PhyML estimate clock-constrained trees ?}\\
3186\begin{itemize}
3187\item No, the PhyML program does not estimate clock-contrained trees. One can however use the
3188  program PhyTime to perform such analysis but the tree topology will not be estimated.
3189\end{itemize}
3190\item {\it Can PhyML analyse partitioned data, such as multiple gene sequences ?}\\
3191\begin{itemize}
3192\item We are currently  working on this topic.  Future releases of  the program will provide options
3193to estimate  trees from phylogenomic data sets,  with the opportunity to  use different substitution
3194models on  the different data partitions (e.g.,  different genes). PhyML will  also include specific
3195algorithms to search the space of tree topologies for this type of data.
3196\end{itemize}
3197\end{enumerate}
3198
3199
3200
3201\section{Acknowledgements}
3202The development of PhyML since 2000 has been supported by the Centre National de la Recherche
3203Scientifique (CNRS) and the Minist\`ere de l'\'Education Nationale.
3204
3205\bibliographystyle{./naturemag}
3206\bibliography{./ref.bib}
3207
3208\printindex
3209\end{document}
3210
3211