1\documentclass[a4paper,12pt]{article} 2\usepackage{fancyvrb} 3\usepackage{amsfonts} 4\usepackage{graphicx} 5\usepackage{tabularx} 6% \usepackage{color} 7\usepackage{xcolor} 8\usepackage{psfrag} 9\usepackage[hyphens]{url} 10% \usepackage{vmargin} 11\usepackage{cite} 12\usepackage{caption2} 13\usepackage{hyperref} 14\usepackage{makeidx} 15\usepackage[fleqn]{amsmath} 16\renewcommand{\captionlabeldelim}{.} 17\def\thesection{\arabic{section}} 18\renewcommand{\thefigure}{\arabic{figure}} 19\renewcommand{\thetable}{\arabic{table}} 20\newcommand{\hl}{\noalign{\vskip3pt}\hline\noalign{\vskip3pt}} 21\newcommand{\hrf}{\hrulefill} 22\newcommand{\tc}[1]{\textcolor{black}{#1}} 23\newcommand{\dc}[1]{\textcolor{green}{#1}} 24% \usepackage[none]{hyphenat} 25\newcommand{\rep}[3][1] 26{ 27 \psfrag{#2}[c][c][#1]{#3} 28} 29\newcommand{\x}[1]{\texttt{#1}} 30\newcommand{\sfv}{$\Lambda$V} 31 32\DeclareMathOperator*{\Ex}{\mathbb{E}} 33\newcommand{\Var}{\mathrm{Var}} 34 35% \setpapersize{A4} 36% \hypersetup{colorlinks=true,linkcolor=blue,urlcolor=red,linkbordercolor=000} 37\hypersetup{colorlinks=true,linkcolor=blue,urlcolor=orange} 38\renewcommand{\baselinestretch}{1.} 39\makeindex 40\begin{document} 41\sloppy 42\begin{center} 43\thispagestyle{empty} 44\vfill\vfill 45% \rule{\linewidth}{0.02cm}\\ 46{\Huge \textbf{ P~h~y~M~L~~--~~M~a~n~u~a~l}} 47\vspace{-0.4cm}\\ 48% \rule{\linewidth}{0.02cm}\\ 49\vfill 50{\huge Version 3.0 \\ 51\today 52\vfill 53\normalsize 54\href{https://github.com/stephaneguindon/phyml}{https://github.com/stephaneguindon/phyml}\\ 55\vspace{0.4cm} 56\href{http://www.atgc-montpellier.fr/phyml}{http://www.atgc-montpellier.fr/phyml}} 57\end{center} 58\clearpage 59\tableofcontents 60\clearpage 61 62{\par 63\small 64\noindent 65\copyright Copyright 1999 - 2008 by PhyML Development Team.\\ 66\noindent The software PhyML is provided ``as is'' without warranty of any kind. In no event shall 67the authors or his employer be held responsible for any damage resulting from the use of this 68software, including but not limited to the frustration that you may experience in using the package. 69All parts of the source and documentation except where indicated are distributed under 70the GNU public licence. See http://www.opensource.org for details. 71 72} 73 74{ 75\noindent 76\setlength{\baselineskip}{0.5\baselineskip} 77\section{Availability} 78\begin{itemize} 79\item Binaries: \href{http://www.atgc-montpellier.fr/phyml}{http://www.atgc-montpellier.fr/phyml} 80\item Sources: \href{http://stephaneguindon.github.io/phyml-downloads/}{http://stephaneguindon.github.io/phyml-downloads/} 81\item Discussion forum: \href{http://groups.google.com/group/phyml-forum}{http://groups.google.com/group/phyml-forum} 82\end{itemize} 83} 84 85{ 86\noindent 87\setlength{\baselineskip}{0.7\baselineskip} 88\section{Authors} 89\begin{itemize} 90\item { St\'ephane Guindon} and { Olivier Gascuel} conceived the original PhyML algorithm. 91\item { St\'ephane Guindon} conceived the PhyTime method. 92\item { St\'ephane Guindon, David Welch and Louis Ranjard} conceived the PhyloGeo method. 93\item { St\'ephane Guindon, Wim Hordjik} and { Olivier Gascuel} conceived the SPR-based tree search algorithm. 94\item { Maria Anisimova} and { Olivier Gascuel} conceived the aLRT method for branch support. 95\item { St\'ephane Guindon, Franck Lethiec}, Jean-Francois Dufayard and Vincent Lefort implemented PhyML. 96\item St\'ephane Guindon implemented PhyloGeo, PhyREX and PhyTime. 97\item { Jean-Francois Dufayard} created the benchmark and implemented the tools that are used to check 98 PhyML accuracy and performances. 99\item { Vincent Lefort, St\'ephane Guindon, Patrice Duroux} and { Olivier Gascuel} conceived and 100 implemented PhyML web server. 101\item { Imran Fanaswala} interfaced PhyML with BEAGLE. 102\item St\'ephane Guindon wrote this document. 103\end{itemize} 104} 105\clearpage 106 107% \section{ML in phylogenetics: the basics.} 108\section{Overview} 109 110PhyML \cite{guindon03} is a software package which primary task that is to estimate maximum 111likelihood phylogenies from alignments of nucleotide or amino-acid sequences. It provides a wide 112range of options that were designed to facilitate standard phylogenetic analyses. The main 113strength of PhyML lies in the large number of substitution models coupled to various options to 114search the space of phylogenetic tree topologies, going from very fast and efficient methods to 115slower but generally more accurate approaches. It also implements two methods to evaluate branch 116supports in a sound statistical framework (the non-parametric bootstrap and the approximate 117likelihood ratio test). 118 119PhyML was designed to process moderate to large data sets. In theory, alignments with up to 4,000 120sequences 2,000,000 character-long can analyzed. In practice however, the amount of memory required 121to process a data set is proportional of the product of the number of sequences by their length. 122Hence, a large number of sequences can only be processed provided that they are short. Also, PhyML 123can handle long sequences provided that they are not numerous. With most standard personal 124computers, the ``comfort zone'' for PhyML generally lies around 100-500 sequences less than 10,000 125character long. For larger data sets, we recommend using other softwares such as RAxML 126\cite{raxml}\index{RAxML} or GARLI \cite{garli}\index{GARLI} or Treefinder 127(\href{http://www.treefinder.de}{http://www.treefinder.de}). 128 129 130\section{Bug report}\index{bug} 131 132While PhyML is, of course, bug-free (!) (please read the disclaimer carefuly...), if you ever come 133across an issue, please feel free to report it using the discuss group web site at the following 134address: \url{https://groups.google.com/forum/?fromgroups#!forum/phyml-forum}. Alternatively, you 135can send an email to \url{s.guindon@auckland.ac.nz}. Do not forget to mention the version of PhyML 136and program options you are using. 137 138 139\section{Installing PhyML} 140 141\subsection{Sources and compilation}\index{compilation} 142 143The sources of the program are available free of charge from \url{http://stephaneguindon.github.io/phyml-downloads/}. The compilation on UNIX-like 144systems is fairly standard. It is described in the `\x{INSTALL}' file that comes with the 145sources. In a command-line window, go to the directory that contains the sources and type: 146 147{\setlength{\baselineskip}{0.5\baselineskip} 148\begin{verbatim} 149./configure; 150make clean; 151make V=0; 152\end{verbatim} 153} 154 155By default, PhyML will be compiled with optimization flags turned on. It is possible to generate a 156version of PhyML that can run through a debugging tool (such as \x{ddd}\label{ddd}) or a profiling 157tool (such as \x{gprof}\label{gprof}) using the following instructions: 158 159{\setlength{\baselineskip}{0.5\baselineskip} 160\begin{verbatim} 161./configure --enable-debug; 162make clean; 163make V=0; 164\end{verbatim} 165} 166 167% {\em Note} -- when PhyML is going to be used mostly of exclusively in batch mode, it is preferable 168% to turn on the batch mode option in the Makefile. In order to do so, the file \x{Makefile.am} needs 169% to be modified: add \x{-DBATCH} to the line with \x{DEFS=-DUNIX -D\$(PROG) -DDEBUG}. 170 171 172 173\subsection{Installing PhyML on UNIX-like systems (including Mac OS)} 174 175Copy PhyML binary file in the directory you like. For the operating system to be able to locate the 176program, this directory must be specified in the global variable \x{PATH}. In order to achieve this, 177you will have to add \x{export PATH="/your\_path/:\${PATH}"} to the \x{.bashrc} or the 178\x{.bash\_profile} located in your home directory (\x{your\_path} is the path to the directory that 179contains PhyML binary). 180 181 182\subsection{Installing PhyML on Microsoft Windows}\label{sec:install-windows} 183 184Copy the files \x{phyml.exe} and \x{phyml.bat} in the same directory. To launch PhyML, click on the 185icon corresponding to \x{phyml.bat}. Clicking on the icon for \x{phyml.exe} works too but the 186dimensions of the window will not fit PhyML PHYLIP-like interface. 187 188\subsection{Installing the parallel version of PhyML}\label{sec:MPI}\index{MPI}\index{bootstrap!parallel} 189 190Bootstrap analysis can run on multiple processors. Each processor analyses one bootstraped dataset. 191Therefore, the computing time needed to perform $R$ bootstrap replicates is divided by the number of 192processors available. 193 194This feature of PhyML relies on the MPI (Message Passing Interface) library. To use it, your 195computer must have MPI installed on it. In case MPI is not installed, you can dowload it from 196\href{http://www.mcs.anl.gov/research/projects/mpich2/}{http://www.mcs.anl.gov/research/projects/mpich2/}. 197Once MPI is installed, it is necessary to launch the MPI daemon. This can be done by entering the 198following instruction: \x{mpd \&}. Note however that in most cases, the MPI daemon will already be 199running on your server so that you most likely do not need to worry about this. You can then just 200go in the \x{phyml/} directory (the directory that contains the \x{src/}, \x{examples/} and \x{doc/} 201folders) and enter the commands below: 202 203{\setlength{\baselineskip}{0.5\baselineskip} 204\begin{verbatim} 205./configure --enable-mpi; 206make clean; 207make; 208\end{verbatim} 209} 210 211A binary file named \x{phyml-mpi} has now been created in the \x{src/} directory and is ready to use 212with MPI. A typical MPI command-line which uses 4 CPUs is given below: 213 214{\setlength{\baselineskip}{0.5\baselineskip} 215\begin{verbatim} 216mpirun -n 4 ./phyml-mpi -i myseq -b 100 217\end{verbatim} 218} 219 220\noindent Please read section \ref{sec:parallel_bootstrap} of this document for more information. 221 222\subsection{Installing PhyML-BEAGLE}\label{sec:install-phyml-beagle} 223 224{\em Note: I haven't found the time nor the resources to make sure the code of PhyML stays 225 compatible with the BEAGLE library. Please do not hesitate at all to contact me if you'd 226 like to contribute here.} 227\\ 228\\ 229PhyML can use the BEAGLE\cite{ayres12} library for the likelihood computation. BEAGLE 230provides provides significant speed-up: the single core version of PhyML-BEAGLE can be up to 10 231times faster than PhyML on a single core and up to 150 times on Graphical Processing 232Units. PhyML-BEAGLE will eventually have of the features of PhyML, even though at the moment the 233boostrap and the invariant site options are not available. Also, please note that in some cases, the 234final log-likelihood reported by PhyML and PhyML-BEALGE may not exactly match, though the 235differences observed are very minor (in the 10$^{-4}$ to 10$^{-4}$ range). 236 237In order to install PhyML-BEAGLE, you first need to download and install the BEAGLE library 238available from \url{https://code.google.com/p/beagle-lib/}. Then run the following commands: 239 240{\setlength{\baselineskip}{0.5\baselineskip} 241\begin{verbatim} 242./configure --enable-beagle; 243make clean; 244make; 245\end{verbatim} } A binary file named \x{phyml-beagle} will be created in the \x{src/} directory. The 246interface to \x{phyml-beagle} (i.e., commandline option of PHYLIP-like interface) is exactly 247identical to that of PhyML. 248 249\section{Program usage.}\label{sec:phyml_new} 250 251PhyML has three distinct user-interfaces. The first corresponds to a PHYLIP-like text interface 252that makes the choice of the options self-explanatory. The command-line interface is well-suited for 253people that are familiar with PhyML options or for running PhyML in batch mode. The XML interface is 254more sophisticated. It allows the user to analyse partitionned data using flexible mixture models of evolution. 255 256\subsection{PHYLIP-like interface} 257 258The default is to use the PHYLIP-like text interface by simply typing `\x{phyml}' in a command-line 259window or by clicking on the PhyML icon (see Section \ref{sec:install-windows}). After entering the 260name of the input sequence file, a list of sub-menus helps the users set up the analysis. There 261are currently four distinct sub-menus: 262 263% \begin{figure} 264% \resizebox{15cm}{9cm}{\includegraphics{./fig/interface.eps}} 265% \caption{PHYLIP-like interface to PhyML.} 266% \label{fig:interface} 267% \end{figure} 268 269\begin{enumerate} 270 271\item {\em Input Data}: specify whether the input file contains amino-acid or nucleotide 272sequences. What the sequence format is (see Section \ref{sec:input_output}) and how many data sets 273should be analysed. 274 275\item {\em Substitution Model}: selection of the Markov model of substitution. 276 277\item {\em Tree Searching}: selection of the tree topology searching algorithm. 278 279\item {\em Branch Support}: selection of the method that is used to measure branch support. 280 281\end{enumerate} 282\noindent `\x{+}' and `\x{-}' keys are used to move forward and backward in the sub-menu list. Once 283the model parameters have been defined, typing `\x{Y}' (or `\x{y}') launches the calculations. The 284meaning of some options may not be obvious to users that are not familiar with phylogenetics. In 285such situation, we strongly recommend to use the default options. As long as the format of the input 286sequence file is correctly specified (sub-menu {\em Input data}), the safest option for non-expert 287users is to use the default settings. The different options provided within each sub-menu are 288described in what follows. 289 290 291\subsubsection{Input Data sub-menu} 292 293\begin{center}\framebox{\x{[D] ............................... Data type (DNA/AA)}} \end{center} 294Type of data in the input file. It can be either DNA or amino-acid sequences in PHYLIP format (see 295Section \ref{sec:input_output}). Type \x{D} to change settings. 296 297\vspace{0.7cm} 298\begin{center} \framebox{\x{[I] ...... Input sequences interleaved (or sequential)}} \end{center} 299PHYLIP format comes in two flavours: interleaved or sequential (see Section 300\ref{sec:input_output}). Type \x{I} to selected among the two formats. 301 302\vspace{0.7cm} 303\begin{center} \framebox{\x{[M] ....................... Analyze multiple data sets}} \end{center} 304If the input sequence file contains more than one data sets, PhyML can analyse each of them 305in a single run of the program. Type \x{M} to change settings. 306 307\vspace{0.7cm} 308\begin{center} \framebox{\x{[R] ............................................ Run ID}} \end{center} 309This option allows you to append a string that identifies the current PhyML run. Say for instance 310that you want to analyse the same data set with two models. You can then `tag' the first PhyML run 311with the name of the first model while the second run is tagged with the name of the second model.\index{run ID} 312 313 314 315\subsubsection{Substitution model sub-menu}\label{sec:submenus} 316 317\begin{center} \framebox{\x{[M] ................. Model of nucleotide substitution}} \end{center} 318\begin{center} \framebox{\x{[M] ................ Model of amino-acids substitution}} \end{center} 319PhyML implements a wide range of substitution models: JC69 \cite{jukes69}, K80 \cite{kimura80}, F81 320\cite{felsenstein81a}, F84 \cite{phylip2}, HKY85 \cite{hasegawa85}, TN93 \cite{tamura93} GTR 321\cite{lanave84,tavare86} and custom for nucleotides; LG \cite{le08}, WAG \cite{whelan01b}, Dayhoff 322\cite{dayhoff78}, JTT \cite{jones92}, Blosum62 \cite{henikoff92}, mtREV \cite{adachi96}, rtREV 323\cite{dimmic02}, cpREV \cite{adachi00}, DCMut \cite{kosiol04}, VT \cite{muller00} and mtMAM 324\cite{cao98} and custom for amino acids. Cycle through the list of nucleotide or amino-acids 325substitution models by typing \x{M}. Both nucleotide and amino-acid lists include a `custom' model. 326The custom option provides the most flexible way to specify the nucleotide substitution model. The 327model is defined by a string made of six digits. The default string is `\x{000000}', which means 328that the six relative rates of nucleotide changes: $A \leftrightarrow C$, $A \leftrightarrow G$, $A 329\leftrightarrow T$, $C \leftrightarrow G$, $C \leftrightarrow T$ and $G \leftrightarrow T$, are 330equal. The string `\x{010010}' indicates that the rates $A \leftrightarrow G$ and $C 331\leftrightarrow T$ are equal and distinct from $A \leftrightarrow C = A \leftrightarrow T = C 332\leftrightarrow G = G \leftrightarrow T$. This model corresponds to HKY85 (default) or K80 if the 333nucleotide frequencies are all set to 0.25. `\x{010020}' and `\x{012345}' correspond to TN93 and 334GTR models respectively. The digit string therefore defines groups of relative substitution rates. 335The initial rate within each group is set to 1.0, which corresponds to F81 (JC69 if the base 336frequencies are equal). Users also have the opportunity to define their own initial rate values. 337These rates are then optimised afterwards (option `\x{O}') or fixed to their initial values. The 338custom option can be used to implement all substitution models that are special cases of 339GTR. Table \ref{tab:modelcode} on page \pageref{tab:modelcode} gives the correspondence between the `standard' name of the model 340(see \url{http://mbe.oxfordjournals.org/content/18/6/897/F2.large.jpg}) and the custom model code. 341The custom model also exists for protein sequences. It is useful when one wants to use an amino-acid 342substitution model that is not hard-coded in PhyML. The symmetric part of the rate matrix, as well 343as the equilibrium amino-acid frequencies, are given in a file which name is given as input of the 344program. The format of this file is described in the section \ref{sec:customaa}. 345 346\vspace{0.7cm} 347\begin{center} \framebox{\x{[F] ................. Optimise equilibrium frequencies}} \end{center} 348\begin{center} \framebox{\x{[E] ......... Equilibrium frequencies (empirical/user)}} \end{center} 349\begin{center} \framebox{\x{[F] . Amino acid frequencies (empirical/model defined)}} \end{center} 350For nucleotide sequences, optimising equilibrium frequencies means that the values of these 351parameters are estimated in the maximum likelihood framework. When the custom model option is 352selected, it is also possible to give the program a user-defined nucleotide frequency distribution 353at equilibrium (option \x{E}). For protein sequences, the stationary amino-acid frequencies are 354either those defined by the substitution model or those estimated by counting the number of 355different amino-acids observed in the data. Hence, the meaning of the \x{F} option depends on the 356type of the data to be processed. 357 358\vspace{0.7cm} 359\begin{center} \framebox{\x{[T] .................... Ts/tv ratio 360(fixed/estimated)}} \end{center}\index{$\kappa$}\index{ts/tv ratio} Fix or estimate the 361transition/transversion ratio in the maximum likelihood framework. This option is only available 362when DNA sequences are to be analysed under K80, HKY85 or TN93 models. The definition given to this 363parameter by PhyML is the same as PAML's\index{PAML} one. Therefore, the value of this parameter 364does {\it not} correspond to the ratio between the expected number of transitions and the expected 365number of transversions during a unit of time. This last definition is the one used in 366PHYLIP\index{PHYLIP}. PAML's manual gives more detail about the distinction between the two 367definitions (\url{http://abacus.gene.ucl.ac.uk/software/paml.html}). 368 369\vspace{0.7cm} 370\begin{center} \framebox{\x{[V] . Proportion of invariable sites 371(fixed/estimated)}} \end{center}\index{invariable sites}\index{proportion of invariants} The 372proportion of invariable sites, i.e., the expected frequency of sites that do not evolve, can be 373fixed or estimated. The default is to fix this proportion to 0.0. By doing so, we consider that each 374site in the sequence may accumulate substitutions at some point during its evolution, even if no 375differences across sequences are actually observed at that site. Users can also fix this parameter 376to any value in the $[0.0,1.0]$ range or estimate it from the data in the maximum-likelihood 377framework. 378 379\vspace{0.7cm} 380\index{gamma distribution (discrete)!mean vs. median} 381\index{gamma distribution (discrete)!number of categories} 382\index{gamma distribution (discrete)!shape parameter} 383\begin{center} \framebox{\x{[R] ....... One category of substitution rate (yes/no)}} \end{center} 384\begin{center} \framebox{\x{[C] ........... Number of substitution rate categories}} \end{center} 385\begin{center} \framebox{\x{[A] ... Gamma distribution parameter (fixed/estimated)}} \end{center} 386\begin{center} \framebox{\x{[G] .........`Middle' of each rate class (mean/median)}} \end{center} 387Rates of evolution often vary from site to site. This heterogeneity can be modelled using a discrete 388gamma distribution. Type \x{R} to switch this option on or off. The different categories of this 389discrete distribution correspond to different (relative) rates of evolution. The number of 390categories of this distribution is set to 4 by default. It is probably not wise to go below this 391number. Larger values are generally preferred. However, the computational burden involved is 392proportional to the number of categories (i.e., an analysis with 8 categories will generally take 393twice the time of the same analysis with only 4 categories). Note that the likelihood will not 394necessarily increase as the number of categories increases. Hence, the number of categories should 395be kept below a ``reasonable'' number, say 20. The default number of categories can be changed by 396typing \x{C}. 397 398The middle of each discretized substitution rate class can be determined using the mean or the 399median. PAML, MrBayes and RAxML use the mean. However, the median is generally associated with 400greater likelihoods than the mean. This conclusion is based on our analysis of several real-world 401data sets extracted from TreeBase. Despite this, the default option in PhyML is to use the mean in 402order to make PhyML likelihoods comparable to those of other phylogenetic software. One must bare 403in mind that {\color{red}{likelihoods calculated with the mean approximation are not directly 404comparable to the likelihoods calculated using the median approximation}}. 405 406The shape of the gamma distribution determines the range of rate variation across sites. Small 407values, typically in the $[0.1,1.0]$ range, correspond to large variability. Larger values 408correspond to moderate to low heterogeneity. The gamma shape parameter can be fixed by the user or 409estimated via maximum-likelihood. Type \x{A} to select one or the other option. 410 411 412\subsubsection{Tree searching sub-menu} 413 414\begin{center} \framebox{\x{[O] ........................... Optimise tree topology}} \end{center} By 415default the tree topology is optimised in order to maximise the likelihood. However, it is also 416possible to avoid any topological alteration. This option is useful when one wants to compute the 417likelihood of a tree given as input (see below). Type \x{O} to select among these two options. 418 419\vspace{0.7cm} 420\begin{center} \framebox{\x{[S] .................. Tree topology search operations}} \end{center} 421PhyML proposes three different methods to estimate tree topologies. The default approach is to use 422simultaneous NNI. This option corresponds to the original PhyML algorithm \cite{guindon03}. The 423second approach relies on subtree pruning and regrafting (SPR). It generally finds better tree 424topologies compared to NNI but is also significantly slower. The third approach, termed BEST, 425simply estimates the phylogeny using both methods and returns the best solution among the two. Type 426\x{S} to choose among these three choices. 427 428\vspace{0.7cm} 429\begin{center} \framebox{\x{[R] ......................... Use random starting tree}} \end{center} 430\begin{center} \framebox{\x{[N] .................. Number of random starting trees}} \end{center} 431When the SPR or the BEST options are selected, is is possible to use random trees rather than BioNJ 432or a user-defined tree, as starting tree. If this option is turned on (type \x{R} to change), five 433trees, corresponding to five random starts, will be estimated. The output tree file will contain the 434best tree found among those five. The number of random starts can be modified by typing 435\x{N}. Setting the number of random starting trees to $N$ means that the analysis will take 436(slightly more than) $N$ times the time required for a standard analysis where only one (BioNJ) 437starting tree is used. However, the analysis of real data sets shows that the best trees estimated 438using the random start option almost systematically have higher likelihoods than those inferred 439using a single starting tree. 440 441\vspace{0.7cm} 442\begin{center} \framebox{\x{[U] ........ Starting tree (BioNJ/parsimony/user 443tree)}} \end{center}\index{BioNJ} When the tree topology optimisation option is turned on, PhyML 444proceeds by refining an input tree. By default, this input tree is estimated using BioNJ 445\cite{gascuelNJ}. The alternative option is to use a parsimony tree. We found this option specially 446useful when analysing large data sets with NNI moves as it generally leads to greater likelihoods 447than those obtained when starting from a BioNJ trees. The user can also to input her/his own 448tree. This tree should be in Newick format (see Section \ref{sec:input_output}). This option is 449useful when one wants to evaluate the likelihood of a given tree with a fixed topology, using 450PhyML. Type \x{U} to choose among these two options. 451 452\subsubsection{Branch support sub-menu} 453 454\begin{center} \framebox{\x{[B] ................ Non parametric bootstrap analysis}} \end{center} 455The support of the data for each internal branch of the phylogeny can be estimated using 456non-parametric bootstrap. By default, this option is switched off. Typing \x{B} switches on the 457bootstrap analysis. The user is then prompted for a number of bootstrap replicates. The largest this 458number the more precise the bootstrap support estimates are. However, for each bootstrap replicate 459a phylogeny is estimated. Hence, the time needed to analyse $N$ bootstrap replicates corresponds to 460$N$-times the time spent on the analysis of the original data set. $N=100$ is generally considered 461as a reasonable number of replicates. 462 463\begin{center} \framebox{\x{[A] ................ Approximate likelihood ratio test}} \end{center} 464When the bootstrap option is switched off (see above), approximate likelihood branch supports are 465estimated. This approach is considerably faster than the bootstrap one. However, both methods intend 466to estimate different quantities and conducting a fair comparison between both criteria is not 467straightforward. The estimation of approximate likelihood branch support comes in multiple 468flavours. The default is set to aBayes, corresponding to the approximate Bayes method described in 469\cite{anisimova11}. The approximate likelihood ratio test (aLRT) \cite{anisimova06}, 470Shimodaira–Hasegawa aLRT (SH-aLRT) statistics are the other available options. 471 472 473\subsection{Command-line interface} 474 475An alternative to the PHYLIP-like interface is the command-line interface. Users that do not need to 476modify the default parameters can launch the program with the `\x{phyml -i seq\_file\_name}' 477command. The list of all command line arguments and how to use them is given in the `Help' section 478which is displayed when entering the `\x{phyml --help}' command. The available command-line options 479are described in what follows. 480 481\begin{itemize} 482 483\item \x{-i} (or \x{--input}) \x{seq\_file\_name}\index{command-line options!\x{--input}} \\ 484\x{seq\_file\_name} is the name of the nucleotide or amino-acid sequence file in PHYLIP format. 485 486\item \x{-d} (or \x{--datatype}) \x{data\_type}\index{command-line options!\x{--data\_type}}\\ 487\x{data\_type} is \x{nt} for nucleotide (default) and \x{aa} for amino-acid sequences. 488 489\item \x{-q} (or \x{--sequential})\index{sequence format!interleaved}\index{sequence format!sequential}\index{command-line options!\x{--sequential}} \\ 490Changes interleaved format (default) to sequential format. 491 492\item \x{-n} (or \x{--multiple}) \x{nb\_data\_sets}\index{multiple data sets}\index{command-line options!\x{--multiple}}\\ 493\x{nb\_data\_sets} is an integer giving the number of data sets to analyse. 494 495\item \x{-p} (or \x{--pars})\index{command-line options!\x{--pars}}\\ 496Use a minimum parsimony starting tree. This option is taken into account when the `-u' option 497is absent and when tree topology modifications are to be done. 498 499\item \x{-b} (or \x{--bootstrap}) \x{int}\index{bootstrap}\index{command-line options!\x{--bootstrap}} 500\begin{itemize} 501\item \x{int} $>$ 0: \x{int} is the number of bootstrap replicates. 502\item \x{int} = 0: neither approximate likelihood ratio test nor bootstrap values are computed. 503\item \x{int} = -1: approximate likelihood ratio test returning aLRT statistics. 504\item \x{int} = -2: approximate likelihood ratio test returning Chi2-based parametric branch supports. 505% \item \x{int} = -3: minimum of Chi2-based parametric and SH-like branch supports. 506\item \x{int} = -4: SH-like branch supports alone. 507\item \x{int} = -5: (default) approximate Bayes branch supports. 508\end{itemize} 509 510\item \x{-m} (or \x{--model}) \x{model\_name}\index{substitution models!DNA}\index{substitution models!amino acids}\index{command-line options!\x{--model}} \\ 511\x{model\_name} : substitution model name. 512\begin{itemize} 513\item {\it Nucleotide-based models}: \x{HKY85} (default) \x{| JC69 | K80 | F81 | F84 | TN93 | GTR | 514 custom | r(ac),r(ag),r(at),r(cg),r(ct),r(gt)} \\ JC90, K80, F81. F84, TN93 and GTR denote the 515 ``standard'' substitution models. The \x{custom} option can be used to define a new substitution model. A string of six 516digits identifies the model. For instance, 000000 corresponds to F81 (or JC69 provided the 517distribution of nucleotide frequencies is uniform). 012345 corresponds to GTR. This option can be 518used for encoding any model that is a nested within GTR. See Section \ref{sec:submenus} and Table \ref{tab:modelcode}. {\em NOTE:} 519the substitution parameters of the custom model will be optimised so as to maximise the 520likelihood. Also, it is possible to specify and fix (i.e., avoid optimisation) the values of the 521substitution rates in the GTR model by passing the corresponding five rates to the \x{-m} 522option. Commas are used to separate the five relative rates and no blank space should be introduced here. 523 524\item {\it Amino-acid based models}: \x{LG} (default) \x{| WAG | JTT | MtREV | Dayhoff | DCMut | RtREV 525 | CpREV | VT | Blosum62 | MtMam | MtArt | HIVw | HIVb | custom} \\ 526The \x{custom} option is useful when one wants to use an amino-acid substitution model that is not 527available by default in PhyML. The symmetric part of the rate matrix, as well as the equilibrium 528amino-acid frequencies, are given in a file which name is asked for by the program. The format of 529this file is described in section \ref{sec:customaa}. 530\end{itemize} 531 532\begin{table}\index{custom models} 533\begin{center} 534\begin{tabular}{ll} 535\hline 536Name & Command-line option \\ 537\hline 538JC69 & \x{-m 000000 -f 0.25,0.25,0.25,0.25} \\ 539F81 & \x{-m 000000}\\ 540K80 & \x{-m 010010 -f 0.25,0.25,0.25,0.25} \\ 541HKY85 & \x{-m 010010}\\ 542TrNef & \x{-m 010020 -f 0.25,0.25,0.25,0.25} \\ 543TrN & \x{-m 010020}\\ 544K81 & \x{-m 123321 -f 0.25,0.25,0.25,0.25}\\ 545K81uf & \x{-m 123321}\\ 546TIMef & \x{-m 132241 -f 0.25,0.25,0.25,0.25}\\ 547TIM & \x{-m 132241}\\ 548TVMef & \x{-m 102304 -f 0.25,0.25,0.25,0.25}\\ 549TVM & \x{-m 102304}\\ 550SYM & \x{-m 123456 -f 0.25,0.25,0.25,0.25}\\ 551GTR & \x{-m 123456}\\ 552\hline 553\end{tabular} 554\caption{Nucleotide substitution model names (as defined in \cite{posada01}) and the corresponding 555 custom model code used in PhyML.}\label{tab:modelcode} 556\end{center} 557\end{table} 558 559\item \x{--aa\_rate\_file file\_name}\index{command-line options!\x{--aa\_rate\_file}} \\ 560This option is compulsory when analysing amino-acid sequences under a `custom' model (see above). \x{file\_name} 561should provide a rate matrix and equilibrium amino acid in PAML format (see Section \ref{sec:customaa}). 562 563\item \x{-f e}, \x{m}, or \x{fA,fC,fG,fT} \index{frequencies!nucleotide}\index{frequencies!amino-acid}\index{stationary frequencies}\index{command-line options!\x{-f}}\\ 564Nucleotide or amino-acid frequencies. 565\begin{itemize} 566\item \x{e} : the character frequencies are determined as follows : 567\begin{itemize} 568\item {\it Nucleotide sequences}: (Empirical) the equilibrium base frequencies are estimated by counting 569 the occurence of the different bases in the alignment. 570\item {\it Amino-acid sequences}: (Empirical) the equilibrium amino-acid frequencies are estimated by counting 571 the occurence of the different amino-acids in the alignment. 572\end{itemize} 573 574\item \x{m} : the character frequencies are determined as follows : 575\begin{itemize} 576\item {\it Nucleotide sequences}: (ML) the equilibrium base frequencies are estimated using maximum 577 likelihood. 578\item {\it Amino-acid sequences}: (Model) the equilibrium amino-acid frequencies are estimated using 579 the frequencies defined by the substitution model. 580\end{itemize} 581 582\item \x{fA,fC,fG,fT}: only valid for nucleotide-based models. \x{fA}, \x{fC}, \x{fG} and \x{fT} are floating numbers that 583correspond to the frequencies of A, C, G and T respectively. 584\end{itemize} 585 586\item \x{-t} (or \x{--ts/tv}) \x{ts/tv\_ratio} \index{$\kappa$}\index{ts/tv ratio}\index{command-line options!\x{--ts/tv}}\\ 587\x{ts/tv\_ratio}: transition/transversion ratio. DNA sequences only. Can be a fixed positive value 588(e.g., 4.0) or type \x{e} to get the maximum likelihood estimate. 589 590\item \x{-v} (or \x{--pinv}) \x{prop\_invar}\index{proportion of invariants}\index{invariable sites} \index{command-line options!\x{--pinv}}\\ 591\x{prop\_invar}: proportion of invariable sites. Can be a fixed value in the [0,1] range or type \x{e} to get the maximum likelihood estimate. 592 593\item \x{-c} (or \x{--nclasses}) \x{nb\_subst\_cat}\index{gamma distribution (discrete)!number of categories} \index{command-line options!\x{--nclasses}}\\ 594\x{nb\_subst\_cat}: number of relative substitution rate categories. Default: \x{nb\_subst\_cat=4}. Must be a positive integer. 595 596\item \x{-a} (or \x{--alpha}) \x{gamma} \index{gamma distribution (discrete)!shape parameter}\index{command-line options!\x{--alpha}} \\ 597\x{gamma}: value of the gamma shape parameter. Can be a fixed positive value or e to get the maximum 598likelihood estimate. The value of this parameter is estimated in the maximum likelihood framework by default. 599 600\item \x{--use\_median} \index{gamma distribution (discrete)!mean vs. median} \index{command-line options!\x{--use\_median}}\\ 601The middle of each substitution rate class in the discrete gamma distribution is taken as the 602median. The mean is used by default. 603 604\item \x{--free\_rates} or \x{--freerates} \index{command-line options!\x{--free\_rates}}\\ 605As an alternative to the discrete gamma model, it is possible to estimate the (relative) rate in 606each class of the (mixture) model and the corresponding frequencies directly from the data. This 607model, called the FreeRate model, has more parameters 608than the discrete gamma one but usually provides a significantly better fit to the data. See 609\cite{soubrier12} for more information about this model and an illustration of its use. 610 611\item \x{--il} \index{command-line options!\x{--il}}\\ \x{il} stands here for integrated (branch) 612 length. This model, described in \cite{guindon13} in the context of molecular dating, provides an 613 efficient way to implement the covarion model \index{covarion}. Under the integrated length (IL) model, the length 614 of each edge is described by a distribution of values, instead of a single value corresponding to 615 the expected number of substitutions per position along the sequence. Let $l_{a,s}$ and $l_{b,s}$ 616 be the number of substitutions at site $s$ along edges $a$ and $b$, and $l_{a,t}$ and $l_{b,t}$, 617 the number of substitutions at site $t$. Standard models have $l_{a,s}=l_{a,t}$ and $l_{b,s}=l_{b,t}$, 618 or $l_{a,s}=\alpha l_{a,t}$ and $l_{b,s}=\alpha l_{b,t}$ if rates vary across sites. The IL model 619 has instead $l_{a,s}= \alpha l_{a,t}$ and $l_{b,s}= \beta l_{b,t}$ with $\alpha \neq \beta$, 620 i.e. substitution rates vary across sites and edges. The IL approach is somehow an analytical 621 approximation to the covarion model that, unlike the covarion model, does not incur any computational overhead compared to the 622 traditional models. A notable difference with the plain vanilla covarion model and the IL model however is that 623 substitution rates are not autocorrelated along the phylogeny under the IL model. 624 625 626 627\item \x{--codpos} \x{1,2 or 3} \index{command-line options!\x{--codpos}}\\ 628When analysing an alignment of coding sequences, use this option to consider only the first, second 629or the third coding position for the estimation. 630 631\item \x{-s} (or \x{--search}) \x{move}\index{NNI}\index{SPR} \index{command-line options!\x{--search}}\\ 632Tree topology search operation option. Can be either \x{NNI} (default, fast) or \x{SPR} (usually 633slower than \x{NNI} but more accurate) or \x{BEST} (best of NNI and SPR search). 634 635\item \x{-u} (or \x{--inputtree}) \x{user\_tree\_file}\index{input tree}\index{user tree} \index{command-line options!\x{--inputtree}}\\ 636\x{user\_tree\_file}: starting tree filename. The tree must be in Newick format. 637 638\item \x{-o params}\index{optimisation!topology}\index{optimisation!substitution parameters} \index{command-line options!\x{-o}}\\ 639This option focuses on specific parameter optimisation. 640\begin{itemize} 641\item \x{params=tlr}: tree topology (\x{t}), branch length (\x{l}) and substitution rate parameters (\x{r}) are optimised. 642\item \x{params=tl}: tree topology and branch lengths are optimised. 643\item \x{params=lr}: branch lengths and substitution rate parameters are optimised. 644\item \x{params=l}: branch lengths are optimised. 645\item \x{params=r}: substitution rate parameters are optimised. 646\item \x{params=n}: no parameter is optimised. 647\end{itemize} 648 649\item \x{--rand\_start}\index{random tree} \index{command-line options!\x{--rand\_start}}\\ 650This option sets the initial tree to random. It is only valid if SPR searches are to be performed. 651 652\item \x{--n\_rand\_starts num} \index{command-line options!\x{--n\_rand\_starts}}\\ 653\x{num} is the number of initial random trees to be used. It is only valid if SPR searches are to be performed. 654 655\item \x{--r\_seed num}\index{random number} \index{command-line options!\x{--r\_seed}}\\ 656\x{num} is the seed used to initiate the random number generator. Must be an integer. 657 658\item \x{--print\_site\_lnl}\index{likelihood!print site likelihood} \index{command-line options!\x{--print\_site\_lk}}\\ 659Print the likelihood for each site in file *\_phyml\_lk.txt. For $\Gamma$ or $\Gamma$+I or 660FreeRate models, this option returns the posterior probability of each relative rate class at 661each site. Such information can then be used to identify fast- and slow-evolving regions of the alignment. 662 663\item \x{--print\_trace}\index{command-line options!\x{--print\_trace}}\\ 664Print each phylogeny explored during the tree search process in file *\_phyml\_trace.txt. This 665option can be useful for monitoring the progress of the analysis for very large data sets and have 666an approximate idea of what the final phylogeny will look like. 667 668\item \x{--json\_trace}\index{command-line options!\x{--json\_trace}}\\ 669Print each phylogeny explored during the tree search process in file *\_phyml\_json\_trace.txt in 670JSON format (see \url{http://www.json.org/}). This option can be useful for monitoring the progress of the analysis for very large data sets and have 671an approximate idea of what the final phylogeny will look like. 672 673 674\item \x{--run\_id ID\_string}\index{run ID} \index{command-line options!\x{--run\_id}}\\ 675Append the string ID\_string at the end of each PhyML output file. This option may be useful when 676running simulations involving PhyML. It can also be used to `tag' multiple analysis of the same data 677set with various program settings. 678 679\item \x{--no\_memory\_check} \index{command-line options!\x{--no\_memory\_check}}\\ 680By default, when processing a large data set, PhyML will pause and ask the user to confirm that 681she/he wants to continue with the execution of the analysis despite the large amount of memory 682required. The \x{--no\_memory\_check} skips this question. It is especially useful when running 683PhyML in batch mode. 684 685\item \x{--no\_colalias} \index{command-line options!\x{--no\_colalias}}\\ 686By default, PhyML preprocesses each alignment by putting together (or aliasing) the columns that are 687identical. Use this option to skip this step but be aware that the analysis might then take more 688time to complete. 689 690\item \x{--constrained\_lens} \index{command-line options!\x{--constrained\_lens}}\\ 691When an input tree with branch lengths is provided, this option will find the branch multiplier that 692maximises the likelihood (i.e., the relative branch lengths remain constant) 693 694\item \x{--constraint\_file} \x{file\_name} \index{command-line options!\x{--constraint\_file}}\\ 695\x{file\_name} lists the topological constraints under which the tree topology search is 696conducted. This option should be used in conjunction with \x{-u} \x{file\_name}. See Section 697\ref{sec:topoconstraints} for more information. 698 699\item \x{--quiet} \index{command-line options!\x{--quiet}}\\ 700Runs PhyML in quiet mode. The program will not pause if the memory required to run the analysis 701exceeds 256MB and will not output the progression of the log-likelihood scores on the standard output. 702 703\item \x{--ancestral} \index{command-line options!\x{--ancestral}}\\ 704PhyML calculates the marginal probabilities of each character state at each internal node and each 705site of the sequence alignment. It then uses the ``Minimum Posterior Expected Error'' criterion to 706infer ancestral sequences. Works for both nucleotide and amino-acid data. 707 708 709\item \x{--leave\_duplicates} \index{command-line options!\x{--leave\_duplicates}}\\ 710 PhyML removes duplicate sequences by default. Use this option if you do not want PhyML to discard 711 them (even though leaving these sequences in will slow down the analysis). 712 713\end{itemize} 714 715\subsection{XML interface} 716\begin{itemize} 717\item \x{--xml=xml\_file\_name}\index{command-line options!\x{--xml}} \\ 718\x{xml\_file\_name} is the name of the XML file containing the information required to run the 719analysis. More details about this type of file is given in the section \ref{sec:xmlio}. 720\end{itemize} 721 722\subsection{Parallel bootstrap}\label{sec:parallel_bootstrap}\index{MPI}\index{bootstrap!parallel} 723 724Bootstrapping is a highly parallelizable task. Indeed, bootstrap replicates are independent from 725one another. Each bootstrap replicate can then be analysed separately. Modern computers often have 726more than one CPU. Each CPU can therefore be used to process a bootstrap sample. Using this parallel 727strategy, performing $R$ bootstrap replicates on $C$ CPUs `costs' the same amount of computation 728time as processing $R \times C$ bootstrap replicates on a single CPU. In other words, for a given 729number of replicates, the computation time is divided by $R$ compared to the non-parallel approach. 730 731PhyML sources must be compiled with specific options to turn on the parallel option (see Section 732\ref{sec:MPI}). Once the binary file (\x{phyml}) has been generated, running a bootstrap analysis 733with, say 100 replicates on 2 CPUs, can be done by typing the following command-line: 734\begin{verbatim} 735mpd &; 736mpirun -np 2 ./phyml-mpi -i seqfile -b 100; 737\end{verbatim} 738The first command launches the mpi daemon while the second launches the analysis. Note that 739launching the daemon needs to be done only once. The output files are similar to the ones generated 740using the standard, non-parallel, analysis (see Section \ref{sec:input_output}). Note that running 741the program in batch mode, i.e.: 742\begin{verbatim} 743mpirun -np 2 ./phyml-mpi -i seqfile -b 100 & 744\end{verbatim} 745will probably NOT work. I do not know how to run a mpi process in batch mode yet. Suggestions welcome... 746Also, at the moment, the number of bootstrap replicates must be a multiple of the number of CPUs 747required in the mpirun command. 748 749\section{Inputs \& outputs for command-line and PHYLIP interface }\label{sec:input_output} 750 751PhyML reads data from standard text files, without the need for any particular file name extension. 752 753\subsection{Sequence formats} 754 755\begin{figure} 756\begin{small} 757\begin{Verbatim}[frame=single, label=PHYLIP interleaved, samepage=true, baselinestretch=0.5] 758 7595 80 760seq1 CCATCTCACGGTCGGTACGATACACCKGCTTTTGGCAGGAAATGGTCAATATTACAAGGT 761seq2 CCATCTCACGGTCAG---GATACACCKGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT 762seq3 RCATCTCCCGCTCAG---GATACCCCKGCTGTTG????????????????ATTAAAAGGT 763seq4 RCATCTCATGGTCAA---GATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT 764seq5 RCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAAT????????GT 765 766ATCKGCTTTTGGCAGGAAAT 767ATCKGCTTTTGGCGGGAAAT 768AGCKGCTGTTG????????? 769ATCTGCTTTTGGCGGGAAAT 770ATCTGCTTTTGGCGGGAAAT 771 772\end{Verbatim} 773\begin{Verbatim}[frame=single, label=PHYLIP sequential, samepage=true, baselinestretch=0.5] 774 7755 40 776seq1 CCATCTCANNNNNNNNACGATACACCKGCTTTTGGCAGG 777seq2 CCATCTCANNNNNNNNGGGATACACCKGCTTTTGGCGGG 778seq3 RCATCTCCCGCTCAGTGAGATACCCCKGCTGTTGXXXXX 779seq4 RCATCTCATGGTCAATG-AATACTCCTGCTTTTGXXXXX 780seq5 RCATCTCACGGTCGGTAAGATACACCTGCTTTTGxxxxx 781 782\end{Verbatim} 783\end{small} 784\label{fig:align_tree} 785\caption{\bf PHYLIP interleaved and sequential formats.} 786\end{figure} 787 788 789 790\begin{figure} 791\begin{small} 792\begin{Verbatim}[frame=single, label=Nexus nucleotides, samepage=true, baselinestretch=0.5] 793 794[ This is a comment ] 795#NEXUS 796BEGIN DATA; 797DIMENSIONS NTAX=10 NCHAR=20; 798FORMAT DATATYPE=DNA; 799MATRIX 800tax1 ?ATGATTTCCTTAGTAGCGG 801tax2 CAGGATTTCCTTAGTAGCGG 802tax3 ?AGGATTTCCTTAGTAGCGG 803tax4 ?????????????GTAGCGG 804tax5 CAGGATTTCCTTAGTAGCGG 805tax6 CAGGATTTCCTTAGTAGCGG 806tax7 ???GATTTCCTTAGTAGCGG 807tax8 ???????????????????? 808tax9 ???GGATTTCTTCGTAGCGG 809tax10 ???????????????AGCGG; 810END; 811 812\end{Verbatim} 813\end{small} 814 815\begin{small} 816\begin{Verbatim}[frame=single, label=Nexus digits, samepage=true, baselinestretch=0.5] 817 818[ This is a comment ] 819#NEXUS 820BEGIN DATA; 821DIMENSIONS NTAX=10 NCHAR=20; 822FORMAT DATATYPE=STANDARD SYMBOLS="0 1 2 3"; 823MATRIX 824tax1 ?0320333113302302122 825tax2 10220333113302302122 826tax3 ?0220333113302302122 827tax4 ?????????????2302122 828tax5 10220333113302302122 829tax6 10220333113302302122 830tax7 ???20333113302302122 831tax8 ???????????????????? 832tax9 ???22033313312302122 833tax10 ???????????????02122; 834END; 835 836\end{Verbatim} 837\end{small} 838 839\begin{small} 840\begin{Verbatim}[frame=single, label=Nexus digits, samepage=true, baselinestretch=0.5] 841 842[ This is a comment ] 843#NEXUS 844BEGIN DATA; 845DIMENSIONS NTAX=10 NCHAR=20; 846FORMAT DATATYPE=STANDARD SYMBOLS="00 01 02 03"; 847MATRIX 848tax1 ??00030200030303010103030002030002010202 849tax2 0100020200030303010103030002030002010202 850tax3 ??00020200030303010103030002030002010202 851tax4 ??????????????????????????02030002010202 852tax5 0100020200030303010103030002030002010202 853tax6 0100020200030303010103030002030002010202 854tax7 ??????0200030303010103030002030002010202 855tax8 ???????????????????????????????????????? 856tax9 ??????0202000303030103030102030002010202 857tax10 ??????????????????????????????0002010202; 858END; 859 860\end{Verbatim} 861\end{small} 862\caption{\bf NEXUS formats.}\label{fig:nexus} 863\end{figure} 864 865 866Alignments of DNA or protein sequences must be in PHYLIP\index{PHYLIP} or NEXUS 867\cite{maddison97}\index{NEXUS} sequential\index{sequential} or interleaved\index{interleaved} format 868(Figures \ref{fig:align_tree} and \ref{fig:nexus}). For PHYLIP formated sequence alignments, the 869first line of the input file contains the number of species and the number of characters, in free 870format, separated by blank characters. One slight difference with PHYLIP format deals with sequence 871name lengths. While PHYLIP format limits this length to ten characters, PhyML can read up to 872hundred character long sequence names. Blanks and the symbols ``(),:'' are not allowed within 873sequence names because the Newick tree format makes special use of these symbols. Another slight 874difference with PHYLIP format is that actual sequences must be separated from their names by at 875least one blank character. 876 877A PHYLIP input sequence file may also display more than a single data set. Each of these data sets 878must be in PHYLIP format and two successive alignments must be separated by an empty 879line. Processing multiple data sets requires to toggle the `\x{M}' option in the {\em Input Data} 880sub-menu or use the `\x{-n}' command line option and enter the number of data sets to analyse. The 881multiple data set option can be used to process re-sampled data that were generated using a 882non-parametric procedure such as cross-validation or jackknife (a bootstrap option is already 883included in PhyML). This option is also useful in multiple gene studies, even if fitting the same 884substitution model to all data sets may not be suitable. 885 886PhyML can also process alignments in NEXUS format. Although not all the options provided by this 887format are supported by PhyML, a few specific features are exploited. Of course, this format can 888handle nucleotide and protein sequence alignments in sequential or interleaved format. It is also 889possible to use custom alphabets, replacing the standard 4-state and 20-state alphabets for 890nucleotides and amino-acids respectively. Examples of a 4-state custom alphabet are given in Figure 891\ref{fig:nexus}. Each state must here correspond to one digit or more. The set of states must be a 892list of consecutive digits starting from 0. For instance, the list ``0, 1, 3, 4'' is not a valid 893alphabet. Each state in the symbol list must be separated from the next one by a space. Hence, 894alphabets with large number of states can be easily defined by using two-digit number (starting with 89500 up to 19 for a 20 state alphabet). Most importantly, this feature gives the opportunity to 896analyse data sets made of presence/absence character states (use the \texttt{symbols=``0 1''} option 897for such data).\index{binary characters} Alignments made of custom-defined states will be processed 898using the Jukes and Cantor model. Other options of the program (e.g., number of rate classes, tree 899topology search algorithm) are freely configurable. Note that, at the moment, the maximum number of 900different states is set to 22 in order to save memory space. It is however possible to lift this 901threshold by modifiying the value of the variable \x{T\_MAX\_ALPHABET} in the file 902`\x{utilities.h}'. The program will then have to be re-compiled. 903 904 905 906\subsubsection{Gaps and ambiguous characters} 907 908Gaps correspond to the `\x{-}' symbol. They are systematically treated as unknown characters ``on 909the grounds that we don't know what would be there if something were there'' (J. Felsenstein, 910PHYLIP main documentation). The likelihood at these sites is summed over all the possible states 911(i.e., nucleotides or amino acids) that could actually be observed at these particular 912positions. Note however that columns of the alignment that display only gaps or unknown characters 913are simply discarded because they do not carry any phylogenetic information (they are equally well 914explained by any model). PhyML also handles ambiguous characters such as $R$ for $A$ or $G$ 915(purines) and $Y$ for $C$ or $T$ (pyrimidines). Tables \ref{tab:ambigu_nt} and \ref{tab:ambigu_aa} 916give the list of valid characters/symbols and the corresponding nucleotides or amino acids. 917 918\begin{table} 919\begin{center} 920\begin{tabular}{lr|lr} 921\hline 922Character & Nucleotide & Character & Nucleotide \\ 923\hline 924$A$ & Adenosine & $Y$ & $C$ or $T$ \\ 925$G$ & Guanosine & $K$ & $G$ or $T$ \\ 926$C$ & Cytidine & $B$ & $C$ or $G$ or $T$\\ 927$T$ & Thymidine & $D$ & $A$ or $G$ or $T$ \\ 928$U$ & Uridine (=$T$) & $H$ & $A$ or $C$ or $T$ \\ 929$M$ & $A$ or $C$ & $V$ & $A$ or $C$ or $G$ \\ 930$R$ & $A$ or $G$ & $-$ or $N$ or $X$ or $?$ & unknown \\ 931$W$ & $A$ or $T$ & & (=$A$ or $C$ or $G$ or $T$)\\ 932$S$ & $C$ or $G$ & & \\ 933\hline 934\end{tabular} 935\end{center} 936\caption{{\bf List of valid characters in DNA sequences and the corresponding nucleotides.}}\label{tab:ambigu_nt} 937\end{table} 938\begin{table} 939\begin{center} 940\begin{tabular}{lr|lr} 941\hline 942Character & Amino-Acid & Character & Amino-Acid \\ 943\hline 944$A$ & Alanine & $L$ & Leucine \\ 945$R$ & Arginine & $K$ & Lysine \\ 946$N$ or $B$& Asparagine & $M$ & Methionine \\ 947$D$ & Aspartic acid & $F$ & Phenylalanine \\ 948$C$ & Cysteine & $P$ & Proline \\ 949$Q$ or $Z$& Glutamine & $S$ & Serine \\ 950$E$ & Glutamic acid & $T$ & Threonine \\ 951$G$ & Glycine & $W$ & Tryptophan \\ 952$H$ & Histidine & $Y$ & Tyrosine \\ 953$I$ & Isoleucine & $V$ & Valine \\ 954$L$ & Leucine & $-$ or $X$ or $?$ & unknown \\ 955$K$ & Lysine & & (can be any amino acid) \\ 956\hline 957\end{tabular} 958\end{center} 959\caption{{\bf List of valid characters in protein sequences and the corresponding amino acids.}}\label{tab:ambigu_aa} 960\end{table} 961 962\subsubsection{Specifying outgroup sequences}\label{sec:outgroupspecify} 963 964PhyML can return rooted trees provided outgroup taxa are identified from the sequence file. In 965order to do so, sequence names that display a `*' character will be automatically considered as 966belonging to the outgroup. 967 968The topology of the rooted tree is exactly the same as the unrooted version of the same tree. In 969other words, PhyML first ignores the distinction between ingroup and outgroup sequences, builds a 970maximum likelihood unrooted tree and then tries to add the root. If the outgroup has more than one 971sequence, the position of the root might be ambiguous. In such situation, PhyML tries to identify 972the most relevant position of the root by considering which edge provides the best separation 973between ingroup and outgroup taxa (i.e., we are trying to make the outgroup ``as monophyletic as 974possible''). 975 976\subsection{Tree format} 977 978PhyML can read one or several phylogenetic trees from an input file. This option is accessible 979through the {\em Tree Searching} sub menu or the `\x{-u}' argument from the command line. Input 980trees are generally used as initial maximum likelihood estimates to be subsequently adjusted by the 981tree searching algorithm. Trees can be either rooted or unrooted and multifurcations are allowed. 982Taxa names must, of course, match the corresponding sequence names. 983 984\begin{figure}[h] 985\begin{small} 986\begin{minipage}{\textwidth} 987\begin{verbatim} 988((seq1:0.03,seq2:0.01):0.04,(seq3:0.01,(seq4:0.2,seq5:0.05):0.2):0.01); 989((seq3,seq2),seq1,(seq4,seq5)); 990\end{verbatim} 991\end{minipage} 992\end{small} 993\caption{{\bf Input trees}. The first tree (top) is rooted and has branch lengths. The second tree 994 (bottom) is unrooted and does not have branch lengths.} 995\label{fig:trees}\index{Newick format} 996\end{figure} 997 998 999\subsection{Multiple alignments and trees}\index{multiple data sets} 1000 1001Single or multiple sequence data sets may be used in combination with single or multiple input 1002trees. When the number of data sets is one ($n_D = 1$) and there is only one input tree ($n_T = 1$), 1003then this tree is simply used as input for the single data set analysis. When $n_D = 1$ and $n_T > 10041$, each input tree is used successively for the analysis of the single alignment. PhyML then 1005outputs the tree with the highest likelihood. If $n_D > 1$ and $n_T = 1$, the same input tree is 1006used for the analysis of each data set. The last combination is $n_D > 1$ and $n_T > 1$. In this 1007situation, the $i$-th tree in the input tree file is used to analyse the $i$-th data set. Hence, 1008$n_D$ and $n_T$ must be equal here. 1009 1010 1011\subsection{Custom amino-acid rate model}\label{sec:customaa} 1012 1013The custom amino-acid model of substitutions can be used to implement a model that is not hard-coded 1014in PhyML. This model must be time-reversible. Hence, the matrix of substitution rates is 1015symmetrical. The format of the rate matrix with the associated stationary frequencies is identical 1016to the one used in PAML\index{PAML}. An example is given below: 1017 1018\begin{center} 1019{\tiny 1020\begin{tabular}{p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}} 1021% Ala & Arg & Asn & Asp & Cys & Gln & Glu & Gly & His & Ile & Leu & Lys & Met & Phe & Pro & Ser & Thr & Trp & Tyr & Val \\ 1022 &&&&&&&&&&&&&&&&&&& \\ 10230.55 & &&&&&&&&&&&&&&&&&& \\ 10240.51 & 0.64 & &&&&&&&&&&&&&&&& \\ 10250.74 & 0.15 & 5.43 & &&&&&&&&&&&&&&&& \\ 10261.03 & 0.53 & 0.27 & 0.03 & &&&&&&&&&&&&&&& \\ 10270.91 & 3.04 & 1.54 & 0.62 & 0.10 & &&&&&&&&&&&&&& \\ 10281.58 & 0.44 & 0.95 & 6.17 & 0.02 & 5.47 & &&&&&&&&&&&&& \\ 10291.42 & 0.58 & 1.13 & 0.87 & 0.31 & 0.33 & 0.57 & &&&&&&&&&&&& \\ 10300.32 & 2.14 & 3.96 & 0.93 & 0.25 & 4.29 & 0.57 & 0.25 & &&&&&&&&&&& \\ 10310.19 & 0.19 & 0.55 & 0.04 & 0.17 & 0.11 & 0.13 & 0.03 & 0.14 & &&&&&&&&&& \\ 10320.40 & 0.50 & 0.13 & 0.08 & 0.38 & 0.87 & 0.15 & 0.06 & 0.50 & 3.17 & &&&&&&&&& \\ 10330.91 & 5.35 & 3.01 & 0.48 & 0.07 & 3.89 & 2.58 & 0.37 & 0.89 & 0.32 & 0.26 & &&&&&&&& \\ 10340.89 & 0.68 & 0.20 & 0.10 & 0.39 & 1.55 & 0.32 & 0.17 & 0.40 & 4.26 & 4.85 & 0.93 & &&&&&&& \\ 10350.21 & 0.10 & 0.10 & 0.05 & 0.40 & 0.10 & 0.08 & 0.05 & 0.68 & 1.06 & 2.12 & 0.09 & 1.19 & &&&&&& \\ 10361.44 & 0.68 & 0.20 & 0.42 & 0.11 & 0.93 & 0.68 & 0.24 & 0.70 & 0.10 & 0.42 & 0.56 & 0.17 & 0.16 & &&&&& \\ 10373.37 & 1.22 & 3.97 & 1.07 & 1.41 & 1.03 & 0.70 & 1.34 & 0.74 & 0.32 & 0.34 & 0.97 & 0.49 & 0.55 & 1.61 & &&&& \\ 10382.12 & 0.55 & 2.03 & 0.37 & 0.51 & 0.86 & 0.82 & 0.23 & 0.47 & 1.46 & 0.33 & 1.39 & 1.52 & 0.17 & 0.80 & 4.38 & &&& \\ 10390.11 & 1.16 & 0.07 & 0.13 & 0.72 & 0.22 & 0.16 & 0.34 & 0.26 & 0.21 & 0.67 & 0.14 & 0.52 & 1.53 & 0.14 & 0.52 & 0.11 & && \\ 10400.24 & 0.38 & 1.09 & 0.33 & 0.54 & 0.23 & 0.20 & 0.10 & 3.87 & 0.42 & 0.40 & 0.13 & 0.43 & 6.45 & 0.22 & 0.79 & 0.29 & 2.49 & & \\ 10412.01 & 0.25 & 0.20 & 0.15 & 1.00 & 0.30 & 0.59 & 0.19 & 0.12 & 7.82 & 1.80 & 0.31 & 2.06 & 0.65 & 0.31 & 0.23 & 1.39 & 0.37 & 0.31 & \\ 1042\\ 10438.66 & 4.40 & 3.91 & 5.70 & 1.93 & 3.67 & 5.81 & 8.33 & 2.44 & 4.85 & 8.62 & 6.20 & 1.95 & 3.84 & 4.58 & 6.95 & 6.10 & 1.44 & 3.53 & 7.09 \\ 1044\end{tabular} 1045} 1046\end{center} 1047 1048The entry on the $i$-th row and $j$-th column of this matrix corresponds to the rate of 1049substitutions between amino-acids $i$ and $j$. The last line in the file gives the stationary 1050frequencies and must be separated from the rate matrix by one line. The ordering of the amino-acids 1051is alphabetical, i.e, Ala, Arg, Asn, Asp, Cys, Gln, Glu, Gly, His, Ile, Leu, Lys, Met, Phe, Pro, 1052Ser, Thr, Trp, Tyr and Val. 1053 1054 1055\subsection{Topological constraint file}\label{sec:topoconstraints} 1056 1057PhyML can perform phylogenetic tree estimation under user-specified topological constraints. In 1058order to do so, one should use the \x{--constraint\_file} \x{file\_name} command-line option where 1059\x{file\_name} lists the topological constraints. Such constraints are straightforward to 1060define. For instance, the following constraints: 1061\vspace{0.2cm} 1062\begin{Verbatim} 1063((A,B),C,(D,E,F)); 1064\end{Verbatim} 1065indicate that taxa D, E and F belong to the same clade. A, B and C also belong to the 1066same clade and the two clades hence defined should not overlap. Under these two 1067constraints, the tree ((A,B),D,((E,F),C)) is not valid. From the example above, you will notice that 1068the constraints are defined using a multifurcating tree in NEWICK format. Note that this tree does 1069not need to display the whole list of taxa. For instance, while the only taxa involved in specifying 1070topological constraints above are A, B, C, D, E \& F, the actual data set could include more than 1071these six taxa only. 1072 1073PhyML tree topology search algorithms all rely on improving a starting tree. By default, BioNJ is 1074the method of choice for building this tree. However, there is no guarantee that the phylogeny 1075estimated with PhyML does comply with the topological constraints. While it is probably possible to 1076implement BioNJ with topological constraints, we have not done so yet. Instead, the same 1077multifurcating tree that defines the topological constraints should also be used as starting tree 1078using the \x{-u} (\x{--inputtree}) option. Altogether, the command line should look like the 1079following: \x{-u}=\x{file\_name} \x{--constraint\_file}=\x{file\_name}. It is not possible to use as 1080input tree a non-binary phylogeny that is distinct from that provided in the constraint tree 1081file. However, any binary tree compatible with the constraint one can be used as input tree. 1082 1083 1084\subsection{Output files} 1085 1086\begin{table} 1087Sequence file name~: `{\x seq}'\\ 1088\begin{center} 1089\begin{tabular}{ll} 1090\hline 1091Output file name & Content \\ 1092\hline 1093\x{seq\_phyml\_tree.txt} & ML tree\\ 1094\x{seq\_phyml\_stats.txt} & ML model parameters\\ 1095\x{seq\_phyml\_boot\_trees.txt} & ML trees -- bootstrap replicates\\ 1096\x{seq\_phyml\_boot\_stats.txt} & ML model parameters -- bootstrap replicates \\ 1097\x{seq\_phyml\_rand\_trees.txt} & ML trees -- multiple random starts\\ 1098\x{seq\_phyml\_ancestral\_seq.txt} & ancestral sequences\\ 1099\x{seq\_phyml\_ancestral\_tree.txt} & ML tree with node labels as in ancestral sequence file\\ 1100\hline 1101\end{tabular} 1102\end{center} 1103\caption{{\bf Standard output files}}\label{tab:output} 1104\end{table} 1105 1106Table \ref{tab:output} presents the list of files resulting from an analysis. Basically, each 1107output file name can be divided into three parts. The first part is the sequence file name, the 1108second part corresponds to the extension `\x{\_phyml\_}' and the third part is related to the file 1109content. When launched with the default options, PhyML only generates two files: the tree file and 1110the model parameter file. The estimated maximum likelihood tree is in standard Newick format (see 1111Figure \ref{fig:trees}). The model parameters file, or statistics file, displays the maximum 1112likelihood estimates of the substitution model parameters, the likelihood of the maximum likelihood 1113phylogenetic model, and other important information concerning the settings of the analysis (e.g., 1114type of data, name of the substitution model, starting tree, etc.). Two additional output files are 1115created if bootstrap supports were evaluated. These files simply contain the maximum likelihood 1116trees and the substitution model parameters estimated from each bootstrap replicate. Such 1117information can be used to estimate sampling errors around each parameter of the phylogenetic model. 1118When the random tree option is turned on, the maximum likelihood trees estimated from each random 1119starting trees are printed in a separate tree file (see last row of Table \ref{tab:output}). 1120 1121PhyML estimates ancestral sequences by calculating the {\em marginal} (as opposed to the {\em joint}) probability of each character 1122 state at each internal node of the phylogeny. These probabilities are given in the file 1123\x{seq\_phyml\_ancestral\_seq.txt}. The bulk of this file is a table where each row corresponds to a 1124site in the original alignment and a number corresponding labeling each internal node. The different 1125columns of that file give the probability of each character state given the data observed at the 1126corresponding sites and the estimated phylogeny. PhyML also outputs 1127an extra column called MPEE\index{MPEE}\index{ancestral reconstruction}, which stands for Minimum Posterior Expected Error. This column gives the 1128state, which can be a single nucleotide/amino-acid or a combination of more than one nucleotide/amino-acid, that is 1129optimal under the MPEE criterion \cite{oliva19}. 1130 1131Also, it is relatively straightforward to identify which number corresponds to which node in the tree by examining the 1132information provided in the Newick-formatted file \x{seq\_phyml\_ancestral\_tree.txt}. 1133 1134 1135\subsection{Treatment of invariable sites with fixed branch lengths} 1136 1137PhyML allows users to give an input tree with fixed topology and branch lengths and find the 1138proportion of invariable sites that maximise the likelihood (option \x{-o r}). These two options can 1139be considered as conflicting since branch lengths depend on the proportion of invariants. Hence, 1140changing the proportion of invariants implies that branch lengths are changing too. More formally, 1141let $l$ denote the length of a branch, i.e., the expected number of substitutions per site, and $p$ 1142be the proportion of invariants. We have $l = (1-p)l'$, where $l'$ is the expected number of 1143substitutions {\em at variable sites}. When asked to optimize $p$ but leave $l$ unchanged, PhyML 1144does the following: 1145\begin{enumerate} 1146\item Calculate $l' = l/(1-p)$ and leave $l'$ unchanged throughout the optimization. 1147\item Find the value of $p$ that maximises the likelihood. Let $p^{*}$ denote this value. 1148\item Set $l^{*} = (1-p^{*})l'$ and print out the tree with $l^{*}$ (instead of $l$). 1149\end{enumerate} 1150 1151PhyML therefore assumes that the users wants to fix the branch lengths measured at {\em variable} 1152sites only (i.e., $l^{*}$ is fixed). This is the reason why the branch lengths in the input and 1153output trees do differ despite the use of the the \x{-o r} option. While we believe that this 1154approach relies on a sound rationale, it is not perfect. In particular, the original transformation 1155of branch lengths ($l' = l/(1-p)$) relies on a default value for $p$ with is set to 0.2 in 1156practice. It is difficult to justify the use of this value rather than another one. One suggestion 1157proposed by Bart Hazes is to avoid fixing the branch lengths altogether and rather estimate the 1158value of a scaling factor applied to each branch length in the input tree (option 1159\x{--contrained\_lens}). We agree that this solution probably matches very well most users 1160expectation, i.e., ``find the best value of $p$ while constraining the ratio of branch lengths to be 1161that given in the input tree''. Please feel free to send us your suggestions regarding this problem 1162by posting on the forum (\url{http://groups.google.com/group/phyml-forum}). 1163 1164 1165\section{Inputs \& outputs for the XML interface }\label{sec:xmlio}\index{XML} 1166 1167\subsection{Mixture models in PhyML}\index{mixture models}\label{sec:mixtures} 1168 1169PhyML implements a wide range of mixture models. The discrete gamma model \cite{yang94b} is arguably 1170the most popular of these models in phylogenetics. However, in theory, mixture models are not 1171restricted to the description of the variation of substitution rates across sites. For instance, if 1172there are good reasons to believe that the relative rates of substitution between nucleotides vary 1173along the sequence alignments, it makes sense to use a mixture of GTR models. Consider the case 1174where substitutions between $A$ and $C$ occur at high rate in some regions of the alignment and low 1175rate elsewhere, a mixture with two classes, each class having its own GTR rate matrix, would be 1176suitable. The likelihood at any site of the alignment is then obtained by averaging the likelihoods 1177obtained for each GTR rate matrix, with the same weight given to each of these matrices. 1178 1179PhyML implements a generic framework that allows users to define mixtures on substitution rates, 1180rate matrices and nucleotide or amino-acid equilibrium frequencies. Each class of the mixture model 1181is built by assembling a substitution rate, a rate matrix\footnote{the rate matrix corresponds here 1182the symmetrical matrix giving the so-called ``echangeability rates''} and a vector of equilibrium 1183frequencies. For instance, let $\{R_1,R_2,R_3\}$ be a set of substitution rates, $\{M_1,M_2\}$ a 1184set of rate matrices and $\{F_1,F_2\}$ a set of vectors of equilibrium frequencies. One could then 1185define the first class of the mixture model as $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, a second class as 1186$\mathcal{C}_2 = \{R_2,M_1,F_1\}$, and a third class as $\mathcal{C}_3 = \{R_3,M_2,F_2\}$. If 1187$R_1$, $R_2$ and $R_3$ correspond to slow, medium and fast substitution rates, then this mixture 1188model allows the fast evolving rates to have their own vector of equilibrium frequencies and rate 1189matrix, distinct from that found at the medium or slow evolving sites. The likelihood at any given 1190site $D_s$ of the alignment is then: 1191\begin{eqnarray*} 1192\Pr(D_s) = \sum_{c=1}^{3} \Pr(D_s | \mathcal{C}_s=c) \Pr(\mathcal{C}_s=c), 1193\label{equ:mixtlk} 1194\end{eqnarray*} 1195where $\Pr(\mathcal{C}_s=c)$ is obtained by multiplying the probability (density) of the three 1196components (i.e., rate, matrix, frequencies). For instance, $\Pr(\mathcal{C}_1=\{R_1,M_1,F_1\}) = 1197 \Pr(R_1)\times \Pr(M_1) \times \Pr(F_1)$. 1198We therefore assume here that substitution rates, rate 1199matrices and equilibrium frequencies are independent from one another. 1200 1201Note that, using the same substitution rates, rate matrices and vector of equilibrium frequencies, 1202it is possible to construct many other mixture models. For instance, the mixture model with the 1203largest number of classes can be created by considering all the combinations of these three 1204components. We would then get a mixture of $3\times 2 \times 2=12$ classes, corresponding to all 1205the possible combinations of 3 rates, 2 matrices and 2 vectors of frequencies. 1206 1207 1208% : $\mathcal{C}_1 = 1209% \{R_1,M_1,F_1\}$ $\mathcal{C}_2 = \{R_1,M_1,F_2\}$, $\mathcal{C}_3 = \{R_1,M_2,F_1\}$, 1210% $\mathcal{C}_4 = \{R_1,M_2,F_2\}$, $\mathcal{C}_5 = \{R_2,M_1,F_1\}$, $\mathcal{C}_6 = 1211% \{R_2,M_1,F_2\}$, $\mathcal{C}_7 = \{R_2,M_2,F_1\}$, $\mathcal{C}_8 = \{R_2,M_2,F_2\}$, 1212% $\mathcal{C}_9 = \{R_3,M_1,F_1\}$, $\mathcal{C}_{10} = \{R_3,M_1,F_2\}$, $\mathcal{C}_{11} = 1213% \{R_3,M_2,F_1\}$ and $\mathcal{C}_{12} = \{R_3,M_2,F_2\}$. 1214 1215 1216\subsection{Partitionned (i.e., multiple-gene) analyses}\index{partitionned analysis}\index{data partitions}\index{multiple-gene analysis} 1217 1218We first introduce some terms of vocabulary that have not been presented before. A partitionned data 1219set, also referred to as partition, is a set of partition elements. Typically, a partitionned data 1220set will be made of a set of distinct gene alignments. A partition element will then correspond to 1221one (or several) of these gene alignments. Note that the biology litterature often uses the term 1222partition to refer to an element of a partitionned data. We thus use here instead the mathematical 1223definition of the terms `partition' and `partition element'. 1224 1225Phylogenetics models usually assume individual columns of an alignment to evolve independently from 1226one another. Codon-based models (e.g., \cite{yang98,yang00b,yang02,guindon04}) are exceptions to 1227this rule since the substitution process applies here to triplets of consecutive sites of coding 1228sequences. The non-independence of the substitution process at the three coding positions (due to 1229the specificities of the genetic code), can therefore be accounted for. Assuming that sites evolve 1230independently does not mean that a distinct model is fitted to each site of the alignment. 1231Estimating the parameters of these models would not make much sense in practice due to the very 1232limited amount of phylogenetic signal conveyed by individual sites. Site independence means instead 1233that the columns of the observed alignment were sampled randomly from the same ``population of 1234columns''. The stochasticity of the substitution process running along the tree is deemed 1235responsible to the variability of site patterns. 1236 1237Some parameters of the phylogenetic model are considered to be common to all the sites in the 1238alignment. The tree topology is typically one such parameter. The transition/transversion ratio is 1239also generally assumed to be the same for all columns. Other parameters can vary from site to site. 1240The rate at which substitutions accumulate is one of these parameters. Hence, different sites can 1241have distinct rates. However, such rates are all ``drawn'' from the same probabilitic distribution 1242(generally a discrete Gamma density). Hence, while different sites may have distinct rates of 1243evolution, they all share the same {\em distribution} of rates. 1244 1245This reasonning also applies on a larger scale. When analysing multiple genes, one can indeed 1246assume that the same mechanism generated the different site patterns observed for every gene. Here 1247again, we can assume that all the genes share the same underlying tree topology (commonly refered to 1248as the ``species tree''). Other parameters of the phylogenetic model, such as branch lengths for 1249instance, might be shared across genes. However, due to the specificities of the gene evolution 1250processes, some model parameters need to be adjusted for each gene separately. To sum up, the 1251phylogenetic analysis of partitionned data requires flexible models with parameters, or distribution 1252of parameters, shared across several partition elements and other parameters estimated separately 1253for each element of the partition. 1254 1255The likelihood of a data set made of the concatenation of $n$ sequence alignments noted $D^{(1)}$, 1256$D^{(2)}, \ldots, D^{(n)}$ is then obtained as follows: 1257 1258\begin{eqnarray*} 1259\Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) &=& \prod_{i=1}^{n} \Pr(D^{(i)}) \\ 1260&=& \prod_{i=1}^{n} \prod_{s=1}^{L_i} \Pr(D^{(i)}_s), 1261\end{eqnarray*} 1262 1263where $L_i$ is the number of site columns in partition element $i$. $\Pr(D^{(i)}_s)$ 1264is then obtained using Equation \ref{equ:mixtlk}, i.e., by summing over the different classes of the 1265mixture model that applies to site $s$ for partition element $i$. Hence, the joint probability of 1266all the partition elements is here broken down into the product of likelihood of every site for each 1267partition element. As noted just above, any given component of the mixture model at a given 1268particular site is shared by the other sites that belong to the same partition element and, for some 1269of them, by sites in other partition elements (e.g., the same tree topology is shared by all the 1270sites, throughout all the partition elements). 1271 1272PhyML implements a wide variety of partition models. The only parameter that is constrained to be 1273shared by all the partition elements is the tree topology. This constraint makes sense when 1274considering distantly related taxa, typically inter-species data. For closely related taxa, i.e., 1275when analysing intra-species or population-level data, not all the genes might have the same 1276evolutionary history. Recombination events combined to the incomplete lineage sorting phenomenon 1277can generate discrepancies between the gene trees and the underlying species tree (see 1278\cite{degnan09} for a review). The phylogenetic softwares BEST \cite{best}\index{BEST}, STEM 1279\cite{stem}\index{STEM} and *BEAST \cite{startbeast}\index{*BEAST} are dedicated to the estimation 1280of species tree phylogenies from the analysis of multi-gene data and allow gene-tree topologies to 1281vary across genes. 1282 1283Aside from the tree topology that is common to all the sites and all the partition elements, other 1284parameters of the phylogenetic model can be either shared across partition elements or estimated 1285separately for each of these. When analysing three partition elements, $A$, $B$ and $C$ for 1286instance, PhyML can fit a model where the same set of branch lengths applies to $A$ and $B$ while 1287$C$ has its own estimated lengths. The same goes for the substitution model: the same GTR model, 1288with identical parameter values, can be fitted to $A$ and $C$ and JC69 for instance can be used for 1289$B$. The sections below give more detailed information on the range of models available and how to 1290set up the corresponding XML configuration files to implement them. 1291 1292 1293\subsection{Combining mixture and partitions in PhyML: the theory} 1294 1295The rationale behind mixture models as implemented in PhyML lies in (1) the definition of suitable 1296rate matrices, equilibrium frequency vectors and relative rates of substitution and (2) the assembly 1297of these components so as to create the classes of a mixture. The main idea behind partitionned 1298analysis in PhyML lies in (1) the hypothesis of statistical independance of the different data 1299partition elements and (2) distinct data partition can share model components such as rate matrices, 1300equilibrium frequencies or distribution of rates across sites. More formally, the likelihood of a 1301data set made of $n$ partition elements is written as follows: 1302\begin{eqnarray*} 1303\Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) &=& \prod_{i=1}^{n} \prod_{s=1}^{L_i} \Pr(D^{(i)}_s) \\ 1304&=& \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{c=1}^{K_i} \Pr(D^{(i)}_s|\mathcal{C}=c) \Pr(\mathcal{C}=c), 1305\end{eqnarray*} 1306where $L_i$ is the number of sites in partition element $i$ and $K_i$ is the number of classes in 1307the mixture model that applies to this same partition element. Each class of a mixture is made of a 1308rate matrix $M$, a vector of equilibrium frequencies $F$ and a relative rate of substitution 1309$R$. Branch lengths, $L$ and tree topology $\tau$ are also required for the calculation of the 1310likelihood. Hence we have: 1311\begin{eqnarray*} 1312&& \Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) \\ 1313&&= \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{c=1}^{K_i} \Pr(D^{(i)}_s|\mathcal{C}=c) \Pr(\mathcal{C}=c) \\ 1314&&= \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{m}^{\mathcal{M}_i} \sum_{f}^{\mathcal{F}_i} \sum_{r}^{\mathcal{R}_i} \Pr(D^{(i)}_s|M_m^{(i)},F_f^{(i)},R_r^{(i)},L^{(i)},\tau) \Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) \mathcal{I}(m,f,r,i) 1315% &&= \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{m}^{\mathcal{M}_i} \sum_{f}^{\mathcal{F}_i} \sum_{r}^{\mathcal{R}_i} 1316% \Pr(D^{(i)}_s|M_m^{(i)},F_f^{(i)},R_r^{(i)},L^{(i)},\tau) \Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)}) 1317\end{eqnarray*} 1318where $\mathcal{M}_i$, $\mathcal{F}_i$ and $\mathcal{R}_i$ are the number of rate matrices, 1319vector of equilibrium frequencies and relative rates that apply to partition element $i$ 1320respectively. $\mathcal{I}(m,f,r,i)$ is an indicator function that takes value 1 if the combination 1321$M_m$, $F_f$ and $R_r$ is acually defined in the model for this particular partition element 1322$i$. Its value is 0 otherwise. In the example given in section \ref{sec:mixtures} $\{R_1,R_2,R_3\}$ 1323is the set of substitution rates, $\{M_1,M_2\}$ the set of rate matrices and $\{F_1,F_2\}$ the set of vectors of equilibrium frequencies. We then 1324define the first class of the mixture model as $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, a second class as 1325$\mathcal{C}_2 = \{R_2,M_1,F_1\}$ and the third as $\mathcal{C}_3 = \{R_3,M_2,F_2\}$. Hence, we 1326have $\mathcal{I}(1,1,1,i)$, $\mathcal{I}(1,1,2,i)$ and $\mathcal{I}(2,2,3,i)$ equal to one while 1327the nine other values that this indicator function takes, corresponding to the possible combinations of 1328two vectors of frequencies, two matrices and three rates, are all zero. 1329 1330As stated before, our implementation assumes that the different components of a mixture are 1331independant. In other words, we have $\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) = \Pr(M_m^{(i)}) \times 1332\Pr(F_f^{(i)}) \times \Pr(R_r^{(i)})$. In practice, the joint probability 1333$\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)})$ is obtained as follows: 1334\begin{eqnarray*} 1335\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) = \frac{\Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)})}{ 1336 \sum_{m,f,r} \Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)}) \mathcal{I}(m,f,r,i)} 1337\label{equ:weights} 1338\end{eqnarray*} 1339The probabilities $\Pr(M_m^{(i)})$, $\Pr(F_f^{(i)})$ and $\Pr(R_r^{(i)})$, also called `weights', can be fixed or estimated 1340from the data. 1341 1342\subsection{The XML format and its use in PhyML}\label{sec:XML format} 1343 1344The few paragraphs below are largely inspired from the Wikipedia page that describes the XML format 1345(\url{http://en.wikipedia.org/wiki/XML}). XML (eXtensible Markup Language) is a markup language that 1346defines a set of rules for encoding documents in a format that is both human-readable and 1347machine-readable. An XML document is divided into {\em markup} and {\em content}, which may be 1348distinguished by the application of simple syntactic rules. Generally, strings that constitute 1349markup either begin with the character `\x{<}' and end with a `\x{>}'. Strings of characters that 1350are not markup are content: 1351 1352\vspace{0.2cm} 1353\begin{Verbatim}[frame=single, label=XML markup and content example, samepage=true, 1354 baselinestretch=0.5, fontsize=\small] 1355<markup> 1356 content 1357</markup> 1358\end{Verbatim} 1359 1360A markup construct that begins with `\x{<}' and ends with `\x{>}' is called a {\em tag}. Tags come 1361in three flavors: (1) start-tags (e.g, \x{<section>}), end-tags (e.g., \x{</section>}) and 1362empty-element tags (e.g., \x{<line-break />}). A {\em component} either begins with a start-tag and 1363ends with a matching end-tag or consists only of an empty-element tag. The characters between the 1364start- and end-tags, if any, are the element's content, and may contain markup, including other 1365elements, which are called child elements. In the following example, the element \x{img} has two 1366{\em attributes}, \x{src} and \x{alt}: \x{<img src="madonna.jpg" alt="Foligno Madonna, by 1367Raphael"/>}. Another example would be \x{<step number="3">Connect A to B.</step>} where the name of 1368the attribute is ``\x{number}" and the value is ``\x{3}". 1369 1370In practice, building a mixture model in a XML file readable by PhyML is relatively 1371straightforward. The first step is to define the different components of each class of the 1372mixture. Consider for instance that the fitted model will have a Gamma distribution with four 1373classes plus a proportion of invariants. The rate component of the mixture can then be specified 1374using the following XML code: 1375 1376\vspace{0.2cm} 1377\begin{Verbatim}[frame=single, label=$\Gamma4$+I rates, samepage=true, baselinestretch=0.5, 1378 fontsize=\small, numbers=left] 1379 1380<siterates id="SiteRates1"> 1381 <weights id="Distrib" family="gamma+inv" alpha=".1" \ 1382 optimise.alpha="yes" pinv="0.4" optimise.pinv="yes"> 1383 </weights> 1384 <instance id="R1" init.value="1.0"/> 1385 <instance id="R2" init.value="1.0"/> 1386 <instance id="R5" init.value="0.0"/> 1387 <instance id="R3" init.value="1.0"/> 1388 <instance id="R4" init.value="1.0"/> 1389</siterates> 1390 1391\end{Verbatim} 1392 1393In the example above, the \x{<siterates>} component completely defines a model of substitution rate 1394variation across sites. This component has a particular identity, i.e., a name associated to it 1395(``\x{SiteRates1}'' here), which is not mandatory. This \x{<siterates>} component has six 1396sub-components. The first is the \x{<weights>} component, followed by five \x{<instance>} 1397components. The \x{<weights>} component defines the type of distribution that characterizes the 1398variation of rates across sites. A discrete Gamma plus invariants is used here. Two parameters 1399specify this distribution: the gamma shape and the proportion of invariant parameters. Their initial 1400values are set by using the corresponding attributes and attribute values (\x{alpha="0.1"} and 1401\x{pinv="0.4"}). Also, PhyML can optimise these parameters so as to maximise the likelihood of the 1402whole phylogenetic model (\x{optimise.pinv="yes"} and \x{optimise.alpha="yes"}). The following five 1403\x{<instance>} components define the rate classes themselves. The \x{id} attribute is here 1404mandatory and must be unique to each class. Note that one of the initial (relative) rate 1405(\x{init.value} attribute) is set to zero. The corresponding rate class (the third in this example) 1406will then correspond to the invariant site category. 1407 1408Having specified the part of the phylogenetic model that describes the variation of rates across 1409sites, we can now move on to build the rest of the model. The component below defines two 1410substitution models: 1411 1412\vspace{0.2cm} 1413\begin{Verbatim}[frame=single, label=Rate matrices, samepage=true, baselinestretch=0.5, 1414 fontsize=\small, numbers=left] 1415 1416<ratematrices id="RateMatrices"> 1417 <instance id="M1" model="HKY85" tstv="4.0" optimise.tstv="no"/> 1418 <instance id="M2" model="GTR" optimise.rr="yes"/> 1419</ratematrices> 1420\end{Verbatim} 1421 1422This \x{<ratematrices>} component sets out a list of substitution models (HKY85 and GTR here). Here 1423again, the different elements in this list correspond to the \x{<instance>} sub-components. Each 1424instance must have a unique \x{id} attribute for a reason that will become obvious shortly. The 1425remaining attributes and their functions are described in Section \ref{sec:xmlratematrices}. 1426 1427The next ``ingredient'' in our phylogenetic model are vectors of nucleotide frequencies. The 1428\x{<equfreqs>} component below specifies two of such vectors: 1429 1430 1431\vspace{0.2cm} 1432\begin{Verbatim}[frame=single, label=Equilibrium frequencies, samepage=true, baselinestretch=0.5, 1433 fontsize=\small, numbers=left] 1434 1435<equfreqs id="EquFreq"> 1436 <instance id="F1"/> 1437 <instance id="F2"/> 1438</equfreqs> 1439 1440\end{Verbatim} 1441 1442Now, we need to assemble these three components (rate variation across sites, rate matrices and 1443vectors of equilibrium frequencies) into a mixture model. The \x{<partitionelem>} component below 1444defines one such model: 1445 1446\vspace{0.2cm} 1447\begin{Verbatim}[frame=single, label=Mixture model, samepage=true, baselinestretch=0.5, 1448 fontsize=\small, numbers=left] 1449 1450<partitionelem id="Part1" file.name="./nucleic.txt" data.type="nt"> 1451 <mixtureelem list="R1, R2, R3, R4, R5"/> 1452 <mixtureelem list="M1, M1, M1, M2, M2"/> 1453 <mixtureelem list="F1, F2, F1, F2, F2"/> 1454</partitionelem> 1455 1456\end{Verbatim} 1457 1458The \x{<partitionelem>} component defines a particular partition element. In this example, the 1459partition element corresponds to the sequence file called \x{nucleic.txt}, which is an alignment of 1460nucleotide sequences (see the \x{data.type} attribute value). The \x{<mixtureelem>} are 1461sub-components of the \x{<partitionelem>} component. Each \x{<mixtureelem>} has a \x{list} 1462atrribute. Each such \x{list} gives the ID of components that have been defined before. For 1463instance, the first \x{<mixtureelem>} refers to the five classes of the \x{<siterates>} 1464component. The ordering of the different term in these list matters a lot since it is directly 1465related to the elements in each class of the mixture model. Hence, the first element in the 1466\x{<list>} attribute of the first \x{<mixtureelem>} added to the first element in the \x{<list>} 1467attribute of the second \x{<mixtureelem>} plus the the first element in \x{<list>} attribute of the 1468third \x{<mixtureelem>} defines the first class of the mixture model. Therefore, the mixture model 1469defined above has five classes: $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, $\mathcal{C}_2 = 1470\{R_2,M_1,F_2\}$, $\mathcal{C}_3 = \{R_3,M_1,F_1\}$, $\mathcal{C}_4 = \{R_4,M_2,F_2\}$ and 1471$\mathcal{C}_5 = \{R_5,M_2,F_2\}$. 1472 1473% Going back to the different components of this model, the XML code dealing with the substitution 1474% rates defines five classes with names {\tt R1} to {\tt R5}. The initial values of these rates are 1475% set to 1.0, except for {\tt R5}, which is set to 0 and will therefore correspond to the invariable 1476% site class. The {\tt <weight>} tag that follows indicate that these rates define a $\Gamma 4$+Inv 1477% model, with initial gamma shape parameter set to 0.1 and initial proportion of invariants set to 1478% 0.4. These two parameters will be estimated in the analysis ({\tt optimise.alpha} and {\tt 1479% optimise.pinv} attributes set to {\tt yes}). The two rate matrices have names {\tt M1} and {\tt 1480% M2}. {\tt M1} corresponds to a HKY85 model, with transition/transversion ratio set to 4.0 and set 1481% to be optimised in the analysis. {\tt M2} is a GTR model, which parameters are also set to be 1482% optimised. {\tt F1} and {\tt F2} are two vectors of nucleotide frequencies at equilibrium. These two 1483% sets of frequencies will therefore be estimated during the analysis. 1484 1485 1486 1487\subsection{Setting up mixture and partition models in PhyML: the basics}\index{mixture 1488 models}\index{partitionned analysis}\index{data partitions} 1489 1490Mixture models are particularly relevant to the analysis of partitionned data. Indeed, some features 1491of evolution are gene-specific (e.g., substitution rates vary across genes). Models that can 1492accomodate for such variation, as mixture models do, are therefore relevant in this context. 1493However, other evolutionary features are shared across loci (e.g., genes located in the same genomic 1494region usually have similar GC contents). As a consequence, some components of mixture models need 1495to be estimated separately for each partition element while others should be shared by different 1496partition elements. 1497 1498Below is a simple example with a partitionned data set made of two elements, corresponding to the 1499sequence alignment files \x{nucleic1.txt} and \x{nucleic2.txt}. Importantly, the number and names of 1500sequences in these two alignments have to match exactly. 1501 1502\vspace{0.2cm} 1503\begin{Verbatim}[frame=single, label=Two sets of branch lengths (one per partition element), 1504 samepage=true, baselinestretch=0.5, fontsize=\small, numbers=left] 1505 1506<branchlengths id="BranchLens"> 1507 <instance id="L1"/> 1508 <instance id="L2"/> 1509</branchlengths> 1510 1511<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt"> 1512 <mixtureelem list="R1, R2, R3, R4, R5"/> 1513 <mixtureelem list="L1, L1, L1, L1, L1"/> 1514</partitionelem> 1515 1516<partitionelem id="Part2" file.name="./nucleic2.txt" data.type="nt"> 1517 <mixtureelem list="R1, R2, R3, R4, R5"/> 1518 <mixtureelem list="L2, L2, L2, L2, L2"/> 1519</partitionelem> 1520 1521\end{Verbatim} 1522 1523Mixture elements with names \x{R1},$\ldots$, \x{R5} refer to the $\Gamma4+$I model defined 1524previsouly (see Section \ref{sec:XML format}). The \x{<branchlengths>} XML component defines a 1525mixture element that had not been introduced before. It defined vectors of branch lengths that 1526apply to the estimated phylogeny. Two instances of such vectors are defined: \x{L1} and \x{L2}. 1527When examining the two partition elements (\x{<partitionelem>} component), it appears that \x{L1} 1528is associated with \x{Part1} while \x{L2} is associated with \x{Part2}. Hence, branch lengths 1529will be estimated separately for these two partition elements. 1530 1531Note that a given partition element can only have one {\tt branchlengths} instance associated to 1532it. For instance, the example given below is not valid: 1533 1534\vspace{0.2cm} 1535\begin{Verbatim}[frame=single, label=Invalid mixture, samepage=true, baselinestretch=0.5, 1536 fontsize=\small, numbers=left] 1537 1538<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt"> 1539 <mixtureelem list="R1, R2, R3, R4, R5"/> 1540 <mixtureelem list="L1, L1, L1, L2, L2"/> 1541</partitionelem> 1542 1543\end{Verbatim} 1544 1545In other words, mixture of branch lengths are forbidden. One reason for this restriction is that 1546mixture of edge lengths sometimes lead to non-identifiable models (i.e., models with distinct sets 1547of branch lengths have the same likelihood) \cite{matsen07}. But mostly, combining mixture of branch 1548lengths with mixture of rates appears like a deadly combination. Consider for instance the following model: 1549 1550\vspace{0.2cm} 1551\begin{Verbatim}[frame=single, label=Invalid mixture, samepage=true, baselinestretch=0.5, 1552 fontsize=\small, numbers=left] 1553 1554<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt"> 1555 <mixtureelem list="R1, R2, R3"/> 1556 <mixtureelem list="L1, L2, L3"/> 1557</partitionelem> 1558 1559\end{Verbatim} 1560 1561It is here impossible to tell apart branch lengths and substitution rates. Such model is strongly 1562non-identifiable and therefore not relevant. 1563 1564In the example given above, the same $\Gamma4+$I model (i.e. the same gamma shape parameter and 1565proportion of invariant ) applies to the two partition elements. It is possible to use two distinct 1566$\Gamma4+$I models instead using the following XML code: 1567 1568 1569 1570\vspace{0.2cm} 1571\begin{Verbatim}[frame=single, label=Two distinct $\Gamma4+$I models, samepage=true, 1572 baselinestretch=0.5, fontsize=\small, numbers=left] 1573 1574<siterates id="SiteRates1"> 1575 <weights id="Distrib1" family="gamma+inv" alpha=".1" \ 1576 optimise.alpha="yes" pinv="0.4" optimise.pinv="yes"> 1577 </weights> 1578 <instance id="R1" init.value="1.0"/> 1579 <instance id="R2" init.value="1.0"/> 1580 <instance id="R5" init.value="0.0"/> 1581 <instance id="R3" init.value="1.0"/> 1582 <instance id="R4" init.value="1.0"/> 1583</siterates> 1584 1585<siterates id="SiteRates2"> 1586 <weights id="Distrib2" family="gamma+inv" alpha=".1" \ 1587 optimise.alpha="yes" pinv="0.4" optimise.pinv="yes"> 1588 </weights> 1589 <instance id="R6" init.value="1.0"/> 1590 <instance id="R7" init.value="1.0"/> 1591 <instance id="R8" init.value="0.0"/> 1592 <instance id="R9" init.value="1.0"/> 1593 <instance id="R10" init.value="1.0"/> 1594</siterates> 1595 1596<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt"> 1597 <mixtureelem list="R1, R2, R3, R4, R5"/> 1598 <mixtureelem list="L1, L1, L1, L1, L1"/> 1599</partitionelem> 1600 1601<partitionelem id="Part2" file.name="./nucleic2.txt" data.type="nt"> 1602 <mixtureelem list="R6, R7, R8, R9, R10"/> 1603 <mixtureelem list="L2, L2, L2, L2, L2"/> 1604</partitionelem> 1605 1606\end{Verbatim} 1607 1608\x{SiteRates1} and \x{SiteRates2} here define two distinct $\Gamma4+$I models. Each of these models apply to 1609one of the two partition elements (\x{nucleic1.txt} and \x{nucleic2.txt}), allowing them to display 1610different patterns of rate variation across sites. 1611 1612 1613\subsection{XML options} 1614\subsubsection{{\tt phyml} component}\index{XML options!{\tt phyml} component} 1615Options: 1616\begin{itemize} 1617\item \x{output.file="filename"}. The main output files of PhyML analysis will be named 1618 \x{filename\_phyml\_tree} and \x{filename\_phyml\_stats}. 1619\item \x{bootstrap="nreplicates"}. Run \x{nreplicates} replicates for the non-parametric bootstrap analysis. 1620\item \x{run.id="idstring"}. PhyML will append the string \x{idstring} to each output file. 1621\item \x{print.json.trace="yes|true|no|false"}. PhyML will print the estimated trees, the 1622 corresponding loglikelihoods and various model parameters at multiple stages of the estimation process. This option is useful 1623 for monitoring the progress of the analysis when processing large data sets. 1624\item \x{print.trace="yes|true|no|false"}. PhyML will print the estimated trees (and the 1625 corresponding loglikelihoods) at multiple stages of the estimation process. This option is also useful 1626 for monitoring the progress of the analysis when processing large data sets. 1627\item \x{branch.test="aBayes|aLRT|SH|no"}. Calculate fast branch support using the aBayes method 1628 \cite{anisimova11}, aLRT \cite{anisimova06} or SH \cite{shimodaira99} tests. These branch 1629 statistics are much faster to estimate than the bootrap proportions and usually provide good 1630 estimates of the probabilities that the corresponding edges are correctly inferred (see Anisimova et 1631 al. 2011 for more precision). By default and if no bootstrap analysis is performed, branch supports 1632 are estimated using the aBayes approach. 1633\item \x{quiet="yes|no"}. Runs PhyML in quiet mode when \x{quiet=yes}. The program will not pause if the memory required to run the analysis 1634exceeds 256MB and will not output the progresssion of the log-likelihood scores on the standard output. 1635\item \x{memory.check="yes|no"}. By default, when processing a large data set, PhyML will pause and ask the user to confirm that 1636she/he wants to continue with the execution of the analysis despite the large amount of memory 1637required. Setting \x{memory.check=no} skips this question. It is especially useful when running 1638PhyML in batch mode. 1639 1640 1641 1642\end{itemize} 1643\subsubsection{{\tt topology} component}\index{XML options!{\tt topology} component} 1644Each instance of the \x{topology} component (there should be only one...) has the following options: 1645\begin{itemize} 1646\item \x{init.tree="bionj"|"user"|"random"}. Starting tree. Default is \x{bionj}. 1647\item \x{n.rand.starts="X"}. Number of random starting trees. Default is 5. 1648\item \x{file.name="name\_of\_tree\_file"}. In case \x{init.tree="user"}, this 1649 attribute is mandatory. \x{name\_of\_tree\_file} is a 1650 text file containing a tree in NEWICK format. 1651\item \x{optimise.tree="yes"|"true"|"no"|"false"}. The starting tree topology as defined by 1652 \x{init.tree} is to be optimised (or not) so as to maximise the likelihood function. 1653\item \x{search="nni"|"spr"|"none"}. Tree topology search is conducted using NNI (fast), SPR (a bit 1654 slower but more accurate) or no moves. 1655\end{itemize} 1656\vspace{0.2cm} 1657\begin{Verbatim}[frame=single, label=Example of `topology' component, samepage=true, 1658 baselinestretch=0.5, fontsize=\small, numbers=left] 1659 1660<topology> 1661 <instance id="T1" init.tree="bionj" optimise.tree="true" \ 1662 search="spr"/> 1663</topology> 1664 1665\end{Verbatim} 1666 1667\subsubsection{{\tt ratematrices} component}\index{XML options!{\tt ratematrices} component}\label{sec:xmlratematrices} 1668Each instance of a \x{ratematrices} component have the following options: 1669\begin{itemize} 1670\item \x{model="JC69"|"K80"|"F81"|"F84"|"HKY85"|"TN93"|"GTR"|"custom"} for nucleotide data. The default is \x{"HKY85"}.\\ 1671\x{model="LG"|"WAG"|"JTT"|"MtREV"|"Dayhoff"|"DCMut"|"RtREV"|"CpREV"|"VT"}\\\x{|"Blosum62"|"MtMam"|"MtArt"|"HIVw"|"HIVb"|"customaa"} 1672for amino-acid sequences. The default is \x{"LG"}. 1673\item \x{model.code="012345"}. For \x{custom} model applied to nucleotide sequences: set the 1674 string of digits that define a custom substitution model. See Table \ref{tab:modelcode} on page 1675 \pageref{tab:modelcode} for more 1676 information about the model codes. 1677\item \x{ratematrix.code="filename"}. When used in conjunction with \x{model="customaa"}, 1678 \x{filename} is the name of the file that gives the rates of substitution between amino-acids as 1679 well as their frequences at equilibrium using PAML rate matrix format. An example of such file is 1680 provided in {phyml/examples/X1.mat}. 1681\item \x{optimise.rr="yes"|"true"|"no"|"false"}. For \x{custom} and \x{GTR} nucleotide models only: 1682 optimise the substitution rate model parameters. 1683\item \x{optimise.tstv="yes"|"true"|"no"|"false"}. For \x{K80}, \x{F84}, \x{HKY85} and \x{TN93} 1684 models only: optimise the transition/transversion rate ratio. 1685\item \x{tstv="value"}. For \x{K80}, \x{HKY85} and \x{TN93} models only: set the transition/transversion to a 1686 given value. 1687\end{itemize} 1688 1689An instance of a \x{ratematrices} component where a GTR or a custom model of substitutions between 1690nucleotides is used can have a \x{rr} component associated to it in order to specificy the relative 1691rates of substitutions (see example below). 1692 1693Also, the {\tt ratematrices} component has the attribute {\tt optimise.weights=yes/no} (default is {\tt 1694 no}). If {\tt optimise.weights=yes}, then the probabilities (or weights) or each matrix in the 1695set of matrices defined by this component (see Equation \ref{equ:weights}), will be estimated from the data. 1696 1697\vspace{0.2cm} 1698\begin{Verbatim}[frame=single, label=Example of `ratematrices' component, samepage=true, 1699 baselinestretch=0.5, fontsize=\small, numbers=left] 1700 1701 <ratematrices id="RM1" optimise.weights="yes"> 1702 <instance id="M1" model="custom" model.code="000000"/> 1703 <rr AC="1.2" AG="1.5" AT="2.3" CT="1.2" CG="1.5"/> 1704 <instance id="M2" model="GTR" optimise.rr="yes"/> 1705 <instance id="M3" model="WAG"/> 1706 </ratematrices> 1707 1708\end{Verbatim} 1709 1710\subsubsection{{\tt equfreqs} component}\index{XML options!{\tt equfreqs} component} 1711Each instance of a \x{equfreqs} component has the following options: 1712\begin{itemize} 1713\item \x{base.freqs="a,b,c,d"} where \x{a-d} are nucleotide frequencies. Make sure that these 1714 frequencies are separated by comas and no space character is inserted. 1715\item \x{aa.freqs="empirical|model"}. Amino-acid frequencies are derived from counting the number of 1716 occurence of each amino-acid in the alignment (\x{aa.freqs="empirical"}) or given by the 1717 substitution model (\x{aa.freqs="model"}). 1718\item \x{optimise.freqs="true|yes|false|no"}. Nucleotide frequencies can be optimised so as to maximise 1719 the likelihood (\x{optimise.freqs="yes|true"}). 1720\end{itemize} 1721 1722The {\tt equfreqs} component has the attribute {\tt optimise.weights=yes/no} (default is {\tt 1723 no}). If {\tt optimise.weights=yes}, then the probabilities (or weights) or each vector of 1724equilibrium frequencies in the 1725set of vectors defined by this component (see Equation \ref{equ:weights}), will be estimated from the data. 1726 1727\vspace{0.2cm} 1728\begin{Verbatim}[frame=single, label=Example of `equfreqs' component, samepage=true, 1729 baselinestretch=0.5, fontsize=\small, numbers=left] 1730 1731 <equfreqs id="EF1" optimise.weights="yes"> 1732 <instance id="F1" base.freqs="0.25,0.25,0.25,0.25"/> 1733 <instance id="F2" aa.freqs="empirical"/> 1734 <instance id="F3" optimise.freqs="yes"/> 1735 </equfreqs> 1736 1737\end{Verbatim} 1738 1739\subsubsection{{\tt branchlengths} component}\index{XML options!{\tt branchlengths} component} 1740Options: 1741\begin{itemize} 1742\item \x{optimise.lens="yes"|"true"|"no"|"false"}: branch lengths are optimised or not. The default 1743 is set to \x{"yes"}. 1744\end{itemize} 1745\vspace{0.2cm} 1746\begin{Verbatim}[frame=single, label=Example of `branchlengths' component, baselinestretch=0.5, 1747 fontsize=\small, numbers=left] 1748 1749 <branchlengths id="BL1"> 1750 <instance id="L1" optimise.lens="yes"/> 1751 <instance id="L2"/> 1752 <instance id="L3" optimise.lens="false"/> 1753 </branchlengths> 1754 1755\end{Verbatim} 1756 1757\subsubsection{{\tt siterates} component}\index{XML options!{\tt siterates} component} 1758Each instance defines a class of relative rate of substitution. It has the following options: 1759\begin{itemize} 1760\item \x{value="val"}, where \x{"val"} is the relative substitution rate for the corresponding class. 1761\end{itemize} 1762A \x{siterates} component generally includes a \x{weights} element that specifies the probabilitic 1763distribution of the relative rates. The available options for such element are: 1764\begin{itemize} 1765\item \x{family="gamma|gamma+inv|freerates"}. \x{gamma} indicates that the distribution of the 1766 relative rates is set to be a discrete Gamma density. \x{gamma+inv} indicates that the relative rate model 1767 is a mixture of Gamma and invariant sites (this is the common $\Gamma+$I model). FreeRate is 1768 a model that does not use any parametric function to describe the distribution of the relative 1769 rates (see \cite{soubrier12}). Under this option, relative rates and the corresponding frequencies of these classes are 1770 directly estimated from the data. While such approach is slightly more computationally demanding 1771 than the $\Gamma$ (or $\Gamma$+I) model, it often provides a significantly better fit to the data. 1772\item \x{alpha="value|optimised"}, where \x{value} is a real positive number. Use this option to set 1773 the gamma shape parameter to the selected value. \x{optimised}: the parameter is estimated from 1774 the data (see also next option). 1775\item \x{optimise.alpha="yes|true|no|false"}. Optimise the shape of the Gamma distribution of 1776 relative rates (or not). 1777\item \x{pinv="value|optimised"}, where \x{value} is in $[0,1]$. Use this option to set 1778 the proportion of invariants to the selected value. \x{optimised}: the parameter is estimated from 1779 the data (see also next option). 1780\item \x{optimise.pinv="yes|true|no|false"}. Optimise the proportion of invariable sites (or not). 1781\item \x{optimise.freerates="yes|true|no|false"}. Optimise the parameters of the FreeRate model, 1782 i.e., the relative rates and the corresponding frequencies. 1783\end{itemize} 1784\vspace{0.2cm} 1785\begin{Verbatim}[frame=single, label=Example of `siterates' component (discrete gamma model), samepage=true, 1786 baselinestretch=0.5, fontsize=\small, numbers=left] 1787 1788 <siterates id="SR1"> 1789 <instance id="R1" init.value="1.0"/> 1790 <instance id="R2" init.value="1.0"/> 1791 <instance id="R3" init.value="1.0"/> 1792 <instance id="R4" init.value="1.0"/> 1793 <weights id="D1" family="gamma" optimise.alpha="yes" \ 1794 optimise.pinv="no"> 1795 </weights> 1796 </siterates> 1797\end{Verbatim} 1798 1799Setting up an analysis using the FreeRate\index{FreeRate} mixture model of rate variation across 1800sites \cite{soubrier12} requires a 1801bit more work. As in the discrete gamma model, the FreeRate model relies on definies classes of 1802(relative) rates. In the discrete gamma model, each class has the same frequency. This is no longer 1803the case with FreeRate. It is thus necessary to specify the frequency, or weight, of each class of rate in the 1804XML file. The example below illustrates how these weights are defined in practice: 1805\vspace{0.2cm} 1806\begin{Verbatim}[frame=single, label=Example of `siterates' component (FreeRate model), samepage=true, 1807 baselinestretch=0.5, fontsize=\small, numbers=left] 1808 <!-- Freerate model of variation of rates across sites --> 1809 <siterates id="SR1"> 1810 <instance id="R1" init.value="0.1"/> 1811 <instance id="R2" init.value="0.7"/> 1812 <instance id="R3" init.value="1.9"/> 1813 <instance id="R4" init.value="5.1"/> 1814 <weights id="D1" family="freerates" optimise.freerates="yes"> 1815 <instance appliesto="R1" value="0.4"/> 1816 <instance appliesto="R2" value="0.3"/> 1817 <instance appliesto="R3" value="0.1"/> 1818 <instance appliesto="R4" value="0.1"/> 1819 </weights> 1820 </siterates> 1821\end{Verbatim} 1822Note that the weights do not have to sum to one. PhyML rescales them before starting the analysis 1823such that these weights are transformed into proper probabilities. The relative rates themselves are 1824rescaled too such that the weighted average (relative) rate is equal to one after rescaling. 1825 1826 1827\subsubsection{{\tt partitionelem} and {\tt mixtureelem} components}\index{XML options!{\tt partitionelem} 1828 component}\index{XML options!{\tt mixtureelem} component} 1829 1830Options: 1831\begin{itemize} 1832\item \x{file.name="inputfilename"}, where \x{inputfilename} is the name of the input sequence file 1833 (in PHYLIP format) to be analysed. 1834\item \x{data.type="nt|aa"}. Specify the type of sequences to be processed (nucleotide of amino-acid sequences). 1835\item \x{interleaved="yes|true|no|false"}. Interleaved (\x{yes|true}) or sequential format 1836 (\x{no|false}) for the sequence alignment. 1837\item \x{optimise.tree.scale="yes|true|no|false"}. The sum of edge length (or tree size) is 1838 optimized. This option is relevant when different data partition elements point to the same set of 1839 edge lengths so that setting \x{optimise.tree.scale="yes"} will find the optimal ratio of tree sizes 1840 considering the two elements. In other words, the different trees all share the same (relative) edge lengths 1841 but the corresponding partition elements have different mean rates of substitution. 1842\item \x{tree.scale="val"}. \x{val} is the value of the (relative) substitution rate. By default, 1843 its value is set to 1.0. 1844\item \x{print.site.lk="yes|true|no|false"}. The likelihood at each site (and other information) 1845 will be written in a file named \x{inputfilename\_phyml\_lk}. 1846\end{itemize} 1847 1848Each \x{partitionelem} element should include exactly four \x{mixtureelem} elements, corresponding to 1849branch lengths, equilibrium frequencies, substitution rate model and tree topology. The ordering of 1850in which the \x{mixtureelem} elements are given does not matter, though exceptions apply for the 1851$\Gamma+I$ model (see below). The $n$-th element in the \x{list} 1852attribute of each \x{mixtureelem} defines the $n$-th class of the mixture model. In the example given 1853below, the first class of the mixture is made of the following elements: \x{T1}, \x{F1}, \x{R1} and 1854\x{L1}, the second class is made of \x{T1}, \x{F1}, \x{R2} and \x{L1}, etc. 1855 1856 1857\vspace{0.2cm} 1858\begin{Verbatim}[frame=single, label=Example of `partitionelem' component, samepage=true, 1859 baselinestretch=0.5, fontsize=\small, numbers=left] 1860 1861 <partitionelem id="partition1" file.name="./small_p1.nxs" \ 1862 data.type="nt" interleaved="yes"> 1863 <mixtureelem list="T1, T1, T1, T1"/> 1864 <mixtureelem list="F1, F1, F1, F1"/> 1865 <mixtureelem list="R1, R2, R3, R4"/> 1866 <mixtureelem list="L1, L1, L1, L1"/> 1867 </partitionelem> 1868 1869\end{Verbatim} 1870 1871In general, the ordering of the \x{mixtureelem} elements does not matter. However, when the model 1872has invariable sites, then the corresponding class should be first in the list of classes provided 1873by \x{mixtureelem}. For instance, in the example above, if the rates are defined as follows: 1874 1875\vspace{0.2cm} 1876\begin{Verbatim}[frame=single, label=Example of `siterates' component, samepage=true, 1877 baselinestretch=0.5, fontsize=\small, numbers=left] 1878 1879 <siterates id="SR1"> 1880 <instance id="R1" init.value="0.0"/> 1881 <instance id="R2" init.value="1.0"/> 1882 <instance id="R3" init.value="1.0"/> 1883 <instance id="R4" init.value="1.0"/> 1884 <weights id="D1" family="gamma+inv" optimise.alpha="yes" \ 1885 optimise.pinv="no"> 1886 </weights> 1887 </siterates> 1888\end{Verbatim} 1889 1890then \x{R1} corresponds to the invariable rate class (as \x{init.value="0.0"}). As \x{R1} is first 1891in the \x{mixtureelem} (see line 6 in the example of \x{`partionelem'} given above), PhyML will 1892print out an explicit error message and bail out. One way to avoid this shortcoming is to define 1893\x{mixtureelem} as \x{R4, R2, R3, R1} instead. 1894 1895 1896\subsection{Example: GTR + $\Gamma$4 + I} 1897 1898The example below provides all the required options to fit a $\Gamma$4+I model to a single alignment 1899of nucleotide sequences under the GTR model of substitution using a SPR search for the best tree 1900topology. The \x{phyml} component sets the name for the analysis to \x{simple.example}, meaning that 1901each output file will display this particular string of characters. Also, the tree and statistics 1902file names will begin with \x{p1.output}. The tree topology will be estimated so as to maximise the 1903likelihood and the topology search algorithm used here is SPR, as indicated by the value of the 1904corresponding attribute (i.e., \x{search="spr"}). Only one vector of branch lengths will be used 1905here since only one partition element will be processed. Hence, the \x{<branchlengths>} component 1906only has one \x{<instance>} sub-component. Also, a single GTR model will apply to all the classes 1907for the mixture model -- the \x{<ratematrices>} component has only one \x{<instance>} sub-component, 1908corresponding to this particular substitution model. The next component, \x{<equfreqs>}, indicates 1909that a single vector of equilibrium frequencies will apply here. Next, the \x{<siterates>} component 1910has five \x{<instance>} sub-components. Four of these correspond to the non-zero relative rates of 1911evolution a defined by a discrete Gamma distribution. The last one (\x{<instance id="R5" 1912value="0.0"/>}) defines the class of the mixture corresponding to invariable sites. The 1913\x{<weight>} component indicates that a $\Gamma+$I model will be fitted here. The shape parameter of 1914the Gamma distribution and the proportion of invariants will be estimated from the data. The 1915\x{<partitionelem>} gives information about the sequence alignment (the corresponding file name, the 1916type of data and the alignment format). The \x{<mixtureelem>} components next define the mixture 1917model. Each class of the fitted model corresponds to one column, with the first column made of the 1918following elements: \x{T1, M1, F1, R1} and \x{L1}. The second class of the mixture is made of \x{T1, 1919M1, F1, R2, L1} and so forth. 1920 1921\vspace{0.2cm} 1922\begin{Verbatim}[frame=single, label=Simple PhyML XML example, samepage=true, baselinestretch=0.5, 1923 fontsize=\small, numbers=left] 1924 1925<phyml runid="simple.example" output.file="p1.output"> 1926 1927 <topology> 1928 <instance id="T1" init.tree="bionj" optimise.tree="yes" \ 1929 search="spr"/> 1930 </topology> 1931 1932 <branchlengths id="BL1"> 1933 <instance id="L1" optimise.lens="yes"/> 1934 </branchlengths> 1935 1936 <ratematrices id="RM1"> 1937 <instance id="M1" model="GTR"/> 1938 </ratematrices> 1939 1940 <equfreqs id="EF1"> 1941 <instance id="F1"/> 1942 </equfreqs> 1943 1944 <siterates id="SR1"> 1945 <instance id="R1" value="1.0"/> 1946 <instance id="R2" value="1.0"/> 1947 <instance id="R3" value="1.0"/> 1948 <instance id="R4" value="1.0"/> 1949 <instance id="R5" value="0.0"/> 1950 <weights id="D1" family="gamma+inv" optimise.alpha="yes" \ 1951 optimise.pinv="yes"> 1952 </weights> 1953 </siterates> 1954 1955 <partitionelem id="partition_elem1" file.name=\ 1956 "./p1.seq" data.type="nt" interleaved="yes"> 1957 <mixtureelem list="T1, T1, T1, T1, T1"/> 1958 <mixtureelem list="M1, M1, M1, M1, M1"/> 1959 <mixtureelem list="F1, F1, F1, F1, F1"/> 1960 <mixtureelem list="R1, R2, R3, R4, R5"/> 1961 <mixtureelem list="L1, L1, L1, L1, L1"/> 1962 </partitionelem> 1963 1964</phyml> 1965 1966\end{Verbatim} 1967 1968 1969\subsection{Example: LG4X}\index{lg4x} 1970 1971The example below shows how to fit the LG4X model \cite{lg4x} to a given alignment of amino-acid 1972sequences (file \x{M587.nex.Phy}). LG4X is a mixture model with four classes. Each class has its own 1973rate and corresponding frequencies (hence the use of the FreeRate model below, see the 1974\x{<siterates>} component). In the particular example given here, the rate values and frequencies 1975are set by the users. These parameters will then be optimized by PhyML (\x{optimise.freerates="yes"}). 1976Each class also has its own rate matrix and vector of equilibrium frequencies, which need to be provided by 1977the user (Note that these matrices can be downloaded from the following web address: 1978\url{http://www.atgc-montpellier.fr/download/datasets/models/lg4x/LG4X_4M.txt}. They are also 1979provided in the PhyML package \x{example/lg4x/} directory.) 1980 1981\vspace{0.2cm} 1982\begin{Verbatim}[frame=single, label=LG4X, samepage=true, baselinestretch=0.5, 1983 fontsize=\small, numbers=left] 1984<phyml run.id="lg4x" output.file="M587.tests" branch.test="no"> 1985 1986 <!-- Tree topology: start with BioNJ and then SPRs --> 1987 <topology> 1988 <instance id="T1" init.tree="user" file.name="user_tree.txt" \ 1989 search="spr" optimise.tree="no"/> 1990 </topology> 1991 1992 1993 <!-- Four rate matrices, read from files --> 1994 <ratematrices id="RM1"> 1995 <instance id="M1" model="customaa" ratematrix.file="X1.mat"/> 1996 <instance id="M2" model="customaa" ratematrix.file="X2.mat"/> 1997 <instance id="M3" model="customaa" ratematrix.file="X3.mat"/> 1998 <instance id="M4" model="customaa" ratematrix.file="X4.mat"/> 1999 </ratematrices> 2000 2001 <!-- Freerate model of variation of rates across sites --> 2002 <siterates id="SR1"> 2003 <instance id="R1" init.value="0.197063"/> 2004 <instance id="R2" init.value="0.750275"/> 2005 <instance id="R3" init.value="1.951569"/> 2006 <instance id="R4" init.value="5.161586"/> 2007 <weights id="D1" family="freerates" optimise.freerates="yes"> 2008 <instance appliesto="R1" value="0.422481"/> 2009 <instance appliesto="R2" value="0.336848"/> 2010 <instance appliesto="R3" value="0.180132"/> 2011 <instance appliesto="R4" value="0.060539"/> 2012 </weights> 2013 </siterates> 2014 2015 <!-- Amino-acid equilibrium freqs. are given by the models --> 2016 <equfreqs id="EF1"> 2017 <instance id="F1" aa.freqs="model"/> 2018 <instance id="F2" aa.freqs="model"/> 2019 <instance id="F3" aa.freqs="model"/> 2020 <instance id="F4" aa.freqs="model"/> 2021 </equfreqs> 2022 2023 2024 <!-- One vector of branch lengths --> 2025 <branchlengths id="BL1" > 2026 <instance id="L1" optimise.lens="yes"/> 2027 </branchlengths> 2028 2029 2030 <!-- Mixture model assemblage --> 2031 <partitionelem id="partition1" file.name="M587.nex.Phy" \ 2032 data.type="aa" interleaved="yes"> 2033 <mixtureelem list="T1, T1, T1, T1"/> 2034 <mixtureelem list="M1, M2, M3, M4"/> 2035 <mixtureelem list="F1, F2, F3, F4"/> 2036 <mixtureelem list="R1, R2, R3, R4"/> 2037 <mixtureelem list="L1, L1, L1, L1"/> 2038 </partitionelem> 2039 2040</phyml> 2041\end{Verbatim} 2042 2043In order to fit the LG4X model to the \x{proteic} sequence file provided in the \x{examples/} 2044directory, simply type \x{./phyml --xml=../examples/lg4x/lg4x.xml} (assuming the PhyML binary is installed 2045in the \x{src/} directory). You can of course slightly tweak the file \x{../examples/lg4x/lg4x.xml} 2046and use it as a template to fit this model to another data set. 2047 2048 2049\subsection{Example: CAT-like model}\index{CAT model} 2050 2051The CAT model \cite{lartillot04} is a mixture model whereby each site may have its own 2052exchangeability rate matrix and vector of state frequencies. In its original exposition, the number 2053of classes in the mixture was also a parameter to be estimated from the data. PhyML uses a fixed 2054number of classes instead. This lesser degree of sophistication facilitates the interpretation 2055of parameter estimates. Examination of the statistics and site-likelihood files produced by PhyML 2056under the CAT-like model makes it indeed straightforward to spot sites with peculiar substitution 2057pattern (e.g., sites that sustained only certain types of substitutions, for instance between 2058purines or between pyrimidines). 2059 2060In the example that follows, the CAT-like model implemented has six rate classes goverened by a 2061FreeRate model. The first and second class share the same GTR matrix (M1) and vector of nucleotide 2062frequencies (F1) (see the two leftmost columns in the assemblage matrix). The third and fourth also 2063share the same GTR matrix (M2) and vector of frequencies (F2) which are estimated independantly from 2064M1 and F1. Finally, the last two classes of the mixture share M3 and F3, which are independant from 2065the other rate matrices and state frequencies. 2066 2067 2068 2069 2070 2071\vspace{0.2cm} 2072\begin{Verbatim}[frame=single, label=CAT, samepage=true, baselinestretch=0.5, 2073 fontsize=\small, numbers=left] 2074 <phyml runid="CAT" output.file="example" branch.test="no" \ 2075 print.json.trace="no" print.site.lk="yes"> 2076 2077 <topology> 2078 <instance id="T1" init.tree="bionj" optimise.tree="yes" \ 2079 search="spr"/> 2080 </topology> 2081 2082 <branchlengths id="BL1"> 2083 <instance id="L1" optimise.lens="yes"/> 2084 </branchlengths> 2085 2086 <!-- GTR rate matrices --> 2087 <ratematrices id="RM1"> 2088 <instance id="M1" model="GTR" optimise.rr="yes"/> 2089 <instance id="M2" model="GTR" optimise.rr="yes"/> 2090 <instance id="M3" model="GTR" optimise.rr="yes"/> 2091 </ratematrices> 2092 2093 2094 <!-- Vectors of nucleotide frequencies --> 2095 <equfreqs id="EF1"> 2096 <instance id="F1" optimise.freqs="yes"/> 2097 <instance id="F2" optimise.freqs="yes"/> 2098 <instance id="F3" optimise.freqs="yes"/> 2099 </equfreqs> 2100 2101 2102 <!-- Variability of rates across sites --> 2103 <siterates id="SR1"> 2104 <instance id="R1" init.value="1.0"/> 2105 <instance id="R2" init.value="1.0"/> 2106 <instance id="R3" init.value="1.0"/> 2107 <instance id="R4" init.value="1.0"/> 2108 <instance id="R5" init.value="1.0"/> 2109 <instance id="R6" init.value="1.0"/> 2110 <weights id="D1" family="freerates" optimise.alpha="yes" \ 2111 optimise.pinv="yes"> 2112 </weights> 2113 </siterates> 2114 2115 2116 <!-- Assemblage --> 2117 <partitionelem id="partition1" file.name="./seqfile.phy" \ 2118 data.type="nt" interleaved="no" optimise.tree.scale="no"> 2119 <mixtureelem list="T1,T1,T1,T1,T1,T1"/> 2120 <mixtureelem list="R1,R2,R3,R4,R5,R6"/> 2121 <mixtureelem list="M1,M1,M2,M2,M3,M3"/> 2122 <mixtureelem list="F1,F1,F2,F2,F3,F3"/> 2123 <mixtureelem list="L1,L1,L1,L1,L1,L1"/> 2124 </partitionelem> 2125 2126 2127</phyml> 2128\end{Verbatim} 2129 2130\subsection{Example: multiple partition elements} 2131 2132The example below gives the complete XML file to specify the analysis of three partition elements, 2133corresponding to the nucleotide sequence files \x{small\_p1\_pos1.seq}, \x{small\_p1\_pos2.seq} and 2134\x{small\_p1\_pos3.seq} in interleaved PHYLIP format. Importantly, the number and names of 2135sequences in these three alignments match exactly. \x{small\_p1\_pos1.seq} is fitted with the 2136HKY85 model of substitution (with the transition/transversion ratio being estimated from the data), 2137combined to a $\Gamma4$ model of rate variation across sites (with the gamma shape parameter being 2138estimated from the data). \x{small\_p1\_pos2.seq} is fitted to a custom substitution model with the 2139constraint $A\leftrightarrow G$=$C\leftrightarrow T$. The nucleotide frequencies are set to 2140$\frac{1}{4}$ here. The model does not allow substitution rates to vary across 2141sites. \x{small\_p1\_pos3.seq} is fitted using a GTR model conbined to a $\Gamma4+$I model of rate 2142variation across sites. Note that the equilibrium nucleotide frequencies for the fourth and fifth 2143class of the mixture are set to be equal to that estimated from the first partition element (i.e., 2144\x{F1}) . The initial phylogeny is built using BioNJ and the tree topology is to be estimated using 2145a NNI search algorithm. 2146 2147\vspace{0.2cm} 2148\begin{Verbatim}[frame=single, label=Example of PhyML XML file, samepage=true, baselinestretch=0.5, 2149 fontsize=\small, numbers=left] 2150 2151<phyml runid="nnisearch" output.file="small_p1_output"> 2152 2153 <topology> 2154 <instance id="T1" init.tree="bionj" optimise.tree="yes" \ 2155 search="nni"/> 2156 </topology> 2157 2158 <branchlengths id="BL1"> 2159 <instance id="L1" optimise.lens="yes"/> 2160 <instance id="L2"/> 2161 <instance id="L3"/> 2162 </branchlengths> 2163 2164 <ratematrices id="RM1"> 2165 <instance id="M1" model="HKY85" optimise.tstv="yes"/> 2166 <instance id="M2" model="custom" model.code="102304" \ 2167 optimise.rr="yes"/> 2168 <instance id="M3" model="GTR"/> 2169 </ratematrices> 2170 2171 <equfreqs id="EF1"> 2172 <instance id="F1"/> 2173 <instance id="F2" base.freqs="0.25,0.25,0.25,0.25"/> 2174 <instance id="F3"/> 2175 </equfreqs> 2176 2177 <siterates id="SR1"> 2178 <instance id="R1" value="1.0"/> 2179 <instance id="R2" value="1.0"/> 2180 <instance id="R3" value="1.0"/> 2181 <instance id="R4" value="1.0"/> 2182 <weights id="D1" family="gamma" optimise.alpha="yes" \ 2183 optimise.pinv="no"> 2184 </weights> 2185 </siterates> 2186 2187 <siterates id="SR2"> 2188 <instance id="R8" value="1.0"/> 2189 <weights id="D2" family="gamma" optimise.alpha="yes" \ 2190 optimise.pinv="yes"> 2191 </weights> 2192 </siterates> 2193 2194 <siterates id="SR3"> 2195 <instance id="R10" value="1.0"/> 2196 <instance id="R11" value="1.0"/> 2197 <instance id="R12" value="1.0"/> 2198 <instance id="R13" value="1.0"/> 2199 <instance id="R14" value="1.0"/> 2200 <weights id="D3" family="gamma" optimise.alpha="yes" \ 2201 optimise.pinv="yes"> 2202 </weights> 2203 </siterates> 2204 2205\end{Verbatim} 2206 2207\vspace{0.2cm} 2208\begin{Verbatim}[frame=single, label=Example of PhyML XML file (ctnd), samepage=true, 2209 baselinestretch=0.5, fontsize=\small, numbers=left] 2210 2211 <partitionelem id="partition_elem1" file.name=\ 2212 "./small_p1_pos1.seq" data.type="nt" interleaved="yes"> 2213 <mixtureelem list="T1, T1, T1, T1"/> 2214 <mixtureelem list="M1, M1, M1, M1"/> 2215 <mixtureelem list="F1, F1, F1, F1"/> 2216 <mixtureelem list="R1, R2, R3, R4"/> 2217 <mixtureelem list="L1, L1, L1, L1"/> 2218 </partitionelem> 2219 2220 <partitionelem id="partition_elem2" file.name=\ 2221 "./small_p1_pos2.seq" data.type="nt" interleaved="yes"> 2222 <mixtureelem list="T1"/> 2223 <mixtureelem list="M2"/> 2224 <mixtureelem list="R8"/> 2225 <mixtureelem list="F2"/> 2226 <mixtureelem list="L2"/> 2227 </partitionelem> 2228 2229 <partitionelem id="partition_elem3" file.name=\ 2230 "./small_p1_pos3.seq" data.type="nt" interleaved="yes"> 2231 <mixtureelem list="T1, T1, T1, T1, T1"/> 2232 <mixtureelem list="M3, M3, M3, M3, M3"/> 2233 <mixtureelem list="R10, R11, R12, R13, R14"/> 2234 <mixtureelem list="F3, F3, F3, F1, F1"/> 2235 <mixtureelem list="L3, L3, L3, L3, L3"/> 2236 </partitionelem> 2237 2238</phyml> 2239 2240\end{Verbatim} 2241 2242 2243\subsection{Branch lengths with invariants and partionned data} 2244 2245Accommodating for models with invariable sites applying to some elements of a partitioned data, with 2246these elements sharing the same set of edge lengths can lead to inconsistencies. Consider for 2247instance a partitioned data set with two elements. Assume that these two elements share the same 2248set of edge lengths. Also, consider that GTR+I applies to the first element and HKY applies to the 2249second. Now, the expected number of substitutions per site for the first element of the partition is 2250equal to $(1-p)l$, where $p$ is the estimated proportion of invariants and $l$ is the 2251maximum-likelihood estimate for the length of that specific edge. For the second element of the 2252partition, the expected number of substitutions per site is equal to $l$, rather than $(1-p)l$. 2253While $l$ are common to the two elements, matching the specification of the input model, the actual 2254edge lengths do differ across the two partition elements. Please be aware that, due to the 2255programming structure implemented in PhyML, the program will only return one value here, which will 2256be equal to $(1-p)l$. 2257 2258\section{Citing PhyML} 2259The ``default citation'' for PhyML is: 2260\begin{itemize} 2261\item 2262``New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance 2263of PhyML 3.0''. Guindon S., Dufayard J.F., Lefort V., Anisimova M., Hordijk W., Gascuel O. 2010, {\it Systematic 2264 Biology}, 59(3):307-321 2265 2266\end{itemize} 2267The ``historic citation'' for PhyML is: 2268\begin{itemize} 2269\item ``A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood'' 2270 Guindon S., Gascuel O. 2003, {\it Systematic Biology}, 52(5):696-704 2271\end{itemize} 2272 2273 2274 2275\section{Other programs} 2276 2277PhyML is software package that provides tools to tackle problems other than estimating maximum 2278likelihood phylogenies. Installing these tools and processing data sets is explained is the 2279following sections. 2280 2281\subsection{PhyTime}\index{PhyTime} PhyTime is a program that uses Bayesian sampling techniques to 2282infer phylogenies from the analysis of genetic and fossil data. Edge lengths (and node heights) are 2283expressed in calendar time units as opposed to expected number of substitutions as in 2284PhyML. PhyTime can thus be used to date past evolutionary events. The main original features 2285compared to other software for molecular dating are: 2286\begin{itemize} 2287\item a model of rate evolution whereby rates can vary along each branch (other 2288 software consider that the rates of evolution vary from one branch to another but stay constant 2289 along any given branch) \cite{guindon13}. 2290\item the opportunity to accomodate for uncertainty in calibration information. A single time 2291 constraint can apply to multiple clades with corresponding probabilities as determined from prior 2292 analysis of fossil data \cite{guindon18}. 2293\end{itemize} 2294 2295\subsubsection{Installing PhyTime} 2296 2297Compiling PhyTime is straightforward on Unix-like machines (i.e., linux and MacOS systems). PhyTime 2298is not readily available for Windows machines but compilation should be easy on this system too. In 2299the `phyml' directory, where the `src/' and `doc/' directories stand, enter the following commands: 2300{\setlength{\baselineskip}{0.5\baselineskip} 2301\begin{verbatim} 2302./configure --enable-phytime; 2303make clean; 2304make; 2305\end{verbatim} } This set of commands generates a binary file called \x{phytime} which can be found 2306 in the `src/' directory. 2307 2308\subsubsection{Running PhyTime} Passing options and running PhyTime on your data set is quite 2309similar to running PhyML using an XML parameter input file, i.e., a typical run would be launched 2310using the following command: \x{./phytime --xml=./dating\_example.xml}, assuming that the \x{phytime} binary file is in the same 2311directory as the XML control file \x{dating\_example.xml}, which happens to be the current directory here. The main 2312differences between PhyML and PhyTime are explained below: 2313\begin{itemize} 2314\item Unlike PhyML, PhyTime requires calibration (along with genetic sequence) data as input. The format for 2315 defining the relevant time intervals is described below. 2316\item PhyTime does not allow partitionned (i.e., multigene) analysis yet. 2317\end{itemize} 2318 2319\subsubsection{PhyTime input} 2320As stated above, PhyTime takes as input an XML file very similar to those compatible with PhyML. An 2321example is given thereafter. The first part of the input file (up to the \x{partitionelem} block) corresponds to a standard XML input 2322file for PhyML. Note however that the tag name ``\x{phytime}'' replaces ``\x{phyml}''. The second 2323part is PhyTime-specific. Three new tags can be found here: \x{lineagerates}, \x{clade} 2324and \x{calibration}\index{PhyTime!calibration}. 2325 2326The \x{lineagerates}\index{PhyTime!lineagerates} element is used to define the model of rate 2327variation across lineages. Its only attribute is \x{model} which can take the following values: 2328\begin{itemize} 2329\item \x{model=strictclock} or \x{model=clock} implements the strict clock model, i.e., all lineages 2330 evolve at the same instantaneous rate, at any point in time. 2331\item \x{model=lognormal} or \x{model=normal}. The logarithm of the average rate along each branch is distributed as a 2332 normal distribution. The mean and variance of this distribution are the same for all branches. More 2333 precisely, let $R_b$ be the average rate of substitution along edge $b$. We have $R_b=X_b\cdot 2334 \mu$, where $\mu$ is the ``baseline'' (or average over lineages) rate of substitution and $X_b$ is 2335 thus the branch-specific {\em relative} rate of substitution. The lognormal model has $\log(X_b) 2336 \sim \mathcal{N}(1,\nu)$, i.e., the logarithm of the relative rate is normally distributed with 2337 mean (mode and median) set to 1.0 and standard deviation equal to $\nu$. 2338\item \x{model=geometricbrownian} or \x{model=brownian} or \x{model=geo}. The logarithm of the rate 2339 of substitution evolves according to a Brownian motion process along the tree. More precisely, the 2340 logarithm of the rate of evolution at the end of edge $b$, noted as $Y_b^{\text{stop}}=\log(R_b^{\text{stop}})$, is 2341 normally distributed with variance $v := \nu \cdot \Delta t_b$, where $\Delta t_b$ is the (calendar) time elapsed 2342 along edge $b$, and mean equal to $Y_b^{\text{start}}-\frac{1}{2} v$, so that $\Ex(R_b^{\text{stop}})=\Ex(R_b^{\text{start}})$. 2343 Under the geometric Brownian motion model, the average rate of evolution is also random. This 2344 variability is taken into account in the calculation of the transition between nucleotides or 2345 amino-acids along edges. See Guindon, 2012, {\em Syst. Biol.} for more information. 2346\end{itemize} 2347 2348The \x{clade}\index{PhyTime!clade} element is used to define a subset of taxa. Each of these taxa is 2349given in a \x{taxon} element. The value that each taxon takes is a string corresponding to the 2350name of one of the sequences in the alignment file. In the example that follows, \x{Gymno\_Araucaria}, 2351\x{Gymno\_Ginko}, \x{Gymno\_Juniperus} and 2352\x{Gymno\_Juniperus} are three taxa that define a clade called \x{Gymno1}. Note that this clade may 2353not be monophyletic. In fact, it is not considered as such (i.e., monophyletic) during the 2354inference. The second important element is \x{calibration}. It defines time intervals corresponding 2355to the time of diversification (i.e., the time of the most recent common ancestor) of the set of 2356taxa the calibration points to. In other words, a given time 2357interval defines the calibration for the timing of the crown node of a given clade. Each interval is defined using 2358the \x{upper} and \x{lower} tags. The upper (resp. lower) bound for each interval is a date 2359expressed in any unit you fancy. 2360 2361\vspace{0.2cm} 2362\begin{Verbatim}[frame=single, label=Example of PhyTime XML file, samepage=true, baselinestretch=0.5, 2363 fontsize=\small, numbers=left] 2364 <phytime run.id="June2018" output.file="out" \ 2365 mcmc.chain.len="1E+7" mcmc.sample.every="100" \ 2366 mcmc.print.every="50" mcmc.burnin="10000" \ 2367 mutmap="no"> 2368 2369 <!-- Tree topology --> 2370 <topology> 2371 <instance id="T1" init.tree="bionj" optimise.tree="yes"/> 2372 </topology> 2373 2374 2375 <!-- Model of rate variation across lineages --> 2376 <lineagerates model="lognormal"/> 2377 2378 <!-- Substitution model --> 2379 <ratematrices id="RM1"> 2380 <instance id="M1" model="HKY85" optimise.tstv="no" tstv="4.0"/> 2381 </ratematrices> 2382 2383 2384 <!-- Freerate model of variation of rates across sites --> 2385 <siterates id="SR1"> 2386 <instance id="R3" init.value="0.5"/> 2387 <instance id="R2" init.value="1.0"/> 2388 <instance id="R1" init.value="2.0"/> 2389 <weights id="D1" family="freerates" optimise.freerates="no"> 2390 <instance appliesto="R3" value="0.33"/> 2391 <instance appliesto="R2" value="0.33"/> 2392 <instance appliesto="R1" value="0.33"/> 2393 </weights> 2394 </siterates> 2395 2396 <!-- Nucleotide frequencies --> 2397 <equfreqs id="EF1"> 2398 <instance id="F1" optimise.freqs="no"/> 2399 </equfreqs> 2400 2401 2402 <!-- Vector of edge lengths --> 2403 <branchlengths id="BL1" > 2404 <instance id="L1" optimise.lens="no"/> 2405 </branchlengths> 2406 2407 <!-- Model assembly --> 2408 <partitionelem id="partition1" file.name="./seq.txt" \ 2409 data.type="nt" interleaved="no"> 2410 <mixtureelem list="T1, T1, T1"/> 2411 <mixtureelem list="M1, M1, M1"/> 2412 <mixtureelem list="F1, F1, F1"/> 2413 <mixtureelem list="R1, R2, R3"/> 2414 <mixtureelem list="L1, L1, L1"/> 2415 </partitionelem> 2416\end{Verbatim} 2417 2418\begin{Verbatim}[frame=single, label=Example of PhyTime XML file, samepage=true, baselinestretch=0.5, 2419 fontsize=\small, numbers=left] 2420 <clade id="Gymno1"> 2421 <taxon value="Gymno_Araucaria"/> 2422 <taxon value="Gymno_Ginkgo"/> 2423 <taxon value="Gymno_Juniperus"/> 2424 <taxon value="Gymno_Sciadopitys"/> 2425 </clade> 2426 2427 <clade id="Gymno2"> 2428 <taxon value="Gymno_Araucaria"/> 2429 <taxon value="Gymno_Juniperus"/> 2430 <taxon value="Gymno_Sciadopitys"/> 2431 </clade> 2432 2433 <calibration id="cal1"> 2434 <lower>40</lower> 2435 <upper>60</upper> 2436 <appliesto clade.id="Gymno1" probability="0.8"/> 2437 <appliesto clade.id="Gymno2" probability="0.2"/> 2438 </calibration> 2439 2440 <calibration id="cal2"> 2441 <lower>50</lower> 2442 <upper>200</upper> 2443 <appliesto clade.id="Gymno1"> 2444 </appliesto> 2445 </calibration> 2446 2447</phytime> 2448\end{Verbatim} 2449 2450\subsubsection{Accounting for calibration uncertainty} 2451 2452It is not always obvious to determine with full precision and accuracy where in the tree a given 2453fossil branches in {\em a priori}. For instance, a first morphological feature of an ancestral (fossilized) species 2454might be shared by species A but not B nor C, while a second feature might be displayed by A and B but 2455not C. Assuming for simplicity that the true tree topology is ((A,B),C), then the first feature calibrates the most recent common ancestor (MRCA)\index{MRCA} of A and 2456B. Indeed, this ancestor cannot be younger than the fossil itself because, if this was the case, then 2457A {\em and} B would display the morphological feature of interest. In a similar fashion, the second 2458feature suggests that the calibration should instead apply to the MRCA of A, B and C. Therefore, assuming that the fossil was discovered in a 2459well-defined geological layer which age range is known without ambiguity, there is still uncertainty 2460around the node in the tree this time interval calibrates. In the particular example given above, it 2461is not clear whether the calibration constraints applies to the node corresponding to the MRCA of A 2462and B, or the root of the tree. 2463 2464PhyTime accomodates for this uncertainty by giving a 2465probability to the two scenari: with probability $\alpha$ the calibration time interval applies to 2466the MRCA of species A and B and with probability $1-\alpha$, this interval calibrates the age of the 2467MRCA of A, B and C. 2468 2469It is fairly straightforward to set up an analysis that incorporates probabilistic distributions on 2470calibrations in PhyTime. The section of the XML file below gives a simple example whereby the 2471calibration ``\x{cal1}'' applies to clade ``\x{Gymno1}'' (i.e., the smallest clade in the tree that 2472displays the four corresponding taxa) with probability 0.8 and to clade ``\x{Gymno2}'' with probability 0.2. 2473Also, calibration \x{cal2} applies to ``\x{Gymno1}'' with probability 1.0, which is implicit here and 2474thus does not need to be specified in the XML file. 2475 2476\begin{Verbatim}[frame=single, label=Calibrating with uncertainty, samepage=true, baselinestretch=0.5, 2477 fontsize=\small, numbers=left] 2478 <clade id="Gymno1"> 2479 <taxon value="Gymno_Araucaria"/> 2480 <taxon value="Gymno_Ginkgo"/> 2481 <taxon value="Gymno_Juniperus"/> 2482 <taxon value="Gymno_Sciadopitys"/> 2483 </clade> 2484 2485 <clade id="Gymno2"> 2486 <taxon value="Gymno_Araucaria"/> 2487 <taxon value="Gymno_Juniperus"/> 2488 <taxon value="Gymno_Sciadopitys"/> 2489 </clade> 2490 2491 <calibration id="cal1"> 2492 <lower>40</lower> 2493 <upper>60</upper> 2494 <appliesto clade.id="Gymno1" probability="0.8"/> 2495 <appliesto clade.id="Gymno2" probability="0.2"/> 2496 </calibration> 2497 2498 <calibration id="cal2"> 2499 <lower>50</lower> 2500 <upper>200</upper> 2501 <appliesto clade.id="Gymno1"> 2502 </appliesto> 2503 </calibration> 2504 2505</phytime> 2506\end{Verbatim} 2507 2508Note that in this example, it can be the case that both \x{cal1} and \x{cal2} apply to the same 2509clade (\x{Gymno1}). In that particular situation, the actual calibration interval 2510is conservative and derives from the overlap of the two calibration time intervals (one 2511corresponding to \x{cal1}, the other to \x{cal2}). The MRCA of 2512\x{Gymno\_Araucaria}, \x{Gymno\_Ginko}, \x{Gymno\_Juniperus} and \x{Gymno\_Juniperus} has an age 2513that falls here in the $[50,60]$ interval. 2514 2515\subsubsection{MCMC settings}\label{sec:phytimesettings} 2516 2517PhyTime estimates the joint posterior distribution of the phylogenetic model parameters using an 2518MCMC algorithm. This algorithm relies on sampling values of these parameters iteratively. The 2519default number of iterations is fixed to 1E+07 and values of parameters are recorded every 1E+03 2520iteration of the MCMC algorithm. The burn-in period, during which tuning parameters of the MCMC as 2521adjusted, lasts for 1E+04 iterations. These three parameters can be modified by setting the values 2522of the attributes \x{mcmc.chain.len}, \x{mcmc.sample.every} and \x{mcmc.burnin} in the XML parameter 2523file accordingly. For instance, the first line of this file could look as follows: 2524\vspace{0.2cm} 2525\begin{Verbatim}[frame=single, label=MCMC settings, samepage=true, baselinestretch=0.5, 2526 fontsize=\small, numbers=left] 2527<phytime 2528run.id="example" output.file="small" mcmc.chain.len="1E+7" 2529mcmc.sample.every="100" mcmc.print.every="50" mcmc.burnin="10000"> 2530\end{Verbatim} 2531 2532The majority of analyses will not warrant 1E+07 iterations to converge to the target 2533distribution (i.e., converge to the solution). I thus recommend monitoring each analysis by loading 2534on a regular basis the statistics output file produced by PhyTime and interrupt the analysis once 2535the effective sample sizes of all parameters have reached 200. 2536 2537The MCMC algorithm implemented in 2538PhyTime relies on multiple operators that update different parameters, or the same parameter in 2539different ways. The weights of these operators, i.e., the frequency with which they are applied, 2540were adjusted for the analysis of the plant data set described in \cite{guindon18}. If you think 2541these default values are not suitable for your own analysis, please feel free to have a look at the 2542function \x{MCMC\_Complete\_MCMC} in the file \x{mcmc.c} to make the appropriate changes to the 2543values of \x{mcmc->move\_weight[XXX]}, where \x{XXX} is the name of the operator. Alternatively, you 2544might want to send me an email (\url{guindon@lirmm.fr}) if you are unsure about all this. 2545 2546It is generally very useful to compare the posterior estimates of model parameters to that obtained 2547from a sampler that ignores sequence data (i.e., the Bayesian sampler only takes as input data the 2548calibration information). Such analysis helps quantifying how sensitive are the marginal posterior 2549estimates to the estimates one would get prior collecting and analysing genetic sequences. The 2550attribute \x{ignore.sequences} used in the \x{phytime} tag permits to run such analysis. The first 2551lines of a corresponding XML file would then look as follows: 2552 2553\vspace{0.2cm} 2554\begin{Verbatim}[frame=single, label=Sampling from the prior in PhyTime, samepage=true, baselinestretch=0.5, 2555 fontsize=\small, numbers=left] 2556<phytime run.id="prior" output.file="seq" ignore.sequences> 2557\end{Verbatim} 2558 2559\subsubsection{PhyTime output} 2560 2561The program PhyTime generates two output files. The statistics file 2562is called `\x{outputfile\_phytime\_stats\_runid.txt}'. It lists the 2563node times and (relative) substitution rates on edges sampled during the estimation process. It 2564also gives the sampled values for other parameters, such as the autocorrelation of rates 2565(parameter `nu'), and the rate of evolution (parameter `clock') amongst others. This output file 2566can be analysed with the program Tracer\index{Tracer} from the 2567BEAST\index{BEAST} package (\url{http://beast.bio.ed.ac.uk/Main_Page}) or Icylog 2568(\url{http://tgvaughan.github.io/icylog/}\index{Icylog}). It is also possible to process this file using R here. Section 2569\ref{sec:phytimeexample} of this document gives more information about the statistics output file. The second file is 2570called `\x{outputfile\_phytime\_trees\_runid.txt}'. It is the list of rooted trees that were collected 2571during the estimation process, i.e., phylogenies sampled from the posterior density of trees. This 2572file can be processed using the software TreeAnnotator, also part of the BEAST 2573package (see \url{http://beast.bio.ed.ac.uk/Main_Page}) in order to generate confidence sets for 2574the node time estimates. 2575 2576 2577\subsubsection{An example of PhyTime input and output files}\label{sec:phytimeexample} 2578 2579The directory \x{phyml/examples/phytime/} provides a sequence alignment (\x{seq.txt}) and an XML 2580input file (\x{dating\_example.xml}) that can be used to run an analysis (using the following command: 2581\x{phytime --xml=dating\_example.xml}). The XML file is very similar to that described above and will thus not 2582be discussed further. The columns of the statistics output file are as follows: 2583\begin{itemize} 2584\item \x{sample}: the index of the iteration in the MCMC algorithm. For instance, a sample value of 2585 100 means that 100 ``moves'' in the MCMC have been applied so far, one such move can correspond to 2586 an attempt to modify the tree topology, or height of some internal nodes, etc. 2587\item \x{lnL(posterior)}: the logarithm of the posterior density of the current model parameters. It is this function 2588 that the MCMC attempts to get samples from. 2589\item \x{lnL(seq)}: the logarithm of the conditional probability of the sequence alignment given the current model 2590 parameters. 2591\item \x{lnL(times)}: the logarithm of the probability density of the phylogeny evaluated under the 2592 birth and death branching process. 2593\item \x{lnL(rates)}: the logarithm of the probability density of the relative substitution rates 2594 along every edge of the current phylogeny. 2595\item \x{birth}: sampled values of the birth parameter. Histogram of these values provide an 2596 estimate of the marginal posterior distribution for this parameter. 2597\item \x{death}: death parameter for the birth-death branching process. 2598\item \x{clock}: mean rate of substitution. 2599\item \x{root}: age of the root node, i.e. the MRCA of all taxa in the sample. 2600\item \x{tstv}: transition/transversion ratio (the substitution model used 2601 here is HKY, as selected in the \x{param.xml} file). 2602\item \x{nu}: rate autocorrelation parameter. 2603\item \x{rr0}, \x{rr1}, \x{rr2}, \x{pr0}, \x{pr1}, \x{pr2}: relative rate of substitution and the 2604 corresponding frequencies. The mixture model of rate variation used in this analysis is the 2605 FreeRate model \cite{soubrier12} with three rate classes. These columns give the values of 2606 these six parameters (though there are only four free parameters) sampled from the target 2607 (posterior) distribution. 2608\item \x{t(calib:cal1\_clade:Gymno1)}: sampled age of the MRCA of \x{Gymno1} when calibration 2609 \x{cal1} applies to this particular node. 2610\item \x{t(calib:cal1\_clade:Gymno2)}: sampled age of the MRCA of \x{Gymno2} when calibration 2611 \x{cal1} applies to this node. 2612\item \x{t(calib:cal2\_clade:Gymno1)}: sampled age of the MRCA of \x{Gymno1} when calibration 2613 \x{cal12} applies to this node. 2614\item \x{clade(calib:cal1)}: the clade id (\x{Gymno1} or \x{Gymno2}) calibration \x{cal1} applies 2615 to (0 for \x{Gymno1} and 1 for \x{Gymno2}). 2616\item \x{clade(calib:cal2)}: the clade id calibration \x{cal2} applies 2617 to. 2618\item \x{br0}, $\ldots$, \x{br41}: relative rate on all edges in the tree. 2619\item \x{t22}, $\ldots$, \x{t42}: ages of all internal nodes. 2620\item The remaining columns provides information about the acceptance rate of various operators in 2621 the MCMC and the corresponding tuning parameters. 2622\end{itemize} 2623 2624The script below can be copied and pasted into R in order to produce the plots in Figure 2625\ref{fig:phytimetrace} (assuming R was launched from the \x{phyml/examples/phytime} directory). The two plots at the top left and center give the trace of the node age of 2626the MRCA of \x{Gymno1} when calibration \x{cal1} applies to it (left) and that of \x{Gymno2} when 2627the same calibration \x{cal1} applies to it (centre). The plot on the top right gives the clade (which of 2628\x{Gymno1} or \x{Gymno2}) the calibration \x{cal1} applies to. 2629 2630The analysis of these plots 2631shows the impact of applying the calibration constraint \x{cal1} to \x{Gymno1} and \x{Gymno2} alternatively. 2632When \x{cal1} applies to \x{Gymno2} (i.e., ({\em Gymno Araucaria}, {\em Gymno Juniperus}, {\em Gymno 2633 Sciadopitys})) the age of the MRCA of \x{Gymno1} ((i.e., ({\em Gymno\_Ginkgo}, {\em Gymno Araucaria}, {\em Gymno Juniperus}, {\em Gymno 2634 Sciadopitys}) is free to wander towards older values compared to the situation where both 2635\x{cal1} and \x{cal2} apply to \x{Gymno1}. The age of the MRCA of \x{Gymno2} is here bound to fall in the 2636$[40,60]$ time interval. When \x{cal1} applies to \x{Gymno1} instead, then the MRCA of that clade 2637has to fall in the $[50,60]$ interval, while that of \x{Gymno2} has to be younger than the MRCA of \x{Gymno1}. 2638 2639 2640 2641\begin{Verbatim}[frame=single, label=R script to produce traces from the PhyTime statistics file, samepage=true, baselinestretch=0.5, 2642fontsize=\small, numbers=left] 2643file=paste("out_phytime_stats_June2018.txt",sep=""); 2644d=read.table(file,header=T); 2645idx=floor(0.1*length(d$sample)):length(d$sample) 2646par(mfrow=c(2,3),mar=c(5,4,2,2)); 2647plot(d$sample[idx],d$t.calib.cal1_clade.Gymno1.[idx],type="l", 2648col="red",ylab="Time",xlab="Sample",main="Time Gymno1") 2649plot(d$sample[idx],d$t.calib.cal1_clade.Gymno2.[idx],type="l", 2650col="red",ylab="Time",xlab="Sample",main="Time Gymno2") 2651plot(d$sample[idx],d$clade.calib.cal1.[idx],type="l",col="black", 2652ylab="Clade",xlab="Sample",main="(0->Gymno1, 1->Gymno2)") 2653plot(d$sample[idx],d$lnL.seq.[idx],type="l",col="orange", 2654ylab="log(Probability)",xlab="Sample",main="Likelihood sequences") 2655plot(d$sample[idx],d$lnL.posterior.[idx],type="l",col="orange", 2656ylab="log(density)",xlab="Sample",main="Joint posterior") 2657plot(d$sample[idx],d$lnL.times.[idx],type="l",col="orange", 2658ylab="log(density)",xlab="Sample",main="Branching process") 2659\end{Verbatim} 2660 2661 2662\begin{figure} 2663\begin{center} 2664\includegraphics[width=13.8cm]{./fig/phytimetrace} 2665\end{center} 2666\caption{{\bf Traces from the statistics output file produced by PhyTime.}} 2667\label{fig:phytimetrace} 2668\end{figure} 2669 2670% Important information is also displayed on the standard output of PhyTime (the standard output 2671% generally corresponds to the terminal window from which PhyTime was launched). The first column of 2672% this output gives the current generation, or run, of the chain. It starts at 1 and goes up to 1E+7 2673% by default. The second column gives the time elapsed in seconds since the sampling began. The 2674% third column gives the log likelihood of the phylogenetic model (i.e., `Felsenstein's 2675% likelihood'). The fourth column gives the logarithm of the joint prior probability of substitution 2676% rates along the tree and node heights. The fifth column gives the current sampled value of the 2677% EvolRate parameter along with the corresponding Effective Sample Size (ESS) for this parameter. 2678% The sixth column gives the tree height and the corresponding ESS. The seventh column gives 2679% the value of the autocorrelation parameter followed by the corresponding ESS. The eightth column 2680% gives the values of the birth rate parameter that governs the birth-rate model of species 2681% divergence dates. The last column of the standard output gives the minimum of the ESS values 2682% taken over the whole set of node height estimates. It provides useful information when one has 2683% to decide whether or not the sample size is large enough to draw valid conclusion, i.e., decide 2684% whether the chain was run for long enough (see Section \ref{sec:recomphytime} for more detail about 2685% adequate chain length). 2686 2687% \subsubsection{ClockRate vs. EvolRate} 2688 2689% The average rate of evolution along a branch is broken into two components. One is called ClockRate 2690% and is the same throughout the tree. The other is called EvolRate and corresponds to a weighted 2691% average of branch-specific rates. The model of rate evolution implemented in PhyTime forces the 2692% branch-specific rate values to be greater than one. As a consequence, ClockRate is usually smaller 2693% EvolRate. 2694 2695% In more mathematical terms, let $\mu$ be the value of ClockRate, $r_i$ be the value of the relative 2696% rate along branch $i$ and $\Delta_i$ the time elapsed along branch $i$. The value of EvolRate is 2697% then given by: 2698% \begin{eqnarray*} 2699% \mathrm{EvolRate} = \mu \frac{\sum_{i}^{2n-3} r_i \Delta_i}{\sum_{i}^{2n-3} \Delta_i}. 2700% \end{eqnarray*} It is clear from this equation that multiplying each $r_i$ by a constant and 2701% dividing $\mu$ by the same constant does not change the value of EvolRate. The $r_i$s and $\mu$ 2702% are then confounded, or non-identifiable, and only the value of EvolRate can be estimated from 2703% the data. {\color{red}{Please make sure that you use the value of EvolRate rather than that of 2704% ClockRate when referring to the estimate of the substitution rate}}. 2705 2706% \subsubsection{Effective sample size}\label{sec:ess} 2707 2708% The MCMC technique generates samples from a target distribution (in our case, the joint posterior 2709% density of parameters). Due to the Markovian nature of the method, these samples are not 2710% independent. The ESS is the estimated number of independent measurements obtained from a set of 2711% (usually dependent) measurements. It is calculated using the following formula: 2712% \begin{eqnarray*} 2713% \mathrm{ESS} = N\left(\frac{1-r}{1+r}\right), 2714% \end{eqnarray*} 2715% where $N$ is the length of the chain (i.e., the `raw' or `correlated' sample size) and $r$ is the 2716% autocorrelation value, which is obtained using the following formula: 2717% \begin{eqnarray*} 2718% r = \frac{1}{(N-k)\sigma_x^2} \sum_{i=1}^{N-k} (X_i - \mu_x)(X_{i+k}-\mu_x), 2719% \end{eqnarray*} where $\mu_x$ and $\sigma_x$ are the mean and standard deviation of the $X_i$ values 2720% respectively and $k$ is the lag. The value of $r$ that is used in PhyTime corresponds to the case where $k=1$, 2721% which therefore gives a first order approximation of the `average' autocorrelation value (i.e., the 2722% autocorrelation averaged over the set of possible values of the lag). 2723 2724 2725% \subsubsection{Prior distributions of model parameters}\label{sec:prior} 2726 2727% Any Bayesian analysis requires specifying a prior distribution of model parameters. The outcome of 2728% the data analysis, i.e., the posterior distribution, is influenced by the priors. It is especially 2729% true if the signal conveyed by the data is weak. While some have argued that the specification of 2730% priors relies more on arbitrary decisions than sound scientific reasoning, choosing relevant prior 2731% distributions is in fact fully integrated in the process of building model that generates the 2732% observed data. In particular, the problem of estimating divergence times naturally lends itself to 2733% hierarchical Bayesian modelling. Based on the hypothesis that rates of evolution are conserved (to 2734% some extant) throughout the course of evolution, the hierarchical Bayesian approach provides an 2735% adequate framework for inferring substitution rates and divergence dates separately. Hence, in this 2736% situation, it makes good sense to use what is known about a relatively well-defined feature of the 2737% evolution of genetic sequences (the ``molecular clock'' hypothesis combined to stochastic variations of 2738% rates across lineages) to build a prior distribution on rates along edges. 2739 2740 2741 2742 2743\subsubsection{Citing PhyTime}\label{sec:citephytime} 2744 2745The ``default citation'' is: 2746 2747\begin{itemize} 2748\item Guindon S. ``From trajectories to averages: an improved description of the heterogeneity of 2749substitution rates along lineages'', {\it Systematic Biology}, 2013. 62(1): 22-34. 2750\end{itemize} 2751 2752If you are using mutliple calibrations on a given clade, please cite the following article: 2753\begin{itemize} 2754\item Guindon S. ``Accounting for calibration uncertainty: Bayesian molecular dating as a ``doubly intractable'' problem'', {\it Systematic Biology}, 2018. 67(4): 651-661. 2755\end{itemize} 2756 2757An earlier article also describes some of the methods implemented in PhyTime: 2758 2759\begin{itemize} 2760\item Guindon S. ``Bayesian estimation of divergence times from large data sets'', {\it Molecular 2761 Biology and Evolution}, 2010, 276227(8):1768:81. 2763\end{itemize} 2764 2765 2766 2767\subsection{PhyloGeo}\index{PhyloGeo} PhyloGeo is a program that implements the 2768competition-dispersal phylogeography model described in Ranjard, Welch, Paturel and Guindon 2769``Modelling competition and dispersal in a statistical phylogeographic framework''. Accepted for 2770publication in {\it Systematic Biology}. 2771 2772It implements a Markov Chain Monte Carlo approach that samples from the posterior distribution of 2773the three parameters of interest in this model, namely the competition intensity $\lambda$, the 2774dispersal bias parameter $\sigma$ and the overal dispersal rate $\tau$. The data consist in a 2775phylogeny with node heights proportional to their ages and geographical locations for evety taxon in 2776this tree. An important assumption of the model is that each node in the phylogeny corresponds to a 2777speciation {\em and} a dispersal event. As a consequence, this model does not authorize a given taxon to 2778occupy more than one locations. Note however that the converse is not true: a given location can be 2779occupied by several different taxa. 2780 2781\subsubsection{Installing PhyloGeo} 2782 2783Compiling PhyloGeo is straightforward on Unix-like machines (i.e., linux and MacOS systems). PhyloGeo 2784is not readily available for Windows machines but compilation should be easy on this system too. In 2785the `phyml' directory, where the `src/' and `doc/' directories stand, enter the following commands: 2786{\setlength{\baselineskip}{0.5\baselineskip} 2787\begin{verbatim} 2788./configure --enable-geo; 2789make clean; 2790make; 2791\end{verbatim} } This set of commands generates a binary file called \x{phylogeo} which can be found 2792 in the `src/' directory. 2793 2794\subsubsection{Running PhyloGeo} PhyloGeo takes as input a rooted tree file in Newick format and a file 2795with geographical locations for all the tips of the phylogeny. Here is an example of valid tree and 2796the corresponding spatial locations just below: 2797 2798 2799\begin{scriptsize} 2800\begin{Verbatim}[frame=single, label=Valid PhyloGeo input tree, samepage=true, baselinestretch=0.5] 2801(((unicaA:1.30202,unicaB:1.30202):1.34596,(((nitidaC:0.94617,(nitidaA:0.31497, 2802nitidaB:0.31497):0.63120):0.18955,(((mauiensisA:0.00370,mauiensisB:0.00370):0.20068, 2803(pilimauiensisA:0.05151,pilimauiensisB:0.05151):0.15287):0.78769,(brunneaA:0.10582, 2804brunneaB:0.10582):0.88625):0.14365):0.80126,(((molokaiensisA:0.03728, 2805molokaiensisB:0.03728):0.71371,(deplanataA:0.01814,deplanataB:0.01814):0.73285):0.34764, 2806((parvulaA:0.20487,parvulaB:0.20487):0.40191,(kauaiensisA:0.24276, 2807kauaiensisB:0.24276):0.36401):0.49186):0.83835):0.71099):1.38043, 2808(nihoaA:0.05673,nihoaB:0.05673):3.97168); 2809\end{Verbatim} 2810 2811\begin{Verbatim}[frame=single, label=Valid PhyloGeo spatial location file, samepage=true, baselinestretch=0.5] 2812nihoaA 23.062222 161.926111 2813nihoaB 23.062222 161.926111 2814kauaiensisA 22.0644445 159.5455555 2815kauaiensisB 22.0644445 159.5455555 2816unicaA 21.436875 158.0524305 2817unicaB 21.436875 158.0524305 2818parvulaA 21.436875 158.0524305 2819parvulaB 21.436875 158.0524305 2820molokaiensisA 20.90532 156.6499 2821molokaiensisB 20.90532 156.6499 2822deplanataA 20.90532 156.6499 2823deplanataB 20.90532 156.6499 2824brunneaA 20.90532 156.6499 2825brunneaB 20.90532 156.6499 2826mauiensisA 20.90532 156.6499 2827mauiensisB 20.90532 156.6499 2828pilimauiensisA 20.90532 156.6499 2829pilimauiensisB 20.90532 156.6499 2830nitidaA 19.7362 155.6069 2831nitidaB 19.7362 155.6069 2832nitidaC 19.7362 155.6069 2833\end{Verbatim} 2834\end{scriptsize} 2835 2836In order to run PhyloGeo, enter the following command: \x{./phylogeo ./tree\_file 2837 ./spatial\_location\_file > phylogeo\_output}. 2838PhyloGeo will then print out the sampled values of the model parameters in the file 2839\x{phylogeo\_output}. This file can then be used to generate the marginal posterior densities of the 2840model parameters. In particular, evidence for competition corresponds to value of $\lambda$ smaller 2841than 1.0. Please see the original article for more information on how to interpret the model 2842parameters. 2843 2844\subsubsection{Citing PhyloGeo}\label{sec:citephylogeo} 2845 2846Ranjard, L., Welch D., Paturel M. and Guindon S. ``Modelling competition and dispersal in a 2847statistical phylogeographic framework''. 2014. Systematic Biology. 2848 2849 2850 2851\subsection{PhyREX}\index{PhyREX} PhyREX is a program that implements the spatial 2852$\Lambda$-Fleming-Viot model 2853\cite{etheridge2008,berestycki2009,barton2010,barton2010b,veber2012,barton2013} or \sfv\ for short. 2854This model can be thought of as a structured-coalescent where space is considered as continuous 2855rather than organized into separate demes. 2856Under the \sfv\ model, the spatial distribution of individuals in a population is uniform and does 2857not change during the course of evolution, as opposed to other models such as the very popular 2858isolation by distance model proposed by Wright and Mal\'ecot. The \sfv\ model does not 2859suffer from the ``pain in the torus'' \cite{felsenstein1975} and clumps of individuals with 2860increasing densities do not arise. Also, estimates of migration parameters are not sensitive to 2861variation in sampling intensities across regions, unlike ``mugration'' or ``discrete trait 2862analysis'' models \cite{lemey2009}. 2863 2864PhyREX implements an original data augmentation technique embedded in a Bayesian sampler to 2865estimate two important biological parameters: the neighorhood size ($\mathcal{N}$) and the dispersal 2866intensity ($\sigma^2$). These two parameters are closely related to the effective size of the 2867population of interest per unit area since $\mathcal{N}$ is defined as follows: $\mathcal{N} := 4\pi 2868\rho_e \sigma^2$, where $\rho_e$ is the effective population density. 2869 2870\subsubsection{Installing PhyREX} 2871 2872Compiling PhyREX is straightforward on Unix-like machines (i.e., linux and MacOS systems). PhyREX 2873is not readily available for Windows machines but compilation should be easy on this system too. In 2874the `phyml' directory, where the `src/' and `doc/' directories stand, enter the following commands: 2875{\setlength{\baselineskip}{0.5\baselineskip} 2876\begin{verbatim} 2877./configure --enable-phyrex; 2878make clean; 2879make; 2880\end{verbatim} } This set of commands generates a binary file called \x{phyrex} which can be found 2881 in the `src/' directory. 2882 2883\subsubsection{Running PhyREX} 2884 2885Example input files for PhyREX can be found in the \x{examples/phyrex/} directory. 2886PhyREX takes as input a XML file (See Section \ref{sec:xmlio}), in a PhyML-compatible format. This 2887file defines the models and various parameter settings for conducting an analysis. We give more 2888information below about this file. The XML file gives the names of a sequence alignment file in the 2889standard PHYLIP (or NEXUS) format along with a file providing the spatial coordinates of the 2890corresponding sequences. An example of the coordinates file is given 2891below: 2892\begin{Verbatim}[frame=single, label=Valid PhyREX spatial location file, samepage=true, baselinestretch=0.5] 2893# state.name lon lat 2894|SouthWest| 0 0 2895|NorthEast| 10 10 2896levosl 5.082206 4.133893 2897kmgcwv 5.914294 4.603446 2898uhfwja 4.990937 4.445124 2899ndmwkc 5.178017 4.442268 2900jpadex 3.747484 4.571090 2901lqcdcw 7.081925 5.133123 2902wnsbtg 4.164588 4.720346 2903ptwgnn 5.711159 4.462993 2904jhqdsm 3.539525 4.537706 2905vlnoes 4.613251 4.470530 2906pfrnpk 4.117791 4.489819 2907elwdvr 5.649958 4.824092 2908lptxiv 4.563302 4.005124 2909\end{Verbatim} 2910This first column in this file gives the sequence names, the second is the longitude and the third 2911column gives the latitude. The first row in this file gives the names of the columns. It starts with 2912the `\#' character signaling a comment. The second and third rows define the limits of the 2913population's habitat. At the moment, this habitat is assumed to be a rectangle. The position in 2914space of that rectangle are determined by the coordinates of the bottom-left corner (\x{|SouthWest| 29150 0}) and the top-right one (\x{|NorthEast| 10 10}). The `\x{|}' characters help identify the terms 2916in the list of coordinate corresponding indeed to these two particular points and should thus not be 2917omitted. The remaining row give the coordinates of each taxon. Every taxon in the sequence file 2918should also be listed in the coordinate file. 2919 2920Below is an excerpt of a XML file that can be used as input to run a PhyREX analysis. The \sfv\ 2921model parameters are estimated using a Bayesian sampler which hyperparameters (number of 2922samples, frequency at which samples are collected, length of the ``burnin'' phase, etc.) are set 2923through the corresponding XML attributes (see \ref{sec:phytimesettings}). PhyREX is similar to 2924PhyTime in the sense that substitution rates and times are separate parameters. The rates can 2925vary across lineages (using for instance a log-normal model) or not (strict clock). In case all 2926sequences were collected at (approximately) the same time, it is recommended to set the average 2927rate of evolution to a specific value (\x{<clockrate value="1E-4"/>}). Yet, with serially-sampled 2928data, the average rate of evolution can be estimated directly from the data and setting it to 2929a particular value is thus not mandatory. Finally, it is necessary to give information about the 2930time at which each sequence was collected. This is done through seeting appropriate calibration 2931constraints. From a technical perspective, these constraints apply to clades where each clade here 2932is a single tip. 2933 2934 2935\begin{Verbatim}[frame=single, label=Example (excerpt) of PhyREX XML file, samepage=true, baselinestretch=0.5, 2936 fontsize=\small, numbers=left] 2937 <phyrex run.id="example" output.file="out" 2938 mcmc.chain.len="1E+7" 2939 mcmc.sample.every="500" 2940 mcmc.print.every="100" 2941 mcmc.burnin="10000" 2942 mutmap="no"> 2943 2944 <!-- ``Standard'' stuff goes here, i.e., <topology>, --> 2945 <!-- <partitionelem>, etc --> 2946 2947 <!-- Model of rate variation across lineages --> 2948 <lineagerates model="lognormal"/> 2949 2950 <!-- Set the average (clock) rate of substitution --> 2951 <!-- to a fixed value (optional) --> 2952 <clockrate value="1E-4"/> 2953 2954 <-- File where 2-D spatial coordinates are found --> 2955 <coordinates id="coordinates" file.name="./usa_coord.txt"/> 2956 2957 <!-- Set the date for each tip. Useful in serial sample --> 2958 <!-- analysis --> 2959 <clade id="clad1"> 2960 <taxon value="CY130177|South_Carolina|12_13|H1N1"/> 2961 </clade> 2962 2963 <!-- Clade 1 (made of a single sequence) has time set to 0 --> 2964 <calibration id="cal1"> 2965 <lower>0</lower> 2966 <upper>0</upper> 2967 <appliesto clade.id="clad1"/> 2968 </calibration> 2969 2970 2971 <clade id="clad2"> 2972 <taxon value="KF647978|Idaho|12_13|H1N1"/> 2973 </clade> 2974 2975 <!-- Clade 2 has time set to 10 --> 2976 <calibration id="cal2"> 2977 <lower>10</lower> 2978 <upper>10</upper> 2979 <appliesto clade.id="clad2"/> 2980 </calibration> 2981 2982 <!-- More calibration info goes here. At least, one --> 2983 <!-- calibrated clade per tip. --> 2984 2985\end{Verbatim} 2986 2987PhyREX generates correlated samples from the posterior distribution of the \sfv\ model 2988parameters. The estimated distributions and related summary statistics can be monitored during the 2989analysis using the MCMC vizualization software Icylog. Figure \ref{fig:phyrextrace} shows the traces 2990for two parameters after about an hour of calculation on the example data set (36 taxa, 50 2991geographic locations). Most of the names of the sampled parameters are self-explanatory but some of 2992them (listed below) need explanations: 2993\begin{itemize} 2994\item \texttt{lnP} : log-posterior. 2995\item \texttt{alnL} : logarithm of the probability of the sequence alignment given the current tree. 2996\item \texttt{glnL} : logarithm of the probability of the current tree and location data. 2997\item \texttt{clock} : substitution rate 2998\item \texttt{lbda} : value of $\lambda$, the rate at which REX events take place in the \sfv~model. 2999\item \texttt{mu} : value of $\mu$, the term governing the probability for the lineage to be hit by 3000 a REX event. 3001\item \texttt{rad} : value of the parameter $\theta$, i.e., the radius parameter in the \sfv~model. 3002\item \texttt{neigh} : neighborhood size (equal to $2/\mu$ under the \sfv~model). 3003\item \texttt{rhoe} : effective population density. 3004\item \texttt{sigsq} : $\sigma^2 := 4\theta^4\lambda \pi \mu /s$ where $s$ is the surface of the habitat. 3005\item \texttt{sigsqobs} : mean, taken over the (most recent) tips, of one half (in a two-dimensional space) times the Euclidean distance 3006 covered in one unit of time, squared. 3007\item \texttt{dispdist} : mean, taken over the (most recent) tips, of the Haversine distance 3008 covered in one unit of time. 3009\item \texttt{nInt} : number of REX events. 3010\item \texttt{nCoal} : number of coalescent events. 3011\item \texttt{nHit} : number of REX events where a single lineage was hit. 3012\item \texttt{rootTime} : time at the root node. 3013\item \texttt{rootLon} : longitude of the root node location. 3014\item \texttt{rootLat} : latitude of the root node location. 3015\end{itemize} 3016If you 3017are unsure about the precise definition of one of these parameters or have any questions regarding 3018how to interpret the results returned by PhyREX, please do not hesitate to contact me (\url{guindon@lirmm.fr}). 3019 3020\begin{figure} 3021\begin{center} 3022 \includegraphics[width=13.8cm]{./fig/phyrexlog} 3023\end{center} 3024\caption{{\bf Statistics file generated by PhyREX and loaded on Icylog 3025 (\url{http://tgvaughan.github.io/icylog/icylog.html}).} The traces for the neighborhood size 3026 (top) and 3027the dispersal intensity (bottom) are shown here.} 3028\label{fig:phyrextrace} 3029\end{figure} 3030 3031\subsubsection{Citing PhyREX} 3032 3033The ``default citation'' is: 3034 3035\begin{itemize} 3036\item Guindon S., H Guo, D Welch. ``Demographic inference under the coalescent in a spatial 3037 continuum'', {\it Theoretical Population Biology}, 2016, 111:43-50. 3038\end{itemize} 3039 3040\section{Recommendations on program usage}\label{sec:progusage} 3041 3042\subsection{PhyML} 3043 3044The choice of the tree searching algorithm among those provided by PhyML is generally a tough one. 3045The fastest option relies on local and simultaneous modifications of the phylogeny using NNI 3046moves. More thorough explorations of the space of topologies are also available through the SPR 3047options. As these two classes of tree topology moves involve different computational burdens, it 3048is important to determine which option is the most suitable for the type of data set or analysis one 3049wants to perform. Below is a list of recommendations for typical phylogenetic analyses. 3050 3051\begin{enumerate} 3052\item {\em Single data set, unlimited computing time.} The best option here is probably to use a SPR 3053search (i.e., straight SPR of best of SPR and NNI). If the focus is on estimating the relationships 3054between species, it is a good idea to use more than one starting tree to decrease the chance of 3055getting stuck in a local maximum of the likelihood function. Using NNIs is appropriate if the 3056analysis does not mainly focus on estimating the evolutionary relationships between species (e.g. a 3057tree is needed to estimate the parameters of codon-based models later on). Branch supports can be 3058estimated using bootstrap and approximate likelihood ratios. 3059 3060\item {\em Single data set, restricted computing time.} The three tree searching options can be 3061used depending on the computing time available and the size of the data set. For small data sets 3062(i.e., $<$ 50 sequences), NNI will generally perform well provided that the phylogenetic signal is 3063strong. It is relevant to estimate a first tree using NNI moves and examine the reconstructed 3064phylogeny in order to have a rough idea of the strength of the phylogenetic signal (the presence of 3065small internal branch lengths is generally considered as a sign of a weak phylogenetic signal, 3066specially when sequences are short). For larger data sets ($>$ 50 sequences), a SPR search is 3067recommended if there are good evidence of a lack of phylogenetic signal. Bootstrap analysis will 3068generally involve large computational burdens. Estimating branch supports using approximate 3069likelihood ratios therefore provides an interesting alternative here. 3070 3071\item {\em Multiple data sets, unlimited computing time.} Comparative genomic analyses sometimes 3072rely on building phylogenies from the analysis of a large number of gene families. Here again, the 3073NNI option is the most relevant if the focus is not on recovering the most accurate picture of the 3074evolutionary relationships between species. Slower SPR-based heuristics should be used when the 3075topology of the tree is an important parameter of the analysis (e.g., identification of horizontally 3076transferred genes using phylogenetic tree comparisons). Internal branch support is generally not a 3077crucial parameter of the multiple data set analyses. Using approximate likelihood ratio is probably 3078the best choice here. 3079 3080\item {\em Multiple data sets, limited computing time.} The large amount of data to be processed in 3081a limited time generally requires the use of the fastest tree searching and branch support 3082estimation methods Hence, NNI and approximate likelihood ratios rather than SPR and non-parametric 3083bootstrap are generally the most appropriate here. 3084\end{enumerate} 3085 3086Another important point is the choice of the substitution model. While default options generally 3087provide acceptable results, it is often warranted to perform a pre-analysis in order to identify the 3088best-fit substitution model. This pre-analysis can be done using popular software such as Modeltest 3089\cite{posada98} or ProtTest \cite{abascal05} for instance. These programs generally recommend the 3090use of a discrete gamma distribution to model the substitution process as variability of rates among 3091sites is a common feature of molecular evolution. The choice of the number of rate classes to use 3092for this distribution is also an important one. While the default is set to four categories in 3093PhyML, it is recommended to use larger number of classes if possible in order to best approximate 3094the patterns of rate variation across sites \cite{galtier04}. Note however that run times are 3095directly proportional to the number of classes of the discrete gamma distribution. Here again, a 3096pre-analysis with the simplest model should help the user to determine the number of rate classes 3097that represents the best trade-off between computing time and fit of the model to the data. 3098 3099 3100\subsection{PhyTime}\label{sec:recomphytime} 3101 3102Analysing a data set using PhyTime should involve three steps based on the following questions: (1) 3103do the priors seem to be adequate (2) can I use the fast approximation of the likelihood and (3) how 3104long shall I run the program for? I explain below how to provide answers to these questions. 3105 3106\begin{itemize} 3107\item {\em Are the priors adequate?} Bayesian analysis relies on specifiying the joint 3108prior density of model parameters. In the case of node age estimation, these priors essentially 3109describe how rates of substitution vary across lineages and the probabilistic distribution that node 3110ages have when ignoring the information provided by the genetic sequences. These priors vary from 3111tree to tree. It is therefore essential to check the adequacy of priors for each user-defined input 3112tree. In order to do so, PhyTime needs to be run with the \x{--no\_data} option. When this option is 3113required, the sequence data provided as input will be ignored and the rest of the analysis will 3114proceed normally. The prior distribution of model parameters, essentially edge rates and node 3115heights, can then be checked using the program Tracer as one would do for the standard `posterior' 3116analysis. 3117 3118\item {\em Can I use the fast approximation to the likelihood?} The suface of the log-likelihood 3119function can be approximated using a multivariate normal density. This technique is saving very 3120substantial amounts of computation time. However, like most approximations, there are situations 3121where it does not provide a good fit to the actual function. This usually happens when the phylogeny 3122displays a lot of short branches, i.e., the signal conveyed by the sequences is weak. It is 3123therefore important to first check whether using the approximate likelihood is reasonable. In order 3124to do so, it is recommended to first run the program without the approximation, i.e., using the 3125default settings. Once the minimum value of the ESS of node ages (the last column on the right of 3126the standard output) has reached 40-50, open the \x{phytime.XXXX} output file with Tracer and 3127examine the correlation between the exact and approximate likelihood values. If the correlation is 3128deemed to be good enough, PhyTime can be re-run using the \x{--fast\_lk} option, which uses the fast 3129normal approximation to the likelihood function. 3130 3131 3132 3133% Figure \ref{fig:approxbad} gives an example where the correlation is too weak and the approximation 3134% of the likelihood should be avoided. Figure \ref{fig:approxbad} gives an example where the 3135% approximation is good enough. The current execution of PhyTime can be terminated and then 3136% re-launched using the \x{--fast\_lk} option. 3137 3138% \begin{figure} 3139% \begin{center} 3140% \resizebox{14cm}{8cm}{\includegraphics{./fig/approx_bad.eps}} 3141% \caption{{\bf Exact vs. approximate likelihoods.} The correlation between the normally approximated 3142% (Y-axis) and the exact (X-axis) likelihoods is weak here. The exact likelihood should be used (option \x{fastlk=no}).} 3143% \label{fig:approxbad} 3144% \end{center} 3145% \end{figure} 3146 3147 3148% \begin{figure} 3149% \begin{center} 3150% \resizebox{14cm}{8cm}{\includegraphics{./fig/approx_good.eps}} 3151% \caption{{\bf Exact vs. approximate likelihoods.} The correlation between the normally approximated 3152% (Y-axis) and the exact (X-axis) likelihoods is good. The approximation of the likelihood can be used (option \x{fastlk=yes}).} 3153% \label{fig:approxgood} 3154% \end{center} 3155% \end{figure} 3156 3157\item {\em How long shall I run the program for?} PhyTime should be run long enough such that the 3158ESS of each parameter is `large enough'. The last column on the right handside of the standard 3159output gives the minimum ESS across all internal node heights. It is recommended to run the program 3160so that this number reaches at least 100. 3161 3162\end{itemize} 3163 3164 3165\section{Frequently asked questions} 3166 3167\begin{enumerate} 3168\item {\it PhyML crashes before reading the sequences. What's wrong ?}\\ 3169\begin{itemize} 3170\item The format of your sequence file is not recognized by PhyML. See Section \ref{sec:input_output} 3171\item The carriage return characters in your sequence files are not recognized by PhyML. You must 3172 make sure that your sequence file is a plain text file, with standard carriage return characters (i.e., 3173 corresponding to ``$\backslash$\x{n}'', or ``$\backslash$\x{r}'') 3174\end{itemize} 3175\item {\it The program crashes after reading the sequences. What's wrong ?}\\ 3176\begin{itemize} 3177\item You analyse protein sequences and did not enter the \x{-d aa} option in the command-line. 3178\item The format of your sequence file is not recognized by PhyML. See Section \ref{sec:input_output} 3179\end{itemize} 3180\item {\it Does PhyML handle outgroup sequences ?}\\ 3181\begin{itemize} 3182\item Yes, it does. Outgroup taxa are identified by adding the `*' sign at the end of each 3183 corresponding sequence name (see Section \ref{sec:outgroupspecify}) 3184\end{itemize} 3185\item {\it Does PhyML estimate clock-constrained trees ?}\\ 3186\begin{itemize} 3187\item No, the PhyML program does not estimate clock-contrained trees. One can however use the 3188 program PhyTime to perform such analysis but the tree topology will not be estimated. 3189\end{itemize} 3190\item {\it Can PhyML analyse partitioned data, such as multiple gene sequences ?}\\ 3191\begin{itemize} 3192\item We are currently working on this topic. Future releases of the program will provide options 3193to estimate trees from phylogenomic data sets, with the opportunity to use different substitution 3194models on the different data partitions (e.g., different genes). PhyML will also include specific 3195algorithms to search the space of tree topologies for this type of data. 3196\end{itemize} 3197\end{enumerate} 3198 3199 3200 3201\section{Acknowledgements} 3202The development of PhyML since 2000 has been supported by the Centre National de la Recherche 3203Scientifique (CNRS) and the Minist\`ere de l'\'Education Nationale. 3204 3205\bibliographystyle{./naturemag} 3206\bibliography{./ref.bib} 3207 3208\printindex 3209\end{document} 3210 3211