1# -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; coding:utf-8 -*-
2# vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4
3#
4# MDAnalysis --- https://www.mdanalysis.org
5# Copyright (c) 2006-2017 The MDAnalysis Development Team and contributors
6# (see the file AUTHORS for the full list of names)
7#
8# Released under the GNU Public Licence, v2 or any higher version
9#
10# Please cite your use of MDAnalysis in published work:
11#
12# R. J. Gowers, M. Linke, J. Barnoud, T. J. E. Reddy, M. N. Melo, S. L. Seyler,
13# D. L. Dotson, J. Domanski, S. Buchoux, I. M. Kenney, and O. Beckstein.
14# MDAnalysis: A Python package for the rapid analysis of molecular dynamics
15# simulations. In S. Benthall and S. Rostrup editors, Proceedings of the 15th
16# Python in Science Conference, pages 102-109, Austin, TX, 2016. SciPy.
17#
18# N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein.
19# MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations.
20# J. Comput. Chem. 32 (2011), 2319--2327, doi:10.1002/jcc.21787
21#
22
23"""Coordinate fitting and alignment --- :mod:`MDAnalysis.analysis.align`
24=====================================================================
25
26:Author: Oliver Beckstein, Joshua Adelman
27:Year: 2010--2013
28:Copyright: GNU Public License v3
29
30The module contains functions to fit a target structure to a reference
31structure. They use the fast QCP algorithm to calculate the root mean
32square distance (RMSD) between two coordinate sets [Theobald2005]_ and
33the rotation matrix *R* that minimizes the RMSD [Liu2010]_. (Please
34cite these references when using this module.).
35
36Typically, one selects a group of atoms (such as the C-alphas),
37calculates the RMSD and transformation matrix, and applys the
38transformation to the current frame of a trajectory to obtain the
39rotated structure. The :func:`alignto` and :class:`AlignTraj`
40functions can be used to do this for individual frames and
41trajectories respectively.
42
43The :ref:`RMS-fitting-tutorial` shows how to do the individual steps
44manually and explains the intermediate steps.
45
46See Also
47--------
48:mod:`MDAnalysis.analysis.rms`
49     contains functions to compute RMSD (when structural alignment is not
50     required)
51:mod:`MDAnalysis.lib.qcprot`
52     implements the fast RMSD algorithm.
53
54
55.. _RMS-fitting-tutorial:
56
57RMS-fitting tutorial
58--------------------
59
60The example uses files provided as part of the MDAnalysis test suite
61(in the variables :data:`~MDAnalysis.tests.datafiles.PSF`,
62:data:`~MDAnalysis.tests.datafiles.DCD`, and
63:data:`~MDAnalysis.tests.datafiles.PDB_small`). For all further
64examples execute first ::
65
66   >>> import MDAnalysis as mda
67   >>> from MDAnalysis.analysis import align
68   >>> from MDAnalysis.analysis.rms import rmsd
69   >>> from MDAnalysis.tests.datafiles import PSF, DCD, PDB_small
70
71
72In the simplest case, we can simply calculate the C-alpha RMSD between
73two structures, using :func:`rmsd`::
74
75   >>> ref = mda.Universe(PDB_small)
76   >>> mobile = mda.Universe(PSF,DCD)
77   >>> rmsd(mobile.select_atoms('name CA').positions, ref.select_atoms('name CA').positions)
78   16.282308620224068
79
80Note that in this example translations have not been removed. In order
81to look at the pure rotation one needs to superimpose the centres of
82mass (or geometry) first:
83
84   >>> rmsd(mobile.select_atoms('name CA').positions, ref.select_atoms('name CA').positions, center=True)
85   12.639693690256898
86
87This has only done a translational superposition. If you want to also do a
88rotational superposition use the superposition keyword. This will calculate a
89minimized RMSD between the reference and mobile structure.
90
91   >>> rmsd(mobile.select_atoms('name CA').positions, ref.select_atoms('name CA').positions,
92   >>>      superposition=True)
93   6.8093965864717951
94
95The rotation matrix that superimposes *mobile* on *ref* while
96minimizing the CA-RMSD is obtained with the :func:`rotation_matrix`
97function ::
98
99   >>> mobile0 = mobile.select_atoms('name CA').positions - mobile.atoms.center_of_mass()
100   >>> ref0 = ref.select_atoms('name CA').positions - ref.atoms.center_of_mass()
101   >>> R, rmsd = align.rotation_matrix(mobile0, ref0)
102   >>> print rmsd
103   6.8093965864717951
104   >>> print R
105   [[ 0.14514539 -0.27259113  0.95111876]
106    [ 0.88652593  0.46267112 -0.00268642]
107    [-0.43932289  0.84358136  0.30881368]]
108
109Putting all this together one can superimpose all of *mobile* onto *ref*::
110
111   >>> mobile.atoms.translate(-mobile.select_atoms('name CA').center_of_mass())
112   >>> mobile.atoms.rotate(R)
113   >>> mobile.atoms.translate(ref.select_atoms('name CA').center_of_mass())
114   >>> mobile.atoms.write("mobile_on_ref.pdb")
115
116
117Common usage
118------------
119
120To **fit a single structure** with :func:`alignto`::
121
122   >>> ref = mda.Universe(PSF, PDB_small)
123   >>> mobile = mda.Universe(PSF, DCD)     # we use the first frame
124   >>> align.alignto(mobile, ref, select="protein and name CA", weights="mass")
125
126This will change *all* coordinates in *mobile* so that the protein
127C-alpha atoms are optimally superimposed (translation and rotation).
128
129To **fit a whole trajectory** to a reference structure with the
130:class:`AlignTraj` class::
131
132   >>> ref = mda.Universe(PSF, PDB_small)   # reference structure 1AKE
133   >>> trj = mda.Universe(PSF, DCD)         # trajectory of change 1AKE->4AKE
134   >>> alignment = align.AlignTraj(trj, ref, filename='rmsfit.dcd')
135   >>> alignment.run()
136
137It is also possible to align two arbitrary structures by providing a
138mapping between atoms based on a sequence alignment. This allows
139fitting of structural homologs or wild type and mutant.
140
141If a alignment was provided as "sequences.aln" one would first produce
142the appropriate MDAnalysis selections with the :func:`fasta2select`
143function and then feed the resulting dictionary to :class:`AlignTraj`::
144
145   >>> seldict = align.fasta2select('sequences.aln')
146   >>> alignment = align.AlignTraj(trj, ref, filename='rmsfit.dcd', select=seldict)
147   >>> alignment.run()
148
149(See the documentation of the functions for this advanced usage.)
150
151
152Functions and Classes
153---------------------
154
155.. versionchanged:: 0.10.0
156   Function :func:`~MDAnalysis.analysis.rms.rmsd` was removed from
157   this module and is now exclusively accessible as
158   :func:`~MDAnalysis.analysis.rms.rmsd`.
159
160.. versionchanged:: 0.16.0
161   Function :func:`~MDAnalysis.analysis.align.rms_fit_trj` deprecated
162   in favor of :class:`AlignTraj` class.
163
164.. versionchanged:: 0.17.0
165   removed deprecated :func:`~MDAnalysis.analysis.align.rms_fit_trj`
166
167.. autofunction:: alignto
168.. autoclass:: AlignTraj
169.. autofunction:: rotation_matrix
170
171
172Helper functions
173----------------
174
175The following functions are used by the other functions in this
176module. They are probably of more interest to developers than to
177normal users.
178
179.. autofunction:: _fit_to
180.. autofunction:: fasta2select
181.. autofunction:: sequence_alignment
182.. autofunction:: get_matching_atoms
183
184"""
185from __future__ import division, absolute_import
186
187import os.path
188import warnings
189import logging
190
191from six.moves import range, zip, zip_longest
192from six import string_types
193
194import numpy as np
195
196import Bio.SeqIO
197import Bio.AlignIO
198import Bio.Align.Applications
199import Bio.Alphabet
200import Bio.pairwise2
201
202import MDAnalysis as mda
203import MDAnalysis.lib.qcprot as qcp
204from MDAnalysis.exceptions import SelectionError, SelectionWarning
205import MDAnalysis.analysis.rms as rms
206from MDAnalysis.coordinates.memory import MemoryReader
207from MDAnalysis.lib.util import get_weights, deprecate
208
209from .base import AnalysisBase
210
211logger = logging.getLogger('MDAnalysis.analysis.align')
212
213
214def rotation_matrix(a, b, weights=None):
215    r"""Returns the 3x3 rotation matrix `R` for RMSD fitting coordinate
216    sets `a` and `b`.
217
218    The rotation matrix `R` transforms vector `a` to overlap with
219    vector `b` (i.e., `b` is the reference structure):
220
221    .. math::
222       \mathbf{b} = \mathsf{R} \cdot \mathbf{a}
223
224    Parameters
225    ----------
226    a : array_like
227          coordinates that are to be rotated ("mobile set"); array of N atoms
228          of shape N*3 as generated by, e.g.,
229          :attr:`MDAnalysis.core.groups.AtomGroup.positions`.
230    b : array_like
231          reference coordinates; array of N atoms of shape N*3 as generated by,
232          e.g., :attr:`MDAnalysis.core.groups.AtomGroup.positions`.
233    weights : array_like (optional)
234          array of floats of size N for doing weighted RMSD fitting (e.g. the
235          masses of the atoms)
236
237    Returns
238    -------
239    R : ndarray
240        rotation matrix
241    rmsd : float
242        RMSD between `a` and `b` before rotation
243    ``(R, rmsd)`` rmsd and rotation matrix *R*
244
245    Example
246    -------
247    `R` can be used as an argument for
248    :meth:`MDAnalysis.core.groups.AtomGroup.rotate` to generate a rotated
249    selection, e.g. ::
250
251    >>> R = rotation_matrix(A.select_atoms('backbone').positions,
252    >>>                     B.select_atoms('backbone').positions)[0]
253    >>> A.atoms.rotate(R)
254    >>> A.atoms.write("rotated.pdb")
255
256    Notes
257    -----
258    The function does *not* shift the centers of mass or geometry;
259    this needs to be done by the user.
260
261    See Also
262    --------
263    MDAnalysis.analysis.rms.rmsd: Calculates the RMSD between *a* and *b*.
264    alignto: A complete fit of two structures.
265    AlignTraj: Fit a whole trajectory.
266    """
267
268    a = np.asarray(a, dtype=np.float64)
269    b = np.asarray(b, dtype=np.float64)
270
271    if a.shape != b.shape:
272        raise ValueError("'a' and 'b' must have same shape")
273
274    N = b.shape[0]
275
276    if weights is not None:
277        # qcp does NOT divide weights relative to the mean
278        weights = np.asarray(weights, dtype=np.float64) / np.mean(weights)
279
280    rot = np.zeros(9, dtype=np.float64)
281
282    # Need to transpose coordinates such that the coordinate array is
283    # 3xN instead of Nx3. Also qcp requires that the dtype be float64
284    # (I think we swapped the position of ref and traj in CalcRMSDRotationalMatrix
285    # so that R acts **to the left** and can be broadcasted; we're saving
286    # one transpose. [orbeckst])
287    rmsd = qcp.CalcRMSDRotationalMatrix(a, b, N, rot, weights)
288    return rot.reshape(3, 3), rmsd
289
290
291def _fit_to(mobile_coordinates, ref_coordinates, mobile_atoms,
292            mobile_com, ref_com, weights=None):
293    r"""Perform an rmsd-fitting to determine rotation matrix and align atoms
294
295    Parameters
296    ----------
297    mobile_coordinates : ndarray
298        Coordinates of atoms to be aligned
299    ref_coordinates : ndarray
300        Coordinates of atoms to be fit against
301    mobile_atoms : AtomGroup
302        Atoms to be translated
303    mobile_com: ndarray
304        array of xyz coordinate of mobile center of mass
305    ref_com: ndarray
306        array of xyz coordinate of reference center of mass
307    weights : array_like (optional)
308       choose weights. With ``None`` weigh each atom equally. If a float array
309       of the same length as `mobile_coordinates` is provided, use each element
310       of the `array_like` as a weight for the corresponding atom in
311       `mobile_coordinates`.
312
313    Returns
314    -------
315    mobile_atoms : AtomGroup
316        AtomGroup of translated and rotated atoms
317    min_rmsd : float
318        Minimum rmsd of coordinates
319
320    Notes
321    -----
322    This function assumes that `mobile_coordinates` and `ref_coordinates` have
323    already been shifted so that their centers of geometry (or centers of mass,
324    depending on `weights`) coincide at the origin. `mobile_com` and `ref_com`
325    are the centers *before* this shift.
326
327    1. The rotation matrix :math:`\mathsf{R}` is determined with
328       :func:`rotation_matrix` directly from `mobile_coordinates` and
329       `ref_coordinates`.
330    2. `mobile_atoms` :math:`X` is rotated according to the
331       rotation matrix and the centers according to
332
333       .. math::
334
335           X' = \mathsf{R}(X - \bar{X}) + \bar{X}_{\text{ref}}
336
337       where :math:`\bar{X}` is the center.
338
339    """
340    R, min_rmsd = rotation_matrix(mobile_coordinates, ref_coordinates,
341                                  weights=weights)
342
343    mobile_atoms.translate(-mobile_com)
344    mobile_atoms.rotate(R)
345    mobile_atoms.translate(ref_com)
346
347    return mobile_atoms, min_rmsd
348
349
350def alignto(mobile, reference, select="all", weights=None,
351            subselection=None, tol_mass=0.1, strict=False):
352    """Perform a spatial superposition by minimizing the RMSD.
353
354    Spatially align the group of atoms `mobile` to `reference` by
355    doing a RMSD fit on `select` atoms.
356
357    The superposition is done in the following way:
358
359    1. A rotation matrix is computed that minimizes the RMSD between
360       the coordinates of `mobile.select_atoms(sel1)` and
361       `reference.select_atoms(sel2)`; before the rotation, `mobile` is
362       translated so that its center of geometry (or center of mass)
363       coincides with the one of `reference`. (See below for explanation of
364       how *sel1* and *sel2* are derived from `select`.)
365
366    2. All atoms in :class:`~MDAnalysis.core.universe.Universe` that
367       contain `mobile` are shifted and rotated. (See below for how
368       to change this behavior through the `subselection` keyword.)
369
370    The `mobile` and `reference` atom groups can be constructed so that they
371    already match atom by atom. In this case, `select` should be set to "all"
372    (or ``None``) so that no further selections are applied to `mobile` and
373    `reference`, therefore preserving the exact atom ordering (see
374    :ref:`ordered-selections-label`).
375
376    .. Warning:: The atom order for `mobile` and `reference` is *only*
377       preserved when `select` is either "all" or ``None``. In any other case,
378       a new selection will be made that will sort the resulting AtomGroup by
379       index and therefore destroy the correspondence between the two groups.
380       **It is safest not to mix ordered AtomGroups with selection strings.**
381
382    Parameters
383    ----------
384    mobile : Universe or AtomGroup
385       structure to be aligned, a
386       :class:`~MDAnalysis.core.groups.AtomGroup` or a whole
387       :class:`~MDAnalysis.core.universe.Universe`
388    reference : Universe or AtomGroup
389       reference structure, a :class:`~MDAnalysis.core.groups.AtomGroup`
390       or a whole :class:`~MDAnalysis.core.universe.Universe`
391    select : str or dict or tuple (optional)
392       The selection to operate on; can be one of:
393
394       1. any valid selection string for
395          :meth:`~MDAnalysis.core.groups.AtomGroup.select_atoms` that
396          produces identical selections in `mobile` and `reference`; or
397
398       2. a dictionary ``{'mobile': sel1, 'reference': sel2}`` where *sel1*
399          and *sel2* are valid selection strings that are applied to
400          `mobile` and `reference` respectively (the
401          :func:`MDAnalysis.analysis.align.fasta2select` function returns such
402          a dictionary based on a ClustalW_ or STAMP_ sequence alignment); or
403
404       3. a tuple ``(sel1, sel2)``
405
406       When using 2. or 3. with *sel1* and *sel2* then these selection strings
407       are applied to `atomgroup` and `reference` respectively and should
408       generate *groups of equivalent atoms*.  *sel1* and *sel2* can each also
409       be a *list of selection strings* to generate a
410       :class:`~MDAnalysis.core.groups.AtomGroup` with defined atom order as
411       described under :ref:`ordered-selections-label`).
412    weights : {"mass", ``None``} or array_like (optional)
413       choose weights. With ``"mass"`` uses masses as weights; with ``None``
414       weigh each atom equally. If a float array of the same length as
415       `mobile` is provided, use each element of the `array_like` as a
416       weight for the corresponding atom in `mobile`.
417    tol_mass: float (optional)
418       Reject match if the atomic masses for matched atoms differ by more than
419       `tol_mass`, default [0.1]
420    strict: bool (optional)
421       ``True``
422           Will raise :exc:`SelectionError` if a single atom does not
423           match between the two selections.
424       ``False`` [default]
425           Will try to prepare a matching selection by dropping
426           residues with non-matching atoms. See :func:`get_matching_atoms`
427           for details.
428    subselection : str or AtomGroup or None (optional)
429       Apply the transformation only to this selection.
430
431       ``None`` [default]
432           Apply to ``mobile.universe.atoms`` (i.e., all atoms in the
433           context of the selection from `mobile` such as the rest of a
434           protein, ligands and the surrounding water)
435       *selection-string*
436           Apply to ``mobile.select_atoms(selection-string)``
437       :class:`~MDAnalysis.core.groups.AtomGroup`
438           Apply to the arbitrary group of atoms
439
440    Returns
441    -------
442    old_rmsd : float
443        RMSD before spatial alignment
444    new_rmsd : float
445        RMSD after spatial alignment
446
447    See Also
448    --------
449    AlignTraj: More efficient method for RMSD-fitting trajectories.
450
451
452    .. _ClustalW: http://www.clustal.org/
453    .. _STAMP: http://www.compbio.dundee.ac.uk/manuals/stamp.4.2/
454
455
456    .. versionchanged:: 0.8
457       Added check that the two groups describe the same atoms including
458       the new *tol_mass* keyword.
459
460    .. versionchanged:: 0.10.0
461       Uses :func:`get_matching_atoms` to work with incomplete selections
462       and new `strict` keyword. The new default is to be lenient whereas
463       the old behavior was the equivalent of ``strict = True``.
464
465    .. versionchanged:: 0.16.0
466       new general 'weights' kwarg replace `mass_weighted`, deprecated `mass_weighted`
467    .. deprecated:: 0.16.0
468       Instead of ``mass_weighted=True`` use new ``weights='mass'``
469
470    .. versionchanged:: 0.17.0
471       Deprecated keyword `mass_weighted` was removed.
472    """
473    if select in ('all', None):
474        # keep the EXACT order in the input AtomGroups; select_atoms('all')
475        # orders them by index, which can lead to wrong results if the user
476        # has crafted mobile and reference to match atom by atom
477        mobile_atoms = mobile.atoms
478        ref_atoms = reference.atoms
479    else:
480        select = rms.process_selection(select)
481        mobile_atoms = mobile.select_atoms(*select['mobile'])
482        ref_atoms = reference.select_atoms(*select['reference'])
483
484    ref_atoms, mobile_atoms = get_matching_atoms(ref_atoms, mobile_atoms,
485                                                 tol_mass=tol_mass,
486                                                 strict=strict)
487
488    weights = get_weights(ref_atoms, weights)
489
490    mobile_com = mobile_atoms.center(weights)
491    ref_com = ref_atoms.center(weights)
492
493    ref_coordinates = ref_atoms.positions - ref_com
494    mobile_coordinates = mobile_atoms.positions - mobile_com
495
496    old_rmsd = rms.rmsd(mobile_coordinates, ref_coordinates, weights)
497
498    if subselection is None:
499        # mobile_atoms is Universe
500        mobile_atoms = mobile.universe.atoms
501    elif isinstance(subselection, string_types):
502        # select mobile_atoms from string
503        mobile_atoms = mobile.select_atoms(subselection)
504    else:
505        try:
506            # treat subselection as AtomGroup
507            mobile_atoms = subselection.atoms
508        except AttributeError:
509            raise TypeError("subselection must be a selection string, an"
510                            " AtomGroup or Universe or None")
511
512    # _fit_to DOES subtract center of mass, will provide proper min_rmsd
513    mobile_atoms, new_rmsd = _fit_to(mobile_coordinates, ref_coordinates,
514                                     mobile_atoms, mobile_com, ref_com,
515                                     weights=weights)
516    return old_rmsd, new_rmsd
517
518
519class AlignTraj(AnalysisBase):
520    """RMS-align trajectory to a reference structure using a selection.
521
522    Both the reference `reference` and the trajectory `mobile` must be
523    :class:`MDAnalysis.Universe` instances. If they contain a trajectory then
524    it is used. The output file format is determined by the file extension of
525    `filename`. One can also use the same universe if one wants to fit to the
526    current frame.
527
528    """
529
530    def __init__(self, mobile, reference, select='all', filename=None,
531                 prefix='rmsfit_', weights=None,
532                 tol_mass=0.1, strict=False, force=True, in_memory=False,
533                 **kwargs):
534        """Parameters
535        ----------
536        mobile : Universe
537            Universe containing trajectory to be fitted to reference
538        reference : Universe
539            Universe containing trajectory frame to be used as reference
540        select : str (optional)
541            Set as default to all, is used for Universe.select_atoms to choose
542            subdomain to be fitted against
543        filename : str (optional)
544            Provide a filename for results to be written to
545        prefix : str (optional)
546            Provide a string to prepend to filename for results to be written
547            to
548        weights : {"mass", ``None``} or array_like (optional)
549            choose weights. With ``"mass"`` uses masses of `reference` as
550            weights; with ``None`` weigh each atom equally. If a float array of
551            the same length as the selection is provided, use each element of
552            the `array_like` as a weight for the corresponding atom in the
553            selection.
554        tol_mass : float (optional)
555            Tolerance given to `get_matching_atoms` to find appropriate atoms
556        strict : bool (optional)
557            Force `get_matching_atoms` to fail if atoms can't be found using
558            exact methods
559        force : bool (optional)
560            Force overwrite of filename for rmsd-fitting
561        start : int (optional)
562            First frame of trajectory to analyse, Default: 0
563        stop : int (optional)
564            Last frame of trajectory to analyse, Default: -1
565        step : int (optional)
566            Step between frames to analyse, Default: 1
567        in_memory : bool (optional)
568            *Permanently* switch `mobile` to an in-memory trajectory
569            so that alignment can be done in-place, which can improve
570            performance substantially in some cases. In this case, no file
571            is written out (`filename` and `prefix` are ignored) and only
572            the coordinates of `mobile` are *changed in memory*.
573        verbose : bool (optional)
574             Set logger to show more information and show detailed progress of
575             the calculation if set to ``True``; the default is ``False``.
576
577
578        Attributes
579        ----------
580        reference_atoms : AtomGroup
581            Atoms of the reference structure to be aligned against
582        mobile_atoms : AtomGroup
583            Atoms inside each trajectory frame to be rmsd_aligned
584        rmsd : Array
585            Array of the rmsd values of the least rmsd between the mobile_atoms
586            and reference_atoms after superposition and minimimization of rmsd
587        filename : str
588            String reflecting the filename of the file where mobile_atoms
589            positions will be written to upon running RMSD alignment
590
591
592        Notes
593        -----
594        - If set to ``verbose=False``, it is recommended to wrap the statement
595          in a ``try ...  finally`` to guarantee restoring of the log level in
596          the case of an exception.
597        - The ``in_memory`` option changes the `mobile` universe to an
598          in-memory representation (see :mod:`MDAnalysis.coordinates.memory`)
599          for the remainder of the Python session. If ``mobile.trajectory`` is
600          already a :class:`MemoryReader` then it is *always* treated as if
601          ``in_memory`` had been set to ``True``.
602
603        .. deprecated:: 0.19.1
604           Default ``filename`` directory will change in 1.0 to the current directory.
605
606        .. versionchanged:: 0.16.0
607           new general ``weights`` kwarg replace ``mass_weights``
608
609        .. deprecated:: 0.16.0
610           Instead of ``mass_weighted=True`` use new ``weights='mass'``
611
612        .. versionchanged:: 0.17.0
613           removed deprecated `mass_weighted` keyword
614
615        """
616        select = rms.process_selection(select)
617        self.ref_atoms = reference.select_atoms(*select['reference'])
618        self.mobile_atoms = mobile.select_atoms(*select['mobile'])
619        if in_memory or isinstance(mobile.trajectory, MemoryReader):
620            mobile.transfer_to_memory()
621            filename = None
622            logger.info("Moved mobile trajectory to in-memory representation")
623        else:
624            if filename is None:
625                # DEPRECATED in 0.19.1
626                # Change in 1.0
627                #
628                # fn = os.path.split(mobile.trajectory.filename)[1]
629                # filename = prefix + fn
630                path, fn = os.path.split(mobile.trajectory.filename)
631                filename = os.path.join(path, prefix + fn)
632                logger.info('filename of rms_align with no filename given'
633                            ': {0}'.format(filename))
634
635            if os.path.exists(filename) and not force:
636                raise IOError(
637                    'Filename already exists in path and force is not set'
638                    ' to True')
639
640        # do this after setting the memory reader to have a reference to the
641        # right reader.
642        super(AlignTraj, self).__init__(mobile.trajectory, **kwargs)
643        if not self._verbose:
644            logging.disable(logging.WARN)
645
646        # store reference to mobile atoms
647        self.mobile = mobile.atoms
648
649        self.filename = filename
650
651        natoms = self.mobile.n_atoms
652        self.ref_atoms, self.mobile_atoms = get_matching_atoms(
653            self.ref_atoms, self.mobile_atoms, tol_mass=tol_mass,
654            strict=strict)
655
656        # with self.filename == None (in_memory), the NullWriter is chosen
657        # (which just ignores input) and so only the in_memory trajectory is
658        # retained
659        self._writer = mda.Writer(self.filename, natoms)
660
661        self._weights = get_weights(self.ref_atoms, weights)
662
663        logger.info("RMS-fitting on {0:d} atoms.".format(len(self.ref_atoms)))
664
665    def _prepare(self):
666        # reference centre of mass system
667        self._ref_com = self.ref_atoms.center(self._weights)
668        self._ref_coordinates = self.ref_atoms.positions - self._ref_com
669        # allocate the array for selection atom coords
670        self.rmsd = np.zeros((self.n_frames,))
671
672    def _single_frame(self):
673        index = self._frame_index
674        mobile_com = self.mobile_atoms.center(self._weights)
675        mobile_coordinates = self.mobile_atoms.positions - mobile_com
676        mobile_atoms, self.rmsd[index] = _fit_to(mobile_coordinates,
677                                                 self._ref_coordinates,
678                                                 self.mobile,
679                                                 mobile_com,
680                                                 self._ref_com, self._weights)
681        # write whole aligned input trajectory system
682        self._writer.write(mobile_atoms)
683
684    def _conclude(self):
685        self._writer.close()
686        if not self._verbose:
687            logging.disable(logging.NOTSET)
688
689    @deprecate(release="0.19.0", remove="1.0")
690    def save(self, rmsdfile):
691        """save rmsd as a numpy array
692        """
693        # these are the values of the new rmsd between the aligned trajectory
694        # and reference structure
695        np.savetxt(rmsdfile, self.rmsd)
696        logger.info("Wrote RMSD timeseries  to file %r", rmsdfile)
697
698
699def sequence_alignment(mobile, reference, match_score=2, mismatch_penalty=-1,
700                       gap_penalty=-2, gapextension_penalty=-0.1):
701    """Generate a global sequence alignment between two residue groups.
702
703    The residues in `reference` and `mobile` will be globally aligned.
704    The global alignment uses the Needleman-Wunsch algorithm as
705    implemented in :mod:`Bio.pairwise2`. The parameters of the dynamic
706    programming algorithm can be tuned with the keywords. The defaults
707    should be suitable for two similar sequences. For sequences with
708    low sequence identity, more specialized tools such as clustalw,
709    muscle, tcoffee, or similar should be used.
710
711    Parameters
712    ----------
713    mobile : AtomGroup
714        Atom group to be aligned
715    reference : AtomGroup
716        Atom group to be aligned against
717    match_score : float (optional), default 2
718         score for matching residues, default 2
719    mismatch_penalty : float (optional), default -1
720         penalty for residues that do not match , default : -1
721    gap_penalty : float (optional), default -2
722         penalty for opening a gap; the high default value creates compact
723         alignments for highly identical sequences but might not be suitable
724         for sequences with low identity, default : -2
725    gapextension_penalty : float (optional), default -0.1
726         penalty for extending a gap, default: -0.1
727
728    Returns
729    -------
730    alignment : tuple
731        Tuple of top sequence matching output `('Sequence A', 'Sequence B', score,
732        begin, end)`
733
734    See Also
735    --------
736    BioPython documentation for `pairwise2`_. Alternatively, use
737    :func:`fasta2select` with :program:`clustalw2` and the option
738    ``is_aligned=False``.
739
740    .. _`pairwise2`: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html
741
742    .. versionadded:: 0.10.0
743
744    """
745    aln = Bio.pairwise2.align.globalms(
746        reference.residues.sequence(format="string"),
747        mobile.residues.sequence(format="string"),
748        match_score, mismatch_penalty, gap_penalty, gapextension_penalty)
749    # choose top alignment
750    return aln[0]
751
752
753def fasta2select(fastafilename, is_aligned=False,
754                 ref_resids=None, target_resids=None,
755                 ref_offset=0, target_offset=0, verbosity=3,
756                 alnfilename=None, treefilename=None, clustalw="clustalw2"):
757    """Return selection strings that will select equivalent residues.
758
759    The function aligns two sequences provided in a FASTA file and
760    constructs MDAnalysis selection strings of the common atoms. When
761    these two strings are applied to the two different proteins they
762    will generate AtomGroups of the aligned residues.
763
764    `fastafilename` contains the two un-aligned sequences in FASTA
765    format. The reference is assumed to be the first sequence, the
766    target the second. ClustalW_ produces a pairwise
767    alignment (which is written to a file with suffix ``.aln``).  The
768    output contains atom selection strings that select the same atoms
769    in the two structures.
770
771    Unless `ref_offset` and/or `target_offset` are specified, the resids
772    in the structure are assumed to correspond to the positions in the
773    un-aligned sequence, namely the first residue has resid == 1.
774
775    In more complicated cases (e.g., when the resid numbering in the
776    input structure has gaps due to missing parts), simply provide the
777    sequence of resids as they appear in the topology in `ref_resids` or
778    `target_resids`, e.g. ::
779
780       target_resids = [a.resid for a in trj.select_atoms('name CA')]
781
782    (This translation table *is* combined with any value for
783    `ref_offset` or `target_offset`!)
784
785    Parameters
786    ----------
787    fastafilename : str, path to filename
788        FASTA file with first sequence as reference and
789        second the one to be aligned (ORDER IS IMPORTANT!)
790    is_aligned : bool (optional)
791        ``False`` (default)
792            run clustalw for sequence alignment;
793        ``True``
794            use the alignment in the file (e.g. from STAMP) [``False``]
795    ref_offset : int (optional)
796        add this number to the column number in the FASTA file
797        to get the original residue number, default: 0
798    target_offset : int (optional)
799        add this number to the column number in the FASTA file
800        to get the original residue number, default: 0
801    ref_resids : str (optional)
802        sequence of resids as they appear in the reference structure
803    target_resids : str (optional)
804        sequence of resids as they appear in the target
805    alnfilename : str (optional)
806        filename of ClustalW alignment (clustal format) that is
807        produced by *clustalw* when *is_aligned* = ``False``.
808        default ``None`` uses the name and path of *fastafilename* and
809        subsititutes the suffix with '.aln'.
810    treefilename: str (optional)
811        filename of ClustalW guide tree (Newick format);
812        if default ``None``  the the filename is generated from *alnfilename*
813        with the suffix '.dnd' instead of '.aln'
814    clustalw : str (optional)
815        path to the ClustalW (or ClustalW2) binary; only
816        needed for `is_aligned` = ``False``, default: "ClustalW2"
817
818    Returns
819    -------
820    select_dict : dict
821        dictionary with 'reference' and 'mobile' selection string
822        that can be used immediately in :class:`AlignTraj` as
823        ``select=select_dict``.
824
825
826    See Also
827    --------
828    :func:`sequence_alignment`, which does not require external
829    programs.
830
831
832    .. _ClustalW: http://www.clustal.org/
833    .. _STAMP: http://www.compbio.dundee.ac.uk/manuals/stamp.4.2/
834
835    """
836    protein_gapped = Bio.Alphabet.Gapped(Bio.Alphabet.IUPAC.protein)
837    if is_aligned:
838        logger.info("Using provided alignment {}".format(fastafilename))
839        with open(fastafilename) as fasta:
840            alignment = Bio.AlignIO.read(
841                fasta, "fasta", alphabet=protein_gapped)
842    else:
843        if alnfilename is None:
844            filepath, ext = os.path.splitext(fastafilename)
845            alnfilename = filepath + '.aln'
846        if treefilename is None:
847            filepath, ext = os.path.splitext(alnfilename)
848            treefilename = filepath + '.dnd'
849        run_clustalw = Bio.Align.Applications.ClustalwCommandline(
850            clustalw,
851            infile=fastafilename,
852            type="protein",
853            align=True,
854            outfile=alnfilename,
855            newtree=treefilename)
856        logger.debug(
857            "Aligning sequences in %(fastafilename)r with %(clustalw)r.",
858            vars())
859        logger.debug("ClustalW commandline: %r", str(run_clustalw))
860        try:
861            stdout, stderr = run_clustalw()
862        except:
863            logger.exception("ClustalW %(clustalw)r failed", vars())
864            logger.info(
865                "(You can get clustalw2 from http://www.clustal.org/clustal2/)")
866            raise
867        with open(alnfilename) as aln:
868            alignment = Bio.AlignIO.read(
869                aln, "clustal", alphabet=protein_gapped)
870        logger.info(
871            "Using clustalw sequence alignment {0!r}".format(alnfilename))
872        logger.info(
873            "ClustalW Newick guide tree was also produced: {0!r}".format(treefilename))
874
875    nseq = len(alignment)
876    if nseq != 2:
877        raise ValueError(
878            "Only two sequences in the alignment can be processed.")
879
880    # implict assertion that we only have two sequences in the alignment
881    orig_resids = [ref_resids, target_resids]
882    offsets = [ref_offset, target_offset]
883    for iseq, a in enumerate(alignment):
884        # need iseq index to change orig_resids
885        if orig_resids[iseq] is None:
886            # build default: assume consecutive numbering of all
887            # residues in the alignment
888            GAP = a.seq.alphabet.gap_char
889            length = len(a.seq) - a.seq.count(GAP)
890            orig_resids[iseq] = np.arange(1, length + 1)
891        else:
892            orig_resids[iseq] = np.asarray(orig_resids[iseq])
893    # add offsets to the sequence <--> resid translation table
894    seq2resids = [resids + offset for resids, offset in zip(
895        orig_resids, offsets)]
896    del orig_resids
897    del offsets
898
899    def resid_factory(alignment, seq2resids):
900        """Return a function that gives the resid for a position ipos in
901        the nseq'th alignment.
902
903        resid = resid_factory(alignment,seq2resids)
904        r = resid(nseq,ipos)
905
906        It is based on a look up table that translates position in the
907        alignment to the residue number in the original
908        sequence/structure.
909
910        The first index of resid() is the alignmment number, the
911        second the position in the alignment.
912
913        seq2resids translates the residues in the sequence to resid
914        numbers in the psf. In the simplest case this is a linear map
915        but if whole parts such as loops are ommitted from the protein
916        the seq2resids may have big gaps.
917
918        Format: a tuple of two numpy arrays; the first array is for
919        the reference, the second for the target, The index in each
920        array gives the consecutive number of the amino acid in the
921        sequence, the value the resid in the structure/psf.
922
923        Note: assumes that alignments have same length and are padded if
924        necessary.
925        """
926        # could maybe use Bio.PDB.StructureAlignment instead?
927        nseq = len(alignment)
928        t = np.zeros((nseq, alignment.get_alignment_length()), dtype=int)
929        for iseq, a in enumerate(alignment):
930            GAP = a.seq.alphabet.gap_char
931            t[iseq, :] = seq2resids[iseq][np.cumsum(np.where(
932                np.array(list(a.seq)) == GAP, 0, 1)) - 1]
933            # -1 because seq2resid is index-1 based (resids start at 1)
934
935        def resid(nseq, ipos, t=t):
936            return t[nseq, ipos]
937
938        return resid
939
940    resid = resid_factory(alignment, seq2resids)
941
942    res_list = []  # collect individual selection string
943    # could collect just resid and type (with/without CB) and
944    # then post-process and use ranges for continuous stretches, eg
945    # ( resid 1:35 and ( backbone or name CB ) ) or ( resid 36 and backbone )
946
947    # should be the same for both seqs
948    GAP = alignment[0].seq.alphabet.gap_char
949    if GAP != alignment[1].seq.alphabet.gap_char:
950        raise ValueError(
951            "Different gap characters in sequence 'target' and 'mobile'.")
952    for ipos in range(alignment.get_alignment_length()):
953        aligned = list(alignment[:, ipos])
954        if GAP in aligned:
955            continue  # skip residue
956        template = "resid %i"
957        if 'G' not in aligned:
958            # can use CB
959            template += " and ( backbone or name CB )"
960        else:
961            template += " and backbone"
962        template = "( " + template + " )"
963
964        res_list.append([template % resid(iseq, ipos) for iseq in range(nseq)])
965
966    sel = np.array(res_list).transpose()
967
968    ref_selection = " or ".join(sel[0])
969    target_selection = " or ".join(sel[1])
970    return {'reference': ref_selection, 'mobile': target_selection}
971
972
973def get_matching_atoms(ag1, ag2, tol_mass=0.1, strict=False):
974    """Return two atom groups with one-to-one matched atoms.
975
976    The function takes two :class:`~MDAnalysis.core.groups.AtomGroup`
977    instances `ag1` and `ag2` and returns two atom groups `g1` and `g2` that
978    consist of atoms so that the mass of atom ``g1[0]`` is the same as the mass
979    of atom ``g2[0]``, ``g1[1]`` and ``g2[1]`` etc.
980
981    The current implementation is very simplistic and works on a per-residue basis:
982
983    1. The two groups must contain the same number of residues.
984    2. Any residues in each group that have differing number of atoms are discarded.
985    3. The masses of corresponding atoms are compared. and if any masses differ
986       by more than `tol_mass` the test is considered failed and a
987       :exc:`SelectionError` is raised.
988
989    The log file (see :func:`MDAnalysis.start_logging`) will contain detailed
990    information about mismatches.
991
992    Parameters
993    ----------
994    ag1 : AtomGroup
995        First :class:`~MDAnalysis.core.groups.AtomGroup` instance that is
996        compared
997    ag2 : AtomGroup
998        Second :class:`~MDAnalysis.core.groups.AtomGroup` instance that is
999        compared
1000    tol_mass : float (optional)
1001         Reject if the atomic masses for matched atoms differ by more than
1002         `tol_mass` [0.1]
1003    strict : bool (optional)
1004        ``True``
1005            Will raise :exc:`SelectionError` if a single atom does not
1006            match between the two selections.
1007        ``False`` [default]
1008            Will try to prepare a matching selection by dropping
1009            residues with non-matching atoms. See :func:`get_matching_atoms`
1010            for details.
1011
1012    Returns
1013    -------
1014    (g1, g2) : tuple
1015        Tuple with :class:`~MDAnalysis.core.groups.AtomGroup`
1016        instances that match, atom by atom. The groups are either the
1017        original groups if all matched or slices of the original
1018        groups.
1019
1020    Raises
1021    ------
1022    :exc:`SelectionError`
1023        Error raised if the number of residues does not match or if in the final
1024        matching masses differ by more than *tol*.
1025
1026    Notes
1027    -----
1028    The algorithm could be improved by using e.g. the Needleman-Wunsch
1029    algorithm in :mod:`Bio.profile2` to align atoms in each residue (doing a
1030    global alignment is too expensive).
1031
1032    .. versionadded:: 0.8
1033
1034    .. versionchanged:: 0.10.0
1035       Renamed from :func:`check_same_atoms` to
1036       :func:`get_matching_atoms` and now returns matching atomgroups
1037       (possibly with residues removed)
1038
1039    """
1040
1041    if ag1.n_atoms != ag2.n_atoms:
1042        if ag1.n_residues != ag2.n_residues:
1043            errmsg = ("Reference and trajectory atom selections do not contain "
1044                      "the same number of atoms: \n"
1045                      "atoms:    N_ref={0}, N_traj={1}\n"
1046                      "and also not the same number of residues:\n"
1047                      "residues: N_ref={2}, N_traj={3}").format(
1048                          ag1.n_atoms, ag2.n_atoms,
1049                          ag1.n_residues, ag2.n_residues)
1050            logger.error(errmsg)
1051            raise SelectionError(errmsg)
1052        else:
1053            msg = ("Reference and trajectory atom selections do not contain "
1054                   "the same number of atoms: \n"
1055                   "atoms:    N_ref={0}, N_traj={1}").format(
1056                       ag1.n_atoms, ag2.n_atoms)
1057            if strict:
1058                logger.error(msg)
1059                raise SelectionError(msg)
1060
1061            # continue with trying to create a valid selection
1062            msg += ("\nbut we attempt to create a valid selection " +
1063                    "(use strict=True to disable this heuristic).")
1064            logger.info(msg)
1065            warnings.warn(msg, category=SelectionWarning)
1066
1067        # continue with trying to salvage the selection:
1068        # - number of atoms is different
1069        # - number of residues is the same
1070        # We will remove residues with mismatching number of atoms (e.g. not resolved
1071        # in an X-ray structure)
1072        assert ag1.n_residues == ag2.n_residues
1073
1074        # Alternatively, we could align all atoms but Needleman-Wunsch
1075        # pairwise2 consumes too much memory for thousands of characters in
1076        # each sequence. Perhaps a solution would be pairwise alignment per residue.
1077        #
1078        # aln_elem = Bio.pairwise2.align.globalms("".join([MDAnalysis.topology.
1079        # core.guess_atom_element(n) for n in gref.atoms.names]),
1080        # "".join([MDAnalysis.topology.core.guess_atom_element(n)
1081        # for n in models[0].atoms.names]),
1082        # 2, -1, -1, -0.1,
1083        # one_alignment_only=True)
1084
1085        # For now, just remove the residues that don't have matching numbers
1086        # NOTE: This can create empty selections, e.g., when comparing a structure
1087        #       with hydrogens to a PDB structure without hydrogens.
1088        rsize1 = np.array([r.atoms.n_atoms for r in ag1.residues])
1089        rsize2 = np.array([r.atoms.n_atoms for r in ag2.residues])
1090        rsize_mismatches = np.absolute(rsize1 - rsize2)
1091        mismatch_mask = (rsize_mismatches > 0)
1092        if np.any(mismatch_mask):
1093            if strict:
1094                # diagnostics
1095                mismatch_resindex = np.arange(ag1.n_residues)[mismatch_mask]
1096
1097                def log_mismatch(
1098                        number,
1099                        ag,
1100                        rsize,
1101                        mismatch_resindex=mismatch_resindex):
1102                    logger.error("Offending residues: group {0}: {1}".format(
1103                        number,
1104                        ", ".join(["{0[0]}{0[1]} ({0[2]})".format(r) for r in
1105                                   zip(ag.resnames[mismatch_resindex],
1106                                       ag.resids[mismatch_resindex],
1107                                       rsize[mismatch_resindex]
1108                                       )])))
1109                logger.error("Found {0} residues with non-matching numbers of atoms (#)".format(
1110                    mismatch_mask.sum()))
1111                log_mismatch(1, ag1, rsize1)
1112                log_mismatch(2, ag2, rsize2)
1113
1114                errmsg = ("Different number of atoms in some residues. "
1115                          "(Use strict=False to attempt using matching atoms only.)")
1116                logger.error(errmsg)
1117                raise SelectionError(errmsg)
1118
1119            def get_atoms_byres(g, match_mask=np.logical_not(mismatch_mask)):
1120                # not pretty... but need to do things on a per-atom basis in
1121                # order to preserve original selection
1122                ag = g.atoms
1123                good = ag.residues.resids[match_mask]  # resid for each residue
1124                resids = ag.resids                     # resid for each atom
1125                # boolean array for all matching atoms
1126                ix_good = np.in1d(resids, good)
1127                return ag[ix_good]
1128
1129            _ag1 = get_atoms_byres(ag1)
1130            _ag2 = get_atoms_byres(ag2)
1131
1132            assert _ag1.atoms.n_atoms == _ag2.atoms.n_atoms
1133
1134            # diagnostics
1135            # (ugly workaround for missing boolean indexing of AtomGroup)
1136            # note: ag[arange(len(ag))[boolean]] is ~2x faster than
1137            # ag[where[boolean]]
1138            mismatch_resindex = np.arange(ag1.n_residues)[mismatch_mask]
1139            logger.warning("Removed {0} residues with non-matching numbers of atoms"
1140                           .format(mismatch_mask.sum()))
1141            logger.debug("Removed residue ids: group 1: {0}"
1142                         .format(ag1.residues.resids[mismatch_resindex]))
1143            logger.debug("Removed residue ids: group 2: {0}"
1144                         .format(ag2.residues.resids[mismatch_resindex]))
1145            # replace after logging (still need old ag1 and ag2 for
1146            # diagnostics)
1147            ag1 = _ag1
1148            ag2 = _ag2
1149            del _ag1, _ag2
1150
1151            # stop if we created empty selections (by removing ALL residues...)
1152            if ag1.n_atoms == 0 or ag2.n_atoms == 0:
1153                errmsg = ("Failed to automatically find matching atoms: created empty selections. " +
1154                          "Try to improve your selections for mobile and reference.")
1155                logger.error(errmsg)
1156                raise SelectionError(errmsg)
1157
1158    # check again because the residue matching heuristic is not very
1159    # good and can easily be misled (e.g., when one of the selections
1160    # had fewer atoms but the residues in mobile and reference have
1161    # each the same number)
1162    try:
1163        mass_mismatches = (np.absolute(ag1.masses - ag2.masses) > tol_mass)
1164    except ValueError:
1165        errmsg = ("Failed to find matching atoms: len(reference) = {}, len(mobile) = {} " +
1166                  "Try to improve your selections for mobile and reference.").format(
1167                      ag1.n_atoms, ag2.n_atoms)
1168        logger.error(errmsg)
1169        raise SelectionError(errmsg)
1170
1171    if np.any(mass_mismatches):
1172        # Test 2 failed.
1173        # diagnostic output:
1174        logger.error("Atoms: reference | trajectory")
1175        for ar, at in zip(ag1[mass_mismatches], ag2[mass_mismatches]):
1176            logger.error(
1177                "{0!s:>4} {1:3d} {2!s:>3} {3!s:>3} {4:6.3f}  |  {5!s:>4} {6:3d} {7!s:>3} {8!s:>3} {9:6.3f}".format(
1178                    ar.segid,
1179                    ar.resid,
1180                    ar.resname,
1181                    ar.name,
1182                    ar.mass,
1183                    at.segid,
1184                    at.resid,
1185                    at.resname,
1186                    at.name,
1187                    at.mass))
1188        errmsg = ("Inconsistent selections, masses differ by more than {0}; "
1189                  "mis-matching atoms are shown above.").format(tol_mass)
1190        logger.error(errmsg)
1191        raise SelectionError(errmsg)
1192
1193    return ag1, ag2
1194