1# -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; coding:utf-8 -*- 2# vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 3# 4# MDAnalysis --- https://www.mdanalysis.org 5# Copyright (c) 2006-2017 The MDAnalysis Development Team and contributors 6# (see the file AUTHORS for the full list of names) 7# 8# Released under the GNU Public Licence, v2 or any higher version 9# 10# Please cite your use of MDAnalysis in published work: 11# 12# R. J. Gowers, M. Linke, J. Barnoud, T. J. E. Reddy, M. N. Melo, S. L. Seyler, 13# D. L. Dotson, J. Domanski, S. Buchoux, I. M. Kenney, and O. Beckstein. 14# MDAnalysis: A Python package for the rapid analysis of molecular dynamics 15# simulations. In S. Benthall and S. Rostrup editors, Proceedings of the 15th 16# Python in Science Conference, pages 102-109, Austin, TX, 2016. SciPy. 17# 18# N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein. 19# MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations. 20# J. Comput. Chem. 32 (2011), 2319--2327, doi:10.1002/jcc.21787 21# 22 23"""Coordinate fitting and alignment --- :mod:`MDAnalysis.analysis.align` 24===================================================================== 25 26:Author: Oliver Beckstein, Joshua Adelman 27:Year: 2010--2013 28:Copyright: GNU Public License v3 29 30The module contains functions to fit a target structure to a reference 31structure. They use the fast QCP algorithm to calculate the root mean 32square distance (RMSD) between two coordinate sets [Theobald2005]_ and 33the rotation matrix *R* that minimizes the RMSD [Liu2010]_. (Please 34cite these references when using this module.). 35 36Typically, one selects a group of atoms (such as the C-alphas), 37calculates the RMSD and transformation matrix, and applys the 38transformation to the current frame of a trajectory to obtain the 39rotated structure. The :func:`alignto` and :class:`AlignTraj` 40functions can be used to do this for individual frames and 41trajectories respectively. 42 43The :ref:`RMS-fitting-tutorial` shows how to do the individual steps 44manually and explains the intermediate steps. 45 46See Also 47-------- 48:mod:`MDAnalysis.analysis.rms` 49 contains functions to compute RMSD (when structural alignment is not 50 required) 51:mod:`MDAnalysis.lib.qcprot` 52 implements the fast RMSD algorithm. 53 54 55.. _RMS-fitting-tutorial: 56 57RMS-fitting tutorial 58-------------------- 59 60The example uses files provided as part of the MDAnalysis test suite 61(in the variables :data:`~MDAnalysis.tests.datafiles.PSF`, 62:data:`~MDAnalysis.tests.datafiles.DCD`, and 63:data:`~MDAnalysis.tests.datafiles.PDB_small`). For all further 64examples execute first :: 65 66 >>> import MDAnalysis as mda 67 >>> from MDAnalysis.analysis import align 68 >>> from MDAnalysis.analysis.rms import rmsd 69 >>> from MDAnalysis.tests.datafiles import PSF, DCD, PDB_small 70 71 72In the simplest case, we can simply calculate the C-alpha RMSD between 73two structures, using :func:`rmsd`:: 74 75 >>> ref = mda.Universe(PDB_small) 76 >>> mobile = mda.Universe(PSF,DCD) 77 >>> rmsd(mobile.select_atoms('name CA').positions, ref.select_atoms('name CA').positions) 78 16.282308620224068 79 80Note that in this example translations have not been removed. In order 81to look at the pure rotation one needs to superimpose the centres of 82mass (or geometry) first: 83 84 >>> rmsd(mobile.select_atoms('name CA').positions, ref.select_atoms('name CA').positions, center=True) 85 12.639693690256898 86 87This has only done a translational superposition. If you want to also do a 88rotational superposition use the superposition keyword. This will calculate a 89minimized RMSD between the reference and mobile structure. 90 91 >>> rmsd(mobile.select_atoms('name CA').positions, ref.select_atoms('name CA').positions, 92 >>> superposition=True) 93 6.8093965864717951 94 95The rotation matrix that superimposes *mobile* on *ref* while 96minimizing the CA-RMSD is obtained with the :func:`rotation_matrix` 97function :: 98 99 >>> mobile0 = mobile.select_atoms('name CA').positions - mobile.atoms.center_of_mass() 100 >>> ref0 = ref.select_atoms('name CA').positions - ref.atoms.center_of_mass() 101 >>> R, rmsd = align.rotation_matrix(mobile0, ref0) 102 >>> print rmsd 103 6.8093965864717951 104 >>> print R 105 [[ 0.14514539 -0.27259113 0.95111876] 106 [ 0.88652593 0.46267112 -0.00268642] 107 [-0.43932289 0.84358136 0.30881368]] 108 109Putting all this together one can superimpose all of *mobile* onto *ref*:: 110 111 >>> mobile.atoms.translate(-mobile.select_atoms('name CA').center_of_mass()) 112 >>> mobile.atoms.rotate(R) 113 >>> mobile.atoms.translate(ref.select_atoms('name CA').center_of_mass()) 114 >>> mobile.atoms.write("mobile_on_ref.pdb") 115 116 117Common usage 118------------ 119 120To **fit a single structure** with :func:`alignto`:: 121 122 >>> ref = mda.Universe(PSF, PDB_small) 123 >>> mobile = mda.Universe(PSF, DCD) # we use the first frame 124 >>> align.alignto(mobile, ref, select="protein and name CA", weights="mass") 125 126This will change *all* coordinates in *mobile* so that the protein 127C-alpha atoms are optimally superimposed (translation and rotation). 128 129To **fit a whole trajectory** to a reference structure with the 130:class:`AlignTraj` class:: 131 132 >>> ref = mda.Universe(PSF, PDB_small) # reference structure 1AKE 133 >>> trj = mda.Universe(PSF, DCD) # trajectory of change 1AKE->4AKE 134 >>> alignment = align.AlignTraj(trj, ref, filename='rmsfit.dcd') 135 >>> alignment.run() 136 137It is also possible to align two arbitrary structures by providing a 138mapping between atoms based on a sequence alignment. This allows 139fitting of structural homologs or wild type and mutant. 140 141If a alignment was provided as "sequences.aln" one would first produce 142the appropriate MDAnalysis selections with the :func:`fasta2select` 143function and then feed the resulting dictionary to :class:`AlignTraj`:: 144 145 >>> seldict = align.fasta2select('sequences.aln') 146 >>> alignment = align.AlignTraj(trj, ref, filename='rmsfit.dcd', select=seldict) 147 >>> alignment.run() 148 149(See the documentation of the functions for this advanced usage.) 150 151 152Functions and Classes 153--------------------- 154 155.. versionchanged:: 0.10.0 156 Function :func:`~MDAnalysis.analysis.rms.rmsd` was removed from 157 this module and is now exclusively accessible as 158 :func:`~MDAnalysis.analysis.rms.rmsd`. 159 160.. versionchanged:: 0.16.0 161 Function :func:`~MDAnalysis.analysis.align.rms_fit_trj` deprecated 162 in favor of :class:`AlignTraj` class. 163 164.. versionchanged:: 0.17.0 165 removed deprecated :func:`~MDAnalysis.analysis.align.rms_fit_trj` 166 167.. autofunction:: alignto 168.. autoclass:: AlignTraj 169.. autofunction:: rotation_matrix 170 171 172Helper functions 173---------------- 174 175The following functions are used by the other functions in this 176module. They are probably of more interest to developers than to 177normal users. 178 179.. autofunction:: _fit_to 180.. autofunction:: fasta2select 181.. autofunction:: sequence_alignment 182.. autofunction:: get_matching_atoms 183 184""" 185from __future__ import division, absolute_import 186 187import os.path 188import warnings 189import logging 190 191from six.moves import range, zip, zip_longest 192from six import string_types 193 194import numpy as np 195 196import Bio.SeqIO 197import Bio.AlignIO 198import Bio.Align.Applications 199import Bio.Alphabet 200import Bio.pairwise2 201 202import MDAnalysis as mda 203import MDAnalysis.lib.qcprot as qcp 204from MDAnalysis.exceptions import SelectionError, SelectionWarning 205import MDAnalysis.analysis.rms as rms 206from MDAnalysis.coordinates.memory import MemoryReader 207from MDAnalysis.lib.util import get_weights, deprecate 208 209from .base import AnalysisBase 210 211logger = logging.getLogger('MDAnalysis.analysis.align') 212 213 214def rotation_matrix(a, b, weights=None): 215 r"""Returns the 3x3 rotation matrix `R` for RMSD fitting coordinate 216 sets `a` and `b`. 217 218 The rotation matrix `R` transforms vector `a` to overlap with 219 vector `b` (i.e., `b` is the reference structure): 220 221 .. math:: 222 \mathbf{b} = \mathsf{R} \cdot \mathbf{a} 223 224 Parameters 225 ---------- 226 a : array_like 227 coordinates that are to be rotated ("mobile set"); array of N atoms 228 of shape N*3 as generated by, e.g., 229 :attr:`MDAnalysis.core.groups.AtomGroup.positions`. 230 b : array_like 231 reference coordinates; array of N atoms of shape N*3 as generated by, 232 e.g., :attr:`MDAnalysis.core.groups.AtomGroup.positions`. 233 weights : array_like (optional) 234 array of floats of size N for doing weighted RMSD fitting (e.g. the 235 masses of the atoms) 236 237 Returns 238 ------- 239 R : ndarray 240 rotation matrix 241 rmsd : float 242 RMSD between `a` and `b` before rotation 243 ``(R, rmsd)`` rmsd and rotation matrix *R* 244 245 Example 246 ------- 247 `R` can be used as an argument for 248 :meth:`MDAnalysis.core.groups.AtomGroup.rotate` to generate a rotated 249 selection, e.g. :: 250 251 >>> R = rotation_matrix(A.select_atoms('backbone').positions, 252 >>> B.select_atoms('backbone').positions)[0] 253 >>> A.atoms.rotate(R) 254 >>> A.atoms.write("rotated.pdb") 255 256 Notes 257 ----- 258 The function does *not* shift the centers of mass or geometry; 259 this needs to be done by the user. 260 261 See Also 262 -------- 263 MDAnalysis.analysis.rms.rmsd: Calculates the RMSD between *a* and *b*. 264 alignto: A complete fit of two structures. 265 AlignTraj: Fit a whole trajectory. 266 """ 267 268 a = np.asarray(a, dtype=np.float64) 269 b = np.asarray(b, dtype=np.float64) 270 271 if a.shape != b.shape: 272 raise ValueError("'a' and 'b' must have same shape") 273 274 N = b.shape[0] 275 276 if weights is not None: 277 # qcp does NOT divide weights relative to the mean 278 weights = np.asarray(weights, dtype=np.float64) / np.mean(weights) 279 280 rot = np.zeros(9, dtype=np.float64) 281 282 # Need to transpose coordinates such that the coordinate array is 283 # 3xN instead of Nx3. Also qcp requires that the dtype be float64 284 # (I think we swapped the position of ref and traj in CalcRMSDRotationalMatrix 285 # so that R acts **to the left** and can be broadcasted; we're saving 286 # one transpose. [orbeckst]) 287 rmsd = qcp.CalcRMSDRotationalMatrix(a, b, N, rot, weights) 288 return rot.reshape(3, 3), rmsd 289 290 291def _fit_to(mobile_coordinates, ref_coordinates, mobile_atoms, 292 mobile_com, ref_com, weights=None): 293 r"""Perform an rmsd-fitting to determine rotation matrix and align atoms 294 295 Parameters 296 ---------- 297 mobile_coordinates : ndarray 298 Coordinates of atoms to be aligned 299 ref_coordinates : ndarray 300 Coordinates of atoms to be fit against 301 mobile_atoms : AtomGroup 302 Atoms to be translated 303 mobile_com: ndarray 304 array of xyz coordinate of mobile center of mass 305 ref_com: ndarray 306 array of xyz coordinate of reference center of mass 307 weights : array_like (optional) 308 choose weights. With ``None`` weigh each atom equally. If a float array 309 of the same length as `mobile_coordinates` is provided, use each element 310 of the `array_like` as a weight for the corresponding atom in 311 `mobile_coordinates`. 312 313 Returns 314 ------- 315 mobile_atoms : AtomGroup 316 AtomGroup of translated and rotated atoms 317 min_rmsd : float 318 Minimum rmsd of coordinates 319 320 Notes 321 ----- 322 This function assumes that `mobile_coordinates` and `ref_coordinates` have 323 already been shifted so that their centers of geometry (or centers of mass, 324 depending on `weights`) coincide at the origin. `mobile_com` and `ref_com` 325 are the centers *before* this shift. 326 327 1. The rotation matrix :math:`\mathsf{R}` is determined with 328 :func:`rotation_matrix` directly from `mobile_coordinates` and 329 `ref_coordinates`. 330 2. `mobile_atoms` :math:`X` is rotated according to the 331 rotation matrix and the centers according to 332 333 .. math:: 334 335 X' = \mathsf{R}(X - \bar{X}) + \bar{X}_{\text{ref}} 336 337 where :math:`\bar{X}` is the center. 338 339 """ 340 R, min_rmsd = rotation_matrix(mobile_coordinates, ref_coordinates, 341 weights=weights) 342 343 mobile_atoms.translate(-mobile_com) 344 mobile_atoms.rotate(R) 345 mobile_atoms.translate(ref_com) 346 347 return mobile_atoms, min_rmsd 348 349 350def alignto(mobile, reference, select="all", weights=None, 351 subselection=None, tol_mass=0.1, strict=False): 352 """Perform a spatial superposition by minimizing the RMSD. 353 354 Spatially align the group of atoms `mobile` to `reference` by 355 doing a RMSD fit on `select` atoms. 356 357 The superposition is done in the following way: 358 359 1. A rotation matrix is computed that minimizes the RMSD between 360 the coordinates of `mobile.select_atoms(sel1)` and 361 `reference.select_atoms(sel2)`; before the rotation, `mobile` is 362 translated so that its center of geometry (or center of mass) 363 coincides with the one of `reference`. (See below for explanation of 364 how *sel1* and *sel2* are derived from `select`.) 365 366 2. All atoms in :class:`~MDAnalysis.core.universe.Universe` that 367 contain `mobile` are shifted and rotated. (See below for how 368 to change this behavior through the `subselection` keyword.) 369 370 The `mobile` and `reference` atom groups can be constructed so that they 371 already match atom by atom. In this case, `select` should be set to "all" 372 (or ``None``) so that no further selections are applied to `mobile` and 373 `reference`, therefore preserving the exact atom ordering (see 374 :ref:`ordered-selections-label`). 375 376 .. Warning:: The atom order for `mobile` and `reference` is *only* 377 preserved when `select` is either "all" or ``None``. In any other case, 378 a new selection will be made that will sort the resulting AtomGroup by 379 index and therefore destroy the correspondence between the two groups. 380 **It is safest not to mix ordered AtomGroups with selection strings.** 381 382 Parameters 383 ---------- 384 mobile : Universe or AtomGroup 385 structure to be aligned, a 386 :class:`~MDAnalysis.core.groups.AtomGroup` or a whole 387 :class:`~MDAnalysis.core.universe.Universe` 388 reference : Universe or AtomGroup 389 reference structure, a :class:`~MDAnalysis.core.groups.AtomGroup` 390 or a whole :class:`~MDAnalysis.core.universe.Universe` 391 select : str or dict or tuple (optional) 392 The selection to operate on; can be one of: 393 394 1. any valid selection string for 395 :meth:`~MDAnalysis.core.groups.AtomGroup.select_atoms` that 396 produces identical selections in `mobile` and `reference`; or 397 398 2. a dictionary ``{'mobile': sel1, 'reference': sel2}`` where *sel1* 399 and *sel2* are valid selection strings that are applied to 400 `mobile` and `reference` respectively (the 401 :func:`MDAnalysis.analysis.align.fasta2select` function returns such 402 a dictionary based on a ClustalW_ or STAMP_ sequence alignment); or 403 404 3. a tuple ``(sel1, sel2)`` 405 406 When using 2. or 3. with *sel1* and *sel2* then these selection strings 407 are applied to `atomgroup` and `reference` respectively and should 408 generate *groups of equivalent atoms*. *sel1* and *sel2* can each also 409 be a *list of selection strings* to generate a 410 :class:`~MDAnalysis.core.groups.AtomGroup` with defined atom order as 411 described under :ref:`ordered-selections-label`). 412 weights : {"mass", ``None``} or array_like (optional) 413 choose weights. With ``"mass"`` uses masses as weights; with ``None`` 414 weigh each atom equally. If a float array of the same length as 415 `mobile` is provided, use each element of the `array_like` as a 416 weight for the corresponding atom in `mobile`. 417 tol_mass: float (optional) 418 Reject match if the atomic masses for matched atoms differ by more than 419 `tol_mass`, default [0.1] 420 strict: bool (optional) 421 ``True`` 422 Will raise :exc:`SelectionError` if a single atom does not 423 match between the two selections. 424 ``False`` [default] 425 Will try to prepare a matching selection by dropping 426 residues with non-matching atoms. See :func:`get_matching_atoms` 427 for details. 428 subselection : str or AtomGroup or None (optional) 429 Apply the transformation only to this selection. 430 431 ``None`` [default] 432 Apply to ``mobile.universe.atoms`` (i.e., all atoms in the 433 context of the selection from `mobile` such as the rest of a 434 protein, ligands and the surrounding water) 435 *selection-string* 436 Apply to ``mobile.select_atoms(selection-string)`` 437 :class:`~MDAnalysis.core.groups.AtomGroup` 438 Apply to the arbitrary group of atoms 439 440 Returns 441 ------- 442 old_rmsd : float 443 RMSD before spatial alignment 444 new_rmsd : float 445 RMSD after spatial alignment 446 447 See Also 448 -------- 449 AlignTraj: More efficient method for RMSD-fitting trajectories. 450 451 452 .. _ClustalW: http://www.clustal.org/ 453 .. _STAMP: http://www.compbio.dundee.ac.uk/manuals/stamp.4.2/ 454 455 456 .. versionchanged:: 0.8 457 Added check that the two groups describe the same atoms including 458 the new *tol_mass* keyword. 459 460 .. versionchanged:: 0.10.0 461 Uses :func:`get_matching_atoms` to work with incomplete selections 462 and new `strict` keyword. The new default is to be lenient whereas 463 the old behavior was the equivalent of ``strict = True``. 464 465 .. versionchanged:: 0.16.0 466 new general 'weights' kwarg replace `mass_weighted`, deprecated `mass_weighted` 467 .. deprecated:: 0.16.0 468 Instead of ``mass_weighted=True`` use new ``weights='mass'`` 469 470 .. versionchanged:: 0.17.0 471 Deprecated keyword `mass_weighted` was removed. 472 """ 473 if select in ('all', None): 474 # keep the EXACT order in the input AtomGroups; select_atoms('all') 475 # orders them by index, which can lead to wrong results if the user 476 # has crafted mobile and reference to match atom by atom 477 mobile_atoms = mobile.atoms 478 ref_atoms = reference.atoms 479 else: 480 select = rms.process_selection(select) 481 mobile_atoms = mobile.select_atoms(*select['mobile']) 482 ref_atoms = reference.select_atoms(*select['reference']) 483 484 ref_atoms, mobile_atoms = get_matching_atoms(ref_atoms, mobile_atoms, 485 tol_mass=tol_mass, 486 strict=strict) 487 488 weights = get_weights(ref_atoms, weights) 489 490 mobile_com = mobile_atoms.center(weights) 491 ref_com = ref_atoms.center(weights) 492 493 ref_coordinates = ref_atoms.positions - ref_com 494 mobile_coordinates = mobile_atoms.positions - mobile_com 495 496 old_rmsd = rms.rmsd(mobile_coordinates, ref_coordinates, weights) 497 498 if subselection is None: 499 # mobile_atoms is Universe 500 mobile_atoms = mobile.universe.atoms 501 elif isinstance(subselection, string_types): 502 # select mobile_atoms from string 503 mobile_atoms = mobile.select_atoms(subselection) 504 else: 505 try: 506 # treat subselection as AtomGroup 507 mobile_atoms = subselection.atoms 508 except AttributeError: 509 raise TypeError("subselection must be a selection string, an" 510 " AtomGroup or Universe or None") 511 512 # _fit_to DOES subtract center of mass, will provide proper min_rmsd 513 mobile_atoms, new_rmsd = _fit_to(mobile_coordinates, ref_coordinates, 514 mobile_atoms, mobile_com, ref_com, 515 weights=weights) 516 return old_rmsd, new_rmsd 517 518 519class AlignTraj(AnalysisBase): 520 """RMS-align trajectory to a reference structure using a selection. 521 522 Both the reference `reference` and the trajectory `mobile` must be 523 :class:`MDAnalysis.Universe` instances. If they contain a trajectory then 524 it is used. The output file format is determined by the file extension of 525 `filename`. One can also use the same universe if one wants to fit to the 526 current frame. 527 528 """ 529 530 def __init__(self, mobile, reference, select='all', filename=None, 531 prefix='rmsfit_', weights=None, 532 tol_mass=0.1, strict=False, force=True, in_memory=False, 533 **kwargs): 534 """Parameters 535 ---------- 536 mobile : Universe 537 Universe containing trajectory to be fitted to reference 538 reference : Universe 539 Universe containing trajectory frame to be used as reference 540 select : str (optional) 541 Set as default to all, is used for Universe.select_atoms to choose 542 subdomain to be fitted against 543 filename : str (optional) 544 Provide a filename for results to be written to 545 prefix : str (optional) 546 Provide a string to prepend to filename for results to be written 547 to 548 weights : {"mass", ``None``} or array_like (optional) 549 choose weights. With ``"mass"`` uses masses of `reference` as 550 weights; with ``None`` weigh each atom equally. If a float array of 551 the same length as the selection is provided, use each element of 552 the `array_like` as a weight for the corresponding atom in the 553 selection. 554 tol_mass : float (optional) 555 Tolerance given to `get_matching_atoms` to find appropriate atoms 556 strict : bool (optional) 557 Force `get_matching_atoms` to fail if atoms can't be found using 558 exact methods 559 force : bool (optional) 560 Force overwrite of filename for rmsd-fitting 561 start : int (optional) 562 First frame of trajectory to analyse, Default: 0 563 stop : int (optional) 564 Last frame of trajectory to analyse, Default: -1 565 step : int (optional) 566 Step between frames to analyse, Default: 1 567 in_memory : bool (optional) 568 *Permanently* switch `mobile` to an in-memory trajectory 569 so that alignment can be done in-place, which can improve 570 performance substantially in some cases. In this case, no file 571 is written out (`filename` and `prefix` are ignored) and only 572 the coordinates of `mobile` are *changed in memory*. 573 verbose : bool (optional) 574 Set logger to show more information and show detailed progress of 575 the calculation if set to ``True``; the default is ``False``. 576 577 578 Attributes 579 ---------- 580 reference_atoms : AtomGroup 581 Atoms of the reference structure to be aligned against 582 mobile_atoms : AtomGroup 583 Atoms inside each trajectory frame to be rmsd_aligned 584 rmsd : Array 585 Array of the rmsd values of the least rmsd between the mobile_atoms 586 and reference_atoms after superposition and minimimization of rmsd 587 filename : str 588 String reflecting the filename of the file where mobile_atoms 589 positions will be written to upon running RMSD alignment 590 591 592 Notes 593 ----- 594 - If set to ``verbose=False``, it is recommended to wrap the statement 595 in a ``try ... finally`` to guarantee restoring of the log level in 596 the case of an exception. 597 - The ``in_memory`` option changes the `mobile` universe to an 598 in-memory representation (see :mod:`MDAnalysis.coordinates.memory`) 599 for the remainder of the Python session. If ``mobile.trajectory`` is 600 already a :class:`MemoryReader` then it is *always* treated as if 601 ``in_memory`` had been set to ``True``. 602 603 .. deprecated:: 0.19.1 604 Default ``filename`` directory will change in 1.0 to the current directory. 605 606 .. versionchanged:: 0.16.0 607 new general ``weights`` kwarg replace ``mass_weights`` 608 609 .. deprecated:: 0.16.0 610 Instead of ``mass_weighted=True`` use new ``weights='mass'`` 611 612 .. versionchanged:: 0.17.0 613 removed deprecated `mass_weighted` keyword 614 615 """ 616 select = rms.process_selection(select) 617 self.ref_atoms = reference.select_atoms(*select['reference']) 618 self.mobile_atoms = mobile.select_atoms(*select['mobile']) 619 if in_memory or isinstance(mobile.trajectory, MemoryReader): 620 mobile.transfer_to_memory() 621 filename = None 622 logger.info("Moved mobile trajectory to in-memory representation") 623 else: 624 if filename is None: 625 # DEPRECATED in 0.19.1 626 # Change in 1.0 627 # 628 # fn = os.path.split(mobile.trajectory.filename)[1] 629 # filename = prefix + fn 630 path, fn = os.path.split(mobile.trajectory.filename) 631 filename = os.path.join(path, prefix + fn) 632 logger.info('filename of rms_align with no filename given' 633 ': {0}'.format(filename)) 634 635 if os.path.exists(filename) and not force: 636 raise IOError( 637 'Filename already exists in path and force is not set' 638 ' to True') 639 640 # do this after setting the memory reader to have a reference to the 641 # right reader. 642 super(AlignTraj, self).__init__(mobile.trajectory, **kwargs) 643 if not self._verbose: 644 logging.disable(logging.WARN) 645 646 # store reference to mobile atoms 647 self.mobile = mobile.atoms 648 649 self.filename = filename 650 651 natoms = self.mobile.n_atoms 652 self.ref_atoms, self.mobile_atoms = get_matching_atoms( 653 self.ref_atoms, self.mobile_atoms, tol_mass=tol_mass, 654 strict=strict) 655 656 # with self.filename == None (in_memory), the NullWriter is chosen 657 # (which just ignores input) and so only the in_memory trajectory is 658 # retained 659 self._writer = mda.Writer(self.filename, natoms) 660 661 self._weights = get_weights(self.ref_atoms, weights) 662 663 logger.info("RMS-fitting on {0:d} atoms.".format(len(self.ref_atoms))) 664 665 def _prepare(self): 666 # reference centre of mass system 667 self._ref_com = self.ref_atoms.center(self._weights) 668 self._ref_coordinates = self.ref_atoms.positions - self._ref_com 669 # allocate the array for selection atom coords 670 self.rmsd = np.zeros((self.n_frames,)) 671 672 def _single_frame(self): 673 index = self._frame_index 674 mobile_com = self.mobile_atoms.center(self._weights) 675 mobile_coordinates = self.mobile_atoms.positions - mobile_com 676 mobile_atoms, self.rmsd[index] = _fit_to(mobile_coordinates, 677 self._ref_coordinates, 678 self.mobile, 679 mobile_com, 680 self._ref_com, self._weights) 681 # write whole aligned input trajectory system 682 self._writer.write(mobile_atoms) 683 684 def _conclude(self): 685 self._writer.close() 686 if not self._verbose: 687 logging.disable(logging.NOTSET) 688 689 @deprecate(release="0.19.0", remove="1.0") 690 def save(self, rmsdfile): 691 """save rmsd as a numpy array 692 """ 693 # these are the values of the new rmsd between the aligned trajectory 694 # and reference structure 695 np.savetxt(rmsdfile, self.rmsd) 696 logger.info("Wrote RMSD timeseries to file %r", rmsdfile) 697 698 699def sequence_alignment(mobile, reference, match_score=2, mismatch_penalty=-1, 700 gap_penalty=-2, gapextension_penalty=-0.1): 701 """Generate a global sequence alignment between two residue groups. 702 703 The residues in `reference` and `mobile` will be globally aligned. 704 The global alignment uses the Needleman-Wunsch algorithm as 705 implemented in :mod:`Bio.pairwise2`. The parameters of the dynamic 706 programming algorithm can be tuned with the keywords. The defaults 707 should be suitable for two similar sequences. For sequences with 708 low sequence identity, more specialized tools such as clustalw, 709 muscle, tcoffee, or similar should be used. 710 711 Parameters 712 ---------- 713 mobile : AtomGroup 714 Atom group to be aligned 715 reference : AtomGroup 716 Atom group to be aligned against 717 match_score : float (optional), default 2 718 score for matching residues, default 2 719 mismatch_penalty : float (optional), default -1 720 penalty for residues that do not match , default : -1 721 gap_penalty : float (optional), default -2 722 penalty for opening a gap; the high default value creates compact 723 alignments for highly identical sequences but might not be suitable 724 for sequences with low identity, default : -2 725 gapextension_penalty : float (optional), default -0.1 726 penalty for extending a gap, default: -0.1 727 728 Returns 729 ------- 730 alignment : tuple 731 Tuple of top sequence matching output `('Sequence A', 'Sequence B', score, 732 begin, end)` 733 734 See Also 735 -------- 736 BioPython documentation for `pairwise2`_. Alternatively, use 737 :func:`fasta2select` with :program:`clustalw2` and the option 738 ``is_aligned=False``. 739 740 .. _`pairwise2`: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html 741 742 .. versionadded:: 0.10.0 743 744 """ 745 aln = Bio.pairwise2.align.globalms( 746 reference.residues.sequence(format="string"), 747 mobile.residues.sequence(format="string"), 748 match_score, mismatch_penalty, gap_penalty, gapextension_penalty) 749 # choose top alignment 750 return aln[0] 751 752 753def fasta2select(fastafilename, is_aligned=False, 754 ref_resids=None, target_resids=None, 755 ref_offset=0, target_offset=0, verbosity=3, 756 alnfilename=None, treefilename=None, clustalw="clustalw2"): 757 """Return selection strings that will select equivalent residues. 758 759 The function aligns two sequences provided in a FASTA file and 760 constructs MDAnalysis selection strings of the common atoms. When 761 these two strings are applied to the two different proteins they 762 will generate AtomGroups of the aligned residues. 763 764 `fastafilename` contains the two un-aligned sequences in FASTA 765 format. The reference is assumed to be the first sequence, the 766 target the second. ClustalW_ produces a pairwise 767 alignment (which is written to a file with suffix ``.aln``). The 768 output contains atom selection strings that select the same atoms 769 in the two structures. 770 771 Unless `ref_offset` and/or `target_offset` are specified, the resids 772 in the structure are assumed to correspond to the positions in the 773 un-aligned sequence, namely the first residue has resid == 1. 774 775 In more complicated cases (e.g., when the resid numbering in the 776 input structure has gaps due to missing parts), simply provide the 777 sequence of resids as they appear in the topology in `ref_resids` or 778 `target_resids`, e.g. :: 779 780 target_resids = [a.resid for a in trj.select_atoms('name CA')] 781 782 (This translation table *is* combined with any value for 783 `ref_offset` or `target_offset`!) 784 785 Parameters 786 ---------- 787 fastafilename : str, path to filename 788 FASTA file with first sequence as reference and 789 second the one to be aligned (ORDER IS IMPORTANT!) 790 is_aligned : bool (optional) 791 ``False`` (default) 792 run clustalw for sequence alignment; 793 ``True`` 794 use the alignment in the file (e.g. from STAMP) [``False``] 795 ref_offset : int (optional) 796 add this number to the column number in the FASTA file 797 to get the original residue number, default: 0 798 target_offset : int (optional) 799 add this number to the column number in the FASTA file 800 to get the original residue number, default: 0 801 ref_resids : str (optional) 802 sequence of resids as they appear in the reference structure 803 target_resids : str (optional) 804 sequence of resids as they appear in the target 805 alnfilename : str (optional) 806 filename of ClustalW alignment (clustal format) that is 807 produced by *clustalw* when *is_aligned* = ``False``. 808 default ``None`` uses the name and path of *fastafilename* and 809 subsititutes the suffix with '.aln'. 810 treefilename: str (optional) 811 filename of ClustalW guide tree (Newick format); 812 if default ``None`` the the filename is generated from *alnfilename* 813 with the suffix '.dnd' instead of '.aln' 814 clustalw : str (optional) 815 path to the ClustalW (or ClustalW2) binary; only 816 needed for `is_aligned` = ``False``, default: "ClustalW2" 817 818 Returns 819 ------- 820 select_dict : dict 821 dictionary with 'reference' and 'mobile' selection string 822 that can be used immediately in :class:`AlignTraj` as 823 ``select=select_dict``. 824 825 826 See Also 827 -------- 828 :func:`sequence_alignment`, which does not require external 829 programs. 830 831 832 .. _ClustalW: http://www.clustal.org/ 833 .. _STAMP: http://www.compbio.dundee.ac.uk/manuals/stamp.4.2/ 834 835 """ 836 protein_gapped = Bio.Alphabet.Gapped(Bio.Alphabet.IUPAC.protein) 837 if is_aligned: 838 logger.info("Using provided alignment {}".format(fastafilename)) 839 with open(fastafilename) as fasta: 840 alignment = Bio.AlignIO.read( 841 fasta, "fasta", alphabet=protein_gapped) 842 else: 843 if alnfilename is None: 844 filepath, ext = os.path.splitext(fastafilename) 845 alnfilename = filepath + '.aln' 846 if treefilename is None: 847 filepath, ext = os.path.splitext(alnfilename) 848 treefilename = filepath + '.dnd' 849 run_clustalw = Bio.Align.Applications.ClustalwCommandline( 850 clustalw, 851 infile=fastafilename, 852 type="protein", 853 align=True, 854 outfile=alnfilename, 855 newtree=treefilename) 856 logger.debug( 857 "Aligning sequences in %(fastafilename)r with %(clustalw)r.", 858 vars()) 859 logger.debug("ClustalW commandline: %r", str(run_clustalw)) 860 try: 861 stdout, stderr = run_clustalw() 862 except: 863 logger.exception("ClustalW %(clustalw)r failed", vars()) 864 logger.info( 865 "(You can get clustalw2 from http://www.clustal.org/clustal2/)") 866 raise 867 with open(alnfilename) as aln: 868 alignment = Bio.AlignIO.read( 869 aln, "clustal", alphabet=protein_gapped) 870 logger.info( 871 "Using clustalw sequence alignment {0!r}".format(alnfilename)) 872 logger.info( 873 "ClustalW Newick guide tree was also produced: {0!r}".format(treefilename)) 874 875 nseq = len(alignment) 876 if nseq != 2: 877 raise ValueError( 878 "Only two sequences in the alignment can be processed.") 879 880 # implict assertion that we only have two sequences in the alignment 881 orig_resids = [ref_resids, target_resids] 882 offsets = [ref_offset, target_offset] 883 for iseq, a in enumerate(alignment): 884 # need iseq index to change orig_resids 885 if orig_resids[iseq] is None: 886 # build default: assume consecutive numbering of all 887 # residues in the alignment 888 GAP = a.seq.alphabet.gap_char 889 length = len(a.seq) - a.seq.count(GAP) 890 orig_resids[iseq] = np.arange(1, length + 1) 891 else: 892 orig_resids[iseq] = np.asarray(orig_resids[iseq]) 893 # add offsets to the sequence <--> resid translation table 894 seq2resids = [resids + offset for resids, offset in zip( 895 orig_resids, offsets)] 896 del orig_resids 897 del offsets 898 899 def resid_factory(alignment, seq2resids): 900 """Return a function that gives the resid for a position ipos in 901 the nseq'th alignment. 902 903 resid = resid_factory(alignment,seq2resids) 904 r = resid(nseq,ipos) 905 906 It is based on a look up table that translates position in the 907 alignment to the residue number in the original 908 sequence/structure. 909 910 The first index of resid() is the alignmment number, the 911 second the position in the alignment. 912 913 seq2resids translates the residues in the sequence to resid 914 numbers in the psf. In the simplest case this is a linear map 915 but if whole parts such as loops are ommitted from the protein 916 the seq2resids may have big gaps. 917 918 Format: a tuple of two numpy arrays; the first array is for 919 the reference, the second for the target, The index in each 920 array gives the consecutive number of the amino acid in the 921 sequence, the value the resid in the structure/psf. 922 923 Note: assumes that alignments have same length and are padded if 924 necessary. 925 """ 926 # could maybe use Bio.PDB.StructureAlignment instead? 927 nseq = len(alignment) 928 t = np.zeros((nseq, alignment.get_alignment_length()), dtype=int) 929 for iseq, a in enumerate(alignment): 930 GAP = a.seq.alphabet.gap_char 931 t[iseq, :] = seq2resids[iseq][np.cumsum(np.where( 932 np.array(list(a.seq)) == GAP, 0, 1)) - 1] 933 # -1 because seq2resid is index-1 based (resids start at 1) 934 935 def resid(nseq, ipos, t=t): 936 return t[nseq, ipos] 937 938 return resid 939 940 resid = resid_factory(alignment, seq2resids) 941 942 res_list = [] # collect individual selection string 943 # could collect just resid and type (with/without CB) and 944 # then post-process and use ranges for continuous stretches, eg 945 # ( resid 1:35 and ( backbone or name CB ) ) or ( resid 36 and backbone ) 946 947 # should be the same for both seqs 948 GAP = alignment[0].seq.alphabet.gap_char 949 if GAP != alignment[1].seq.alphabet.gap_char: 950 raise ValueError( 951 "Different gap characters in sequence 'target' and 'mobile'.") 952 for ipos in range(alignment.get_alignment_length()): 953 aligned = list(alignment[:, ipos]) 954 if GAP in aligned: 955 continue # skip residue 956 template = "resid %i" 957 if 'G' not in aligned: 958 # can use CB 959 template += " and ( backbone or name CB )" 960 else: 961 template += " and backbone" 962 template = "( " + template + " )" 963 964 res_list.append([template % resid(iseq, ipos) for iseq in range(nseq)]) 965 966 sel = np.array(res_list).transpose() 967 968 ref_selection = " or ".join(sel[0]) 969 target_selection = " or ".join(sel[1]) 970 return {'reference': ref_selection, 'mobile': target_selection} 971 972 973def get_matching_atoms(ag1, ag2, tol_mass=0.1, strict=False): 974 """Return two atom groups with one-to-one matched atoms. 975 976 The function takes two :class:`~MDAnalysis.core.groups.AtomGroup` 977 instances `ag1` and `ag2` and returns two atom groups `g1` and `g2` that 978 consist of atoms so that the mass of atom ``g1[0]`` is the same as the mass 979 of atom ``g2[0]``, ``g1[1]`` and ``g2[1]`` etc. 980 981 The current implementation is very simplistic and works on a per-residue basis: 982 983 1. The two groups must contain the same number of residues. 984 2. Any residues in each group that have differing number of atoms are discarded. 985 3. The masses of corresponding atoms are compared. and if any masses differ 986 by more than `tol_mass` the test is considered failed and a 987 :exc:`SelectionError` is raised. 988 989 The log file (see :func:`MDAnalysis.start_logging`) will contain detailed 990 information about mismatches. 991 992 Parameters 993 ---------- 994 ag1 : AtomGroup 995 First :class:`~MDAnalysis.core.groups.AtomGroup` instance that is 996 compared 997 ag2 : AtomGroup 998 Second :class:`~MDAnalysis.core.groups.AtomGroup` instance that is 999 compared 1000 tol_mass : float (optional) 1001 Reject if the atomic masses for matched atoms differ by more than 1002 `tol_mass` [0.1] 1003 strict : bool (optional) 1004 ``True`` 1005 Will raise :exc:`SelectionError` if a single atom does not 1006 match between the two selections. 1007 ``False`` [default] 1008 Will try to prepare a matching selection by dropping 1009 residues with non-matching atoms. See :func:`get_matching_atoms` 1010 for details. 1011 1012 Returns 1013 ------- 1014 (g1, g2) : tuple 1015 Tuple with :class:`~MDAnalysis.core.groups.AtomGroup` 1016 instances that match, atom by atom. The groups are either the 1017 original groups if all matched or slices of the original 1018 groups. 1019 1020 Raises 1021 ------ 1022 :exc:`SelectionError` 1023 Error raised if the number of residues does not match or if in the final 1024 matching masses differ by more than *tol*. 1025 1026 Notes 1027 ----- 1028 The algorithm could be improved by using e.g. the Needleman-Wunsch 1029 algorithm in :mod:`Bio.profile2` to align atoms in each residue (doing a 1030 global alignment is too expensive). 1031 1032 .. versionadded:: 0.8 1033 1034 .. versionchanged:: 0.10.0 1035 Renamed from :func:`check_same_atoms` to 1036 :func:`get_matching_atoms` and now returns matching atomgroups 1037 (possibly with residues removed) 1038 1039 """ 1040 1041 if ag1.n_atoms != ag2.n_atoms: 1042 if ag1.n_residues != ag2.n_residues: 1043 errmsg = ("Reference and trajectory atom selections do not contain " 1044 "the same number of atoms: \n" 1045 "atoms: N_ref={0}, N_traj={1}\n" 1046 "and also not the same number of residues:\n" 1047 "residues: N_ref={2}, N_traj={3}").format( 1048 ag1.n_atoms, ag2.n_atoms, 1049 ag1.n_residues, ag2.n_residues) 1050 logger.error(errmsg) 1051 raise SelectionError(errmsg) 1052 else: 1053 msg = ("Reference and trajectory atom selections do not contain " 1054 "the same number of atoms: \n" 1055 "atoms: N_ref={0}, N_traj={1}").format( 1056 ag1.n_atoms, ag2.n_atoms) 1057 if strict: 1058 logger.error(msg) 1059 raise SelectionError(msg) 1060 1061 # continue with trying to create a valid selection 1062 msg += ("\nbut we attempt to create a valid selection " + 1063 "(use strict=True to disable this heuristic).") 1064 logger.info(msg) 1065 warnings.warn(msg, category=SelectionWarning) 1066 1067 # continue with trying to salvage the selection: 1068 # - number of atoms is different 1069 # - number of residues is the same 1070 # We will remove residues with mismatching number of atoms (e.g. not resolved 1071 # in an X-ray structure) 1072 assert ag1.n_residues == ag2.n_residues 1073 1074 # Alternatively, we could align all atoms but Needleman-Wunsch 1075 # pairwise2 consumes too much memory for thousands of characters in 1076 # each sequence. Perhaps a solution would be pairwise alignment per residue. 1077 # 1078 # aln_elem = Bio.pairwise2.align.globalms("".join([MDAnalysis.topology. 1079 # core.guess_atom_element(n) for n in gref.atoms.names]), 1080 # "".join([MDAnalysis.topology.core.guess_atom_element(n) 1081 # for n in models[0].atoms.names]), 1082 # 2, -1, -1, -0.1, 1083 # one_alignment_only=True) 1084 1085 # For now, just remove the residues that don't have matching numbers 1086 # NOTE: This can create empty selections, e.g., when comparing a structure 1087 # with hydrogens to a PDB structure without hydrogens. 1088 rsize1 = np.array([r.atoms.n_atoms for r in ag1.residues]) 1089 rsize2 = np.array([r.atoms.n_atoms for r in ag2.residues]) 1090 rsize_mismatches = np.absolute(rsize1 - rsize2) 1091 mismatch_mask = (rsize_mismatches > 0) 1092 if np.any(mismatch_mask): 1093 if strict: 1094 # diagnostics 1095 mismatch_resindex = np.arange(ag1.n_residues)[mismatch_mask] 1096 1097 def log_mismatch( 1098 number, 1099 ag, 1100 rsize, 1101 mismatch_resindex=mismatch_resindex): 1102 logger.error("Offending residues: group {0}: {1}".format( 1103 number, 1104 ", ".join(["{0[0]}{0[1]} ({0[2]})".format(r) for r in 1105 zip(ag.resnames[mismatch_resindex], 1106 ag.resids[mismatch_resindex], 1107 rsize[mismatch_resindex] 1108 )]))) 1109 logger.error("Found {0} residues with non-matching numbers of atoms (#)".format( 1110 mismatch_mask.sum())) 1111 log_mismatch(1, ag1, rsize1) 1112 log_mismatch(2, ag2, rsize2) 1113 1114 errmsg = ("Different number of atoms in some residues. " 1115 "(Use strict=False to attempt using matching atoms only.)") 1116 logger.error(errmsg) 1117 raise SelectionError(errmsg) 1118 1119 def get_atoms_byres(g, match_mask=np.logical_not(mismatch_mask)): 1120 # not pretty... but need to do things on a per-atom basis in 1121 # order to preserve original selection 1122 ag = g.atoms 1123 good = ag.residues.resids[match_mask] # resid for each residue 1124 resids = ag.resids # resid for each atom 1125 # boolean array for all matching atoms 1126 ix_good = np.in1d(resids, good) 1127 return ag[ix_good] 1128 1129 _ag1 = get_atoms_byres(ag1) 1130 _ag2 = get_atoms_byres(ag2) 1131 1132 assert _ag1.atoms.n_atoms == _ag2.atoms.n_atoms 1133 1134 # diagnostics 1135 # (ugly workaround for missing boolean indexing of AtomGroup) 1136 # note: ag[arange(len(ag))[boolean]] is ~2x faster than 1137 # ag[where[boolean]] 1138 mismatch_resindex = np.arange(ag1.n_residues)[mismatch_mask] 1139 logger.warning("Removed {0} residues with non-matching numbers of atoms" 1140 .format(mismatch_mask.sum())) 1141 logger.debug("Removed residue ids: group 1: {0}" 1142 .format(ag1.residues.resids[mismatch_resindex])) 1143 logger.debug("Removed residue ids: group 2: {0}" 1144 .format(ag2.residues.resids[mismatch_resindex])) 1145 # replace after logging (still need old ag1 and ag2 for 1146 # diagnostics) 1147 ag1 = _ag1 1148 ag2 = _ag2 1149 del _ag1, _ag2 1150 1151 # stop if we created empty selections (by removing ALL residues...) 1152 if ag1.n_atoms == 0 or ag2.n_atoms == 0: 1153 errmsg = ("Failed to automatically find matching atoms: created empty selections. " + 1154 "Try to improve your selections for mobile and reference.") 1155 logger.error(errmsg) 1156 raise SelectionError(errmsg) 1157 1158 # check again because the residue matching heuristic is not very 1159 # good and can easily be misled (e.g., when one of the selections 1160 # had fewer atoms but the residues in mobile and reference have 1161 # each the same number) 1162 try: 1163 mass_mismatches = (np.absolute(ag1.masses - ag2.masses) > tol_mass) 1164 except ValueError: 1165 errmsg = ("Failed to find matching atoms: len(reference) = {}, len(mobile) = {} " + 1166 "Try to improve your selections for mobile and reference.").format( 1167 ag1.n_atoms, ag2.n_atoms) 1168 logger.error(errmsg) 1169 raise SelectionError(errmsg) 1170 1171 if np.any(mass_mismatches): 1172 # Test 2 failed. 1173 # diagnostic output: 1174 logger.error("Atoms: reference | trajectory") 1175 for ar, at in zip(ag1[mass_mismatches], ag2[mass_mismatches]): 1176 logger.error( 1177 "{0!s:>4} {1:3d} {2!s:>3} {3!s:>3} {4:6.3f} | {5!s:>4} {6:3d} {7!s:>3} {8!s:>3} {9:6.3f}".format( 1178 ar.segid, 1179 ar.resid, 1180 ar.resname, 1181 ar.name, 1182 ar.mass, 1183 at.segid, 1184 at.resid, 1185 at.resname, 1186 at.name, 1187 at.mass)) 1188 errmsg = ("Inconsistent selections, masses differ by more than {0}; " 1189 "mis-matching atoms are shown above.").format(tol_mass) 1190 logger.error(errmsg) 1191 raise SelectionError(errmsg) 1192 1193 return ag1, ag2 1194