1# -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; coding:utf-8 -*- 2# vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 3# 4# MDAnalysis --- https://www.mdanalysis.org 5# Copyright (c) 2006-2017 The MDAnalysis Development Team and contributors 6# (see the file AUTHORS for the full list of names) 7# 8# Released under the GNU Public Licence, v2 or any higher version 9# 10# Please cite your use of MDAnalysis in published work: 11# 12# R. J. Gowers, M. Linke, J. Barnoud, T. J. E. Reddy, M. N. Melo, S. L. Seyler, 13# D. L. Dotson, J. Domanski, S. Buchoux, I. M. Kenney, and O. Beckstein. 14# MDAnalysis: A Python package for the rapid analysis of molecular dynamics 15# simulations. In S. Benthall and S. Rostrup editors, Proceedings of the 15th 16# Python in Science Conference, pages 102-109, Austin, TX, 2016. SciPy. 17# 18# N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein. 19# MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations. 20# J. Comput. Chem. 32 (2011), 2319--2327, doi:10.1002/jcc.21787 21# 22 23r""" 24Calculating path similarity --- :mod:`MDAnalysis.analysis.psa` 25========================================================================== 26 27:Author: Sean Seyler 28:Year: 2015 29:Copyright: GNU Public License v3 30 31.. versionadded:: 0.10.0 32 33The module contains code to calculate the geometric similarity of trajectories 34using path metrics such as the Hausdorff or Fréchet distances 35[Seyler2015]_. The path metrics are functions of two paths and return a 36nonnegative number, i.e., a distance. Two paths are identical if their distance 37is zero, and large distances indicate dissimilarity. Each path metric is a 38function of the individual points (e.g., coordinate snapshots) that comprise 39each path and, loosely speaking, identify the two points, one per path of a 40pair of paths, where the paths deviate the most. The distance between these 41points of maximal deviation is measured by the root mean square deviation 42(RMSD), i.e., to compute structural similarity. 43 44One typically computes the pairwise similarity for an ensemble of paths to 45produce a symmetric distance matrix, which can be clustered to, at a glance, 46identify patterns in the trajectory data. To properly analyze a path ensemble, 47one must select a suitable reference structure to which all paths (each 48conformer in each path) will be universally aligned using the rotations 49determined by the best-fit rmsds. Distances between paths and their structures 50are then computed directly with no further alignment. This pre-processing step 51is necessary to preserve the metric properties of the Hausdorff and Fréchet 52metrics; using the best-fit rmsd on a pairwise basis does not generally 53preserve the triangle inequality. 54 55Note 56---- 57The `PSAnalysisTutorial`_ outlines a typical application of PSA to 58a set of trajectories, including doing proper alignment, 59performing distance comparisons, and generating heat 60map-dendrogram plots from hierarchical clustering. 61 62 63.. Rubric:: References 64 65 66.. [Seyler2015] Seyler SL, Kumar A, Thorpe MF, Beckstein O (2015) 67 Path Similarity Analysis: A Method for Quantifying 68 Macromolecular Pathways. PLoS Comput Biol 11(10): e1004568. 69 doi: `10.1371/journal.pcbi.1004568`_ 70 71.. _`10.1371/journal.pcbi.1004568`: http://dx.doi.org/10.1371/journal.pcbi.1004568 72.. _`PSAnalysisTutorial`: https://github.com/Becksteinlab/PSAnalysisTutorial 73 74 75Helper functions and variables 76------------------------------ 77The following convenience functions are used by other functions in this module. 78 79.. autofunction:: sqnorm 80.. autofunction:: get_msd_matrix 81.. autofunction:: get_coord_axes 82 83 84Classes, methods, and functions 85------------------------------- 86 87.. autofunction:: get_path_metric_func 88.. autofunction:: hausdorff 89.. autofunction:: hausdorff_wavg 90.. autofunction:: hausdorff_avg 91.. autofunction:: hausdorff_neighbors 92.. autofunction:: discrete_frechet 93.. autofunction:: dist_mat_to_vec 94 95.. autoclass:: Path 96 :members: 97 98 .. attribute:: u_original 99 100 :class:`MDAnalysis.Universe` object with a trajectory 101 102 .. attribute:: u_reference 103 104 :class:`MDAnalysis.Universe` object containing a reference structure 105 106 .. attribute:: ref_select 107 108 string, selection for 109 :meth:`~MDAnalysis.core.groups.AtomGroup.select_atoms` to select frame 110 from :attr:`Path.u_reference` 111 112 .. attribute:: path_select 113 114 string, selection for 115 :meth:`~MDAnalysis.core.groups.AtomGroup.select_atoms` to select atoms 116 to compose :attr:`Path.path` 117 118 .. attribute:: ref_frame 119 120 int, frame index to select frame from :attr:`Path.u_reference` 121 122 .. attribute:: u_fitted 123 124 :class:`MDAnalysis.Universe` object with the fitted trajectory 125 126 .. attribute:: path 127 128 :class:`numpy.ndarray` object representation of the fitted trajectory 129 130.. autoclass:: PSAPair 131 132 .. attribute:: npaths 133 134 int, total number of paths in the comparison in which *this* 135 :class:`PSAPair` was generated 136 137 .. attribute:: matrix_id 138 139 (int, int), (row, column) indices of the location of *this* 140 :class:`PSAPair` in the corresponding pairwise distance matrix 141 142 .. attribute:: pair_id 143 144 int, ID of *this* :class:`PSAPair` (the pair_id:math:`^\text{th}` 145 comparison) in the distance vector corresponding to the pairwise distance 146 matrix 147 148 .. attribute:: nearest_neighbors 149 150 dict, contains the nearest neighbors by frame index and the 151 nearest neighbor distances for each path in *this* :class:`PSAPair` 152 153 .. attribute:: hausdorff_pair 154 155 dict, contains the frame indices of the Hausdorff pair for each path in 156 *this* :class:`PSAPair` and the corresponding (Hausdorff) distance 157 158.. autoclass:: PSAnalysis 159 :members: 160 161 .. attribute:: universes 162 163 list of :class:`MDAnalysis.Universe` objects containing trajectories 164 165 .. attribute:: u_reference 166 167 :class:`MDAnalysis.Universe` object containing a reference structure 168 169 .. attribute:: ref_select 170 171 string, selection for 172 :meth:`~MDAnalysis.core.groups.AtomGroup.select_atoms` to select frame 173 from :attr:`PSAnalysis.u_reference` 174 175 .. attribute:: path_select 176 177 string, selection for 178 :meth:`~MDAnalysis.core.groups.AtomGroup.select_atoms` to select atoms 179 to compose :attr:`Path.path` 180 181 .. attribute:: ref_frame 182 183 int, frame index to select frame from :attr:`Path.u_reference` 184 185 .. attribute:: filename 186 187 string, name of file to store calculated distance matrix 188 (:attr:`PSAnalysis.D`) 189 190 .. attribute:: paths 191 192 list of :class:`numpy.ndarray` objects representing the set/ensemble of 193 fitted trajectories 194 195 .. attribute:: D 196 197 string, name of file to store calculated distance matrix 198 (:attr:`PSAnalysis.D`) 199 200 201.. Markup definitions 202.. ------------------ 203.. 204.. |3Dp| replace:: :math:`N_p \times N \times 3` 205.. |2Dp| replace:: :math:`N_p \times (3N)` 206.. |3Dq| replace:: :math:`N_q \times N \times 3` 207.. |2Dq| replace:: :math:`N_q \times (3N)` 208.. |3D| replace:: :math:`N_p\times N\times 3` 209.. |2D| replace:: :math:`N_p\times 3N` 210.. |Np| replace:: :math:`N_p` 211 212""" 213from __future__ import division, absolute_import, print_function 214 215import six 216from six.moves import range, cPickle 217from six import string_types 218 219import os 220import warnings 221import numbers 222 223import numpy as np 224from scipy import spatial, cluster 225from scipy.spatial.distance import directed_hausdorff 226import matplotlib 227 228import MDAnalysis 229import MDAnalysis.analysis.align 230from MDAnalysis import NoDataError 231from MDAnalysis.lib.util import deprecate 232 233import logging 234logger = logging.getLogger('MDAnalysis.analysis.psa') 235 236 237from ..due import due, Doi 238 239due.cite(Doi("10.1371/journal.pcbi.1004568"), 240 description="Path Similarity Analysis algorithm and implementation", 241 path="MDAnalysis.analysis.psa", 242 cite_module=True) 243del Doi 244 245def get_path_metric_func(name): 246 """Selects a path metric function by name. 247 248 Parameters 249 ---------- 250 name : str 251 name of path metric 252 253 Returns 254 ------- 255 path_metric : function 256 The path metric function specified by *name* (if found). 257 """ 258 path_metrics = { 259 'hausdorff' : hausdorff, 260 'weighted_average_hausdorff' : hausdorff_wavg, 261 'average_hausdorff' : hausdorff_avg, 262 'hausdorff_neighbors' : hausdorff_neighbors, 263 'discrete_frechet' : discrete_frechet 264 } 265 try: 266 return path_metrics[name] 267 except KeyError as key: 268 raise KeyError('Path metric "{}" not found. Valid selections: {}' 269 ''.format(key, " ".join('"{}"'.format(n) 270 for n in path_metrics.keys()))) 271 272 273def sqnorm(v, axis=None): 274 """Compute the sum of squares of elements along specified axes. 275 276 Parameters 277 ---------- 278 v : numpy.ndarray 279 coordinates 280 axes : None / int / tuple (optional) 281 Axes or axes along which a sum is performed. The default 282 (*axes* = ``None``) performs a sum over all the dimensions of 283 the input array. The value of *axes* may be negative, in 284 which case it counts from the last axis to the zeroth axis. 285 286 Returns 287 ------- 288 float 289 the sum of the squares of the elements of `v` along `axes` 290 291 """ 292 return np.sum(v*v, axis=axis) 293 294 295def get_msd_matrix(P, Q, axis=None): 296 r"""Generate the matrix of pairwise mean-squared deviations between paths. 297 298 The MSDs between all pairs of points in `P` and `Q` are 299 calculated, each pair having a point from `P` and a point from 300 `Q`. 301 302 `P` (`Q`) is a :class:`numpy.ndarray` of :math:`N_p` (:math:`N_q`) time 303 steps, :math:`N` atoms, and :math:`3N` coordinates (e.g., 304 :attr:`MDAnalysis.core.groups.AtomGroup.positions`). The pairwise MSD 305 matrix has dimensions :math:`N_p` by :math:`N_q`. 306 307 Parameters 308 ---------- 309 P : numpy.ndarray 310 the points in the first path 311 Q : numpy.ndarray 312 the points in the second path 313 314 Returns 315 ------- 316 msd_matrix : numpy.ndarray 317 matrix of pairwise MSDs between points in `P` and points 318 in `Q` 319 320 Notes 321 ----- 322 We calculate the MSD matrix 323 324 .. math:: 325 M_{ij} = ||p_i - q_j||^2 326 327 where :math:`p_i \in P` and :math:`q_j \in Q`. 328 """ 329 return np.asarray([sqnorm(p - Q, axis=axis) for p in P]) 330 331 332def reshaper(path, axis): 333 """Flatten path when appropriate to facilitate calculations 334 requiring two dimensional input. 335 """ 336 if len(axis) > 1: 337 path = path.reshape(len(path), -1) 338 return path 339 340def get_coord_axes(path): 341 """Return the number of atoms and the axes corresponding to atoms 342 and coordinates for a given path. 343 344 The `path` is assumed to be a :class:`numpy.ndarray` where the 0th axis 345 corresponds to a frame (a snapshot of coordinates). The :math:`3N` 346 (Cartesian) coordinates are assumed to be either: 347 348 1. all in the 1st axis, starting with the x,y,z coordinates of the 349 first atom, followed by the *x*,*y*,*z* coordinates of the 2nd, etc. 350 2. in the 1st *and* 2nd axis, where the 1st axis indexes the atom 351 number and the 2nd axis contains the *x*,*y*,*z* coordinates of 352 each atom. 353 354 Parameters 355 ---------- 356 path : numpy.ndarray 357 representing a path 358 359 Returns 360 ------- 361 (int, (int, ...)) 362 the number of atoms and the axes containing coordinates 363 364 """ 365 path_dimensions = len(path.shape) 366 if path_dimensions == 3: 367 N = path.shape[1] 368 axis = (1,2) # 1st axis: atoms, 2nd axis: x,y,z coords 369 elif path_dimensions == 2: 370 # can use mod to check if total # coords divisible by 3 371 N = path.shape[1] / 3 372 axis = (1,) # 1st axis: 3N structural coords (x1,y1,z1,...,xN,xN,zN) 373 else: 374 raise ValueError("Path must have 2 or 3 dimensions; the first " 375 "dimensions (axis 0) must correspond to frames, " 376 "axis 1 (and axis 2, if present) must contain atomic " 377 "coordinates.") 378 return N, axis 379 380 381def hausdorff(P, Q): 382 r"""Calculate the symmetric Hausdorff distance between two paths. 383 384 The metric used is RMSD, as opposed to the more conventional L2 385 (Euclidean) norm, because this is convenient for i.e., comparing 386 protein configurations. 387 388 *P* (*Q*) is a :class:`numpy.ndarray` of :math:`N_p` (:math:`N_q`) time 389 steps, :math:`N` atoms, and :math:`3N` coordinates (e.g., 390 :attr:`MDAnalysis.core.groups.AtomGroup.positions`). *P* (*Q*) has 391 either shape |3Dp| (|3Dq|), or |2Dp| (|2Dq|) in flattened form. 392 393 Note that reversing the path does not change the Hausdorff distance. 394 395 Parameters 396 ---------- 397 P : numpy.ndarray 398 the points in the first path 399 Q : numpy.ndarray 400 the points in the second path 401 402 Returns 403 ------- 404 float 405 the Hausdorff distance between paths `P` and `Q` 406 407 Example 408 ------- 409 Calculate the Hausdorff distance between two halves of a trajectory: 410 411 >>> from MDAnalysis.tests.datafiles import PSF, DCD 412 >>> u = Universe(PSF,DCD) 413 >>> mid = len(u.trajectory)/2 414 >>> ca = u.select_atoms('name CA') 415 >>> P = numpy.array([ 416 ... ca.positions for _ in u.trajectory[:mid:] 417 ... ]) # first half of trajectory 418 >>> Q = numpy.array([ 419 ... ca.positions for _ in u.trajectory[mid::] 420 ... ]) # second half of trajectory 421 >>> hausdorff(P,Q) 422 4.7786639840135905 423 >>> hausdorff(P,Q[::-1]) # hausdorff distance w/ reversed 2nd trajectory 424 4.7786639840135905 425 426 427 Notes 428 ----- 429 :func:`scipy.spatial.distance.directed_hausdorff` is an optimized 430 implementation of the early break algorithm of [Taha2015]_; the 431 latter code is used here to calculate the symmetric Hausdorff 432 distance with an RMSD metric 433 434 References 435 ---------- 436 .. [Taha2015] A. A. Taha and A. Hanbury. An efficient algorithm for 437 calculating the exact Hausdorff distance. IEEE Transactions On Pattern 438 Analysis And Machine Intelligence, 37:2153-63, 2015. 439 440 """ 441 N_p, axis_p = get_coord_axes(P) 442 N_q, axis_q = get_coord_axes(Q) 443 444 if N_p != N_q: 445 raise ValueError("P and Q must have matching sizes") 446 447 P = reshaper(P, axis_p) 448 Q = reshaper(Q, axis_q) 449 450 return max(directed_hausdorff(P, Q)[0], 451 directed_hausdorff(Q, P)[0]) / np.sqrt(N_p) 452 453 454def hausdorff_wavg(P, Q): 455 r"""Calculate the weighted average Hausdorff distance between two paths. 456 457 *P* (*Q*) is a :class:`numpy.ndarray` of :math:`N_p` (:math:`N_q`) time 458 steps, :math:`N` atoms, and :math:`3N` coordinates (e.g., 459 :attr:`MDAnalysis.core.groups.AtomGroup.positions`). *P* (*Q*) has 460 either shape |3Dp| (|3Dq|), or |2Dp| (|2Dq|) in flattened form. The nearest 461 neighbor distances for *P* (to *Q*) and those of *Q* (to *P*) are averaged 462 individually to get the average nearest neighbor distance for *P* and 463 likewise for *Q*. These averages are then summed and divided by 2 to get a 464 measure that gives equal weight to *P* and *Q*. 465 466 Parameters 467 ---------- 468 P : numpy.ndarray 469 the points in the first path 470 Q : numpy.ndarray 471 the points in the second path 472 473 Returns 474 ------- 475 float 476 the weighted average Hausdorff distance between paths `P` and `Q` 477 478 Example 479 ------- 480 481 >>> from MDAnalysis import Universe 482 >>> from MDAnalysis.tests.datafiles import PSF, DCD 483 >>> u = Universe(PSF,DCD) 484 >>> mid = len(u.trajectory)/2 485 >>> ca = u.select_atoms('name CA') 486 >>> P = numpy.array([ 487 ... ca.positions for _ in u.trajectory[:mid:] 488 ... ]) # first half of trajectory 489 >>> Q = numpy.array([ 490 ... ca.positions for _ in u.trajectory[mid::] 491 ... ]) # second half of trajectory 492 >>> hausdorff_wavg(P,Q) 493 2.5669644353703447 494 >>> hausdorff_wavg(P,Q[::-1]) # weighted avg hausdorff dist w/ Q reversed 495 2.5669644353703447 496 497 Notes 498 ----- 499 The weighted average Hausdorff distance is not a true metric (it does not 500 obey the triangle inequality); see [Seyler2015]_ for further details. 501 502 503 """ 504 N, axis = get_coord_axes(P) 505 d = get_msd_matrix(P, Q, axis=axis) 506 out = 0.5*( np.mean(np.amin(d,axis=0)) + np.mean(np.amin(d,axis=1)) ) 507 return ( out / N )**0.5 508 509 510def hausdorff_avg(P, Q): 511 r"""Calculate the average Hausdorff distance between two paths. 512 513 *P* (*Q*) is a :class:`numpy.ndarray` of :math:`N_p` (:math:`N_q`) time 514 steps, :math:`N` atoms, and :math:`3N` coordinates (e.g., 515 :attr:`MDAnalysis.core.groups.AtomGroup.positions`). *P* (*Q*) has 516 either shape |3Dp| (|3Dq|), or |2Dp| (|2Dq|) in flattened form. The nearest 517 neighbor distances for *P* (to *Q*) and those of *Q* (to *P*) are all 518 averaged together to get a mean nearest neighbor distance. This measure 519 biases the average toward the path that has more snapshots, whereas weighted 520 average Hausdorff gives equal weight to both paths. 521 522 Parameters 523 ---------- 524 P : numpy.ndarray 525 the points in the first path 526 Q : numpy.ndarray 527 the points in the second path 528 529 Returns 530 ------- 531 float 532 the average Hausdorff distance between paths `P` and `Q` 533 534 Example 535 ------- 536 537 >>> from MDAnalysis.tests.datafiles import PSF, DCD 538 >>> u = Universe(PSF,DCD) 539 >>> mid = len(u.trajectory)/2 540 >>> ca = u.select_atoms('name CA') 541 >>> P = numpy.array([ 542 ... ca.positions for _ in u.trajectory[:mid:] 543 ... ]) # first half of trajectory 544 >>> Q = numpy.array([ 545 ... ca.positions for _ in u.trajectory[mid::] 546 ... ]) # second half of trajectory 547 >>> hausdorff_avg(P,Q) 548 2.5669646575869005 549 >>> hausdorff_avg(P,Q[::-1]) # hausdorff distance w/ reversed 2nd trajectory 550 2.5669646575869005 551 552 553 Notes 554 ----- 555 The average Hausdorff distance is not a true metric (it does not obey the 556 triangle inequality); see [Seyler2015]_ for further details. 557 558 """ 559 N, axis = get_coord_axes(P) 560 d = get_msd_matrix(P, Q, axis=axis) 561 out = np.mean( np.append( np.amin(d,axis=0), np.amin(d,axis=1) ) ) 562 return ( out / N )**0.5 563 564 565def hausdorff_neighbors(P, Q): 566 r"""Find the Hausdorff neighbors of two paths. 567 568 *P* (*Q*) is a :class:`numpy.ndarray` of :math:`N_p` (:math:`N_q`) time 569 steps, :math:`N` atoms, and :math:`3N` coordinates (e.g., 570 :attr:`MDAnalysis.core.groups.AtomGroup.positions`). *P* (*Q*) has 571 either shape |3Dp| (|3Dq|), or |2Dp| (|2Dq|) in flattened form. 572 573 Parameters 574 ---------- 575 P : numpy.ndarray 576 the points in the first path 577 Q : numpy.ndarray 578 the points in the second path 579 580 Returns 581 ------- 582 dict 583 dictionary of two pairs of numpy arrays, the first pair (key 584 "frames") containing the indices of (Hausdorff) nearest 585 neighbors for `P` and `Q`, respectively, the second (key 586 "distances") containing (corresponding) nearest neighbor 587 distances for `P` and `Q`, respectively 588 589 Notes 590 ----- 591 - Hausdorff neighbors are those points on the two paths that are separated by 592 the Hausdorff distance. They are the farthest nearest neighbors and are 593 maximally different in the sense of the Hausdorff distance [Seyler2015]_. 594 - :func:`scipy.spatial.distance.directed_hausdorff` can also provide the 595 hausdorff neighbors. 596 597 """ 598 N, axis = get_coord_axes(P) 599 d = get_msd_matrix(P, Q, axis=axis) 600 nearest_neighbors = { 601 'frames' : (np.argmin(d, axis=1), np.argmin(d, axis=0)), 602 'distances' : ((np.amin(d,axis=1)/N)**0.5, (np.amin(d, axis=0)/N)**0.5) 603 } 604 return nearest_neighbors 605 606 607def discrete_frechet(P, Q): 608 r"""Calculate the discrete Fréchet distance between two paths. 609 610 *P* (*Q*) is a :class:`numpy.ndarray` of :math:`N_p` (:math:`N_q`) time 611 steps, :math:`N` atoms, and :math:`3N` coordinates (e.g., 612 :attr:`MDAnalysis.core.groups.AtomGroup.positions`). *P* (*Q*) has 613 either shape |3Dp| (|3Dq|), or :|2Dp| (|2Dq|) in flattened form. 614 615 Parameters 616 ---------- 617 P : numpy.ndarray 618 the points in the first path 619 Q : numpy.ndarray 620 the points in the second path 621 622 Returns 623 ------- 624 float 625 the discrete Fréchet distance between paths *P* and *Q* 626 627 Example 628 ------- 629 Calculate the discrete Fréchet distance between two halves of a 630 trajectory. 631 632 >>> u = Universe(PSF,DCD) 633 >>> mid = len(u.trajectory)/2 634 >>> ca = u.select_atoms('name CA') 635 >>> P = np.array([ 636 ... ca.positions for _ in u.trajectory[:mid:] 637 ... ]) # first half of trajectory 638 >>> Q = np.array([ 639 ... ca.positions for _ in u.trajectory[mid::] 640 ... ]) # second half of trajectory 641 >>> discrete_frechet(P,Q) 642 4.7786639840135905 643 >>> discrete_frechet(P,Q[::-1]) # frechet distance w/ 2nd trj reversed 2nd 644 6.8429011177113832 645 646 Note that reversing the direction increased the Fréchet distance: 647 it is sensitive to the direction of the path. 648 649 Notes 650 ----- 651 652 The discrete Fréchet metric is an approximation to the continuous Fréchet 653 metric [Frechet1906]_ [Alt1995]_. The calculation of the continuous 654 Fréchet distance is implemented with the dynamic programming algorithm of 655 [EiterMannila1994]_ [EiterMannila1997]_. 656 657 658 References 659 ---------- 660 661 .. [Frechet1906] M. Fréchet. Sur quelques points du calcul 662 fonctionnel. Rend. Circ. Mat. Palermo, 22(1):1–72, Dec. 1906. 663 664 .. [Alt1995] H. Alt and M. Godau. Computing the Fréchet distance between 665 two polygonal curves. Int J Comput Geometry & Applications, 666 5(01n02):75–91, 1995. doi: `10.1142/S0218195995000064`_ 667 668 .. _`10.1142/S0218195995000064`: http://doi.org/10.1142/S0218195995000064 669 670 .. [EiterMannila1994] T. Eiter and H. Mannila. Computing discrete Fréchet 671 distance. Technical Report CD-TR 94/64, Christian Doppler Laboratory for 672 Expert Systems, Technische Universität Wien, Wien, 1994. 673 674 .. [EiterMannila1997] T. Eiter and H. Mannila. Distance measures for point 675 sets and their computation. Acta Informatica, 34:109–133, 1997. doi: `10.1007/s002360050075`_. 676 677 678 .. _10.1007/s002360050075: http://doi.org/10.1007/s002360050075 679 680 """ 681 682 N, axis = get_coord_axes(P) 683 Np, Nq = len(P), len(Q) 684 d = get_msd_matrix(P, Q, axis=axis) 685 ca = -np.ones((Np, Nq)) 686 687 def c(i, j): 688 """Compute the coupling distance for two partial paths formed by *P* and 689 *Q*, where both begin at frame 0 and end (inclusive) at the respective 690 frame indices :math:`i-1` and :math:`j-1`. The partial path of *P* (*Q*) 691 up to frame *i* (*j*) is formed by the slicing ``P[0:i]`` (``Q[0:j]``). 692 693 :func:`c` is called recursively to compute the coupling distance 694 between the two full paths *P* and *Q* (i.e., the discrete Frechet 695 distance) in terms of coupling distances between their partial paths. 696 697 Parameters 698 ---------- 699 i : int 700 partial path of *P* through final frame *i-1* 701 j : int 702 partial path of *Q* through final frame *j-1* 703 704 Returns 705 ------- 706 dist : float 707 the coupling distance between partial paths `P[0:i]` and `Q[0:j]` 708 """ 709 if ca[i,j] != -1 : 710 return ca[i,j] 711 if i > 0: 712 if j > 0: 713 ca[i,j] = max( min(c(i-1,j),c(i,j-1),c(i-1,j-1)), d[i,j] ) 714 else: 715 ca[i,j] = max( c(i-1,0), d[i,0] ) 716 elif j > 0: 717 ca[i,j] = max( c(0,j-1), d[0,j] ) 718 else: 719 ca[i,j] = d[0,0] 720 return ca[i,j] 721 722 return (c(Np-1, Nq-1) / N)**0.5 723 724 725def dist_mat_to_vec(N, i, j): 726 """Convert distance matrix indices (in the upper triangle) to the index of 727 the corresponding distance vector. 728 729 This is a convenience function to locate distance matrix elements (and the 730 pair generating it) in the corresponding distance vector. The row index *j* 731 should be greater than *i+1*, corresponding to the upper triangle of the 732 distance matrix. 733 734 Parameters 735 ---------- 736 N : int 737 size of the distance matrix (of shape *N*-by-*N*) 738 i : int 739 row index (starting at 0) of the distance matrix 740 j : int 741 column index (starting at 0) of the distance matrix 742 743 Returns 744 ------- 745 int 746 index (of the matrix element) in the corresponding distance vector 747 748 """ 749 750 if not (isinstance(N, numbers.Integral) and isinstance(i, numbers.Integral) 751 and isinstance(j, numbers.Integral)): 752 raise ValueError("N, i, j all must be of type int") 753 754 if i < 0 or j < 0 or N < 2: 755 raise ValueError("Matrix indices are invalid; i and j must be greater " 756 "than 0 and N must be greater the 2") 757 758 if (j > i and (i > N - 1 or j > N)) or (j < i and (i > N or j > N - 1)): 759 raise ValueError("Matrix indices are out of range; i and j must be " 760 "less than N = {0:d}".format(N)) 761 if j > i: 762 return (N*i) + j - (i+2)*(i+1) // 2 # old-style division for int output 763 elif j < i: 764 warnings.warn("Column index entered (j = {:d} is smaller than row " 765 "index (i = {:d}). Using symmetric element in upper " 766 "triangle of distance matrix instead: i --> j, " 767 "j --> i".format(j, i)) 768 return (N*j) + i - (j+2)*(j+1) // 2 # old-style division for int output 769 else: 770 raise ValueError("Error in processing matrix indices; i and j must " 771 "be integers less than integer N = {0:d} such that" 772 " j >= i+1.".format(N)) 773 774 775class Path(object): 776 """Represent a path based on a :class:`~MDAnalysis.core.universe.Universe`. 777 778 Pre-process a :class:`Universe` object: (1) fit the trajectory to a 779 reference structure, (2) convert fitted time series to a 780 :class:`numpy.ndarray` representation of :attr:`Path.path`. 781 782 The analysis is performed with :meth:`PSAnalysis.run` and stores the result 783 in the :class:`numpy.ndarray` distance matrix :attr:`PSAnalysis.D`. 784 :meth:`PSAnalysis.run` also generates a fitted trajectory and path from 785 alignment of the original trajectories to a reference structure. 786 787 .. versionadded:: 0.9.1 788 789 """ 790 791 def __init__(self, universe, reference, ref_select='name CA', 792 path_select='all', ref_frame=0): 793 """Setting up trajectory alignment and fitted path generation. 794 795 Parameters 796 ---------- 797 universe : Universe 798 :class:`MDAnalysis.Universe` object containing a trajectory 799 reference : Universe 800 reference structure (uses `ref_frame` from the trajectory) 801 ref_select : str or dict or tuple (optional) 802 The selection to operate on for rms fitting; can be one of: 803 804 1. any valid selection string for 805 :meth:`~MDAnalysis.core.groups.AtomGroup.select_atoms` that 806 produces identical selections in *mobile* and *reference*; or 807 2. a dictionary ``{'mobile':sel1, 'reference':sel2}`` (the 808 :func:`MDAnalysis.analysis.align.fasta2select` function returns 809 such a dictionary based on a ClustalW_ or STAMP_ sequence 810 alignment); or 811 3. a tuple ``(sel1, sel2)`` 812 813 When using 2. or 3. with *sel1* and *sel2* then these selections 814 can also each be a list of selection strings (to generate an 815 AtomGroup with defined atom order as described under 816 :ref:`ordered-selections-label`). 817 ref_frame : int 818 frame index to select the coordinate frame from 819 `ref_select.trajectory` 820 path_select : selection_string 821 atom selection composing coordinates of (fitted) path; if ``None`` 822 then `path_select` is set to `ref_select` [``None``] 823 824 """ 825 self.u_original = universe 826 self.u_reference = reference 827 self.ref_select = ref_select 828 self.ref_frame = ref_frame 829 self.path_select = path_select 830 831 self.top_name = self.u_original.filename 832 self.trj_name = self.u_original.trajectory.filename 833 self.newtrj_name = None 834 self.u_fitted = None 835 self.path = None 836 self.natoms = None 837 838 839 def fit_to_reference(self, filename=None, prefix='', postfix='_fit', 840 rmsdfile=None, targetdir=os.path.curdir, 841 weights=None, tol_mass=0.1): 842 """Align each trajectory frame to the reference structure 843 844 Parameters 845 ---------- 846 filename : str (optional) 847 file name for the RMS-fitted trajectory or pdb; defaults to the 848 original trajectory filename (from :attr:`Path.u_original`) with 849 `prefix` prepended 850 prefix : str (optional) 851 prefix for auto-generating the new output filename 852 rmsdfile : str (optional) 853 file name for writing the RMSD time series [``None``] 854 weights : {"mass", ``None``} or array_like (optional) 855 choose weights. With ``"mass"`` uses masses as weights; with 856 ``None`` weigh each atom equally. If a float array of the same 857 length as the selected AtomGroup is provided, use each element of 858 the `array_like` as a weight for the corresponding atom in the 859 AtomGroup. 860 tol_mass : float (optional) 861 Reject match if the atomic masses for matched atoms differ by more 862 than `tol_mass` [0.1] 863 864 Returns 865 ------- 866 Universe 867 :class:`MDAnalysis.Universe` object containing a fitted trajectory 868 869 Notes 870 ----- 871 Uses :class:`MDAnalysis.analysis.align.AlignTraj` for the fitting. 872 873 874 .. deprecated:: 0.16.1 875 Instead of ``mass_weighted=True`` use new ``weights='mass'``; 876 refactored to fit with AnalysisBase API 877 878 .. versionchanged:: 0.17.0 879 Deprecated keyword `mass_weighted` was removed. 880 """ 881 head, tail = os.path.split(self.trj_name) 882 oldname, ext = os.path.splitext(tail) 883 filename = filename or oldname 884 self.newtrj_name = os.path.join(targetdir, filename + postfix + ext) 885 self.u_reference.trajectory[self.ref_frame] # select frame from ref traj 886 aligntrj = MDAnalysis.analysis.align.AlignTraj(self.u_original, 887 self.u_reference, 888 select=self.ref_select, 889 filename=self.newtrj_name, 890 prefix=prefix, 891 weights=weights, 892 tol_mass=tol_mass).run() 893 if rmsdfile is not None: 894 aligntrj.save(rmsdfile) 895 return MDAnalysis.Universe(self.top_name, self.newtrj_name) 896 897 898 def to_path(self, fitted=False, select=None, flat=False): 899 r"""Generates a coordinate time series from the fitted universe 900 trajectory. 901 902 Given a selection of *N* atoms from *select*, the atomic positions for 903 each frame in the fitted universe (:attr:`Path.u_fitted`) trajectory 904 (with |Np| total frames) are appended sequentially to form a 3D or 2D 905 (if *flat* is ``True``) :class:`numpy.ndarray` representation of the 906 fitted trajectory (with dimensions |3D| or |2D|, respectively). 907 908 Parameters 909 ---------- 910 fitted : bool (optional) 911 construct a :attr:`Path.path` from the :attr:`Path.u_fitted` 912 trajectory; if ``False`` then :attr:`Path.path` is generated with 913 the trajectory from :attr:`Path.u_original` [``False``] 914 select : str (optional) 915 the selection for constructing the coordinates of each frame in 916 :attr:`Path.path`; if ``None`` then :attr:`Path.path_select` 917 is used, else it is overridden by *select* [``None``] 918 flat : bool (optional) 919 represent :attr:`Path.path` as a 2D (|2D|) :class:`numpy.ndarray`; 920 if ``False`` then :attr:`Path.path` is a 3D (|3D|) 921 :class:`numpy.ndarray` [``False``] 922 923 Returns 924 ------- 925 numpy.ndarray 926 representing a time series of atomic positions of an 927 :class:`MDAnalysis.core.groups.AtomGroup` selection from 928 :attr:`Path.u_fitted.trajectory` 929 930 """ 931 select = select if select is not None else self.path_select 932 if fitted: 933 if not isinstance(self.u_fitted, MDAnalysis.Universe): 934 raise TypeError("Fitted universe not found. Generate a fitted " + 935 "universe with fit_to_reference() first, or explicitly "+ 936 "set argument \"fitted\" to \"False\" to generate a " + 937 "path from the original universe.") 938 u = self.u_fitted 939 else: 940 u = self.u_original 941 frames = u.trajectory 942 atoms = u.select_atoms(select) 943 self.natoms = len(atoms) 944 frames.rewind() 945 if flat: 946 return np.array([atoms.positions.flatten() for _ in frames]) 947 else: 948 return np.array([atoms.positions for _ in frames]) 949 950 951 def run(self, align=False, filename=None, postfix='_fit', rmsdfile=None, 952 targetdir=os.path.curdir, weights=None, tol_mass=0.1, 953 flat=False): 954 r"""Generate a path from a trajectory and reference structure. 955 956 As part of the path generation, the trajectory can be superimposed 957 ("aligned") to a reference structure if specified. 958 959 This is a convenience method to generate a fitted trajectory from an 960 inputted universe (:attr:`Path.u_original`) and reference structure 961 (:attr:`Path.u_reference`). :meth:`Path.fit_to_reference` and 962 :meth:`Path.to_path` are used consecutively to generate a new universe 963 (:attr:`Path.u_fitted`) containing the fitted trajectory along with the 964 corresponding :attr:`Path.path` represented as an 965 :class:`numpy.ndarray`. The method returns a tuple of the topology name 966 and new trajectory name, which can be fed directly into an 967 :class:`MDAnalysis.Universe` object after unpacking the tuple using the 968 ``*`` operator, as in 969 ``MDAnalysis.Universe(*(top_name, newtraj_name))``. 970 971 Parameters 972 ---------- 973 align : bool (optional) 974 Align trajectory to atom selection :attr:`Path.ref_select` of 975 :attr:`Path.u_reference`. If ``True``, a universe containing an 976 aligned trajectory is produced with :meth:`Path.fit_to_reference` 977 [``False``] 978 filename : str (optional) 979 filename for the RMS-fitted trajectory or pdb; defaults to the 980 original trajectory filename (from :attr:`Path.u_original`) with 981 *prefix* prepended 982 postfix : str (optional) 983 prefix for auto-generating the new output filename 984 rmsdfile : str (optional) 985 file name for writing the RMSD time series [``None``] 986 weights : {"mass", ``None``} or array_like (optional) 987 choose weights. With ``"mass"`` uses masses as weights; with 988 ``None`` weigh each atom equally. If a float array of the same 989 length as the selected AtomGroup is provided, use each element of 990 the `array_like` as a weight for the corresponding atom in the 991 AtomGroup. 992 tol_mass : float (optional) 993 Reject match if the atomic masses for matched atoms differ by more 994 than *tol_mass* [0.1] 995 flat : bool (optional) 996 represent :attr:`Path.path` with 2D (|2D|) :class:`numpy.ndarray`; 997 if ``False`` then :attr:`Path.path` is a 3D (|3D|) 998 :class:`numpy.ndarray` [``False``] 999 1000 Returns 1001 ------- 1002 topology_trajectory : tuple 1003 A tuple of the topology name and new trajectory name. 1004 1005 1006 .. deprecated:: 0.16.1 1007 Instead of ``mass_weighted=True`` use new ``weights='mass'``; 1008 refactored to fit with AnalysisBase API 1009 1010 .. versionchanged:: 0.17.0 1011 Deprecated keyword `mass_weighted` was removed. 1012 """ 1013 if align: 1014 self.u_fitted = self.fit_to_reference( 1015 filename=filename, postfix=postfix, 1016 rmsdfile=rmsdfile, targetdir=targetdir, 1017 weights=weights, tol_mass=0.1) 1018 self.path = self.to_path(fitted=align, flat=flat) 1019 return self.top_name, self.newtrj_name 1020 1021 1022 def get_num_atoms(self): 1023 """Return the number of atoms used to construct the :class:`Path`. 1024 1025 Must run :meth:`Path.to_path` prior to calling this method. 1026 1027 Returns 1028 ------- 1029 int 1030 the number of atoms in the :class:`Path` 1031 1032 1033 """ 1034 if self.natoms is None: 1035 raise ValueError("No path data; do 'Path.to_path()' first.") 1036 return self.natoms 1037 1038 1039class PSAPair(object): 1040 """Generate nearest neighbor and Hausdorff pair information between a pair 1041 of paths from an all-pairs comparison generated by :class:`PSA`. 1042 1043 The nearest neighbors for each path of a pair of paths is generated by 1044 :meth:`PSAPair.compute_nearest_neighbors` and stores the result 1045 in a dictionary (:attr:`nearest_neighbors`): each path has a 1046 :class:`numpy.ndarray` of the frames of its nearest neighbors, and a 1047 :class:`numpy.ndarray` of its nearest neighbor distances 1048 :attr:`PSAnalysis.D`. For example, *nearest_neighbors['frames']* is a pair 1049 of :class:`numpy.ndarray`, the first being the frames of the nearest 1050 neighbors of the first path, *i*, the second being those of the second path, 1051 *j*. 1052 1053 The Hausdorff pair for the pair of paths is found by calling 1054 :meth:`find_hausdorff_pair` (locates the nearest neighbor pair having the 1055 largest overall distance separating them), which stores the result in a 1056 dictionary (:attr:`hausdorff_pair`) containing the frames (indices) of the 1057 pair along with the corresponding (Hausdorff) distance. 1058 *hausdorff_pair['frame']* contains a pair of frames in the first path, *i*, 1059 and the second path, *j*, respectively, that correspond to the Hausdorff 1060 distance between them. 1061 1062 .. versionadded:: 0.11 1063 """ 1064 1065 def __init__(self, npaths, i, j): 1066 """Set up a :class:`PSAPair` for a pair of paths that are part of a 1067 :class:`PSA` comparison of *npaths* total paths. 1068 1069 Each unique pair of paths compared using :class:`PSA` is related by 1070 their nearest neighbors (and corresponding distances) and the Hausdorff 1071 pair and distance. :class:`PSAPair` is a convenience class for 1072 calculating and encapsulating nearest neighbor and Hausdorff pair 1073 information for one pair of paths. 1074 1075 Given *npaths*, :class:`PSA` performs and all-pairs comparison among all 1076 paths for a total of :math:`\text{npaths}*(\text{npaths}-1)/2` unique 1077 comparisons. If distances between paths are computed, the all-pairs 1078 comparison can be summarized in a symmetric distance matrix whose upper 1079 triangle can be mapped to a corresponding distance vector form in a 1080 one-to-one manner. A particular comparison of a pair of paths in a 1081 given instance of :class:`PSAPair` is thus unique identified by the row 1082 and column indices in the distance matrix representation (whether or not 1083 distances are actually computed), or a single ID (index) in the 1084 corresponding distance vector. 1085 1086 Parameters 1087 ---------- 1088 npaths : int 1089 total number of paths in :class:`PSA` used to generate *this* 1090 :class:`PSAPair` 1091 i : int 1092 row index (starting at 0) of the distance matrix 1093 j : int 1094 column index (starting at 0) of the distance matrix 1095 """ 1096 self.npaths = npaths 1097 self.matrix_idx = (i,j) 1098 self.pair_idx = self._dvec_idx(i,j) 1099 1100 # Set by calling hausdorff_nn 1101 self.nearest_neighbors = {'frames' : None, 'distances' : None} 1102 1103 # Set by self.getHausdorffPair 1104 self.hausdorff_pair = {'frames' : (None, None), 'distance' : None} 1105 1106 1107 def _dvec_idx(self, i, j): 1108 """Convert distance matrix indices (in the upper triangle) to the index 1109 of the corresponding distance vector. 1110 1111 This is a convenience function to locate distance matrix elements (and 1112 the pair generating it) in the corresponding distance vector. The row 1113 index *j* should be greater than *i+1*, corresponding to the upper 1114 triangle of the distance matrix. 1115 1116 Parameters 1117 ---------- 1118 i : int 1119 row index (starting at 0) of the distance matrix 1120 j : int 1121 column index (starting at 0) of the distance matrix 1122 1123 Returns 1124 ------- 1125 int 1126 (matrix element) index in the corresponding distance vector 1127 """ 1128 return (self.npaths*i) + j - (i+2)*(i+1)/2 1129 1130 1131 def compute_nearest_neighbors(self, P,Q, N=None): 1132 """Generates Hausdorff nearest neighbor lists of *frames* (by index) and 1133 *distances* for *this* pair of paths corresponding to distance matrix 1134 indices (*i*,*j*). 1135 1136 :meth:`PSAPair.compute_nearest_neighbors` calls 1137 :func:`hausdorff_neighbors` to populate the dictionary of the nearest 1138 neighbor lists of frames (by index) and distances 1139 (:attr:`PSAPair.nearest_neighbors`). This method must explicitly take as 1140 arguments a pair of paths, *P* and *Q*, where *P* is the 1141 :math:`i^\text{th}` path and *Q* is the :math:`j^\text{th}` path among 1142 the set of *N* total paths in the comparison. 1143 1144 Parameters 1145 ---------- 1146 P : numpy.ndarray 1147 representing a path 1148 Q : numpy.ndarray 1149 representing a path 1150 N : int 1151 size of the distance matrix (of shape *N*-by-*N*) [``None``] 1152 1153 """ 1154 hn = hausdorff_neighbors(P, Q) 1155 self.nearest_neighbors['frames'] = hn['frames'] 1156 self.nearest_neighbors['distances'] = hn['distances'] 1157 1158 1159 def find_hausdorff_pair(self): 1160 r"""Find the Hausdorff pair (of frames) for *this* pair of paths. 1161 1162 :meth:`PSAPair.find_hausdorff_pair` requires that 1163 `:meth:`PSAPair.compute_nearest_neighbors` be called first to 1164 generate the nearest neighbors (and corresponding distances) for each 1165 path in *this* :class:`PSAPair`. The Hausdorff pair is the nearest 1166 neighbor pair (of snapshots/frames), one in the first path and one in 1167 the second, with the largest separation distance. 1168 """ 1169 if self.nearest_neighbors['distances'] is None: 1170 raise NoDataError("Nearest neighbors have not been calculated yet;" 1171 " run compute_nearest_neighbors() first.") 1172 1173 nn_idx_P, nn_idx_Q = self.nearest_neighbors['frames'] 1174 nn_dist_P, nn_dist_Q = self.nearest_neighbors['distances'] 1175 max_nn_dist_P = max(nn_dist_P) 1176 max_nn_dist_Q = max(nn_dist_Q) 1177 if max_nn_dist_P > max_nn_dist_Q: 1178 max_nn_idx_P = np.argmax(nn_dist_P) 1179 self.hausdorff_pair['frames'] = max_nn_idx_P, nn_idx_P[max_nn_idx_P] 1180 self.hausdorff_pair['distance'] = max_nn_dist_P 1181 else: 1182 max_nn_idx_Q = np.argmax(nn_dist_Q) 1183 self.hausdorff_pair['frames'] = nn_idx_Q[max_nn_idx_Q], max_nn_idx_Q 1184 self.hausdorff_pair['distance'] = max_nn_dist_Q 1185 1186 1187 def get_nearest_neighbors(self, frames=True, distances=True): 1188 """Returns the nearest neighbor frame indices, distances, or both, for 1189 each path in *this* :class:`PSAPair`. 1190 1191 :meth:`PSAPair.get_nearest_neighbors` requires that the nearest 1192 neighbors (:attr:`nearest_neighbors`) be initially computed by first 1193 calling :meth:`compute_nearest_neighbors`. At least one of *frames* 1194 or *distances* must be ``True``, or else a ``NoDataError`` is raised. 1195 1196 Parameters 1197 ---------- 1198 frames : bool 1199 if ``True``, return nearest neighbor frame indices 1200 [``True``] 1201 distances : bool 1202 if ``True``, return nearest neighbor distances [``True``] 1203 1204 Returns 1205 ------- 1206 dict or tuple 1207 If both *frames* and *distances* are ``True``, return the entire 1208 dictionary (:attr:`nearest_neighbors`); if only *frames* is 1209 ``True``, return a pair of :class:`numpy.ndarray` containing the 1210 indices of the frames (for the pair of paths) of the nearest 1211 neighbors; if only *distances* is ``True``, return a pair of 1212 :class:`numpy.ndarray` of the nearest neighbor distances (for the 1213 pair of paths). 1214 1215 """ 1216 if self.nearest_neighbors['distances'] is None: 1217 raise NoDataError("Nearest neighbors have not been calculated yet;" 1218 " run compute_nearest_neighbors() first.") 1219 1220 if frames: 1221 if distances: 1222 return self.nearest_neighbors 1223 else: 1224 return self.nearest_neighbors['frames'] 1225 elif distances: 1226 return self.nearest_neighbors['distances'] 1227 else: 1228 raise NoDataError('Need to select Hausdorff pair "frames" or' 1229 ' "distances" or both. "frames" and "distances"' 1230 ' cannot both be set to False.') 1231 1232 def get_hausdorff_pair(self, frames=True, distance=True): 1233 """Returns the Hausdorff pair of frames indices, the Hausdorff distance, 1234 or both, for the paths in *this* :class:`PSAPair`. 1235 1236 :meth:`PSAPair.get_hausdorff_pair` requires that the Hausdorff pair 1237 (and distance) be initially found by first calling 1238 :meth:`find_hausdorff_pair`. At least one of *frames* or *distance* 1239 must be ``True``, or else a ``NoDataError`` is raised. 1240 1241 Parameters 1242 ---------- 1243 frames : bool 1244 if ``True``, return the indices of the frames 1245 of the Hausdorff pair [``True``] 1246 distances : bool 1247 if ``True``, return Hausdorff distance [``True``] 1248 1249 Returns 1250 ------- 1251 dict or tuple 1252 If both *frames* and *distance* are ``True``, return the entire 1253 dictionary (:attr:`hausdorff_pair`); if only *frames* is 1254 ``True``, return a pair of ``int`` containing the indices of the 1255 frames (one index per path) of the Hausdorff pair; if only *distance* 1256 is ``True``, return the Hausdorff distance for this path pair. 1257 """ 1258 if self.hausdorff_pair['distance'] is None: 1259 raise NoDataError("Hausdorff pair has not been calculated yet;" 1260 " run find_hausdorff_pair() first.") 1261 1262 if frames: 1263 if distance: 1264 return self.hausdorff_pair 1265 else: 1266 return self.hausdorff_pair['frames'] 1267 elif distance: 1268 return self.hausdorff_pair['distance'] 1269 else: 1270 raise NoDataError('Need to select Hausdorff pair "frames" or' 1271 ' "distance" or both. "frames" and "distance"' 1272 ' cannot both be set to False.') 1273 1274 1275class PSAnalysis(object): 1276 """Perform Path Similarity Analysis (PSA) on a set of trajectories. 1277 1278 The analysis is performed with :meth:`PSAnalysis.run` and stores the result 1279 in the :class:`numpy.ndarray` distance matrix :attr:`PSAnalysis.D`. 1280 :meth:`PSAnalysis.run` also generates a fitted trajectory and path from 1281 alignment of the original trajectories to a reference structure. 1282 1283 .. versionadded:: 0.8 1284 """ 1285 def __init__(self, universes, reference=None, ref_select='name CA', 1286 ref_frame=0, path_select=None, labels=None, 1287 targetdir=os.path.curdir): 1288 """Setting up Path Similarity Analysis. 1289 1290 The mutual similarity between all unique pairs of trajectories 1291 are computed using a selected path metric. 1292 1293 Parameters 1294 ---------- 1295 universes : list 1296 a list of universes (:class:`MDAnalysis.Universe` object), each 1297 containing a trajectory 1298 reference : Universe 1299 reference coordinates; :class:`MDAnalysis.Universe` object; if 1300 ``None`` the first time step of the first item in `universes` is used 1301 [``None``] 1302 ref_select : str or dict or tuple 1303 The selection to operate on; can be one of: 1304 1305 1. any valid selection string for 1306 :meth:`~MDAnalysis.core.groups.AtomGroup.select_atoms` that 1307 produces identical selections in *mobile* and *reference*; or 1308 2. a dictionary ``{'mobile':sel1, 'reference':sel2}`` (the 1309 :func:`MDAnalysis.analysis.align.fasta2select` function returns 1310 such a dictionary based on a ClustalW_ or STAMP_ sequence 1311 alignment); or 1312 3. a tuple ``(sel1, sel2)`` 1313 1314 When using 2. or 3. with *sel1* and *sel2* then these selections 1315 can also each be a list of selection strings (to generate an 1316 AtomGroup with defined atom order as described under 1317 :ref:`ordered-selections-label`). 1318 tol_mass : float 1319 Reject match if the atomic masses for matched atoms differ by more 1320 than *tol_mass* [0.1] 1321 ref_frame : int 1322 frame index to select frame from *reference* [0] 1323 path_select : str 1324 atom selection composing coordinates of (fitted) path; if ``None`` 1325 then *path_select* is set to *ref_select* [``None``] 1326 targetdir : str 1327 output files are saved there; if ``None`` then "./psadata" is 1328 created and used [.] 1329 labels : list 1330 list of strings, names of trajectories to be analyzed 1331 (:class:`MDAnalysis.Universe`); if ``None``, defaults to trajectory 1332 names [``None``] 1333 1334 1335 .. _ClustalW: http://www.clustal.org/ 1336 .. _STAMP: http://www.compbio.dundee.ac.uk/manuals/stamp.4.2/ 1337 1338 """ 1339 self.universes = universes 1340 self.u_reference = self.universes[0] if reference is None else reference 1341 self.ref_select = ref_select 1342 self.ref_frame = ref_frame 1343 self.path_select = self.ref_select if path_select is None else path_select 1344 if targetdir is None: 1345 try: 1346 targetdir = os.path.join(os.path.curdir, 'psadata') 1347 os.makedirs(targetdir) 1348 except OSError: 1349 if not os.path.isdir(targetdir): 1350 raise 1351 self.targetdir = os.path.realpath(targetdir) 1352 1353 # Set default directory names for storing topology/reference structures, 1354 # fitted trajectories, paths, distance matrices, and plots 1355 self.datadirs = {'fitted_trajs' : '/fitted_trajs', 1356 'paths' : '/paths', 1357 'distance_matrices' : '/distance_matrices', 1358 'plots' : '/plots'} 1359 for dir_name, directory in six.iteritems(self.datadirs): 1360 try: 1361 full_dir_name = os.path.join(self.targetdir, dir_name) 1362 os.makedirs(full_dir_name) 1363 except OSError: 1364 if not os.path.isdir(full_dir_name): 1365 raise 1366 1367 # Keep track of topology, trajectory, and related files 1368 trj_names = [] 1369 for i, u in enumerate(self.universes): 1370 head, tail = os.path.split(u.trajectory.filename) 1371 filename, ext = os.path.splitext(tail) 1372 trj_names.append(filename) 1373 self.trj_names = trj_names 1374 self.fit_trj_names = None 1375 self.path_names = None 1376 self.top_name = self.universes[0].filename if len(universes) != 0 else None 1377 self.labels = labels or self.trj_names 1378 1379 # Names of persistence (pickle) files where topology and trajectory 1380 # filenames are stored--should not be modified by user 1381 self._top_pkl = os.path.join(self.targetdir, "psa_top-name.pkl") 1382 self._trjs_pkl = os.path.join(self.targetdir, "psa_orig-traj-names.pkl") 1383 self._fit_trjs_pkl = os.path.join(self.targetdir, "psa_fitted-traj-names.pkl") 1384 self._paths_pkl = os.path.join(self.targetdir, "psa_path-names.pkl") 1385 self._labels_pkl = os.path.join(self.targetdir, "psa_labels.pkl") 1386 # Pickle topology and trajectory filenames for this analysis to curdir 1387 with open(self._top_pkl, 'wb') as output: 1388 cPickle.dump(self.top_name, output) 1389 with open(self._trjs_pkl, 'wb') as output: 1390 cPickle.dump(self.trj_names, output) 1391 with open(self._labels_pkl, 'wb') as output: 1392 cPickle.dump(self.labels, output) 1393 1394 self.natoms = None 1395 self.npaths = None 1396 self.paths = None 1397 self.D = None # pairwise distances 1398 self._HP = None # (distance vector order) list of all Hausdorff pairs 1399 self._NN = None # (distance vector order) list of all nearest neighbors 1400 self._psa_pairs = None # (distance vector order) list of all PSAPairs 1401 1402 1403 def generate_paths(self, align=False, filename='fitted', infix='', weights=None, 1404 tol_mass=False, ref_frame=None, flat=False, save=True, store=True): 1405 """Generate paths, aligning each to reference structure if necessary. 1406 1407 Parameters 1408 ---------- 1409 align : bool 1410 Align trajectories to atom selection :attr:`PSAnalysis.ref_select` 1411 of :attr:`PSAnalysis.u_reference` [``False``] 1412 filename : str 1413 strings representing base filename for fitted trajectories and 1414 paths [``None``] 1415 infix : str 1416 additional tag string that is inserted into the output filename of 1417 the fitted trajectory files [''] 1418 weights : {"mass", ``None``} or array_like (optional) 1419 choose weights. With ``"mass"`` uses masses as weights; with 1420 ``None`` weigh each atom equally. If a float array of the same 1421 length as the selected AtomGroup is provided, use each element of 1422 the `array_like` as a weight for the corresponding atom in the 1423 AtomGroup. 1424 tol_mass : float 1425 Reject match if the atomic masses for matched atoms differ by more 1426 than *tol_mass* 1427 ref_frame : int 1428 frame index to select frame from *reference* 1429 flat : bool 1430 represent :attr:`Path.path` as a 2D (|2D|) :class:`numpy.ndarray`; 1431 if ``False`` then :attr:`Path.path` is a 3D (|3D|) 1432 :class:`numpy.ndarray` [``False``] 1433 save : bool 1434 if ``True``, pickle list of names for fitted trajectories 1435 [``True``] 1436 store : bool 1437 if ``True`` then writes each path (:class:`numpy.ndarray`) 1438 in :attr:`PSAnalysis.paths` to compressed npz (numpy) files 1439 [``False``] 1440 1441 1442 The fitted trajectories are written to new files in the 1443 "/trj_fit" subdirectory in :attr:`PSAnalysis.targetdir` named 1444 "filename(*trajectory*)XXX*infix*_psa", where "XXX" is a number between 1445 000 and 999; the extension of each file is the same as its original. 1446 Optionally, the trajectories can also be saved in numpy compressed npz 1447 format in the "/paths" subdirectory in :attr:`PSAnalysis.targetdir` for 1448 persistence and can be accessed as the attribute 1449 :attr:`PSAnalysis.paths`. 1450 1451 1452 .. deprecated:: 0.16.1 1453 Instead of ``mass_weighted=True`` use new ``weights='mass'``; 1454 refactored to fit with AnalysisBase API 1455 1456 .. versionchanged:: 0.17.0 1457 Deprecated keyword `mass_weighted` was removed. 1458 """ 1459 if ref_frame is None: 1460 ref_frame = self.ref_frame 1461 1462 paths = [] 1463 fit_trj_names = [] 1464 for i, u in enumerate(self.universes): 1465 p = Path(u, self.u_reference, ref_select=self.ref_select, 1466 path_select=self.path_select, ref_frame=ref_frame) 1467 trj_dir = self.targetdir + self.datadirs['fitted_trajs'] 1468 postfix = '{0}{1}{2:03n}'.format(infix, '_psa', i+1) 1469 top_name, fit_trj_name = p.run(align=align, filename=filename, 1470 postfix=postfix, 1471 targetdir=trj_dir, 1472 weights=weights, 1473 tol_mass=tol_mass, flat=flat) 1474 paths.append(p.path) 1475 fit_trj_names.append(fit_trj_name) 1476 self.natoms, axis = get_coord_axes(paths[0]) 1477 self.paths = paths 1478 self.npaths = len(paths) 1479 self.fit_trj_names = fit_trj_names 1480 if save: 1481 with open(self._fit_trjs_pkl, 'wb') as output: 1482 cPickle.dump(self.fit_trj_names, output) 1483 if store: 1484 self.save_paths(filename=filename) 1485 1486 1487 def run(self, **kwargs): 1488 """Perform path similarity analysis on the trajectories to compute 1489 the distance matrix. 1490 1491 A number of parameters can be changed from the defaults. The 1492 result is stored as the array :attr:`PSAnalysis.D`. 1493 1494 Parameters 1495 ---------- 1496 metric : str or callable 1497 selection string specifying the path metric to measure pairwise 1498 distances among :attr:`PSAnalysis.paths` or a callable with the 1499 same call signature as :func:`hausdorff` 1500 [``'hausdorff'``] 1501 start : int 1502 `start` and `stop` frame index with `step` size: analyze 1503 ``trajectory[start:stop:step]`` [``None``] 1504 stop : int 1505 step : int 1506 store : bool 1507 if ``True`` then writes :attr:`PSAnalysis.D` to text and 1508 compressed npz (numpy) files [``True``] 1509 1510 .. deprecated:: 0.19.0 1511 `store` will be removed together with :meth:`save_results` in 1.0.0. 1512 1513 filename : str 1514 string, filename to save :attr:`PSAnalysis.D` 1515 1516 .. deprecated:: 0.19.0 1517 `filename` will be removed together with :meth:`save_results` in 1.0.0. 1518 1519 1520 """ 1521 metric = kwargs.pop('metric', 'hausdorff') 1522 start = kwargs.pop('start', None) 1523 stop = kwargs.pop('stop', None) 1524 step = kwargs.pop('step', None) 1525 # DEPRECATED 0.19.0: remove in 1.0 1526 if 'store' in kwargs: 1527 warnings.warn("PSAnalysis.run(): 'store' was deprecated in 0.19.0 " 1528 "and will be removed in 1.0", 1529 category=DeprecationWarning) 1530 store = kwargs.pop('store', True) 1531 1532 if isinstance(metric, string_types): 1533 metric_func = get_path_metric_func(str(metric)) 1534 else: 1535 metric_func = metric 1536 numpaths = self.npaths 1537 D = np.zeros((numpaths,numpaths)) 1538 1539 for i in range(0, numpaths-1): 1540 for j in range(i+1, numpaths): 1541 P = self.paths[i][start:stop:step] 1542 Q = self.paths[j][start:stop:step] 1543 D[i,j] = metric_func(P, Q) 1544 D[j,i] = D[i,j] 1545 self.D = D 1546 if store: 1547 # DEPRECATED 0.19.0: remove in 1.0 1548 if 'filename' in kwargs: 1549 warnings.warn("PSAnalysis.run(): 'filename' was deprecated in " 1550 "0.19.0 and will be removed in 1.0", 1551 category=DeprecationWarning) 1552 filename = kwargs.pop('filename', metric) 1553 if not isinstance(metric, string_types): 1554 filename = 'custom_metric' 1555 self.save_result(filename=filename) 1556 1557 def run_pairs_analysis(self, **kwargs): 1558 """Perform PSA Hausdorff (nearest neighbor) pairs analysis on all unique 1559 pairs of paths in :attr:`PSAnalysis.paths`. 1560 1561 Partial results can be stored in separate lists, where each list is 1562 indexed according to distance vector convention (i.e., element *(i,j)* 1563 in distance matrix representation corresponds to element 1564 :math:`s=N*i+j-(i+1)*(i+2)` in distance vector representation, which is 1565 the :math:`s^\text{th}` comparison). For each unique pair of paths, the 1566 nearest neighbors for that pair can be stored in :attr:`NN` and the 1567 Hausdorff pair in :attr:`HP`. :attr:`PP` stores the full information 1568 of Hausdorff pairs analysis that is available for each pair of path, 1569 including nearest neighbors lists and the Hausdorff pairs. 1570 1571 The pairwise distances are stored as the array :attr:`PSAnalysis.D`. 1572 1573 Parameters 1574 ---------- 1575 start : int 1576 `start` and `stop` frame index with `step` size: analyze 1577 ``trajectory[start:stop:step]`` [``None``] 1578 stop : int 1579 step : int 1580 neighbors : bool 1581 if ``True``, then stores dictionary of nearest neighbor 1582 frames/distances in :attr:`PSAnalysis.NN` [``False``] 1583 hausdorff_pairs : bool 1584 if ``True``, then stores dictionary of Hausdorff pair 1585 frames/distances in :attr:`PSAnalysis.HP` [``False``] 1586 """ 1587 start = kwargs.pop('start', None) 1588 stop = kwargs.pop('stop', None) 1589 step = kwargs.pop('step', None) 1590 neighbors = kwargs.pop('neighbors', False) 1591 hausdorff_pairs = kwargs.pop('hausdorff_pairs', False) 1592 1593 numpaths = self.npaths 1594 D = np.zeros((numpaths,numpaths)) 1595 self._NN = [] # list of nearest neighbors pairs 1596 self._HP = [] # list of Hausdorff pairs 1597 self._psa_pairs = [] # list of PSAPairs 1598 1599 for i in range(0, numpaths-1): 1600 for j in range(i+1, numpaths): 1601 pp = PSAPair(i, j, numpaths) 1602 P = self.paths[i][start:stop:step] 1603 Q = self.paths[j][start:stop:step] 1604 pp.compute_nearest_neighbors(P, Q, self.natoms) 1605 pp.find_hausdorff_pair() 1606 D[i,j] = pp.hausdorff_pair['distance'] 1607 D[j,i] = D[i,j] 1608 self._psa_pairs.append(pp) 1609 if neighbors: 1610 self._NN.append(pp.get_nearest_neighbors()) 1611 if hausdorff_pairs: 1612 self._HP.append(pp.get_hausdorff_pair()) 1613 self.D = D 1614 1615 @deprecate(release="0.19.0", remove="1.0.0", 1616 message="You can save the distance matrix :attr:`D` to a numpy " 1617 "file with ``np.save(filename, PSAnalysis.D)``.") 1618 def save_result(self, filename=None): 1619 """Save distance matrix :attr:`PSAnalysis.D` to a numpy compressed npz 1620 file and text file. 1621 1622 The data are saved with :func:`numpy.savez_compressed` and 1623 :func:`numpy.savetxt` in the directory specified by 1624 :attr:`PSAnalysis.targetdir`. 1625 1626 Parameters 1627 ---------- 1628 filename : str 1629 specifies filename [``None``] 1630 1631 Returns 1632 ------- 1633 filename : str 1634 1635 """ 1636 filename = filename or 'psa_distances' 1637 head = self.targetdir + self.datadirs['distance_matrices'] 1638 outfile = os.path.join(head, filename) 1639 if self.D is None: 1640 raise NoDataError("Distance matrix has not been calculated yet") 1641 np.save(outfile + '.npy', self.D) 1642 np.savetxt(outfile + '.dat', self.D) 1643 logger.info("Wrote distance matrix to file %r.npz", outfile) 1644 logger.info("Wrote distance matrix to file %r.dat", outfile) 1645 return filename 1646 1647 1648 def save_paths(self, filename=None): 1649 """Save fitted :attr:`PSAnalysis.paths` to numpy compressed npz files. 1650 1651 The data are saved with :func:`numpy.savez_compressed` in the directory 1652 specified by :attr:`PSAnalysis.targetdir`. 1653 1654 Parameters 1655 ---------- 1656 filename : str 1657 specifies filename [``None``] 1658 1659 Returns 1660 ------- 1661 filename : str 1662 1663 """ 1664 filename = filename or 'path_psa' 1665 head = self.targetdir + self.datadirs['paths'] 1666 outfile = os.path.join(head, filename) 1667 if self.paths is None: 1668 raise NoDataError("Paths have not been calculated yet") 1669 path_names = [] 1670 for i, path in enumerate(self.paths): 1671 current_outfile = "{0}{1:03n}.npy".format(outfile, i+1) 1672 np.save(current_outfile, self.paths[i]) 1673 path_names.append(current_outfile) 1674 logger.info("Wrote path to file %r", current_outfile) 1675 self.path_names = path_names 1676 with open(self._paths_pkl, 'wb') as output: 1677 cPickle.dump(self.path_names, output) 1678 return filename 1679 1680 1681 def load(self): 1682 """Load fitted paths specified by 'psa_path-names.pkl' in 1683 :attr:`PSAnalysis.targetdir`. 1684 """ 1685 if not os.path.exists(self._paths_pkl): 1686 raise NoDataError("Fitted trajectories cannot be loaded; save file" + 1687 "{0} does not exist.".format(self._paths_pkl)) 1688 self.path_names = np.load(self._paths_pkl) 1689 self.paths = [np.load(pname) for pname in self.path_names] 1690 if os.path.exists(self._labels_pkl): 1691 self.labels = np.load(self._labels_pkl) 1692 print("Loaded paths from " + self._paths_pkl) 1693 1694 1695 def plot(self, filename=None, linkage='ward', count_sort=False, 1696 distance_sort=False, figsize=4.5, labelsize=12): 1697 """Plot a clustered distance matrix. 1698 1699 Usese method *linkage* and plots the corresponding dendrogram. Rows 1700 (and columns) are identified using the list of strings specified by 1701 :attr:`PSAnalysis.labels`. 1702 1703 If `filename` is supplied then the figure is also written to file (the 1704 suffix determines the file type, e.g. pdf, png, eps, ...). All other 1705 keyword arguments are passed on to :func:`matplotlib.pyplot.matshow`. 1706 1707 1708 Parameters 1709 ---------- 1710 filename : str 1711 save figure to *filename* [``None``] 1712 linkage : str 1713 name of linkage criterion for clustering [``'ward'``] 1714 count_sort : bool 1715 see :func:`scipy.cluster.hierarchy.dendrogram` [``False``] 1716 distance_sort : bool 1717 see :func:`scipy.cluster.hierarchy.dendrogram` [``False``] 1718 figsize : float 1719 set the vertical size of plot in inches [``4.5``] 1720 labelsize : float 1721 set the font size for colorbar labels; font size for path labels on 1722 dendrogram default to 3 points smaller [``12``] 1723 1724 Returns 1725 ------- 1726 Z 1727 `Z` from :meth:`cluster` 1728 dgram 1729 `dgram` from :meth:`cluster` 1730 dist_matrix_clus 1731 clustered distance matrix (reordered) 1732 1733 """ 1734 from matplotlib.pyplot import figure, colorbar, cm, savefig, clf 1735 1736 if self.D is None: 1737 raise ValueError( 1738 "No distance data; do 'PSAnalysis.run(store=True)' first.") 1739 npaths = len(self.D) 1740 dist_matrix = self.D 1741 1742 dgram_loc, hmap_loc, cbar_loc = self._get_plot_obj_locs() 1743 aspect_ratio = 1.25 1744 clf() 1745 fig = figure(figsize=(figsize*aspect_ratio, figsize)) 1746 ax_hmap = fig.add_axes(hmap_loc) 1747 ax_dgram = fig.add_axes(dgram_loc) 1748 1749 Z, dgram = self.cluster(method=linkage, \ 1750 count_sort=count_sort, \ 1751 distance_sort=distance_sort) 1752 rowidx = colidx = dgram['leaves'] # get row-wise ordering from clustering 1753 ax_dgram.invert_yaxis() # Place origin at up left (from low left) 1754 1755 minDist, maxDist = 0, np.max(dist_matrix) 1756 dist_matrix_clus = dist_matrix[rowidx,:] 1757 dist_matrix_clus = dist_matrix_clus[:,colidx] 1758 im = ax_hmap.matshow(dist_matrix_clus, aspect='auto', origin='lower', \ 1759 cmap=cm.YlGn, vmin=minDist, vmax=maxDist) 1760 ax_hmap.invert_yaxis() # Place origin at upper left (from lower left) 1761 ax_hmap.locator_params(nbins=npaths) 1762 ax_hmap.set_xticks(np.arange(npaths), minor=True) 1763 ax_hmap.set_yticks(np.arange(npaths), minor=True) 1764 ax_hmap.tick_params(axis='x', which='both', labelleft='off', \ 1765 labelright='off', labeltop='on', labelsize=0) 1766 ax_hmap.tick_params(axis='y', which='both', labelleft='on', \ 1767 labelright='off', labeltop='off', labelsize=0) 1768 rowlabels = [self.labels[i] for i in rowidx] 1769 collabels = [self.labels[i] for i in colidx] 1770 ax_hmap.set_xticklabels(collabels, rotation='vertical', \ 1771 size=(labelsize-4), multialignment='center', minor=True) 1772 ax_hmap.set_yticklabels(rowlabels, rotation='horizontal', \ 1773 size=(labelsize-4), multialignment='left', ha='right', \ 1774 minor=True) 1775 1776 ax_color = fig.add_axes(cbar_loc) 1777 colorbar(im, cax=ax_color, ticks=np.linspace(minDist, maxDist, 10), \ 1778 format="%0.1f") 1779 ax_color.tick_params(labelsize=labelsize) 1780 1781 # Remove major ticks from both heat map axes 1782 for tic in ax_hmap.xaxis.get_major_ticks(): 1783 tic.tick1On = tic.tick2On = False 1784 tic.label1On = tic.label2On = False 1785 for tic in ax_hmap.yaxis.get_major_ticks(): 1786 tic.tick1On = tic.tick2On = False 1787 tic.label1On = tic.label2On = False 1788 # Remove minor ticks from both heat map axes 1789 for tic in ax_hmap.xaxis.get_minor_ticks(): 1790 tic.tick1On = tic.tick2On = False 1791 for tic in ax_hmap.yaxis.get_minor_ticks(): 1792 tic.tick1On = tic.tick2On = False 1793 # Remove tickmarks from colorbar 1794 for tic in ax_color.yaxis.get_major_ticks(): 1795 tic.tick1On = tic.tick2On = False 1796 1797 if filename is not None: 1798 head = self.targetdir + self.datadirs['plots'] 1799 outfile = os.path.join(head, filename) 1800 savefig(outfile, dpi=300, bbox_inches='tight') 1801 1802 return Z, dgram, dist_matrix_clus 1803 1804 1805 def plot_annotated_heatmap(self, filename=None, linkage='ward', \ 1806 count_sort=False, distance_sort=False, \ 1807 figsize=8, annot_size=6.5): 1808 """Plot a clustered distance matrix. 1809 1810 Uses method `linkage` and plots annotated distances in the matrix. Rows 1811 (and columns) are identified using the list of strings specified by 1812 :attr:`PSAnalysis.labels`. 1813 1814 If `filename` is supplied then the figure is also written to file (the 1815 suffix determines the file type, e.g. pdf, png, eps, ...). All other 1816 keyword arguments are passed on to :func:`matplotlib.pyplot.imshow`. 1817 1818 Parameters 1819 ---------- 1820 filename : str 1821 save figure to *filename* [``None``] 1822 linkage : str 1823 name of linkage criterion for clustering [``'ward'``] 1824 count_sort : bool 1825 see :func:`scipy.cluster.hierarchy.dendrogram` [``False``] 1826 distance_sort : bool 1827 see :func:`scipy.cluster.hierarchy.dendrogram` [``False``] 1828 figsize : float 1829 set the vertical size of plot in inches [``4.5``] 1830 annot_size : float 1831 font size of annotation labels on heat map [``6.5``] 1832 1833 Returns 1834 ------- 1835 Z 1836 `Z` from :meth:`cluster` 1837 dgram 1838 `dgram` from :meth:`cluster` 1839 dist_matrix_clus 1840 clustered distance matrix (reordered) 1841 1842 1843 Note 1844 ---- 1845 This function requires the seaborn_ package, which can be installed 1846 with `pip install seaborn` or `conda install seaborn`. 1847 1848 .. _seaborn: https://seaborn.pydata.org/ 1849 1850 """ 1851 from matplotlib.pyplot import figure, colorbar, cm, savefig, clf 1852 1853 try: 1854 import seaborn.apionly as sns 1855 except ImportError: 1856 raise ImportError( 1857 """ERROR --- The seaborn package cannot be found! 1858 1859 The seaborn API could not be imported. Please install it first. 1860 You can try installing with pip directly from the 1861 internet: 1862 1863 pip install seaborn 1864 1865 Alternatively, download the package from 1866 1867 http://pypi.python.org/pypi/seaborn/ 1868 1869 and install in the usual manner. 1870 """ 1871 ) 1872 1873 if self.D is None: 1874 raise ValueError( 1875 "No distance data; do 'PSAnalysis.run(store=True)' first.") 1876 dist_matrix = self.D 1877 1878 Z, dgram = self.cluster(method=linkage, \ 1879 count_sort=count_sort, \ 1880 distance_sort=distance_sort, \ 1881 no_plot=True) 1882 rowidx = colidx = dgram['leaves'] # get row-wise ordering from clustering 1883 dist_matrix_clus = dist_matrix[rowidx,:] 1884 dist_matrix_clus = dist_matrix_clus[:,colidx] 1885 1886 clf() 1887 aspect_ratio = 1.25 1888 fig = figure(figsize=(figsize*aspect_ratio, figsize)) 1889 ax_hmap = fig.add_subplot(111) 1890 ax_hmap = sns.heatmap(dist_matrix_clus, \ 1891 linewidths=0.25, cmap=cm.YlGn, annot=True, fmt='3.1f', \ 1892 square=True, xticklabels=rowidx, yticklabels=colidx, \ 1893 annot_kws={"size": 7}, ax=ax_hmap) 1894 1895 # Remove major ticks from both heat map axes 1896 for tic in ax_hmap.xaxis.get_major_ticks(): 1897 tic.tick1On = tic.tick2On = False 1898 tic.label1On = tic.label2On = False 1899 for tic in ax_hmap.yaxis.get_major_ticks(): 1900 tic.tick1On = tic.tick2On = False 1901 tic.label1On = tic.label2On = False 1902 # Remove minor ticks from both heat map axes 1903 for tic in ax_hmap.xaxis.get_minor_ticks(): 1904 tic.tick1On = tic.tick2On = False 1905 for tic in ax_hmap.yaxis.get_minor_ticks(): 1906 tic.tick1On = tic.tick2On = False 1907 1908 if filename is not None: 1909 head = self.targetdir + self.datadirs['plots'] 1910 outfile = os.path.join(head, filename) 1911 savefig(outfile, dpi=600, bbox_inches='tight') 1912 1913 return Z, dgram, dist_matrix_clus 1914 1915 1916 def plot_nearest_neighbors(self, filename=None, idx=0, \ 1917 labels=('Path 1', 'Path 2'), figsize=4.5, \ 1918 multiplot=False, aspect_ratio=1.75, \ 1919 labelsize=12): 1920 """Plot nearest neighbor distances as a function of normalized frame 1921 number. 1922 1923 The frame number is mapped to the interval *[0, 1]*. 1924 1925 If `filename` is supplied then the figure is also written to file (the 1926 suffix determines the file type, e.g. pdf, png, eps, ...). All other 1927 keyword arguments are passed on to :func:`matplotlib.pyplot.imshow`. 1928 1929 Parameters 1930 ---------- 1931 filename : str 1932 save figure to *filename* [``None``] 1933 idx : int 1934 index of path (pair) comparison to plot [``0``] 1935 labels : (str, str) 1936 pair of names to label nearest neighbor distance 1937 curves [``('Path 1', 'Path 2')``] 1938 figsize : float 1939 set the vertical size of plot in inches [``4.5``] 1940 multiplot : bool 1941 set to ``True`` to enable plotting multiple nearest 1942 neighbor distances on the same figure [``False``] 1943 aspect_ratio : float 1944 set the ratio of width to height of the plot [``1.75``] 1945 labelsize : float 1946 set the font size for colorbar labels; font size for path labels on 1947 dendrogram default to 3 points smaller [``12``] 1948 1949 Returns 1950 ------- 1951 ax : axes 1952 1953 Note 1954 ---- 1955 This function requires the seaborn_ package, which can be installed 1956 with `pip install seaborn` or `conda install seaborn`. 1957 1958 .. _seaborn: https://seaborn.pydata.org/ 1959 1960 """ 1961 from matplotlib.pyplot import figure, savefig, tight_layout, clf, show 1962 try: 1963 import seaborn.apionly as sns 1964 except ImportError: 1965 raise ImportError( 1966 """ERROR --- The seaborn package cannot be found! 1967 1968 The seaborn API could not be imported. Please install it first. 1969 You can try installing with pip directly from the 1970 internet: 1971 1972 pip install seaborn 1973 1974 Alternatively, download the package from 1975 1976 http://pypi.python.org/pypi/seaborn/ 1977 1978 and install in the usual manner. 1979 """ 1980 ) 1981 1982 colors = sns.xkcd_palette(["cherry", "windows blue"]) 1983 1984 if self._NN is None: 1985 raise ValueError("No nearest neighbor data; run " 1986 "'PSAnalysis.run_pairs_analysis(neighbors=True)' first.") 1987 1988 sns.set_style('whitegrid') 1989 1990 if not multiplot: 1991 clf() 1992 fig = figure(figsize=(figsize*aspect_ratio, figsize)) 1993 ax = fig.add_subplot(111) 1994 1995 nn_dist_P, nn_dist_Q = self._NN[idx]['distances'] 1996 frames_P = len(nn_dist_P) 1997 frames_Q = len(nn_dist_Q) 1998 progress_P = np.asarray(range(frames_P))/(1.0*frames_P) 1999 progress_Q = np.asarray(range(frames_Q))/(1.0*frames_Q) 2000 2001 ax.plot(progress_P, nn_dist_P, color=colors[0], lw=1.5, label=labels[0]) 2002 ax.plot(progress_Q, nn_dist_Q, color=colors[1], lw=1.5, label=labels[1]) 2003 2004 ax.legend() 2005 ax.set_xlabel(r'(normalized) progress by frame number', fontsize=12) 2006 ax.set_ylabel(r'nearest neighbor rmsd ($\AA$)', fontsize=12) 2007 ax.tick_params(axis='both', which='major', labelsize=12, pad=4) 2008 2009 sns.despine(bottom=True, left=True, ax=ax) 2010 tight_layout() 2011 2012 if filename is not None: 2013 head = self.targetdir + self.datadirs['plots'] 2014 outfile = os.path.join(head, filename) 2015 savefig(outfile, dpi=300, bbox_inches='tight') 2016 2017 return ax 2018 2019 2020 def cluster(self, dist_mat=None, method='ward', count_sort=False, \ 2021 distance_sort=False, no_plot=False, no_labels=True, \ 2022 color_threshold=4): 2023 """Cluster trajectories and optionally plot the dendrogram. 2024 2025 This method is used by :meth:`PSAnalysis.plot` to generate a heatmap- 2026 dendrogram combination plot. By default, the distance matrix, 2027 :attr:`PSAnalysis.D`, is assumed to exist, converted to 2028 distance-vector form, and inputted to :func:`cluster.hierarchy.linkage` 2029 to generate a clustering. For convenience in plotting arbitrary 2030 distance matrices, one can also be specify `dist_mat`, which will be 2031 checked for proper distance matrix form by 2032 :func:`spatial.distance.squareform` 2033 2034 Parameters 2035 ---------- 2036 dist_mat : numpy.ndarray 2037 user-specified distance matrix to be clustered [``None``] 2038 method : str 2039 name of linkage criterion for clustering [``'ward'``] 2040 no_plot : bool 2041 if ``True``, do not render the dendrogram [``False``] 2042 no_labels : bool 2043 if ``True`` then do not label dendrogram [``True``] 2044 color_threshold : float 2045 For brevity, let t be the color_threshold. Colors all the 2046 descendent links below a cluster node k the same color if k is 2047 the first node below the cut threshold t. All links connecting 2048 nodes with distances greater than or equal to the threshold are 2049 colored blue. If t is less than or equal to zero, all nodes are 2050 colored blue. If color_threshold is None or ‘default’, 2051 corresponding with MATLAB(TM) behavior, the threshold is set to 2052 0.7*max(Z[:,2]). [``4``]] 2053 2054 Returns 2055 ------- 2056 Z 2057 output from :func:`scipy.cluster.hierarchy.linkage`; 2058 list of indices representing the row-wise order of the objects 2059 after clustering 2060 dgram 2061 output from :func:`scipy.cluster.hierarchy.dendrogram` 2062 """ 2063 # perhaps there is a better way to manipulate the plot... or perhaps it 2064 # is not even necessary? In any case, the try/finally makes sure that 2065 # we are not permanently changing the user's global state 2066 orig_linewidth = matplotlib.rcParams['lines.linewidth'] 2067 matplotlib.rcParams['lines.linewidth'] = 0.5 2068 try: 2069 if dist_mat: 2070 dist_vec = spatial.distance.squareform(dist_mat, 2071 force='tovector', 2072 checks=True) 2073 else: 2074 dist_vec = self.get_pairwise_distances(vectorform=True) 2075 Z = cluster.hierarchy.linkage(dist_vec, method=method) 2076 dgram = cluster.hierarchy.dendrogram( 2077 Z, no_labels=no_labels, orientation='left', 2078 count_sort=count_sort, distance_sort=distance_sort, 2079 no_plot=no_plot, color_threshold=color_threshold) 2080 finally: 2081 matplotlib.rcParams['lines.linewidth'] = orig_linewidth 2082 return Z, dgram 2083 2084 def _get_plot_obj_locs(self): 2085 """Find and return coordinates for dendrogram, heat map, and colorbar. 2086 2087 Returns 2088 ------- 2089 tuple 2090 tuple of coordinates for placing the dendrogram, heat map, and 2091 colorbar in the plot. 2092 """ 2093 plot_xstart = 0.04 2094 plot_ystart = 0.04 2095 label_margin = 0.155 2096 2097 dgram_height = 0.2 # dendrogram heights(s) 2098 hmap_xstart = plot_xstart + dgram_height + label_margin 2099 2100 # Set locations for dendrogram(s), matrix, and colorbar 2101 hmap_height = 0.8 2102 hmap_width = 0.6 2103 dgram_loc = [plot_xstart, plot_ystart, dgram_height, hmap_height] 2104 cbar_width = 0.02 2105 cbar_xstart = hmap_xstart + hmap_width + 0.01 2106 cbar_loc = [cbar_xstart, plot_ystart, cbar_width, hmap_height] 2107 hmap_loc = [hmap_xstart, plot_ystart, hmap_width, hmap_height] 2108 2109 return dgram_loc, hmap_loc, cbar_loc 2110 2111 2112 def get_num_atoms(self): 2113 """Return the number of atoms used to construct the :class:`Path` instances in 2114 :class:`PSA`. 2115 2116 Returns 2117 ------- 2118 int 2119 the number of atoms in any path 2120 2121 Note 2122 ---- 2123 Must run :meth:`PSAnalysis.generate_paths` prior to calling this 2124 method. 2125 """ 2126 if self.natoms is None: 2127 raise ValueError( 2128 "No path data; do 'PSAnalysis.generate_paths()' first.") 2129 return self.natoms 2130 2131 2132 def get_num_paths(self): 2133 """Return the number of paths in :class:`PSA`. 2134 2135 Note 2136 ---- 2137 Must run :meth:`PSAnalysis.generate_paths` prior to calling this method. 2138 2139 Returns 2140 ------- 2141 int 2142 the number of paths in :class:`PSA` 2143 """ 2144 if self.npaths is None: 2145 raise ValueError( 2146 "No path data; do 'PSAnalysis.generate_paths()' first.") 2147 return self.npaths 2148 2149 2150 def get_paths(self): 2151 """Return the paths in :class:`PSA`. 2152 2153 Note 2154 ---- 2155 Must run :meth:`PSAnalysis.generate_paths` prior to calling this 2156 method. 2157 2158 Returns 2159 ------- 2160 list 2161 list of :class:`numpy.ndarray` representations of paths in 2162 :class:`PSA` 2163 """ 2164 if self.paths is None: 2165 raise ValueError( 2166 "No path data; do 'PSAnalysis.generate_paths()' first.") 2167 return self.paths 2168 2169 2170 def get_pairwise_distances(self, vectorform=False, checks=False): 2171 """Return the distance matrix (or vector) of pairwise path distances. 2172 2173 Note 2174 ---- 2175 Must run :meth:`PSAnalysis.run` with ``store=True`` prior to 2176 calling this method. 2177 2178 Parameters 2179 ---------- 2180 vectorform : bool 2181 if ``True``, return the distance vector instead [``False``] 2182 checks : bool 2183 if ``True``, check that :attr:`PSAnalysis.D` is a proper distance 2184 matrix [``False``] 2185 2186 Returns 2187 ------- 2188 numpy.ndarray 2189 representation of the distance matrix (or vector) 2190 2191 """ 2192 if self.D is None: 2193 raise ValueError( 2194 "No distance data; do 'PSAnalysis.run(store=True)' first.") 2195 if vectorform: 2196 return spatial.distance.squareform(self.D, force='tovector', 2197 checks=checks) 2198 else: 2199 return self.D 2200 2201 2202 @property 2203 def psa_pairs(self): 2204 """The list of :class:`PSAPair` instances for each pair of paths. 2205 2206 :attr:`psa_pairs` is a list of all :class:`PSAPair` objects (in 2207 distance vector order). The elements of a :class:`PSAPair` are pairs of 2208 paths that have been compared using 2209 :meth:`PSAnalysis.run_pairs_analysis`. Each :class:`PSAPair` contains 2210 nearest neighbor and Hausdorff pair information specific to a pair of 2211 paths. The nearest neighbor frames and distances for a :class:`PSAPair` 2212 can be accessed in the nearest neighbor dictionary using the keys 2213 'frames' and 'distances', respectively. E.g., 2214 :attr:`PSAPair.nearest_neighbors['distances']` returns a *pair* of 2215 :class:`numpy.ndarray` corresponding to the nearest neighbor distances 2216 for each path. Similarly, Hausdorff pair information can be accessed 2217 using :attr:`PSAPair.hausdorff_pair` with the keys 'frames' and 2218 'distance'. 2219 2220 Note 2221 ---- 2222 Must run :meth:`PSAnalysis.run_pairs_analysis` prior to calling this 2223 method. 2224 2225 """ 2226 if self._psa_pairs is None: 2227 raise ValueError("No nearest neighbors data; do" 2228 " 'PSAnalysis.run_pairs_analysis()' first.") 2229 return self._psa_pairs 2230 2231 2232 @property 2233 def hausdorff_pairs(self): 2234 """The Hausdorff pair for each (unique) pairs of paths. 2235 2236 This attribute contains a list of Hausdorff pair information (in 2237 distance vector order), where each element is a dictionary containing 2238 the pair of frames and the (Hausdorff) distance between a pair of 2239 paths. See :meth:`PSAnalysis.psa_pairs` and 2240 :attr:`PSAPair.hausdorff_pair` for more information about accessing 2241 Hausdorff pair data. 2242 2243 Note 2244 ---- 2245 Must run :meth:`PSAnalysis.run_pairs_analysis` with 2246 ``hausdorff_pairs=True`` prior to calling this method. 2247 2248 """ 2249 if self._HP is None: 2250 raise ValueError("No Hausdorff pairs data; do " 2251 "'PSAnalysis.run_pairs_analysis(hausdorff_pairs=True)' " 2252 "first.") 2253 return self._HP 2254 2255 @property 2256 def nearest_neighbors(self): 2257 """The nearest neighbors for each (unique) pair of paths. 2258 2259 This attribute contains a list of nearest neighbor information (in 2260 distance vector order), where each element is a dictionary containing 2261 the nearest neighbor frames and distances between a pair of paths. See 2262 :meth:`PSAnalysis.psa_pairs` and :attr:`PSAPair.nearest_neighbors` for 2263 more information about accessing nearest neighbor data. 2264 2265 Note 2266 ---- 2267 Must run :meth:`PSAnalysis.run_pairs_analysis` with 2268 ``neighbors=True`` prior to calling this method. 2269 2270 """ 2271 if self._NN is None: 2272 raise ValueError("No nearest neighbors data; do" 2273 " 'PSAnalysis.run_pairs_analysis(neighbors=True)'" 2274 " first.") 2275 return self._NN 2276