1# -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; coding:utf-8 -*-
2# vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4
3#
4# MDAnalysis --- https://www.mdanalysis.org
5# Copyright (c) 2006-2017 The MDAnalysis Development Team and contributors
6# (see the file AUTHORS for the full list of names)
7#
8# Released under the GNU Public Licence, v2 or any higher version
9#
10# Please cite your use of MDAnalysis in published work:
11#
12# R. J. Gowers, M. Linke, J. Barnoud, T. J. E. Reddy, M. N. Melo, S. L. Seyler,
13# D. L. Dotson, J. Domanski, S. Buchoux, I. M. Kenney, and O. Beckstein.
14# MDAnalysis: A Python package for the rapid analysis of molecular dynamics
15# simulations. In S. Benthall and S. Rostrup editors, Proceedings of the 15th
16# Python in Science Conference, pages 102-109, Austin, TX, 2016. SciPy.
17#
18# N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein.
19# MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations.
20# J. Comput. Chem. 32 (2011), 2319--2327, doi:10.1002/jcc.21787
21#
22
23r"""
24Calculating path similarity --- :mod:`MDAnalysis.analysis.psa`
25==========================================================================
26
27:Author: Sean Seyler
28:Year: 2015
29:Copyright: GNU Public License v3
30
31.. versionadded:: 0.10.0
32
33The module contains code to calculate the geometric similarity of trajectories
34using path metrics such as the Hausdorff or Fréchet distances
35[Seyler2015]_. The path metrics are functions of two paths and return a
36nonnegative number, i.e., a distance. Two paths are identical if their distance
37is zero, and large distances indicate dissimilarity. Each path metric is a
38function of the individual points (e.g., coordinate snapshots) that comprise
39each path and, loosely speaking, identify the two points, one per path of a
40pair of paths, where the paths deviate the most.  The distance between these
41points of maximal deviation is measured by the root mean square deviation
42(RMSD), i.e., to compute structural similarity.
43
44One typically computes the pairwise similarity for an ensemble of paths to
45produce a symmetric distance matrix, which can be clustered to, at a glance,
46identify patterns in the trajectory data. To properly analyze a path ensemble,
47one must select a suitable reference structure to which all paths (each
48conformer in each path) will be universally aligned using the rotations
49determined by the best-fit rmsds. Distances between paths and their structures
50are then computed directly with no further alignment. This pre-processing step
51is necessary to preserve the metric properties of the Hausdorff and Fréchet
52metrics; using the best-fit rmsd on a pairwise basis does not generally
53preserve the triangle inequality.
54
55Note
56----
57The `PSAnalysisTutorial`_ outlines a typical application of PSA to
58a set of trajectories, including doing proper alignment,
59performing distance comparisons, and generating heat
60map-dendrogram plots from hierarchical clustering.
61
62
63.. Rubric:: References
64
65
66.. [Seyler2015] Seyler SL, Kumar A, Thorpe MF, Beckstein O (2015)
67                Path Similarity Analysis: A Method for Quantifying
68                Macromolecular Pathways. PLoS Comput Biol 11(10): e1004568.
69                doi: `10.1371/journal.pcbi.1004568`_
70
71.. _`10.1371/journal.pcbi.1004568`: http://dx.doi.org/10.1371/journal.pcbi.1004568
72.. _`PSAnalysisTutorial`: https://github.com/Becksteinlab/PSAnalysisTutorial
73
74
75Helper functions and variables
76------------------------------
77The following convenience functions are used by other functions in this module.
78
79.. autofunction:: sqnorm
80.. autofunction:: get_msd_matrix
81.. autofunction:: get_coord_axes
82
83
84Classes, methods, and functions
85-------------------------------
86
87.. autofunction:: get_path_metric_func
88.. autofunction:: hausdorff
89.. autofunction:: hausdorff_wavg
90.. autofunction:: hausdorff_avg
91.. autofunction:: hausdorff_neighbors
92.. autofunction:: discrete_frechet
93.. autofunction:: dist_mat_to_vec
94
95.. autoclass:: Path
96   :members:
97
98   .. attribute:: u_original
99
100      :class:`MDAnalysis.Universe` object with a trajectory
101
102   .. attribute:: u_reference
103
104      :class:`MDAnalysis.Universe` object containing a reference structure
105
106   .. attribute:: ref_select
107
108      string, selection for
109      :meth:`~MDAnalysis.core.groups.AtomGroup.select_atoms` to select frame
110      from :attr:`Path.u_reference`
111
112   .. attribute:: path_select
113
114      string, selection for
115      :meth:`~MDAnalysis.core.groups.AtomGroup.select_atoms` to select atoms
116      to compose :attr:`Path.path`
117
118   .. attribute:: ref_frame
119
120      int, frame index to select frame from :attr:`Path.u_reference`
121
122   .. attribute:: u_fitted
123
124      :class:`MDAnalysis.Universe` object with the fitted trajectory
125
126   .. attribute:: path
127
128      :class:`numpy.ndarray` object representation of the fitted trajectory
129
130.. autoclass:: PSAPair
131
132   .. attribute:: npaths
133
134      int, total number of paths in the comparison in which *this*
135      :class:`PSAPair` was generated
136
137   .. attribute:: matrix_id
138
139      (int, int), (row, column) indices of the location of *this*
140      :class:`PSAPair` in the corresponding pairwise distance matrix
141
142   .. attribute:: pair_id
143
144      int, ID of *this* :class:`PSAPair` (the pair_id:math:`^\text{th}`
145      comparison) in the distance vector corresponding to the pairwise distance
146      matrix
147
148   .. attribute:: nearest_neighbors
149
150      dict, contains the nearest neighbors by frame index and the
151      nearest neighbor distances for each path in *this* :class:`PSAPair`
152
153   .. attribute:: hausdorff_pair
154
155      dict, contains the frame indices of the Hausdorff pair for each path in
156      *this* :class:`PSAPair` and the corresponding (Hausdorff) distance
157
158.. autoclass:: PSAnalysis
159   :members:
160
161   .. attribute:: universes
162
163      list of :class:`MDAnalysis.Universe` objects containing trajectories
164
165   .. attribute:: u_reference
166
167      :class:`MDAnalysis.Universe` object containing a reference structure
168
169   .. attribute:: ref_select
170
171      string, selection for
172      :meth:`~MDAnalysis.core.groups.AtomGroup.select_atoms` to select frame
173      from :attr:`PSAnalysis.u_reference`
174
175   .. attribute:: path_select
176
177      string, selection for
178      :meth:`~MDAnalysis.core.groups.AtomGroup.select_atoms` to select atoms
179      to compose :attr:`Path.path`
180
181   .. attribute:: ref_frame
182
183      int, frame index to select frame from :attr:`Path.u_reference`
184
185   .. attribute:: filename
186
187      string, name of file to store calculated distance matrix
188      (:attr:`PSAnalysis.D`)
189
190   .. attribute:: paths
191
192      list of :class:`numpy.ndarray` objects representing the set/ensemble of
193      fitted trajectories
194
195   .. attribute:: D
196
197      string, name of file to store calculated distance matrix
198      (:attr:`PSAnalysis.D`)
199
200
201.. Markup definitions
202.. ------------------
203..
204.. |3Dp| replace:: :math:`N_p \times N \times 3`
205.. |2Dp| replace:: :math:`N_p \times (3N)`
206.. |3Dq| replace:: :math:`N_q \times N \times 3`
207.. |2Dq| replace:: :math:`N_q \times (3N)`
208.. |3D| replace:: :math:`N_p\times N\times 3`
209.. |2D| replace:: :math:`N_p\times 3N`
210.. |Np| replace:: :math:`N_p`
211
212"""
213from __future__ import division, absolute_import, print_function
214
215import six
216from six.moves import range, cPickle
217from six import string_types
218
219import os
220import warnings
221import numbers
222
223import numpy as np
224from scipy import spatial, cluster
225from scipy.spatial.distance import directed_hausdorff
226import matplotlib
227
228import MDAnalysis
229import MDAnalysis.analysis.align
230from MDAnalysis import NoDataError
231from MDAnalysis.lib.util import deprecate
232
233import logging
234logger = logging.getLogger('MDAnalysis.analysis.psa')
235
236
237from ..due import due, Doi
238
239due.cite(Doi("10.1371/journal.pcbi.1004568"),
240         description="Path Similarity Analysis algorithm and implementation",
241         path="MDAnalysis.analysis.psa",
242         cite_module=True)
243del Doi
244
245def get_path_metric_func(name):
246    """Selects a path metric function by name.
247
248    Parameters
249    ----------
250    name : str
251        name of path metric
252
253    Returns
254    -------
255    path_metric : function
256        The path metric function specified by *name* (if found).
257    """
258    path_metrics = {
259            'hausdorff' : hausdorff,
260            'weighted_average_hausdorff' : hausdorff_wavg,
261            'average_hausdorff' : hausdorff_avg,
262            'hausdorff_neighbors' : hausdorff_neighbors,
263            'discrete_frechet' : discrete_frechet
264    }
265    try:
266        return path_metrics[name]
267    except KeyError as key:
268        raise KeyError('Path metric "{}" not found. Valid selections: {}'
269                       ''.format(key, " ".join('"{}"'.format(n)
270                                               for n in path_metrics.keys())))
271
272
273def sqnorm(v, axis=None):
274    """Compute the sum of squares of elements along specified axes.
275
276    Parameters
277    ----------
278    v :  numpy.ndarray
279         coordinates
280    axes : None / int / tuple (optional)
281         Axes or axes along which a sum is performed. The default
282         (*axes* = ``None``) performs a sum over all the dimensions of
283         the input array.  The value of *axes* may be negative, in
284         which case it counts from the last axis to the zeroth axis.
285
286    Returns
287    -------
288    float
289          the sum of the squares of the elements of `v` along `axes`
290
291    """
292    return np.sum(v*v, axis=axis)
293
294
295def get_msd_matrix(P, Q, axis=None):
296    r"""Generate the matrix of pairwise mean-squared deviations between paths.
297
298    The MSDs between all pairs of points in `P` and `Q` are
299    calculated, each pair having a point from `P` and a point from
300    `Q`.
301
302    `P` (`Q`) is a :class:`numpy.ndarray` of :math:`N_p` (:math:`N_q`) time
303    steps, :math:`N` atoms, and :math:`3N` coordinates (e.g.,
304    :attr:`MDAnalysis.core.groups.AtomGroup.positions`). The pairwise MSD
305    matrix has dimensions :math:`N_p` by :math:`N_q`.
306
307    Parameters
308    ----------
309    P : numpy.ndarray
310        the points in the first path
311    Q : numpy.ndarray
312        the points in the second path
313
314    Returns
315    -------
316    msd_matrix : numpy.ndarray
317         matrix of pairwise MSDs between points in `P` and points
318         in `Q`
319
320    Notes
321    -----
322    We calculate the MSD matrix
323
324    .. math::
325       M_{ij} = ||p_i - q_j||^2
326
327    where :math:`p_i \in P` and :math:`q_j \in Q`.
328    """
329    return np.asarray([sqnorm(p - Q, axis=axis) for p in P])
330
331
332def reshaper(path, axis):
333    """Flatten path when appropriate to facilitate calculations
334    requiring two dimensional input.
335    """
336    if len(axis) > 1:
337        path = path.reshape(len(path), -1)
338    return path
339
340def get_coord_axes(path):
341    """Return the number of atoms and the axes corresponding to atoms
342    and coordinates for a given path.
343
344    The `path` is assumed to be a :class:`numpy.ndarray` where the 0th axis
345    corresponds to a frame (a snapshot of coordinates). The :math:`3N`
346    (Cartesian) coordinates are assumed to be either:
347
348    1. all in the 1st axis, starting with the x,y,z coordinates of the
349       first atom, followed by the *x*,*y*,*z* coordinates of the 2nd, etc.
350    2. in the 1st *and* 2nd axis, where the 1st axis indexes the atom
351       number and the 2nd axis contains the *x*,*y*,*z* coordinates of
352       each atom.
353
354    Parameters
355    ----------
356    path : numpy.ndarray
357         representing a path
358
359    Returns
360    -------
361    (int, (int, ...))
362         the number of atoms and the axes containing coordinates
363
364    """
365    path_dimensions = len(path.shape)
366    if path_dimensions == 3:
367        N = path.shape[1]
368        axis = (1,2) # 1st axis: atoms, 2nd axis: x,y,z coords
369    elif path_dimensions == 2:
370        # can use mod to check if total # coords divisible by 3
371        N = path.shape[1] / 3
372        axis = (1,) # 1st axis: 3N structural coords (x1,y1,z1,...,xN,xN,zN)
373    else:
374        raise ValueError("Path must have 2 or 3 dimensions; the first "
375                         "dimensions (axis 0) must correspond to frames, "
376                         "axis 1 (and axis 2, if present) must contain atomic "
377                         "coordinates.")
378    return N, axis
379
380
381def hausdorff(P, Q):
382    r"""Calculate the symmetric Hausdorff distance between two paths.
383
384    The metric used is RMSD, as opposed to the more conventional L2
385    (Euclidean) norm, because this is convenient for i.e., comparing
386    protein configurations.
387
388    *P* (*Q*) is a :class:`numpy.ndarray` of :math:`N_p` (:math:`N_q`) time
389    steps, :math:`N` atoms, and :math:`3N` coordinates (e.g.,
390    :attr:`MDAnalysis.core.groups.AtomGroup.positions`). *P* (*Q*) has
391    either shape |3Dp| (|3Dq|), or |2Dp| (|2Dq|) in flattened form.
392
393    Note that reversing the path does not change the Hausdorff distance.
394
395    Parameters
396    ----------
397    P : numpy.ndarray
398        the points in the first path
399    Q : numpy.ndarray
400        the points in the second path
401
402    Returns
403    -------
404    float
405        the Hausdorff distance between paths `P` and `Q`
406
407    Example
408    -------
409    Calculate the Hausdorff distance between two halves of a trajectory:
410
411     >>> from MDAnalysis.tests.datafiles import PSF, DCD
412     >>> u = Universe(PSF,DCD)
413     >>> mid = len(u.trajectory)/2
414     >>> ca = u.select_atoms('name CA')
415     >>> P = numpy.array([
416     ...                ca.positions for _ in u.trajectory[:mid:]
417     ...              ]) # first half of trajectory
418     >>> Q = numpy.array([
419     ...                ca.positions for _ in u.trajectory[mid::]
420     ...              ]) # second half of trajectory
421     >>> hausdorff(P,Q)
422     4.7786639840135905
423     >>> hausdorff(P,Q[::-1]) # hausdorff distance w/ reversed 2nd trajectory
424     4.7786639840135905
425
426
427    Notes
428    -----
429    :func:`scipy.spatial.distance.directed_hausdorff` is an optimized
430    implementation of the early break algorithm of [Taha2015]_; the
431    latter code is used here to calculate the symmetric Hausdorff
432    distance with an RMSD metric
433
434    References
435    ----------
436    .. [Taha2015] A. A. Taha and A. Hanbury. An efficient algorithm for
437       calculating the exact Hausdorff distance. IEEE Transactions On Pattern
438       Analysis And Machine Intelligence, 37:2153-63, 2015.
439
440    """
441    N_p, axis_p = get_coord_axes(P)
442    N_q, axis_q = get_coord_axes(Q)
443
444    if N_p != N_q:
445        raise ValueError("P and Q must have matching sizes")
446
447    P = reshaper(P, axis_p)
448    Q = reshaper(Q, axis_q)
449
450    return max(directed_hausdorff(P, Q)[0],
451               directed_hausdorff(Q, P)[0]) / np.sqrt(N_p)
452
453
454def hausdorff_wavg(P, Q):
455    r"""Calculate the weighted average Hausdorff distance between two paths.
456
457    *P* (*Q*) is a :class:`numpy.ndarray` of :math:`N_p` (:math:`N_q`) time
458    steps, :math:`N` atoms, and :math:`3N` coordinates (e.g.,
459    :attr:`MDAnalysis.core.groups.AtomGroup.positions`). *P* (*Q*) has
460    either shape |3Dp| (|3Dq|), or |2Dp| (|2Dq|) in flattened form. The nearest
461    neighbor distances for *P* (to *Q*) and those of *Q* (to *P*) are averaged
462    individually to get the average nearest neighbor distance for *P* and
463    likewise for *Q*. These averages are then summed and divided by 2 to get a
464    measure that gives equal weight to *P* and *Q*.
465
466    Parameters
467    ----------
468    P : numpy.ndarray
469        the points in the first path
470    Q : numpy.ndarray
471        the points in the second path
472
473    Returns
474    -------
475    float
476        the weighted average Hausdorff distance between paths `P` and `Q`
477
478    Example
479    -------
480
481     >>> from MDAnalysis import Universe
482     >>> from MDAnalysis.tests.datafiles import PSF, DCD
483     >>> u = Universe(PSF,DCD)
484     >>> mid = len(u.trajectory)/2
485     >>> ca = u.select_atoms('name CA')
486     >>> P = numpy.array([
487     ...                ca.positions for _ in u.trajectory[:mid:]
488     ...              ]) # first half of trajectory
489     >>> Q = numpy.array([
490     ...                ca.positions for _ in u.trajectory[mid::]
491     ...              ]) # second half of trajectory
492     >>> hausdorff_wavg(P,Q)
493     2.5669644353703447
494     >>> hausdorff_wavg(P,Q[::-1]) # weighted avg hausdorff dist w/ Q reversed
495     2.5669644353703447
496
497    Notes
498    -----
499    The weighted average Hausdorff distance is not a true metric (it does not
500    obey the triangle inequality); see [Seyler2015]_ for further details.
501
502
503    """
504    N, axis = get_coord_axes(P)
505    d = get_msd_matrix(P, Q, axis=axis)
506    out = 0.5*( np.mean(np.amin(d,axis=0)) + np.mean(np.amin(d,axis=1)) )
507    return ( out / N )**0.5
508
509
510def hausdorff_avg(P, Q):
511    r"""Calculate the average Hausdorff distance between two paths.
512
513    *P* (*Q*) is a :class:`numpy.ndarray` of :math:`N_p` (:math:`N_q`) time
514    steps, :math:`N` atoms, and :math:`3N` coordinates (e.g.,
515    :attr:`MDAnalysis.core.groups.AtomGroup.positions`). *P* (*Q*) has
516    either shape |3Dp| (|3Dq|), or |2Dp| (|2Dq|) in flattened form. The nearest
517    neighbor distances for *P* (to *Q*) and those of *Q* (to *P*) are all
518    averaged together to get a mean nearest neighbor distance. This measure
519    biases the average toward the path that has more snapshots, whereas weighted
520    average Hausdorff gives equal weight to both paths.
521
522    Parameters
523    ----------
524    P : numpy.ndarray
525        the points in the first path
526    Q : numpy.ndarray
527        the points in the second path
528
529    Returns
530    -------
531    float
532        the average Hausdorff distance between paths `P` and `Q`
533
534    Example
535    -------
536
537     >>> from MDAnalysis.tests.datafiles import PSF, DCD
538     >>> u = Universe(PSF,DCD)
539     >>> mid = len(u.trajectory)/2
540     >>> ca = u.select_atoms('name CA')
541     >>> P = numpy.array([
542     ...                ca.positions for _ in u.trajectory[:mid:]
543     ...              ]) # first half of trajectory
544     >>> Q = numpy.array([
545     ...                ca.positions for _ in u.trajectory[mid::]
546     ...              ]) # second half of trajectory
547     >>> hausdorff_avg(P,Q)
548     2.5669646575869005
549     >>> hausdorff_avg(P,Q[::-1]) # hausdorff distance w/ reversed 2nd trajectory
550     2.5669646575869005
551
552
553    Notes
554    -----
555    The average Hausdorff distance is not a true metric (it does not obey the
556    triangle inequality); see [Seyler2015]_ for further details.
557
558    """
559    N, axis = get_coord_axes(P)
560    d = get_msd_matrix(P, Q, axis=axis)
561    out = np.mean( np.append( np.amin(d,axis=0), np.amin(d,axis=1) ) )
562    return ( out / N )**0.5
563
564
565def hausdorff_neighbors(P, Q):
566    r"""Find the Hausdorff neighbors of two paths.
567
568    *P* (*Q*) is a :class:`numpy.ndarray` of :math:`N_p` (:math:`N_q`) time
569    steps, :math:`N` atoms, and :math:`3N` coordinates (e.g.,
570    :attr:`MDAnalysis.core.groups.AtomGroup.positions`). *P* (*Q*) has
571    either shape |3Dp| (|3Dq|), or |2Dp| (|2Dq|) in flattened form.
572
573    Parameters
574    ----------
575    P : numpy.ndarray
576        the points in the first path
577    Q : numpy.ndarray
578        the points in the second path
579
580    Returns
581    -------
582    dict
583        dictionary of two pairs of numpy arrays, the first pair (key
584        "frames") containing the indices of (Hausdorff) nearest
585        neighbors for `P` and `Q`, respectively, the second (key
586        "distances") containing (corresponding) nearest neighbor
587        distances for `P` and `Q`, respectively
588
589    Notes
590    -----
591    - Hausdorff neighbors are those points on the two paths that are separated by
592      the Hausdorff distance. They are the farthest nearest neighbors and are
593      maximally different in the sense of the Hausdorff distance [Seyler2015]_.
594    - :func:`scipy.spatial.distance.directed_hausdorff` can also provide the
595      hausdorff neighbors.
596
597    """
598    N, axis = get_coord_axes(P)
599    d = get_msd_matrix(P, Q, axis=axis)
600    nearest_neighbors = {
601        'frames' : (np.argmin(d, axis=1), np.argmin(d, axis=0)),
602        'distances' : ((np.amin(d,axis=1)/N)**0.5, (np.amin(d, axis=0)/N)**0.5)
603    }
604    return nearest_neighbors
605
606
607def discrete_frechet(P, Q):
608    r"""Calculate the discrete Fréchet distance between two paths.
609
610    *P* (*Q*) is a :class:`numpy.ndarray` of :math:`N_p` (:math:`N_q`) time
611    steps, :math:`N` atoms, and :math:`3N` coordinates (e.g.,
612    :attr:`MDAnalysis.core.groups.AtomGroup.positions`). *P* (*Q*) has
613    either shape |3Dp| (|3Dq|), or :|2Dp| (|2Dq|) in flattened form.
614
615    Parameters
616    ----------
617    P : numpy.ndarray
618        the points in the first path
619    Q : numpy.ndarray
620        the points in the second path
621
622    Returns
623    -------
624    float
625        the discrete Fréchet distance between paths *P* and *Q*
626
627    Example
628    -------
629    Calculate the discrete Fréchet distance between two halves of a
630    trajectory.
631
632     >>> u = Universe(PSF,DCD)
633     >>> mid = len(u.trajectory)/2
634     >>> ca = u.select_atoms('name CA')
635     >>> P = np.array([
636     ...                ca.positions for _ in u.trajectory[:mid:]
637     ...              ]) # first half of trajectory
638     >>> Q = np.array([
639     ...                ca.positions for _ in u.trajectory[mid::]
640     ...              ]) # second half of trajectory
641     >>> discrete_frechet(P,Q)
642     4.7786639840135905
643     >>> discrete_frechet(P,Q[::-1]) # frechet distance w/ 2nd trj reversed 2nd
644     6.8429011177113832
645
646    Note that reversing the direction increased the Fréchet distance:
647    it is sensitive to the direction of the path.
648
649    Notes
650    -----
651
652    The discrete Fréchet metric is an approximation to the continuous Fréchet
653    metric [Frechet1906]_ [Alt1995]_. The calculation of the continuous
654    Fréchet distance is implemented with the dynamic programming algorithm of
655    [EiterMannila1994]_ [EiterMannila1997]_.
656
657
658    References
659    ----------
660
661    .. [Frechet1906] M. Fréchet. Sur quelques points du calcul
662       fonctionnel. Rend. Circ. Mat. Palermo, 22(1):1–72, Dec. 1906.
663
664    .. [Alt1995] H. Alt and M. Godau. Computing the Fréchet distance between
665       two polygonal curves. Int J Comput Geometry & Applications,
666       5(01n02):75–91, 1995. doi: `10.1142/S0218195995000064`_
667
668    .. _`10.1142/S0218195995000064`: http://doi.org/10.1142/S0218195995000064
669
670    .. [EiterMannila1994] T. Eiter and H. Mannila. Computing discrete Fréchet
671       distance. Technical Report CD-TR 94/64, Christian Doppler Laboratory for
672       Expert Systems, Technische Universität Wien, Wien, 1994.
673
674    .. [EiterMannila1997] T. Eiter and H. Mannila. Distance measures for point
675       sets and their computation. Acta Informatica, 34:109–133, 1997. doi: `10.1007/s002360050075`_.
676
677
678    .. _10.1007/s002360050075: http://doi.org/10.1007/s002360050075
679
680    """
681
682    N, axis = get_coord_axes(P)
683    Np, Nq = len(P), len(Q)
684    d = get_msd_matrix(P, Q, axis=axis)
685    ca = -np.ones((Np, Nq))
686
687    def c(i, j):
688        """Compute the coupling distance for two partial paths formed by *P* and
689        *Q*, where both begin at frame 0 and end (inclusive) at the respective
690        frame indices :math:`i-1` and :math:`j-1`. The partial path of *P* (*Q*)
691        up to frame *i* (*j*) is formed by the slicing ``P[0:i]`` (``Q[0:j]``).
692
693        :func:`c` is called recursively to compute the coupling distance
694        between the two full paths *P* and *Q*  (i.e., the discrete Frechet
695        distance) in terms of coupling distances between their partial paths.
696
697        Parameters
698        ----------
699        i : int
700            partial path of *P* through final frame *i-1*
701        j : int
702            partial path of *Q* through final frame *j-1*
703
704        Returns
705        -------
706        dist : float
707            the coupling distance between partial paths `P[0:i]` and `Q[0:j]`
708        """
709        if ca[i,j] != -1 :
710            return ca[i,j]
711        if i > 0:
712            if j > 0:
713                ca[i,j] = max( min(c(i-1,j),c(i,j-1),c(i-1,j-1)), d[i,j] )
714            else:
715                ca[i,j] = max( c(i-1,0), d[i,0] )
716        elif j > 0:
717            ca[i,j] = max( c(0,j-1), d[0,j] )
718        else:
719            ca[i,j] = d[0,0]
720        return ca[i,j]
721
722    return (c(Np-1, Nq-1) / N)**0.5
723
724
725def dist_mat_to_vec(N, i, j):
726    """Convert distance matrix indices (in the upper triangle) to the index of
727    the corresponding distance vector.
728
729    This is a convenience function to locate distance matrix elements (and the
730    pair generating it) in the corresponding distance vector. The row index *j*
731    should be greater than *i+1*, corresponding to the upper triangle of the
732    distance matrix.
733
734    Parameters
735    ----------
736    N : int
737        size of the distance matrix (of shape *N*-by-*N*)
738    i : int
739        row index (starting at 0) of the distance matrix
740    j : int
741        column index (starting at 0) of the distance matrix
742
743    Returns
744    -------
745    int
746        index (of the matrix element) in the corresponding distance vector
747
748    """
749
750    if not (isinstance(N, numbers.Integral) and isinstance(i, numbers.Integral)
751            and isinstance(j, numbers.Integral)):
752        raise ValueError("N, i, j all must be of type int")
753
754    if i < 0 or j < 0 or N < 2:
755        raise ValueError("Matrix indices are invalid; i and j must be greater "
756                         "than 0 and N must be greater the 2")
757
758    if (j > i and (i > N - 1 or j > N)) or (j < i and (i > N or j > N - 1)):
759        raise ValueError("Matrix indices are out of range; i and j must be "
760                         "less than N = {0:d}".format(N))
761    if j > i:
762        return (N*i) + j - (i+2)*(i+1) // 2  # old-style division for int output
763    elif j < i:
764        warnings.warn("Column index entered (j = {:d} is smaller than row "
765                      "index (i = {:d}). Using symmetric element in upper "
766                      "triangle of distance matrix instead: i --> j, "
767                      "j --> i".format(j, i))
768        return (N*j) + i - (j+2)*(j+1) // 2  # old-style division for int output
769    else:
770        raise ValueError("Error in processing matrix indices; i and j must "
771                         "be integers less than integer N = {0:d} such that"
772                         " j >= i+1.".format(N))
773
774
775class Path(object):
776    """Represent a path based on a :class:`~MDAnalysis.core.universe.Universe`.
777
778    Pre-process a :class:`Universe` object: (1) fit the trajectory to a
779    reference structure, (2) convert fitted time series to a
780    :class:`numpy.ndarray` representation of :attr:`Path.path`.
781
782    The analysis is performed with :meth:`PSAnalysis.run` and stores the result
783    in the :class:`numpy.ndarray` distance matrix :attr:`PSAnalysis.D`.
784    :meth:`PSAnalysis.run` also generates a fitted trajectory and path from
785    alignment of the original trajectories to a reference structure.
786
787    .. versionadded:: 0.9.1
788
789    """
790
791    def __init__(self, universe, reference, ref_select='name CA',
792                 path_select='all', ref_frame=0):
793        """Setting up trajectory alignment and fitted path generation.
794
795        Parameters
796        ----------
797        universe : Universe
798             :class:`MDAnalysis.Universe` object containing a trajectory
799        reference : Universe
800             reference structure (uses `ref_frame` from the trajectory)
801        ref_select : str or dict or tuple (optional)
802             The selection to operate on for rms fitting; can be one of:
803
804             1. any valid selection string for
805                :meth:`~MDAnalysis.core.groups.AtomGroup.select_atoms` that
806                produces identical selections in *mobile* and *reference*; or
807             2. a dictionary ``{'mobile':sel1, 'reference':sel2}`` (the
808                :func:`MDAnalysis.analysis.align.fasta2select` function returns
809                such a dictionary based on a ClustalW_ or STAMP_ sequence
810                alignment); or
811             3. a tuple ``(sel1, sel2)``
812
813             When using 2. or 3. with *sel1* and *sel2* then these selections
814             can also each be a list of selection strings (to generate an
815             AtomGroup with defined atom order as described under
816             :ref:`ordered-selections-label`).
817        ref_frame : int
818             frame index to select the coordinate frame from
819             `ref_select.trajectory`
820        path_select : selection_string
821             atom selection composing coordinates of (fitted) path; if ``None``
822             then `path_select` is set to `ref_select` [``None``]
823
824        """
825        self.u_original = universe
826        self.u_reference = reference
827        self.ref_select = ref_select
828        self.ref_frame = ref_frame
829        self.path_select = path_select
830
831        self.top_name = self.u_original.filename
832        self.trj_name = self.u_original.trajectory.filename
833        self.newtrj_name = None
834        self.u_fitted = None
835        self.path = None
836        self.natoms = None
837
838
839    def fit_to_reference(self, filename=None, prefix='', postfix='_fit',
840                         rmsdfile=None, targetdir=os.path.curdir,
841                         weights=None, tol_mass=0.1):
842        """Align each trajectory frame to the reference structure
843
844        Parameters
845        ----------
846        filename : str (optional)
847             file name for the RMS-fitted trajectory or pdb; defaults to the
848             original trajectory filename (from :attr:`Path.u_original`) with
849             `prefix` prepended
850        prefix : str (optional)
851             prefix for auto-generating the new output filename
852        rmsdfile : str (optional)
853             file name for writing the RMSD time series [``None``]
854        weights : {"mass", ``None``} or array_like (optional)
855             choose weights. With ``"mass"`` uses masses as weights; with
856             ``None`` weigh each atom equally. If a float array of the same
857             length as the selected AtomGroup is provided, use each element of
858             the `array_like` as a weight for the corresponding atom in the
859             AtomGroup.
860        tol_mass : float (optional)
861             Reject match if the atomic masses for matched atoms differ by more
862             than `tol_mass` [0.1]
863
864        Returns
865        -------
866        Universe
867            :class:`MDAnalysis.Universe` object containing a fitted trajectory
868
869        Notes
870        -----
871        Uses :class:`MDAnalysis.analysis.align.AlignTraj` for the fitting.
872
873
874        .. deprecated:: 0.16.1
875           Instead of ``mass_weighted=True`` use new ``weights='mass'``;
876           refactored to fit with AnalysisBase API
877
878        .. versionchanged:: 0.17.0
879           Deprecated keyword `mass_weighted` was removed.
880        """
881        head, tail = os.path.split(self.trj_name)
882        oldname, ext = os.path.splitext(tail)
883        filename = filename or oldname
884        self.newtrj_name = os.path.join(targetdir, filename + postfix + ext)
885        self.u_reference.trajectory[self.ref_frame] # select frame from ref traj
886        aligntrj = MDAnalysis.analysis.align.AlignTraj(self.u_original,
887                                                       self.u_reference,
888                                                       select=self.ref_select,
889                                                       filename=self.newtrj_name,
890                                                       prefix=prefix,
891                                                       weights=weights,
892                                                       tol_mass=tol_mass).run()
893        if rmsdfile is not None:
894            aligntrj.save(rmsdfile)
895        return MDAnalysis.Universe(self.top_name, self.newtrj_name)
896
897
898    def to_path(self, fitted=False, select=None, flat=False):
899        r"""Generates a coordinate time series from the fitted universe
900        trajectory.
901
902        Given a selection of *N* atoms from *select*, the atomic positions for
903        each frame in the fitted universe (:attr:`Path.u_fitted`) trajectory
904        (with |Np| total frames) are appended sequentially to form a 3D or 2D
905        (if *flat* is ``True``) :class:`numpy.ndarray` representation of the
906        fitted trajectory (with dimensions |3D| or |2D|, respectively).
907
908        Parameters
909        ----------
910        fitted : bool (optional)
911             construct a :attr:`Path.path` from the :attr:`Path.u_fitted`
912             trajectory; if ``False`` then :attr:`Path.path` is generated with
913             the trajectory from :attr:`Path.u_original` [``False``]
914        select : str (optional)
915             the selection for constructing the coordinates of each frame in
916             :attr:`Path.path`; if ``None`` then :attr:`Path.path_select`
917             is used, else it is overridden by *select* [``None``]
918        flat : bool (optional)
919             represent :attr:`Path.path` as a 2D (|2D|) :class:`numpy.ndarray`;
920             if ``False`` then :attr:`Path.path` is a 3D (|3D|)
921             :class:`numpy.ndarray` [``False``]
922
923        Returns
924        -------
925        numpy.ndarray
926              representing a time series of atomic positions of an
927              :class:`MDAnalysis.core.groups.AtomGroup` selection from
928              :attr:`Path.u_fitted.trajectory`
929
930        """
931        select = select if select is not None else self.path_select
932        if fitted:
933            if not isinstance(self.u_fitted, MDAnalysis.Universe):
934                raise TypeError("Fitted universe not found. Generate a fitted " +
935                        "universe with fit_to_reference() first, or explicitly "+
936                        "set argument \"fitted\" to \"False\" to generate a "   +
937                        "path from the original universe.")
938            u = self.u_fitted
939        else:
940            u = self.u_original
941        frames = u.trajectory
942        atoms = u.select_atoms(select)
943        self.natoms = len(atoms)
944        frames.rewind()
945        if flat:
946            return np.array([atoms.positions.flatten() for _ in frames])
947        else:
948            return np.array([atoms.positions for _ in frames])
949
950
951    def run(self, align=False, filename=None, postfix='_fit', rmsdfile=None,
952            targetdir=os.path.curdir, weights=None, tol_mass=0.1,
953            flat=False):
954        r"""Generate a path from a trajectory and reference structure.
955
956        As part of the path generation, the trajectory can be superimposed
957        ("aligned") to a reference structure if specified.
958
959        This is a convenience method to generate a fitted trajectory from an
960        inputted universe (:attr:`Path.u_original`) and reference structure
961        (:attr:`Path.u_reference`). :meth:`Path.fit_to_reference` and
962        :meth:`Path.to_path` are used consecutively to generate a new universe
963        (:attr:`Path.u_fitted`) containing the fitted trajectory along with the
964        corresponding :attr:`Path.path` represented as an
965        :class:`numpy.ndarray`. The method returns a tuple of the topology name
966        and new trajectory name, which can be fed directly into an
967        :class:`MDAnalysis.Universe` object after unpacking the tuple using the
968        ``*`` operator, as in
969        ``MDAnalysis.Universe(*(top_name, newtraj_name))``.
970
971        Parameters
972        ----------
973        align : bool (optional)
974             Align trajectory to atom selection :attr:`Path.ref_select` of
975             :attr:`Path.u_reference`. If ``True``, a universe containing an
976             aligned trajectory is produced with :meth:`Path.fit_to_reference`
977             [``False``]
978        filename : str (optional)
979             filename for the RMS-fitted trajectory or pdb; defaults to the
980             original trajectory filename (from :attr:`Path.u_original`) with
981             *prefix* prepended
982        postfix : str (optional)
983             prefix for auto-generating the new output filename
984        rmsdfile : str (optional)
985             file name for writing the RMSD time series [``None``]
986        weights : {"mass", ``None``} or array_like (optional)
987             choose weights. With ``"mass"`` uses masses as weights; with
988             ``None`` weigh each atom equally. If a float array of the same
989             length as the selected AtomGroup is provided, use each element of
990             the `array_like` as a weight for the corresponding atom in the
991             AtomGroup.
992        tol_mass : float (optional)
993             Reject match if the atomic masses for matched atoms differ by more
994             than *tol_mass* [0.1]
995        flat : bool (optional)
996             represent :attr:`Path.path` with 2D (|2D|) :class:`numpy.ndarray`;
997             if ``False`` then :attr:`Path.path` is a 3D (|3D|)
998             :class:`numpy.ndarray` [``False``]
999
1000        Returns
1001        -------
1002        topology_trajectory : tuple
1003             A tuple of the topology name and new trajectory name.
1004
1005
1006        .. deprecated:: 0.16.1
1007           Instead of ``mass_weighted=True`` use new ``weights='mass'``;
1008           refactored to fit with AnalysisBase API
1009
1010        .. versionchanged:: 0.17.0
1011           Deprecated keyword `mass_weighted` was removed.
1012        """
1013        if align:
1014            self.u_fitted = self.fit_to_reference(
1015                                filename=filename, postfix=postfix,
1016                                rmsdfile=rmsdfile, targetdir=targetdir,
1017                                weights=weights, tol_mass=0.1)
1018        self.path = self.to_path(fitted=align, flat=flat)
1019        return self.top_name, self.newtrj_name
1020
1021
1022    def get_num_atoms(self):
1023        """Return the number of atoms used to construct the :class:`Path`.
1024
1025        Must run :meth:`Path.to_path` prior to calling this method.
1026
1027        Returns
1028        -------
1029        int
1030            the number of atoms in the :class:`Path`
1031
1032
1033        """
1034        if self.natoms is None:
1035            raise ValueError("No path data; do 'Path.to_path()' first.")
1036        return self.natoms
1037
1038
1039class PSAPair(object):
1040    """Generate nearest neighbor and Hausdorff pair information between a pair
1041    of paths from an all-pairs comparison generated by :class:`PSA`.
1042
1043    The nearest neighbors for each path of a pair of paths is generated by
1044    :meth:`PSAPair.compute_nearest_neighbors` and stores the result
1045    in a dictionary (:attr:`nearest_neighbors`): each path has a
1046    :class:`numpy.ndarray` of the frames of its nearest neighbors, and a
1047    :class:`numpy.ndarray` of its nearest neighbor distances
1048    :attr:`PSAnalysis.D`. For example, *nearest_neighbors['frames']* is a pair
1049    of :class:`numpy.ndarray`, the first being the frames of the nearest
1050    neighbors of the first path, *i*, the second being those of the second path,
1051    *j*.
1052
1053    The Hausdorff pair for the pair of paths is found by calling
1054    :meth:`find_hausdorff_pair` (locates the nearest neighbor pair having the
1055    largest overall distance separating them), which stores the result in a
1056    dictionary (:attr:`hausdorff_pair`) containing the frames (indices) of the
1057    pair along with the corresponding (Hausdorff) distance.
1058    *hausdorff_pair['frame']* contains a pair of frames in the first path, *i*,
1059    and the second path, *j*, respectively, that correspond to the Hausdorff
1060    distance between them.
1061
1062    .. versionadded:: 0.11
1063    """
1064
1065    def __init__(self, npaths, i, j):
1066        """Set up a :class:`PSAPair` for a pair of paths that are part of a
1067        :class:`PSA` comparison of *npaths* total paths.
1068
1069        Each unique pair of paths compared using :class:`PSA` is related by
1070        their nearest neighbors (and corresponding distances) and the Hausdorff
1071        pair and distance. :class:`PSAPair` is a convenience class for
1072        calculating and encapsulating nearest neighbor and Hausdorff pair
1073        information for one pair of paths.
1074
1075        Given *npaths*, :class:`PSA` performs and all-pairs comparison among all
1076        paths for a total of :math:`\text{npaths}*(\text{npaths}-1)/2` unique
1077        comparisons. If distances between paths are computed, the all-pairs
1078        comparison can be summarized in a symmetric distance matrix whose upper
1079        triangle can be mapped to a corresponding distance vector form in a
1080        one-to-one manner. A particular comparison of a pair of paths in a
1081        given instance of :class:`PSAPair` is thus unique identified by the row
1082        and column indices in the distance matrix representation (whether or not
1083        distances are actually computed), or a single ID (index) in the
1084        corresponding distance vector.
1085
1086        Parameters
1087        ----------
1088        npaths : int
1089            total number of paths in :class:`PSA` used to generate *this*
1090            :class:`PSAPair`
1091        i : int
1092             row index (starting at 0) of the distance matrix
1093        j : int
1094             column index (starting at 0) of the distance matrix
1095        """
1096        self.npaths = npaths
1097        self.matrix_idx = (i,j)
1098        self.pair_idx = self._dvec_idx(i,j)
1099
1100        # Set by calling hausdorff_nn
1101        self.nearest_neighbors = {'frames' : None, 'distances' : None}
1102
1103        # Set by self.getHausdorffPair
1104        self.hausdorff_pair = {'frames' : (None, None), 'distance' : None}
1105
1106
1107    def _dvec_idx(self, i, j):
1108        """Convert distance matrix indices (in the upper triangle) to the index
1109        of the corresponding distance vector.
1110
1111        This is a convenience function to locate distance matrix elements (and
1112        the pair generating it) in the corresponding distance vector. The row
1113        index *j* should be greater than *i+1*, corresponding to the upper
1114        triangle of the distance matrix.
1115
1116        Parameters
1117        ----------
1118        i : int
1119            row index (starting at 0) of the distance matrix
1120        j : int
1121            column index (starting at 0) of the distance matrix
1122
1123        Returns
1124        -------
1125        int
1126             (matrix element) index in the corresponding distance vector
1127        """
1128        return (self.npaths*i) + j - (i+2)*(i+1)/2
1129
1130
1131    def compute_nearest_neighbors(self, P,Q, N=None):
1132        """Generates Hausdorff nearest neighbor lists of *frames* (by index) and
1133        *distances* for *this* pair of paths corresponding to distance matrix
1134        indices (*i*,*j*).
1135
1136        :meth:`PSAPair.compute_nearest_neighbors` calls
1137        :func:`hausdorff_neighbors` to populate the dictionary of the nearest
1138        neighbor lists of frames (by index) and distances
1139        (:attr:`PSAPair.nearest_neighbors`). This method must explicitly take as
1140        arguments a pair of paths, *P* and *Q*, where *P* is the
1141        :math:`i^\text{th}` path and *Q* is the :math:`j^\text{th}` path among
1142        the set of *N* total paths in the comparison.
1143
1144        Parameters
1145        ----------
1146        P : numpy.ndarray
1147            representing a path
1148        Q : numpy.ndarray
1149            representing a path
1150        N : int
1151            size of the distance matrix (of shape *N*-by-*N*) [``None``]
1152
1153        """
1154        hn = hausdorff_neighbors(P, Q)
1155        self.nearest_neighbors['frames'] = hn['frames']
1156        self.nearest_neighbors['distances'] = hn['distances']
1157
1158
1159    def find_hausdorff_pair(self):
1160        r"""Find the Hausdorff pair (of frames) for *this* pair of paths.
1161
1162        :meth:`PSAPair.find_hausdorff_pair` requires that
1163        `:meth:`PSAPair.compute_nearest_neighbors` be called first to
1164        generate the nearest neighbors (and corresponding distances) for each
1165        path in *this* :class:`PSAPair`. The Hausdorff pair is the nearest
1166        neighbor pair (of snapshots/frames), one in the first path and one in
1167        the second, with the largest separation distance.
1168        """
1169        if self.nearest_neighbors['distances'] is None:
1170            raise NoDataError("Nearest neighbors have not been calculated yet;"
1171                              " run compute_nearest_neighbors() first.")
1172
1173        nn_idx_P, nn_idx_Q = self.nearest_neighbors['frames']
1174        nn_dist_P, nn_dist_Q = self.nearest_neighbors['distances']
1175        max_nn_dist_P = max(nn_dist_P)
1176        max_nn_dist_Q = max(nn_dist_Q)
1177        if max_nn_dist_P > max_nn_dist_Q:
1178            max_nn_idx_P = np.argmax(nn_dist_P)
1179            self.hausdorff_pair['frames'] = max_nn_idx_P, nn_idx_P[max_nn_idx_P]
1180            self.hausdorff_pair['distance']  = max_nn_dist_P
1181        else:
1182            max_nn_idx_Q = np.argmax(nn_dist_Q)
1183            self.hausdorff_pair['frames'] = nn_idx_Q[max_nn_idx_Q], max_nn_idx_Q
1184            self.hausdorff_pair['distance'] = max_nn_dist_Q
1185
1186
1187    def get_nearest_neighbors(self, frames=True, distances=True):
1188        """Returns the nearest neighbor frame indices, distances, or both, for
1189        each path in *this* :class:`PSAPair`.
1190
1191        :meth:`PSAPair.get_nearest_neighbors` requires that the nearest
1192        neighbors (:attr:`nearest_neighbors`) be initially computed by first
1193        calling :meth:`compute_nearest_neighbors`. At least one of *frames*
1194        or *distances* must be ``True``, or else a ``NoDataError`` is raised.
1195
1196        Parameters
1197        ----------
1198        frames :  bool
1199             if ``True``, return nearest neighbor frame indices
1200             [``True``]
1201         distances : bool
1202             if ``True``, return nearest neighbor distances [``True``]
1203
1204        Returns
1205        -------
1206        dict or tuple
1207             If both *frames* and *distances* are ``True``, return the entire
1208             dictionary (:attr:`nearest_neighbors`); if only *frames* is
1209             ``True``, return a pair of :class:`numpy.ndarray` containing the
1210             indices of the frames (for the pair of paths) of the nearest
1211             neighbors; if only *distances* is ``True``, return a pair of
1212             :class:`numpy.ndarray` of the nearest neighbor distances (for the
1213             pair of paths).
1214
1215        """
1216        if self.nearest_neighbors['distances'] is None:
1217            raise NoDataError("Nearest neighbors have not been calculated yet;"
1218                              " run compute_nearest_neighbors() first.")
1219
1220        if frames:
1221            if distances:
1222                return self.nearest_neighbors
1223            else:
1224                return self.nearest_neighbors['frames']
1225        elif distances:
1226            return self.nearest_neighbors['distances']
1227        else:
1228            raise NoDataError('Need to select Hausdorff pair "frames" or'
1229                              ' "distances" or both. "frames" and "distances"'
1230                              ' cannot both be set to False.')
1231
1232    def get_hausdorff_pair(self, frames=True, distance=True):
1233        """Returns the Hausdorff pair of frames indices, the Hausdorff distance,
1234        or both, for the paths in *this* :class:`PSAPair`.
1235
1236        :meth:`PSAPair.get_hausdorff_pair` requires that the Hausdorff pair
1237        (and distance) be initially found by first calling
1238        :meth:`find_hausdorff_pair`. At least one of *frames* or *distance*
1239        must be ``True``, or else a ``NoDataError`` is raised.
1240
1241        Parameters
1242        ----------
1243        frames : bool
1244             if ``True``, return the indices of the frames
1245             of the Hausdorff pair [``True``]
1246        distances : bool
1247             if ``True``, return Hausdorff distance [``True``]
1248
1249        Returns
1250        -------
1251        dict or tuple
1252             If both *frames* and *distance* are ``True``, return the entire
1253             dictionary (:attr:`hausdorff_pair`); if only *frames* is
1254             ``True``, return a pair of ``int`` containing the indices of the
1255             frames (one index per path) of the Hausdorff pair; if only *distance*
1256             is ``True``, return the Hausdorff distance for this path pair.
1257        """
1258        if self.hausdorff_pair['distance'] is None:
1259            raise NoDataError("Hausdorff pair has not been calculated yet;"
1260                              " run find_hausdorff_pair() first.")
1261
1262        if frames:
1263            if distance:
1264                return self.hausdorff_pair
1265            else:
1266                return self.hausdorff_pair['frames']
1267        elif distance:
1268            return self.hausdorff_pair['distance']
1269        else:
1270            raise NoDataError('Need to select Hausdorff pair "frames" or'
1271                              ' "distance" or both. "frames" and "distance"'
1272                              ' cannot both be set to False.')
1273
1274
1275class PSAnalysis(object):
1276    """Perform Path Similarity Analysis (PSA) on a set of trajectories.
1277
1278    The analysis is performed with :meth:`PSAnalysis.run` and stores the result
1279    in the :class:`numpy.ndarray` distance matrix :attr:`PSAnalysis.D`.
1280    :meth:`PSAnalysis.run` also generates a fitted trajectory and path from
1281    alignment of the original trajectories to a reference structure.
1282
1283    .. versionadded:: 0.8
1284    """
1285    def __init__(self, universes, reference=None, ref_select='name CA',
1286                 ref_frame=0, path_select=None, labels=None,
1287                 targetdir=os.path.curdir):
1288        """Setting up Path Similarity Analysis.
1289
1290        The mutual similarity between all unique pairs of trajectories
1291        are computed using a selected path metric.
1292
1293        Parameters
1294        ----------
1295        universes : list
1296             a list of universes (:class:`MDAnalysis.Universe` object), each
1297             containing a trajectory
1298        reference : Universe
1299             reference coordinates; :class:`MDAnalysis.Universe` object; if
1300             ``None`` the first time step of the first item in `universes` is used
1301             [``None``]
1302        ref_select : str or dict or tuple
1303             The selection to operate on; can be one of:
1304
1305             1. any valid selection string for
1306                :meth:`~MDAnalysis.core.groups.AtomGroup.select_atoms` that
1307                produces identical selections in *mobile* and *reference*; or
1308             2. a dictionary ``{'mobile':sel1, 'reference':sel2}`` (the
1309                :func:`MDAnalysis.analysis.align.fasta2select` function returns
1310                such a dictionary based on a ClustalW_ or STAMP_ sequence
1311                alignment); or
1312             3. a tuple ``(sel1, sel2)``
1313
1314             When using 2. or 3. with *sel1* and *sel2* then these selections
1315             can also each be a list of selection strings (to generate an
1316             AtomGroup with defined atom order as described under
1317             :ref:`ordered-selections-label`).
1318        tol_mass : float
1319             Reject match if the atomic masses for matched atoms differ by more
1320             than *tol_mass* [0.1]
1321        ref_frame : int
1322             frame index to select frame from *reference* [0]
1323        path_select : str
1324             atom selection composing coordinates of (fitted) path; if ``None``
1325             then *path_select* is set to *ref_select* [``None``]
1326        targetdir : str
1327            output files are saved there; if ``None`` then "./psadata" is
1328            created and used [.]
1329        labels : list
1330             list of strings, names of trajectories to be analyzed
1331             (:class:`MDAnalysis.Universe`); if ``None``, defaults to trajectory
1332             names [``None``]
1333
1334
1335        .. _ClustalW: http://www.clustal.org/
1336        .. _STAMP: http://www.compbio.dundee.ac.uk/manuals/stamp.4.2/
1337
1338        """
1339        self.universes = universes
1340        self.u_reference = self.universes[0] if reference is None else reference
1341        self.ref_select = ref_select
1342        self.ref_frame = ref_frame
1343        self.path_select = self.ref_select if path_select is None else path_select
1344        if targetdir is None:
1345            try:
1346                targetdir = os.path.join(os.path.curdir, 'psadata')
1347                os.makedirs(targetdir)
1348            except OSError:
1349                if not os.path.isdir(targetdir):
1350                    raise
1351        self.targetdir = os.path.realpath(targetdir)
1352
1353        # Set default directory names for storing topology/reference structures,
1354        # fitted trajectories, paths, distance matrices, and plots
1355        self.datadirs = {'fitted_trajs' : '/fitted_trajs',
1356                         'paths' : '/paths',
1357                         'distance_matrices' : '/distance_matrices',
1358                         'plots' : '/plots'}
1359        for dir_name, directory in six.iteritems(self.datadirs):
1360            try:
1361                full_dir_name = os.path.join(self.targetdir, dir_name)
1362                os.makedirs(full_dir_name)
1363            except OSError:
1364                if not os.path.isdir(full_dir_name):
1365                    raise
1366
1367        # Keep track of topology, trajectory, and related files
1368        trj_names = []
1369        for i, u in enumerate(self.universes):
1370            head, tail = os.path.split(u.trajectory.filename)
1371            filename, ext = os.path.splitext(tail)
1372            trj_names.append(filename)
1373        self.trj_names = trj_names
1374        self.fit_trj_names = None
1375        self.path_names = None
1376        self.top_name = self.universes[0].filename if len(universes) != 0 else None
1377        self.labels = labels or self.trj_names
1378
1379        # Names of persistence (pickle) files where topology and trajectory
1380        # filenames are stored--should not be modified by user
1381        self._top_pkl = os.path.join(self.targetdir, "psa_top-name.pkl")
1382        self._trjs_pkl = os.path.join(self.targetdir, "psa_orig-traj-names.pkl")
1383        self._fit_trjs_pkl = os.path.join(self.targetdir, "psa_fitted-traj-names.pkl")
1384        self._paths_pkl = os.path.join(self.targetdir, "psa_path-names.pkl")
1385        self._labels_pkl = os.path.join(self.targetdir, "psa_labels.pkl")
1386        # Pickle topology and trajectory filenames for this analysis to curdir
1387        with open(self._top_pkl, 'wb') as output:
1388            cPickle.dump(self.top_name, output)
1389        with open(self._trjs_pkl, 'wb') as output:
1390            cPickle.dump(self.trj_names, output)
1391        with open(self._labels_pkl, 'wb') as output:
1392            cPickle.dump(self.labels, output)
1393
1394        self.natoms = None
1395        self.npaths = None
1396        self.paths = None
1397        self.D = None   # pairwise distances
1398        self._HP = None # (distance vector order) list of all Hausdorff pairs
1399        self._NN = None # (distance vector order) list of all nearest neighbors
1400        self._psa_pairs = None # (distance vector order) list of all PSAPairs
1401
1402
1403    def generate_paths(self, align=False, filename='fitted', infix='', weights=None,
1404                       tol_mass=False, ref_frame=None, flat=False, save=True, store=True):
1405        """Generate paths, aligning each to reference structure if necessary.
1406
1407        Parameters
1408        ----------
1409        align : bool
1410             Align trajectories to atom selection :attr:`PSAnalysis.ref_select`
1411             of :attr:`PSAnalysis.u_reference` [``False``]
1412        filename : str
1413             strings representing base filename for fitted trajectories and
1414             paths [``None``]
1415        infix : str
1416             additional tag string that is inserted into the output filename of
1417             the fitted trajectory files ['']
1418        weights : {"mass", ``None``} or array_like (optional)
1419             choose weights. With ``"mass"`` uses masses as weights; with
1420             ``None`` weigh each atom equally. If a float array of the same
1421             length as the selected AtomGroup is provided, use each element of
1422             the `array_like` as a weight for the corresponding atom in the
1423             AtomGroup.
1424        tol_mass : float
1425             Reject match if the atomic masses for matched atoms differ by more
1426             than *tol_mass*
1427        ref_frame : int
1428             frame index to select frame from *reference*
1429        flat : bool
1430             represent :attr:`Path.path` as a 2D (|2D|) :class:`numpy.ndarray`;
1431             if ``False`` then :attr:`Path.path` is a 3D (|3D|)
1432             :class:`numpy.ndarray` [``False``]
1433        save : bool
1434             if ``True``, pickle list of names for fitted trajectories
1435             [``True``]
1436        store : bool
1437             if ``True`` then writes each path (:class:`numpy.ndarray`)
1438             in :attr:`PSAnalysis.paths` to compressed npz (numpy) files
1439             [``False``]
1440
1441
1442        The fitted trajectories are written to new files in the
1443        "/trj_fit" subdirectory in :attr:`PSAnalysis.targetdir` named
1444        "filename(*trajectory*)XXX*infix*_psa", where "XXX" is a number between
1445        000 and 999; the extension of each file is the same as its original.
1446        Optionally, the trajectories can also be saved in numpy compressed npz
1447        format in the "/paths" subdirectory in :attr:`PSAnalysis.targetdir` for
1448        persistence and can be accessed as the attribute
1449        :attr:`PSAnalysis.paths`.
1450
1451
1452        .. deprecated:: 0.16.1
1453           Instead of ``mass_weighted=True`` use new ``weights='mass'``;
1454           refactored to fit with AnalysisBase API
1455
1456        .. versionchanged:: 0.17.0
1457           Deprecated keyword `mass_weighted` was removed.
1458        """
1459        if ref_frame is None:
1460            ref_frame = self.ref_frame
1461
1462        paths = []
1463        fit_trj_names = []
1464        for i, u in enumerate(self.universes):
1465            p = Path(u, self.u_reference, ref_select=self.ref_select,
1466                     path_select=self.path_select, ref_frame=ref_frame)
1467            trj_dir = self.targetdir + self.datadirs['fitted_trajs']
1468            postfix = '{0}{1}{2:03n}'.format(infix, '_psa', i+1)
1469            top_name, fit_trj_name = p.run(align=align, filename=filename,
1470                                           postfix=postfix,
1471                                           targetdir=trj_dir,
1472                                           weights=weights,
1473                                           tol_mass=tol_mass, flat=flat)
1474            paths.append(p.path)
1475            fit_trj_names.append(fit_trj_name)
1476        self.natoms, axis = get_coord_axes(paths[0])
1477        self.paths = paths
1478        self.npaths = len(paths)
1479        self.fit_trj_names = fit_trj_names
1480        if save:
1481            with open(self._fit_trjs_pkl, 'wb') as output:
1482                cPickle.dump(self.fit_trj_names, output)
1483        if store:
1484            self.save_paths(filename=filename)
1485
1486
1487    def run(self, **kwargs):
1488        """Perform path similarity analysis on the trajectories to compute
1489        the distance matrix.
1490
1491        A number of parameters can be changed from the defaults. The
1492        result is stored as the array :attr:`PSAnalysis.D`.
1493
1494        Parameters
1495        ----------
1496        metric : str or callable
1497             selection string specifying the path metric to measure pairwise
1498             distances among :attr:`PSAnalysis.paths` or a callable with the
1499             same call signature as :func:`hausdorff`
1500             [``'hausdorff'``]
1501        start : int
1502             `start` and `stop` frame index with `step` size: analyze
1503             ``trajectory[start:stop:step]`` [``None``]
1504        stop : int
1505        step : int
1506        store : bool
1507             if ``True`` then writes :attr:`PSAnalysis.D` to text and
1508             compressed npz (numpy) files [``True``]
1509
1510             .. deprecated:: 0.19.0
1511                `store` will be removed together with :meth:`save_results` in 1.0.0.
1512
1513        filename : str
1514             string, filename to save :attr:`PSAnalysis.D`
1515
1516             .. deprecated:: 0.19.0
1517                `filename` will be removed together with :meth:`save_results` in 1.0.0.
1518
1519
1520        """
1521        metric = kwargs.pop('metric', 'hausdorff')
1522        start = kwargs.pop('start', None)
1523        stop = kwargs.pop('stop', None)
1524        step = kwargs.pop('step', None)
1525        # DEPRECATED 0.19.0: remove in 1.0
1526        if 'store' in kwargs:
1527            warnings.warn("PSAnalysis.run(): 'store' was deprecated in 0.19.0 "
1528                          "and will be removed in 1.0",
1529                          category=DeprecationWarning)
1530        store = kwargs.pop('store', True)
1531
1532        if isinstance(metric, string_types):
1533            metric_func = get_path_metric_func(str(metric))
1534        else:
1535            metric_func = metric
1536        numpaths = self.npaths
1537        D = np.zeros((numpaths,numpaths))
1538
1539        for i in range(0, numpaths-1):
1540            for j in range(i+1, numpaths):
1541                P = self.paths[i][start:stop:step]
1542                Q = self.paths[j][start:stop:step]
1543                D[i,j] = metric_func(P, Q)
1544                D[j,i] = D[i,j]
1545        self.D = D
1546        if store:
1547            # DEPRECATED 0.19.0: remove in 1.0
1548            if 'filename' in kwargs:
1549                warnings.warn("PSAnalysis.run(): 'filename' was deprecated in "
1550                              "0.19.0 and will be removed in 1.0",
1551                              category=DeprecationWarning)
1552            filename = kwargs.pop('filename', metric)
1553            if not isinstance(metric, string_types):
1554                filename = 'custom_metric'
1555            self.save_result(filename=filename)
1556
1557    def run_pairs_analysis(self, **kwargs):
1558        """Perform PSA Hausdorff (nearest neighbor) pairs analysis on all unique
1559        pairs of paths in :attr:`PSAnalysis.paths`.
1560
1561        Partial results can be stored in separate lists, where each list is
1562        indexed according to distance vector convention (i.e., element *(i,j)*
1563        in distance matrix representation corresponds to element
1564        :math:`s=N*i+j-(i+1)*(i+2)` in distance vector representation, which is
1565        the :math:`s^\text{th}` comparison). For each unique pair of paths, the
1566        nearest neighbors for that pair can be stored in :attr:`NN` and the
1567        Hausdorff pair in :attr:`HP`. :attr:`PP` stores the full information
1568        of Hausdorff pairs analysis that is available for each pair of path,
1569        including nearest neighbors lists and the Hausdorff pairs.
1570
1571        The pairwise distances are stored as the array :attr:`PSAnalysis.D`.
1572
1573        Parameters
1574        ----------
1575        start : int
1576             `start` and `stop` frame index with `step` size: analyze
1577             ``trajectory[start:stop:step]`` [``None``]
1578        stop : int
1579        step : int
1580        neighbors : bool
1581             if ``True``, then stores dictionary of nearest neighbor
1582             frames/distances in :attr:`PSAnalysis.NN` [``False``]
1583        hausdorff_pairs : bool
1584             if ``True``, then stores dictionary of Hausdorff pair
1585             frames/distances in :attr:`PSAnalysis.HP` [``False``]
1586        """
1587        start = kwargs.pop('start', None)
1588        stop = kwargs.pop('stop', None)
1589        step = kwargs.pop('step', None)
1590        neighbors = kwargs.pop('neighbors', False)
1591        hausdorff_pairs = kwargs.pop('hausdorff_pairs', False)
1592
1593        numpaths = self.npaths
1594        D = np.zeros((numpaths,numpaths))
1595        self._NN = [] # list of nearest neighbors pairs
1596        self._HP = [] # list of Hausdorff pairs
1597        self._psa_pairs = [] # list of PSAPairs
1598
1599        for i in range(0, numpaths-1):
1600            for j in range(i+1, numpaths):
1601                pp = PSAPair(i, j, numpaths)
1602                P = self.paths[i][start:stop:step]
1603                Q = self.paths[j][start:stop:step]
1604                pp.compute_nearest_neighbors(P, Q, self.natoms)
1605                pp.find_hausdorff_pair()
1606                D[i,j] = pp.hausdorff_pair['distance']
1607                D[j,i] = D[i,j]
1608                self._psa_pairs.append(pp)
1609                if neighbors:
1610                    self._NN.append(pp.get_nearest_neighbors())
1611                if hausdorff_pairs:
1612                    self._HP.append(pp.get_hausdorff_pair())
1613        self.D = D
1614
1615    @deprecate(release="0.19.0", remove="1.0.0",
1616               message="You can save the distance matrix :attr:`D` to a numpy "
1617               "file with ``np.save(filename, PSAnalysis.D)``.")
1618    def save_result(self, filename=None):
1619        """Save distance matrix :attr:`PSAnalysis.D` to a numpy compressed npz
1620        file and text file.
1621
1622        The data are saved with :func:`numpy.savez_compressed` and
1623        :func:`numpy.savetxt` in the directory specified by
1624        :attr:`PSAnalysis.targetdir`.
1625
1626        Parameters
1627        ----------
1628        filename : str
1629             specifies filename [``None``]
1630
1631        Returns
1632        -------
1633        filename : str
1634
1635        """
1636        filename = filename or 'psa_distances'
1637        head = self.targetdir + self.datadirs['distance_matrices']
1638        outfile = os.path.join(head, filename)
1639        if self.D is None:
1640            raise NoDataError("Distance matrix has not been calculated yet")
1641        np.save(outfile + '.npy', self.D)
1642        np.savetxt(outfile + '.dat', self.D)
1643        logger.info("Wrote distance matrix to file %r.npz", outfile)
1644        logger.info("Wrote distance matrix to file %r.dat", outfile)
1645        return filename
1646
1647
1648    def save_paths(self, filename=None):
1649        """Save fitted :attr:`PSAnalysis.paths` to numpy compressed npz files.
1650
1651        The data are saved with :func:`numpy.savez_compressed` in the directory
1652        specified by :attr:`PSAnalysis.targetdir`.
1653
1654        Parameters
1655        ----------
1656        filename : str
1657             specifies filename [``None``]
1658
1659        Returns
1660        -------
1661        filename : str
1662
1663        """
1664        filename = filename or 'path_psa'
1665        head = self.targetdir + self.datadirs['paths']
1666        outfile = os.path.join(head, filename)
1667        if self.paths is None:
1668            raise NoDataError("Paths have not been calculated yet")
1669        path_names = []
1670        for i, path in enumerate(self.paths):
1671            current_outfile = "{0}{1:03n}.npy".format(outfile, i+1)
1672            np.save(current_outfile, self.paths[i])
1673            path_names.append(current_outfile)
1674            logger.info("Wrote path to file %r", current_outfile)
1675        self.path_names = path_names
1676        with open(self._paths_pkl, 'wb') as output:
1677            cPickle.dump(self.path_names, output)
1678        return filename
1679
1680
1681    def load(self):
1682        """Load fitted paths specified by 'psa_path-names.pkl' in
1683        :attr:`PSAnalysis.targetdir`.
1684        """
1685        if not os.path.exists(self._paths_pkl):
1686            raise NoDataError("Fitted trajectories cannot be loaded; save file" +
1687                              "{0} does not exist.".format(self._paths_pkl))
1688        self.path_names = np.load(self._paths_pkl)
1689        self.paths = [np.load(pname) for pname in self.path_names]
1690        if os.path.exists(self._labels_pkl):
1691            self.labels = np.load(self._labels_pkl)
1692        print("Loaded paths from " + self._paths_pkl)
1693
1694
1695    def plot(self, filename=None, linkage='ward', count_sort=False,
1696             distance_sort=False, figsize=4.5, labelsize=12):
1697        """Plot a clustered distance matrix.
1698
1699        Usese method *linkage* and plots the corresponding dendrogram. Rows
1700        (and columns) are identified using the list of strings specified by
1701        :attr:`PSAnalysis.labels`.
1702
1703        If `filename` is supplied then the figure is also written to file (the
1704        suffix determines the file type, e.g. pdf, png, eps, ...). All other
1705        keyword arguments are passed on to :func:`matplotlib.pyplot.matshow`.
1706
1707
1708        Parameters
1709        ----------
1710        filename : str
1711             save figure to *filename* [``None``]
1712        linkage : str
1713             name of linkage criterion for clustering [``'ward'``]
1714        count_sort : bool
1715             see :func:`scipy.cluster.hierarchy.dendrogram` [``False``]
1716        distance_sort : bool
1717             see :func:`scipy.cluster.hierarchy.dendrogram` [``False``]
1718        figsize : float
1719             set the vertical size of plot in inches [``4.5``]
1720        labelsize : float
1721             set the font size for colorbar labels; font size for path labels on
1722             dendrogram default to 3 points smaller [``12``]
1723
1724        Returns
1725        -------
1726        Z
1727          `Z` from :meth:`cluster`
1728        dgram
1729          `dgram` from :meth:`cluster`
1730        dist_matrix_clus
1731          clustered distance matrix (reordered)
1732
1733        """
1734        from matplotlib.pyplot import figure, colorbar, cm, savefig, clf
1735
1736        if self.D is None:
1737            raise ValueError(
1738                "No distance data; do 'PSAnalysis.run(store=True)' first.")
1739        npaths = len(self.D)
1740        dist_matrix = self.D
1741
1742        dgram_loc, hmap_loc, cbar_loc = self._get_plot_obj_locs()
1743        aspect_ratio = 1.25
1744        clf()
1745        fig = figure(figsize=(figsize*aspect_ratio, figsize))
1746        ax_hmap = fig.add_axes(hmap_loc)
1747        ax_dgram = fig.add_axes(dgram_loc)
1748
1749        Z, dgram = self.cluster(method=linkage,                                 \
1750                                count_sort=count_sort,                          \
1751                                distance_sort=distance_sort)
1752        rowidx = colidx = dgram['leaves'] # get row-wise ordering from clustering
1753        ax_dgram.invert_yaxis() # Place origin at up left (from low left)
1754
1755        minDist, maxDist = 0, np.max(dist_matrix)
1756        dist_matrix_clus = dist_matrix[rowidx,:]
1757        dist_matrix_clus = dist_matrix_clus[:,colidx]
1758        im = ax_hmap.matshow(dist_matrix_clus, aspect='auto', origin='lower',   \
1759                    cmap=cm.YlGn, vmin=minDist, vmax=maxDist)
1760        ax_hmap.invert_yaxis() # Place origin at upper left (from lower left)
1761        ax_hmap.locator_params(nbins=npaths)
1762        ax_hmap.set_xticks(np.arange(npaths), minor=True)
1763        ax_hmap.set_yticks(np.arange(npaths), minor=True)
1764        ax_hmap.tick_params(axis='x', which='both', labelleft='off',            \
1765                        labelright='off', labeltop='on', labelsize=0)
1766        ax_hmap.tick_params(axis='y', which='both', labelleft='on',             \
1767                labelright='off', labeltop='off', labelsize=0)
1768        rowlabels = [self.labels[i] for i in rowidx]
1769        collabels = [self.labels[i] for i in colidx]
1770        ax_hmap.set_xticklabels(collabels, rotation='vertical',                 \
1771                size=(labelsize-4), multialignment='center', minor=True)
1772        ax_hmap.set_yticklabels(rowlabels, rotation='horizontal',               \
1773                size=(labelsize-4), multialignment='left', ha='right',          \
1774                minor=True)
1775
1776        ax_color = fig.add_axes(cbar_loc)
1777        colorbar(im, cax=ax_color, ticks=np.linspace(minDist, maxDist, 10),  \
1778                format="%0.1f")
1779        ax_color.tick_params(labelsize=labelsize)
1780
1781        # Remove major ticks from both heat map axes
1782        for tic in ax_hmap.xaxis.get_major_ticks():
1783            tic.tick1On = tic.tick2On = False
1784            tic.label1On = tic.label2On = False
1785        for tic in ax_hmap.yaxis.get_major_ticks():
1786            tic.tick1On = tic.tick2On = False
1787            tic.label1On = tic.label2On = False
1788        # Remove minor ticks from both heat map axes
1789        for tic in ax_hmap.xaxis.get_minor_ticks():
1790            tic.tick1On = tic.tick2On = False
1791        for tic in ax_hmap.yaxis.get_minor_ticks():
1792            tic.tick1On = tic.tick2On = False
1793        # Remove tickmarks from colorbar
1794        for tic in ax_color.yaxis.get_major_ticks():
1795            tic.tick1On = tic.tick2On = False
1796
1797        if filename is not None:
1798            head = self.targetdir + self.datadirs['plots']
1799            outfile = os.path.join(head, filename)
1800            savefig(outfile, dpi=300, bbox_inches='tight')
1801
1802        return Z, dgram, dist_matrix_clus
1803
1804
1805    def plot_annotated_heatmap(self, filename=None, linkage='ward',             \
1806                               count_sort=False, distance_sort=False,           \
1807                               figsize=8, annot_size=6.5):
1808        """Plot a clustered distance matrix.
1809
1810        Uses method `linkage` and plots annotated distances in the matrix. Rows
1811        (and columns) are identified using the list of strings specified by
1812        :attr:`PSAnalysis.labels`.
1813
1814        If `filename` is supplied then the figure is also written to file (the
1815        suffix determines the file type, e.g. pdf, png, eps, ...). All other
1816        keyword arguments are passed on to :func:`matplotlib.pyplot.imshow`.
1817
1818        Parameters
1819        ----------
1820        filename : str
1821             save figure to *filename* [``None``]
1822        linkage : str
1823             name of linkage criterion for clustering [``'ward'``]
1824        count_sort : bool
1825             see :func:`scipy.cluster.hierarchy.dendrogram` [``False``]
1826        distance_sort : bool
1827             see :func:`scipy.cluster.hierarchy.dendrogram` [``False``]
1828        figsize : float
1829             set the vertical size of plot in inches [``4.5``]
1830        annot_size : float
1831             font size of annotation labels on heat map [``6.5``]
1832
1833        Returns
1834        -------
1835        Z
1836          `Z` from :meth:`cluster`
1837        dgram
1838          `dgram` from :meth:`cluster`
1839        dist_matrix_clus
1840          clustered distance matrix (reordered)
1841
1842
1843        Note
1844        ----
1845        This function requires the seaborn_ package, which can be installed
1846        with `pip install seaborn` or `conda install seaborn`.
1847
1848        .. _seaborn: https://seaborn.pydata.org/
1849
1850        """
1851        from matplotlib.pyplot import figure, colorbar, cm, savefig, clf
1852
1853        try:
1854            import seaborn.apionly as sns
1855        except ImportError:
1856            raise ImportError(
1857                """ERROR --- The seaborn package cannot be found!
1858
1859                The seaborn API could not be imported. Please install it first.
1860                You can try installing with pip directly from the
1861                internet:
1862
1863                  pip install seaborn
1864
1865                Alternatively, download the package from
1866
1867                  http://pypi.python.org/pypi/seaborn/
1868
1869                and install in the usual manner.
1870                """
1871            )
1872
1873        if self.D is None:
1874            raise ValueError(
1875                "No distance data; do 'PSAnalysis.run(store=True)' first.")
1876        dist_matrix = self.D
1877
1878        Z, dgram = self.cluster(method=linkage,                                 \
1879                                count_sort=count_sort,                          \
1880                                distance_sort=distance_sort,                    \
1881                                no_plot=True)
1882        rowidx = colidx = dgram['leaves'] # get row-wise ordering from clustering
1883        dist_matrix_clus = dist_matrix[rowidx,:]
1884        dist_matrix_clus = dist_matrix_clus[:,colidx]
1885
1886        clf()
1887        aspect_ratio = 1.25
1888        fig = figure(figsize=(figsize*aspect_ratio, figsize))
1889        ax_hmap = fig.add_subplot(111)
1890        ax_hmap = sns.heatmap(dist_matrix_clus,                                 \
1891                         linewidths=0.25, cmap=cm.YlGn, annot=True, fmt='3.1f', \
1892                         square=True, xticklabels=rowidx, yticklabels=colidx,   \
1893                         annot_kws={"size": 7}, ax=ax_hmap)
1894
1895        # Remove major ticks from both heat map axes
1896        for tic in ax_hmap.xaxis.get_major_ticks():
1897            tic.tick1On = tic.tick2On = False
1898            tic.label1On = tic.label2On = False
1899        for tic in ax_hmap.yaxis.get_major_ticks():
1900            tic.tick1On = tic.tick2On = False
1901            tic.label1On = tic.label2On = False
1902        # Remove minor ticks from both heat map axes
1903        for tic in ax_hmap.xaxis.get_minor_ticks():
1904            tic.tick1On = tic.tick2On = False
1905        for tic in ax_hmap.yaxis.get_minor_ticks():
1906            tic.tick1On = tic.tick2On = False
1907
1908        if filename is not None:
1909            head = self.targetdir + self.datadirs['plots']
1910            outfile = os.path.join(head, filename)
1911            savefig(outfile, dpi=600, bbox_inches='tight')
1912
1913        return Z, dgram, dist_matrix_clus
1914
1915
1916    def plot_nearest_neighbors(self, filename=None, idx=0,                      \
1917                               labels=('Path 1', 'Path 2'), figsize=4.5,        \
1918                               multiplot=False, aspect_ratio=1.75,              \
1919                               labelsize=12):
1920        """Plot nearest neighbor distances as a function of normalized frame
1921        number.
1922
1923        The frame number is mapped to the interval *[0, 1]*.
1924
1925        If `filename` is supplied then the figure is also written to file (the
1926        suffix determines the file type, e.g. pdf, png, eps, ...). All other
1927        keyword arguments are passed on to :func:`matplotlib.pyplot.imshow`.
1928
1929        Parameters
1930        ----------
1931        filename : str
1932             save figure to *filename* [``None``]
1933        idx : int
1934             index of path (pair) comparison to plot [``0``]
1935        labels : (str, str)
1936             pair of names to label nearest neighbor distance
1937             curves [``('Path 1', 'Path 2')``]
1938        figsize : float
1939             set the vertical size of plot in inches [``4.5``]
1940        multiplot : bool
1941             set to ``True`` to enable plotting multiple nearest
1942             neighbor distances on the same figure [``False``]
1943        aspect_ratio : float
1944             set the ratio of width to height of the plot [``1.75``]
1945        labelsize : float
1946             set the font size for colorbar labels; font size for path labels on
1947             dendrogram default to 3 points smaller [``12``]
1948
1949        Returns
1950        -------
1951        ax : axes
1952
1953        Note
1954        ----
1955        This function requires the seaborn_ package, which can be installed
1956        with `pip install seaborn` or `conda install seaborn`.
1957
1958        .. _seaborn: https://seaborn.pydata.org/
1959
1960        """
1961        from matplotlib.pyplot import figure, savefig, tight_layout, clf, show
1962        try:
1963            import seaborn.apionly as sns
1964        except ImportError:
1965            raise ImportError(
1966                """ERROR --- The seaborn package cannot be found!
1967
1968                The seaborn API could not be imported. Please install it first.
1969                You can try installing with pip directly from the
1970                internet:
1971
1972                  pip install seaborn
1973
1974                Alternatively, download the package from
1975
1976                  http://pypi.python.org/pypi/seaborn/
1977
1978                and install in the usual manner.
1979                """
1980            )
1981
1982        colors = sns.xkcd_palette(["cherry", "windows blue"])
1983
1984        if self._NN is None:
1985            raise ValueError("No nearest neighbor data; run "
1986                             "'PSAnalysis.run_pairs_analysis(neighbors=True)' first.")
1987
1988        sns.set_style('whitegrid')
1989
1990        if not multiplot:
1991            clf()
1992        fig = figure(figsize=(figsize*aspect_ratio, figsize))
1993        ax = fig.add_subplot(111)
1994
1995        nn_dist_P, nn_dist_Q = self._NN[idx]['distances']
1996        frames_P = len(nn_dist_P)
1997        frames_Q = len(nn_dist_Q)
1998        progress_P = np.asarray(range(frames_P))/(1.0*frames_P)
1999        progress_Q = np.asarray(range(frames_Q))/(1.0*frames_Q)
2000
2001        ax.plot(progress_P, nn_dist_P, color=colors[0], lw=1.5, label=labels[0])
2002        ax.plot(progress_Q, nn_dist_Q, color=colors[1], lw=1.5, label=labels[1])
2003
2004        ax.legend()
2005        ax.set_xlabel(r'(normalized) progress by frame number', fontsize=12)
2006        ax.set_ylabel(r'nearest neighbor rmsd ($\AA$)', fontsize=12)
2007        ax.tick_params(axis='both', which='major', labelsize=12, pad=4)
2008
2009        sns.despine(bottom=True, left=True, ax=ax)
2010        tight_layout()
2011
2012        if filename is not None:
2013            head = self.targetdir + self.datadirs['plots']
2014            outfile = os.path.join(head, filename)
2015            savefig(outfile, dpi=300, bbox_inches='tight')
2016
2017        return ax
2018
2019
2020    def cluster(self, dist_mat=None, method='ward', count_sort=False,           \
2021                distance_sort=False, no_plot=False, no_labels=True,             \
2022                color_threshold=4):
2023        """Cluster trajectories and optionally plot the dendrogram.
2024
2025        This method is used by :meth:`PSAnalysis.plot` to generate a heatmap-
2026        dendrogram combination plot. By default, the distance matrix,
2027        :attr:`PSAnalysis.D`, is assumed to exist, converted to
2028        distance-vector form, and inputted to :func:`cluster.hierarchy.linkage`
2029        to generate a clustering. For convenience in plotting arbitrary
2030        distance matrices, one can also be specify `dist_mat`, which will be
2031        checked for proper distance matrix form by
2032        :func:`spatial.distance.squareform`
2033
2034        Parameters
2035        ----------
2036        dist_mat : numpy.ndarray
2037            user-specified distance matrix to be clustered [``None``]
2038        method : str
2039            name of linkage criterion for clustering [``'ward'``]
2040        no_plot : bool
2041            if ``True``, do not render the dendrogram [``False``]
2042        no_labels : bool
2043            if ``True`` then do not label dendrogram [``True``]
2044        color_threshold : float
2045            For brevity, let t be the color_threshold. Colors all the
2046            descendent links below a cluster node k the same color if k is
2047            the first node below the cut threshold t. All links connecting
2048            nodes with distances greater than or equal to the threshold are
2049            colored blue. If t is less than or equal to zero, all nodes are
2050            colored blue. If color_threshold is None or ‘default’,
2051            corresponding with MATLAB(TM) behavior, the threshold is set to
2052            0.7*max(Z[:,2]). [``4``]]
2053
2054        Returns
2055        -------
2056        Z
2057            output from :func:`scipy.cluster.hierarchy.linkage`;
2058            list of indices representing the row-wise order of the objects
2059            after clustering
2060        dgram
2061            output from :func:`scipy.cluster.hierarchy.dendrogram`
2062        """
2063        # perhaps there is a better way to manipulate the plot... or perhaps it
2064        # is not even necessary? In any case, the try/finally makes sure that
2065        # we are not permanently changing the user's global state
2066        orig_linewidth = matplotlib.rcParams['lines.linewidth']
2067        matplotlib.rcParams['lines.linewidth'] = 0.5
2068        try:
2069            if dist_mat:
2070                dist_vec = spatial.distance.squareform(dist_mat,
2071                                                       force='tovector',
2072                                                       checks=True)
2073            else:
2074                dist_vec = self.get_pairwise_distances(vectorform=True)
2075            Z = cluster.hierarchy.linkage(dist_vec, method=method)
2076            dgram = cluster.hierarchy.dendrogram(
2077                Z, no_labels=no_labels, orientation='left',
2078                count_sort=count_sort, distance_sort=distance_sort,
2079                no_plot=no_plot, color_threshold=color_threshold)
2080        finally:
2081            matplotlib.rcParams['lines.linewidth'] = orig_linewidth
2082        return Z, dgram
2083
2084    def _get_plot_obj_locs(self):
2085        """Find and return coordinates for dendrogram, heat map, and colorbar.
2086
2087        Returns
2088        -------
2089        tuple
2090          tuple of coordinates for placing the dendrogram, heat map, and
2091          colorbar in the plot.
2092        """
2093        plot_xstart = 0.04
2094        plot_ystart = 0.04
2095        label_margin = 0.155
2096
2097        dgram_height = 0.2 # dendrogram heights(s)
2098        hmap_xstart = plot_xstart + dgram_height + label_margin
2099
2100        # Set locations for dendrogram(s), matrix, and colorbar
2101        hmap_height = 0.8
2102        hmap_width = 0.6
2103        dgram_loc = [plot_xstart, plot_ystart, dgram_height, hmap_height]
2104        cbar_width = 0.02
2105        cbar_xstart = hmap_xstart + hmap_width + 0.01
2106        cbar_loc = [cbar_xstart, plot_ystart, cbar_width, hmap_height]
2107        hmap_loc =  [hmap_xstart, plot_ystart, hmap_width, hmap_height]
2108
2109        return dgram_loc, hmap_loc, cbar_loc
2110
2111
2112    def get_num_atoms(self):
2113        """Return the number of atoms used to construct the :class:`Path` instances in
2114        :class:`PSA`.
2115
2116        Returns
2117        -------
2118        int
2119            the number of atoms in any path
2120
2121        Note
2122        ----
2123        Must run :meth:`PSAnalysis.generate_paths` prior to calling this
2124        method.
2125        """
2126        if self.natoms is None:
2127            raise ValueError(
2128                "No path data; do 'PSAnalysis.generate_paths()' first.")
2129        return self.natoms
2130
2131
2132    def get_num_paths(self):
2133        """Return the number of paths in :class:`PSA`.
2134
2135        Note
2136        ----
2137        Must run :meth:`PSAnalysis.generate_paths` prior to calling this method.
2138
2139        Returns
2140        -------
2141        int
2142           the number of paths in :class:`PSA`
2143        """
2144        if self.npaths is None:
2145            raise ValueError(
2146                "No path data; do 'PSAnalysis.generate_paths()' first.")
2147        return self.npaths
2148
2149
2150    def get_paths(self):
2151        """Return the paths in :class:`PSA`.
2152
2153        Note
2154        ----
2155        Must run :meth:`PSAnalysis.generate_paths` prior to calling this
2156        method.
2157
2158        Returns
2159        -------
2160        list
2161            list of :class:`numpy.ndarray` representations of paths in
2162            :class:`PSA`
2163        """
2164        if self.paths is None:
2165            raise ValueError(
2166                "No path data; do 'PSAnalysis.generate_paths()' first.")
2167        return self.paths
2168
2169
2170    def get_pairwise_distances(self, vectorform=False, checks=False):
2171        """Return the distance matrix (or vector) of pairwise path distances.
2172
2173        Note
2174        ----
2175        Must run :meth:`PSAnalysis.run` with ``store=True`` prior to
2176        calling this method.
2177
2178        Parameters
2179        ----------
2180        vectorform : bool
2181             if ``True``, return the distance vector instead [``False``]
2182        checks : bool
2183             if ``True``, check that :attr:`PSAnalysis.D` is a proper distance
2184             matrix [``False``]
2185
2186        Returns
2187        -------
2188        numpy.ndarray
2189             representation of the distance matrix (or vector)
2190
2191        """
2192        if self.D is None:
2193            raise ValueError(
2194                "No distance data; do 'PSAnalysis.run(store=True)' first.")
2195        if vectorform:
2196            return spatial.distance.squareform(self.D, force='tovector',
2197                                               checks=checks)
2198        else:
2199            return self.D
2200
2201
2202    @property
2203    def psa_pairs(self):
2204        """The list of :class:`PSAPair` instances for each pair of paths.
2205
2206        :attr:`psa_pairs` is a list of all :class:`PSAPair` objects (in
2207        distance vector order). The elements of a :class:`PSAPair` are pairs of
2208        paths that have been compared using
2209        :meth:`PSAnalysis.run_pairs_analysis`. Each :class:`PSAPair` contains
2210        nearest neighbor and Hausdorff pair information specific to a pair of
2211        paths. The nearest neighbor frames and distances for a :class:`PSAPair`
2212        can be accessed in the nearest neighbor dictionary using the keys
2213        'frames' and 'distances', respectively. E.g.,
2214        :attr:`PSAPair.nearest_neighbors['distances']` returns a *pair* of
2215        :class:`numpy.ndarray` corresponding to the nearest neighbor distances
2216        for each path. Similarly, Hausdorff pair information can be accessed
2217        using :attr:`PSAPair.hausdorff_pair` with the keys 'frames' and
2218        'distance'.
2219
2220        Note
2221        ----
2222        Must run :meth:`PSAnalysis.run_pairs_analysis` prior to calling this
2223        method.
2224
2225        """
2226        if self._psa_pairs is None:
2227            raise ValueError("No nearest neighbors data; do"
2228                             " 'PSAnalysis.run_pairs_analysis()' first.")
2229        return self._psa_pairs
2230
2231
2232    @property
2233    def hausdorff_pairs(self):
2234        """The Hausdorff pair for each (unique) pairs of paths.
2235
2236        This attribute contains a list of Hausdorff pair information (in
2237        distance vector order), where each element is a dictionary containing
2238        the pair of frames and the (Hausdorff) distance between a pair of
2239        paths. See :meth:`PSAnalysis.psa_pairs` and
2240        :attr:`PSAPair.hausdorff_pair` for more information about accessing
2241        Hausdorff pair data.
2242
2243        Note
2244        ----
2245        Must run :meth:`PSAnalysis.run_pairs_analysis` with
2246        ``hausdorff_pairs=True`` prior to calling this method.
2247
2248        """
2249        if self._HP is None:
2250            raise ValueError("No Hausdorff pairs data; do "
2251                             "'PSAnalysis.run_pairs_analysis(hausdorff_pairs=True)' "
2252                             "first.")
2253        return self._HP
2254
2255    @property
2256    def nearest_neighbors(self):
2257        """The nearest neighbors for each (unique) pair of paths.
2258
2259        This attribute contains a list of nearest neighbor information (in
2260        distance vector order), where each element is a dictionary containing
2261        the nearest neighbor frames and distances between a pair of paths. See
2262        :meth:`PSAnalysis.psa_pairs` and :attr:`PSAPair.nearest_neighbors` for
2263        more information about accessing nearest neighbor data.
2264
2265        Note
2266        ----
2267        Must run :meth:`PSAnalysis.run_pairs_analysis` with
2268        ``neighbors=True`` prior to calling this method.
2269
2270        """
2271        if self._NN is None:
2272            raise ValueError("No nearest neighbors data; do"
2273                             " 'PSAnalysis.run_pairs_analysis(neighbors=True)'"
2274                             " first.")
2275        return self._NN
2276