utils/man/adist.Rd

% File src/library/utils/man/adist.Rd
% Part of the R package, https://www.R-project.org
% Copyright 2011-2015 R Core Team
% Distributed under GPL 2 or later

\name{adist}
\alias{adist}
\title{Approximate String Distances}
\description{
  Compute the approximate string distance between character vectors.
  The distance is a generalized Levenshtein (edit) distance, giving the
  minimal possibly weighted number of insertions, deletions and
  substitutions needed to transform one string into another.
}
\usage{
adist(x, y = NULL, costs = NULL, counts = FALSE, fixed = TRUE,
      partial = !fixed, ignore.case = FALSE, useBytes = FALSE)
}
\arguments{
  \item{x}{a character vector.  \link{Long vectors} are not supported.}
  \item{y}{a character vector, or \code{NULL} (default) indicating
    taking \code{x} as \code{y}.}
  \item{costs}{a numeric vector or list with names partially matching
    \samp{insertions}, \samp{deletions} and \samp{substitutions} giving
    the respective costs for computing the Levenshtein distance, or
    \code{NULL} (default) indicating using unit cost for all three
    possible transformations.}
  \item{counts}{a logical indicating whether to optionally return the
    transformation counts (numbers of insertions, deletions and
    substitutions) as the \code{"counts"} attribute of the return
    value.}
  \item{fixed}{a logical.  If \code{TRUE} (default), the \code{x}
    elements are used as string literals.  Otherwise, they are taken as
    regular expressions and \code{partial = TRUE} is implied
    (corresponding to the approximate string distance used by
    \code{\link{agrep}} with \code{fixed = FALSE}).}
  \item{partial}{a logical indicating whether the transformed \code{x}
    elements must exactly match the complete \code{y} elements, or only
    substrings of these.  The latter corresponds to the approximate
    string distance used by \code{\link{agrep}} (by default).}
  \item{ignore.case}{a logical.  If \code{TRUE}, case is ignored for
    computing the distances.}
  \item{useBytes}{a logical.  If \code{TRUE} distance computations are
    done byte-by-byte rather than character-by-character.}
}
\value{
  A matrix with the approximate string distances of the elements of
  \code{x} and \code{y}, with rows and columns corresponding to \code{x}
  and \code{y}, respectively.

  If \code{counts} is \code{TRUE}, the transformation counts are
  returned as the \code{"counts"} attribute of this matrix, as a
  3-dimensional array with dimensions corresponding to the elements of
  \code{x}, the elements of \code{y}, and the type of transformation
  (insertions, deletions and substitutions), respectively.
  Additionally, if \code{partial = FALSE}, the transformation sequences
  are returned as the \code{"trafos"} attribute of the return value, as
  character strings with elements \samp{M}, \samp{I}, \samp{D} and
  \samp{S} indicating a match, insertion, deletion and substitution,
  respectively.  If \code{partial = TRUE}, the offsets (positions of
  the first and last element) of the matched substrings are returned as
  the \code{"offsets"} attribute of the return value (with both offsets
  \eqn{-1} in case of no match).
}
\details{
  The (generalized) Levenshtein (or edit) distance between two strings
  \var{s} and \var{t} is the minimal possibly weighted number of
  insertions, deletions and substitutions needed to transform \var{s}
  into \var{t} (so that the transformation exactly matches \var{t}).
  This distance is computed for \code{partial = FALSE}, currently using
  a dynamic programming algorithm (see, e.g.,
  \url{https://en.wikipedia.org/wiki/Levenshtein_distance}) with space
  and time complexity \eqn{O(mn)}, where \eqn{m} and \eqn{n} are the
  lengths of \var{s} and \var{t}, respectively.  Additionally computing
  the transformation sequence and counts is \eqn{O(\max(m, n))}.

  The generalized Levenshtein distance can also be used for approximate
  (fuzzy) string matching, in which case one finds the substring of
  \var{t} with minimal distance to the pattern \var{s} (which could be
  taken as a regular expression, in which case the principle of using
  the leftmost and longest match applies), see, e.g.,
  \url{https://en.wikipedia.org/wiki/Approximate_string_matching}.  This
  distance is computed for \code{partial = TRUE} using \samp{tre} by
  Ville Laurikari (\url{https://github.com/laurikari/tre}) and
  corresponds to the distance used by \code{\link{agrep}}.  In this
  case, the given cost values are coerced to integer.

  Note that the costs for insertions and deletions can be different, in
  which case the distance between \var{s} and \var{t} can be different
  from the distance between \var{t} and \var{s}.
}
\seealso{
  \code{\link{agrep}} for approximate string matching (fuzzy matching)
  using the generalized Levenshtein distance.
}
\examples{
## Cf. https://en.wikipedia.org/wiki/Levenshtein_distance
adist("kitten", "sitting")
## To see the transformation counts for the Levenshtein distance:
drop(attr(adist("kitten", "sitting", counts = TRUE), "counts"))
## To see the transformation sequences:
attr(adist(c("kitten", "sitting"), counts = TRUE), "trafos")

## Cf. the examples for agrep:
adist("lasy", "1 lazy 2")
## For a "partial approximate match" (as used for agrep):
adist("lasy", "1 lazy 2", partial = TRUE)
}
\keyword{character}