README
1NAME
2 README - General information about Text::Similarity
3
4DESCRIPTION
5 Text-Similarity is a Perl module that allows a user to measure the
6 similarity between two strings or two files. There is one method for
7 computing similarity supported Text::Similarity::Overlaps, and others
8 can be added.
9
10 When using Text::Similarity::Overlaps, text similarity is based on
11 counting the number of overlapping words between the two files, and is
12 (optionally) normalized by the length of the files.
13
14 The lesk value provided in Text::Similarity::Overlaps is based on
15 counting the number of overlapping words and phrases between the two
16 files, and is (optionally) normalized by the length of the files.
17 Phrasal matches are scored more highly.
18
19 The smallest unit we are considered for matches are white space
20 separated strings. 'the cat and the hat' and 'these cats and these hats'
21 will only result in similarity between 'and', matches below the word
22 level are not measured.
23
24 Each input file is treated as a single string. There are methods
25 provided that allow you to write programs that measure files for
26 similarity (getSimilarity) and identifying the overlaps present in
27 strings (getOverlaps).
28
29CONTENTS
30 When the distribution is unpacked, several subdirectories are created:
31
32 /bin
33 This directory contains a driver program called text_similarity.pl
34 that can be used to conveniently measure two files for similarity.
35 Please see the perldoc for this program for more details.
36
37 /lib
38 This directory contains the Perl modules that do the actual work of
39 disambiguation. By default, these files are installed into
40 /usr/local/lib/perl5/site_perl/PERL_VERSION (where PERL_VERSION is
41 the version of Perl you are using). See the INSTALL file for more
42 information.
43
44 /doc
45 This directory contains all of the *pod files used to document the
46 system. These are processed via pod2text and the output of this is
47 placed in the top level directory, although these top level text
48 files should be considered read only.
49
50 /t This directory contains test scripts. These scripts are run when you
51 execute 'make test'.
52
53 /samples
54 It includes two formats of stoplist file, one word per line
55 (stoplist.txt) and regular expression format (stoplist-nsp.regex).
56
57SEE ALSO
58 <http://text-similarity.sourceforge.net>
59
60AUTHORS
61 Ted Pedersen, University of Minnesota, Duluth
62 tpederse at d.umn.edu
63
64 Siddharth Patwardhan, University of Utah
65 sidd at cs.utah.edu
66
67 Satanjeev Banerjee, Carnegie Mellon University
68 banerjee at cs.cmu.edu
69
70 Jason Michelizzi
71
72 Ying Liu, University of Minnesota, Twin Cities
73 liux0395 at umn.edu
74
75 Last modified by: $Id: README.pod,v 1.1.1.1 2013/06/26 02:38:12 tpederse
76 Exp $
77
78COPYRIGHT AND LICENSE
79 Copyright (C) 2004-2008 by Jason Michelizzi, Ted Pedersen, Siddharth
80 Patwardhan, Satanjeev Banerjee
81
82 Permission is granted to copy, distribute and/or modify this document
83 under the terms of the GNU Free Documentation License, Version 1.2 or
84 any later version published by the Free Software Foundation; with no
85 Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
86
87 Note: a copy of the GNU Free Documentation License is available on the
88 web at <http://www.gnu.org/copyleft/fdl.html> and is included in this
89 distribution as FDL.txt.
90
91