1
2#(@) Lingua::Ispell.pm - a module encapsulating access to the Ispell program.
3
4=head1 NAME
5
6Lingua::Ispell.pm - a module encapsulating access to the Ispell program.
7
8Note: this module was previously known as Text::Ispell; if you have
9Text::Ispell installed on your system, it is now obsolete and should be
10replaced by Lingua::Ispell.
11
12=head1 NOTA BENE
13
14ispell, when reporting on misspelled words, indicates the string it was unable
15to verify, as well as its starting offset in the input line.
16No such information is returned for words which are deemed to be correctly spelled.
17For example, in a line like "Can't buy a thrill", ispell simply reports that the
18line contained four correctly spelled words.
19
20Lingua::Ispell would like to identify which substrings of the input
21line are words -- correctly spelled or otherwise.  It used to attempt to split
22the input line into words according to the same rules ispell uses; but that has
23proven to be very difficult, resulting in both slow and error-prone code.
24
25=head2 Consequences
26
27Lingua::Ispell now operates only in "terse" mode.
28In this mode, only misspelled words are reported.
29Words which ispell verifies as correctly spelled are silently accepted.
30
31In the report structures returned by C<spellcheck()>, the C<'term'> member
32is now always identical to the C<'original'> member; of the two, you should
33probably use the C<'term'> member.  (Also consider the C<'offset'> member.)
34ispell does not report this information for correctly spelled words; if at
35some point in the future this capability is added to ispell, Lingua::Ispell
36will be updated to take advantage of it.
37
38Use of the C<$word_chars> variable has been removed; setting it no longer
39has any effect.
40
41C<terse_mode()> now does nothing.
42
43=cut
44
45
46package Lingua::Ispell;
47use Exporter;
48@Lingua::Ispell::ISA = qw(Exporter);
49@Lingua::Ispell::EXPORT_OK = qw(
50  spellcheck
51  add_word
52  add_word_lc
53  accept_word
54  parse_according_to
55  set_params_by_language
56  save_dictionary
57  allow_compounds
58  make_wild_guesses
59  use_dictionary
60  use_personal_dictionary
61);
62%Lingua::Ispell::EXPORT_TAGS = (
63  'all' => \@Lingua::Ispell::EXPORT_OK,
64);
65
66
67use FileHandle;
68use IPC::Open2;
69use Carp;
70
71use strict;
72
73use vars qw( $VERSION );
74$VERSION = '0.07';
75
76
77=head1 SYNOPSIS
78
79 # Brief:
80 use Lingua::Ispell;
81 Lingua::Ispell::spellcheck( $string );
82 # or
83 use Lingua::Ispell qw( spellcheck ); # import the function
84 spellcheck( $string );
85
86 # Useful:
87 use Lingua::Ispell qw( :all );  # import all symbols
88 for my $r ( spellcheck( "hello hacking perl shrdlu 42" ) ) {
89   print "$r->{'type'}: $r->{'term'}\n";
90 }
91
92
93=head1 DESCRIPTION
94
95Lingua::Ispell::spellcheck() takes one argument.  It must be a
96string, and it should contain only printable characters.
97One allowable exception is a terminal newline, which will be
98chomped off anyway.  The line is fed to a coprocess running
99ispell for analysis.  ispell parses the line into "terms"
100according to the language-specific rules in effect.
101
102The result of ispell's analysis of each term is a categorization
103of the term into one of six types: ok, compound, root, miss, none,
104and guess.  Some of these carry additional information.
105The first three types are "correctly" spelled terms, and the last
106three are for "incorrectly" spelled terms.
107
108Lingua::Ispell::spellcheck returns a list of objects, each
109corresponding to a term in the spellchecked string.  Each object
110is a hash (hash-ref) with at least two entries: 'term' and 'type'.
111The former contains the term ispell is reporting on, and the latter
112is ispell's determination of that term's type (see above).
113For types 'ok' and 'none', that is all the information there is.
114For the type 'root', an additional hash entry is present: 'root'.
115Its value is the word which ispell identified in the dictionary
116as being the likely root of the current term.
117For the type 'miss', an additional hash entry is present: 'misses'.
118Its value is an ref to an array of words which ispell
119identified as being "near-misses" of the current term, when
120scanning the dictionary.
121
122=head2 NOTE
123
124As mentioned above, C<Lingua::Ispell::spellcheck()> currently only reports on misspelled terms.
125
126=head2 EXAMPLE
127
128 use Lingua::Ispell qw( spellcheck );
129 Lingua::Ispell::allow_compounds(1);
130 for my $r ( spellcheck( "hello hacking perl salmoning fruithammer shrdlu 42" ) ) {
131   if ( $r->{'type'} eq 'ok' ) {
132     # as in the case of 'hello'
133     print "'$r->{'term'}' was found in the dictionary.\n";
134   }
135   elsif ( $r->{'type'} eq 'root' ) {
136     # as in the case of 'hacking'
137     print "'$r->{'term'}' can be formed from root '$r->{'root'}'\n";
138   }
139   elsif ( $r->{'type'} eq 'miss' ) {
140     # as in the case of 'perl'
141     print "'$r->{'term'}' was not found in the dictionary;\n";
142     print "Near misses: @{$r->{'misses'}}\n";
143   }
144   elsif ( $r->{'type'} eq 'guess' ) {
145     # as in the case of 'salmoning'
146     print "'$r->{'term'}' was not found in the dictionary;\n";
147     print "Root/affix Guesses: @{$r->{'guesses'}}\n";
148   }
149   elsif ( $r->{'type'} eq 'compound' ) {
150     # as in the case of 'fruithammer'
151     print "'$r->{'term'}' is a valid compound word.\n";
152   }
153   elsif ( $r->{'type'} eq 'none' ) {
154     # as in the case of 'shrdlu'
155     print "No match for term '$r->{'term'}'\n";
156   }
157   # and numbers are skipped entirely, as in the case of 42.
158 }
159
160
161=head2 ERRORS
162
163C<Lingua::Ispell::spellcheck()> starts the ispell coprocess
164if the coprocess seems not to exist.  Ordinarily this is simply
165the first time it's called.
166
167ispell is spawned via the C<Open2::open2()> function, which
168throws an exception (i.e. dies) if the spawn fails.  The caller
169should be prepared to catch this exception -- unless, of course,
170the default behavior of die is acceptable.
171
172=head2 Nota Bene
173
174The full location of the ispell executable is stored
175in the variable C<$Lingua::Ispell::path>.  The default
176value is F</usr/local/bin/ispell>.
177If your ispell executable has some name other than
178this, then you must set C<$Lingua::Ispell::path> accordingly
179before you call C<Lingua::Ispell::spellcheck()> (or any other function
180in the module) for the first time!
181
182=cut
183
184
185sub _init {
186  unless ( $Lingua::Ispell::pid ) {
187    my @options;
188    while ( my( $k, $ar ) = each %Lingua::Ispell::options ) {
189      if ( @$ar ) {
190        for ( @$ar ) {
191          #push @options, "$k $_";
192          push @options, $k, $_;
193        }
194      }
195      else {
196        push @options, $k;
197      }
198    }
199
200    $Lingua::Ispell::path ||= '/usr/local/bin/ispell';
201
202    $Lingua::Ispell::pid = undef; # so that it's still undef if open2 fails.
203    $Lingua::Ispell::pid = open2( # if open2 fails, it throws, but doesn't return.
204      *Reader,
205      *Writer,
206      $Lingua::Ispell::path,
207      '-a', '-S',
208      @options,
209    );
210
211    my $hdr = scalar(<Reader>);
212
213    # must be the same as ispell:
214    $Lingua::Ispell::terse = 0;
215    {
216      # set up permanent terse mode:
217      local $/ = "\n";
218      local $\ = '';
219      print Writer "!\n";
220      $Lingua::Ispell::terse = 1;
221    }
222  }
223
224  $Lingua::Ispell::pid
225}
226
227sub _exit {
228  if ( $Lingua::Ispell::pid ) {
229    close Reader;
230    close Writer;
231    kill $Lingua::Ispell::pid;
232    $Lingua::Ispell::pid = undef;
233  }
234}
235
236
237sub spellcheck {
238  _init() or return();  # caller should really catch the exception from a failed open2.
239  my $line = shift;
240  local $/ = "\n"; local $\ = '';
241  chomp $line;
242  $line =~ s/\r//g; # kill the hate
243  $line =~ /\n/ and croak "newlines not allowed in arguments to Lingua::Ispell::spellcheck!";
244  print Writer "^$line\n";
245  my @commentary;
246  local $_;
247  while ( <Reader> ) {
248    chomp;
249    last unless $_ gt '';
250    push @commentary, $_;
251  }
252
253  my %types = (
254    # correct words:
255    '*' => 'ok',
256    '-' => 'compound',
257    '+' => 'root',
258
259    # misspelled words:
260    '#' => 'none',
261    '&' => 'miss',
262    '?' => 'guess',
263  );
264  # and there's one more type, unknown, which is
265  # used when the first char is not in the above set.
266
267  my %modisp = (
268      'root' => sub {
269        my $h = shift;
270        $h->{'root'} = shift;
271      },
272      'none' => sub {
273        my $h = shift;
274        $h->{'original'} = shift;
275        $h->{'offset'} = shift;
276      },
277      'miss' => sub { # also used for 'guess'
278        my $h = shift;
279        $h->{'original'} = shift;
280        $h->{'count'} = shift; # count will always be 0, when $c eq '?'.
281        $h->{'offset'} = shift;
282
283        my @misses  = splice @_, 0, $h->{'count'};
284        my @guesses = @_;
285
286        $h->{'misses'}  = \@misses;
287        $h->{'guesses'} = \@guesses;
288      },
289  );
290  $modisp{'guess'} = $modisp{'miss'}; # same handler.
291
292  my @results;
293  for my $i ( 0 .. $#commentary ) {
294    my %h = (
295      'commentary' => $commentary[$i],
296    );
297
298    my @tail; # will get stuff after a colon, if any.
299
300    if ( $h{'commentary'} =~ s/:\s+(.*)// ) {
301      my $tail = $1;
302      @tail = split /, /, $tail;
303    }
304
305    my( $c, @args ) = split ' ', $h{'commentary'};
306
307    my $type = $types{$c} || 'unknown';
308
309    $modisp{$type} and $modisp{$type}->( \%h, @args, @tail );
310
311    $h{'type'} = $type;
312    $h{'term'} = $h{'original'};
313
314    push @results, \%h;
315  }
316
317  @results
318}
319
320sub _send_command($$) {
321  my( $cmd, $arg ) = @_;
322  defined $arg or $arg = '';
323  local $/ = "\n"; local $\ = '';
324  chomp $arg;
325  _init();
326  print Writer "$cmd$arg\n";
327}
328
329
330=head1 AUX FUNCTIONS
331
332=head2 add_word(word)
333
334Adds a word to the personal dictionary.  Be careful of capitalization.
335If you want the word to be added "case-insensitively", you should
336call C<add_word_lc()>
337
338=cut
339
340sub add_word($) {
341  _send_command "\*", $_[0];
342}
343
344=head2 add_word_lc(word)
345
346Adds a word to the personal dictionary, in lower-case form.
347This allows ispell to match it in a case-insensitive manner.
348
349=cut
350
351sub add_word_lc($) {
352  _send_command "\&", $_[0];
353}
354
355=head2 accept_word(word)
356
357Similar to adding a word to the dictionary, in that it causes
358ispell to accept the word as valid, but it does not actually
359add it to the dictionary.  Presumably the effects of this only
360last for the current ispell session, which will mysteriously
361end if any of the coprocess-restarting functions are called...
362
363=cut
364
365sub accept_word($) {
366  _send_command "\@", $_[0];
367}
368
369=head2 parse_according_to(formatter)
370
371Causes ispell to parse subsequent input lines according to
372the specified formatter.  As of ispell v. 3.1.20, only
373'tex' and 'nroff' are supported.
374
375=cut
376
377sub parse_according_to($) {
378  # must be one of 'tex' or 'nroff'
379  _send_command "\-", $_[0];
380}
381
382=head2 set_params_by_language(language)
383
384Causes ispell to set its internal operational parameters
385according to the given language.  Legal arguments to this
386function, and its effects, are currently unknown by the
387author of Lingua::Ispell.
388
389=cut
390
391sub set_params_by_language($) {
392  _send_command "\~", $_[0];
393}
394
395=head2 save_dictionary()
396
397Causes ispell to save the current state of the dictionary
398to its disk file.  Presumably ispell would ordinarily
399only do this upon exit.
400
401=cut
402
403sub save_dictionary() {
404  _send_command "\#", '';
405}
406
407=head2 terse_mode(bool:terse)
408
409I<B<NOTE:> This function has been disabled!
410Lingua::Ispell now always operates in terse mode.>
411
412In terse mode, ispell will not produce reports for "correct" words.
413This means that the calling program will not receive results of the
414types 'ok', 'root', and 'compound'.
415
416=cut
417
418sub terse_mode($) {
419#  my $bool = shift;
420#  my $cmd = $bool ?  "\!" : "\%";
421#  _send_command $cmd, '';
422#  $Lingua::Ispell::terse = $bool;
423}
424
425
426=head1 FUNCTIONS THAT RESTART ISPELL
427
428The following functions cause the current ispell coprocess, if any, to terminate.
429This means that all the changes to the state of ispell made by the above
430functions will be lost, and their respective values reset to their defaults.
431The only function above whose effect is persistent is C<save_dictionary()>.
432
433Perhaps in the future we will figure out a good way to make this
434state information carry over from one instantiation of the coprocess
435to the next.
436
437=head2 allow_compounds(bool)
438
439When this value is set to True, compound words are
440accepted as legal -- as long as both words are found in the
441dictionary; more than two words are always illegal.
442When this value is set to False, run-together words are
443considered spelling errors.
444
445The default value of this setting is dictionary-dependent,
446so the caller should set it explicitly if it really matters.
447
448=cut
449
450sub allow_compounds {
451  my $bool = shift;
452  _exit();
453  if ( $bool ) {
454    $Lingua::Ispell::options{'-C'} = [];
455    delete $Lingua::Ispell::options{'-B'};
456  }
457  else {
458    $Lingua::Ispell::options{'-B'} = [];
459    delete $Lingua::Ispell::options{'-C'};
460  }
461}
462
463=head2 make_wild_guesses(bool)
464
465This setting controls when ispell makes "wild" guesses.
466
467If False, ispell only makes "sane" guesses, i.e.  possible
468root/affix combinations that match the current dictionary;
469only if it can find none will it make "wild" guesses,
470which don't match the dictionary, and might in fact
471be illegal words.
472
473If True, wild guesses are always made, along with any "sane" guesses.
474This feature can be useful if the dictionary has a limited word list,
475or a word list with few suffixes.
476
477The default value of this setting is dictionary-dependent,
478so the caller should set it explicitly if it really matters.
479
480=cut
481
482sub make_wild_guesses {
483  my $bool = shift;
484  _exit();
485  if ( $bool ) {
486    $Lingua::Ispell::options{'-m'} = [];
487    delete $Lingua::Ispell::options{'-P'};
488  }
489  else {
490    $Lingua::Ispell::options{'-P'} = [];
491    delete $Lingua::Ispell::options{'-m'};
492  }
493}
494
495=head2 use_dictionary([dictionary])
496
497Specifies what dictionary to use instead of the
498default.  Dictionary names are actually file
499names, and are searched for according to the
500following rule: if the name does not contain a slash,
501it is looked for in the directory containing the
502default dictionary, typically /usr/local/lib.
503Otherwise, it is used as is: if it does not begin
504with a slash, it is construed from the current
505directory.
506
507If no argument is given, the default dictionary will be used.
508
509=cut
510
511sub use_dictionary {
512  _exit();
513  if ( @_ ) {
514    $Lingua::Ispell::options{'-d'} = [ @_ ];
515  }
516  else {
517    delete $Lingua::Ispell::options{'-d'};
518  }
519}
520
521=head2 use_personal_dictionary([dictionary])
522
523Specifies what personal dictionary to use
524instead of the default.
525
526Dictionary names are actually file names, and are
527searched for according to the following rule:
528if the name begins with a slash, it is used as
529is (i.e. it is an absolute path name). Otherwise,
530it is construed as relative to the user's home
531directory ($HOME).
532
533If no argument is given, the default personal
534dictionary will be used.
535
536=cut
537
538sub use_personal_dictionary {
539  _exit();
540  if ( @_ ) {
541    $Lingua::Ispell::options{'-p'} = [ @_ ];
542  }
543  else {
544    delete $Lingua::Ispell::options{'-p'};
545  }
546}
547
548
549
5501;
551
552
553=head1 FUTURE ENHANCEMENTS
554
555ispell options:
556
557  -w chars
558         Specify additional characters that can be part of a word.
559
560=head1 DEPENDENCIES
561
562Lingua::Ispell uses the external program ispell, which is
563the "International Ispell", available at
564
565  http://fmg-www.cs.ucla.edu/geoff/ispell.html
566
567as well as various archives and mirrors, such as
568
569  ftp://ftp.math.orst.edu/pub/ispell-3.1/
570
571This is a very popular program, and may already be
572installed on your system.
573
574Lingua::Ispell also uses the standard perl modules FileHandle,
575IPC::Open2, and Carp.
576
577=head1 AUTHOR
578
579jdporter@min.net (John Porter)
580
581=head1 COPYRIGHT
582
583This module is free software; you may redistribute it and/or
584modify it under the same terms as Perl itself.
585
586=cut
587
588