• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

bin/H06-Aug-2016-324122

lib/URI/H06-Aug-2016-1,360923

t/H06-Aug-2016-569398

.travis.ymlH A D06-Aug-2016108 1110

Build.PLH A D06-Aug-2016944 4329

ChangesH A D06-Aug-20166.5 KiB192141

INSTALLH A D06-Aug-2016861 3923

LICENSEH A D06-Aug-201618 KiB380292

MANIFESTH A D06-Aug-2016330 2524

MANIFEST.SKIPH A D06-Aug-20161.1 KiB6742

META.jsonH A D06-Aug-20161.4 KiB6160

META.ymlH A D06-Aug-2016908 3433

READMEH A D06-Aug-20166.6 KiB230150

TODOH A D06-Aug-2016492 1110

appveyor.ymlH A D06-Aug-2016378 2216

README

1NAME
2
3    URI::Find - Find URIs in arbitrary text
4
5SYNOPSIS
6
7      require URI::Find;
8
9      my $finder = URI::Find->new(\&callback);
10
11      $how_many_found = $finder->find(\$text);
12
13DESCRIPTION
14
15    This module does one thing: Finds URIs and URLs in plain text. It finds
16    them quickly and it finds them all (or what URI.pm considers a URI to
17    be.) It only finds URIs which include a scheme (http:// or the like),
18    for something a bit less strict have a look at URI::Find::Schemeless.
19
20    For a command-line interface, urifind is provided.
21
22 Public Methods
23
24    new
25
26        my $finder = URI::Find->new(\&callback);
27
28      Creates a new URI::Find object.
29
30      &callback is a function which is called on each URI found. It is
31      passed two arguments, the first is a URI object representing the URI
32      found. The second is the original text of the URI found. The return
33      value of the callback will replace the original URI in the text.
34
35    find
36
37        my $how_many_found = $finder->find(\$text);
38
39      $text is a string to search and possibly modify with your callback.
40
41      Alternatively, find can be called with a replacement function for the
42      rest of the text:
43
44        use CGI qw(escapeHTML);
45        # ...
46        my $how_many_found = $finder->find(\$text, \&escapeHTML);
47
48      will not only call the callback function for every URL found (and
49      perform the replacement instructions therein), but also run the rest
50      of the text through escapeHTML(). This makes it easier to turn plain
51      text which contains URLs into HTML (see example below).
52
53 Protected Methods
54
55    I got a bunch of mail from people asking if I'd add certain features to
56    URI::Find. Most wanted the search to be less restrictive, do more
57    heuristics, etc... Since many of the requests were contradictory, I'm
58    letting people create their own custom subclasses to do what they want.
59
60    The following are methods internal to URI::Find which a subclass can
61    override to change the way URI::Find acts. They are only to be called
62    inside a URI::Find subclass. Users of this module are NOT to use these
63    methods.
64
65    uri_re
66
67        my $uri_re = $self->uri_re;
68
69      Returns the regex for finding absolute, schemed URIs
70      (http://www.foo.com and such). This, combined with
71      schemeless_uri_re() is what finds candidate URIs.
72
73      Usually this method does not have to be overridden.
74
75    schemeless_uri_re
76
77        my $schemeless_re = $self->schemeless_uri_re;
78
79      Returns the regex for finding schemeless URIs (www.foo.com and such)
80      and other things which might be URIs. By default this will match
81      nothing (though it used to try to find schemeless URIs which started
82      with www and ftp).
83
84      Many people will want to override this method. See
85      URI::Find::Schemeless for a subclass does a reasonable job of finding
86      URIs which might be missing the scheme.
87
88    uric_set
89
90        my $uric_set = $self->uric_set;
91
92      Returns a set matching the 'uric' set defined in RFC 2396 suitable
93      for putting into a character set ([]) in a regex.
94
95      You almost never have to override this.
96
97    cruft_set
98
99        my $cruft_set = $self->cruft_set;
100
101      Returns a set of characters which are considered garbage. Used by
102      decruft().
103
104    decruft
105
106        my $uri = $self->decruft($uri);
107
108      Sometimes garbage characters like periods and parenthesis get
109      accidentally matched along with the URI. In order for the URI to be
110      properly identified, it must sometimes be "decrufted", the garbage
111      characters stripped.
112
113      This method takes a candidate URI and strips off any cruft it finds.
114
115    recruft
116
117        my $uri = $self->recruft($uri);
118
119      This method puts back the cruft taken off with decruft(). This is
120      necessary because the cruft is destructively removed from the string
121      before invoking the user's callback, so it has to be put back
122      afterwards.
123
124    schemeless_to_schemed
125
126        my $schemed_uri = $self->schemeless_to_schemed($schemeless_uri);
127
128      This takes a schemeless URI and returns an absolute, schemed URI. The
129      standard implementation supplies ftp:// for URIs which start with
130      ftp., and http:// otherwise.
131
132    is_schemed
133
134        $obj->is_schemed($uri);
135
136      Returns whether or not the given URI is schemed or schemeless. True
137      for schemed, false for schemeless.
138
139    badinvo
140
141        __PACKAGE__->badinvo($extra_levels, $msg)
142
143      This is used to complain about bogus subroutine/method invocations.
144      The args are optional.
145
146 Old Functions
147
148    The old find_uri() function is still around and it works, but its
149    deprecated.
150
151EXAMPLES
152
153    Store a list of all URIs (normalized) in the document.
154
155      my @uris;
156      my $finder = URI::Find->new(sub {
157          my($uri) = shift;
158          push @uris, $uri;
159      });
160      $finder->find(\$text);
161
162    Print the original URI text found and the normalized representation.
163
164      my $finder = URI::Find->new(sub {
165          my($uri, $orig_uri) = @_;
166          print "The text '$orig_uri' represents '$uri'\n";
167          return $orig_uri;
168      });
169      $finder->find(\$text);
170
171    Check each URI in document to see if it exists.
172
173      use LWP::Simple;
174
175      my $finder = URI::Find->new(sub {
176          my($uri, $orig_uri) = @_;
177          if( head $uri ) {
178              print "$orig_uri is okay\n";
179          }
180          else {
181              print "$orig_uri cannot be found\n";
182          }
183          return $orig_uri;
184      });
185      $finder->find(\$text);
186
187    Turn plain text into HTML, with each URI found wrapped in an HTML
188    anchor.
189
190      use CGI qw(escapeHTML);
191      use URI::Find;
192
193      my $finder = URI::Find->new(sub {
194          my($uri, $orig_uri) = @_;
195          return qq|<a href="$uri">$orig_uri</a>|;
196      });
197      $finder->find(\$text, \&escapeHTML);
198      print "<pre>$text</pre>";
199
200NOTES
201
202    Will not find URLs with Internationalized Domain Names or pretty much
203    any non-ascii stuff in them. See
204    http://rt.cpan.org/Ticket/Display.html?id=44226
205
206AUTHOR
207
208    Michael G Schwern <schwern@pobox.com> with insight from Uri Gutman,
209    Greg Bacon, Jeff Pinyan, Roderick Schertler and others.
210
211    Roderick Schertler <roderick@argon.org> maintained versions 0.11 to
212    0.16.
213
214    Darren Chamberlain wrote urifind.
215
216LICENSE
217
218    Copyright 2000, 2009-2010, 2014, 2016 by Michael G Schwern
219    <schwern@pobox.com>.
220
221    This program is free software; you can redistribute it and/or modify it
222    under the same terms as Perl itself.
223
224    See http://www.perlfoundation.org/artistic_license_1_0
225
226SEE ALSO
227
228    urifind, URI::Find::Schemeless, URI, RFC 3986 Appendix C
229
230