• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

examples/H17-Oct-2014-249

lib/HTML/H17-Oct-2014-677262

t/H17-Oct-2014-564392

Build.PLH A D17-Oct-20141.5 KiB6856

CONTRIBUTORSH A D17-Oct-2014625 2014

ChangesH A D17-Oct-20144 KiB11285

INSTALLH A D17-Oct-2014970 4524

LICENSEH A D17-Oct-201417.9 KiB380292

MANIFESTH A D17-Oct-2014439 3130

META.jsonH A D17-Oct-20142.3 KiB8785

META.ymlH A D17-Oct-20141.4 KiB5554

Makefile.PLH A D17-Oct-20141.8 KiB8770

READMEH A D17-Oct-201412.3 KiB339245

cpanfileH A D17-Oct-2014719 3026

dist.iniH A D17-Oct-2014922 6250

README

1NAME
2    HTML::Restrict - Strip unwanted HTML tags and attributes
3
4VERSION
5    version 2.2.2
6
7SYNOPSIS
8        use HTML::Restrict;
9
10        my $hr = HTML::Restrict->new();
11
12        # use default rules to start with (strip away all HTML)
13        my $processed = $hr->process('  <b>i am bold</b>  ');
14
15        # $processed now equals: 'i am bold'
16
17        # Now, a less restrictive example:
18        use HTML::Restrict;
19
20        my $hr = HTML::Restrict->new(
21            rules => {
22                b   => [],
23                img => [qw( src alt / )]
24            }
25        );
26
27        my $html = q[<body><b>hello</b> <img src="pic.jpg" alt="me" id="test" /></body>];
28        my $processed = $hr->process( $html );
29
30        # $processed now equals: <b>hello</b> <img src="pic.jpg" alt="me" />
31
32DESCRIPTION
33    This module uses HTML::Parser to strip HTML from text in a restrictive
34    manner. By default all HTML is restricted. You may alter the default
35    behaviour by supplying your own tag rules.
36
37CONSTRUCTOR AND STARTUP
38  new()
39    Creates and returns a new HTML::Restrict object.
40
41        my $hr = HTML::Restrict->new()
42
43    HTML::Restrict doesn't require any params to be passed to new. If your
44    goal is to remove all HTML from text, then no further setup is required.
45    Just pass your text to the process() method and you're done:
46
47        my $plain_text = $hr->process( $html );
48
49    If you need to set up specific rules, have a look at the params which
50    HTML::Restrict recognizes:
51
52    *   "rules => \%rules"
53
54        Sets the rules which will be used to process your data. By default
55        all HTML tags are off limits. Use this argument to define the HTML
56        elements and corresponding attributes you'd like to use.
57        Essentially, consider the default behaviour to be:
58
59            rules => {}
60
61        Rules should be passed as a HASHREF of allowed tags. Each hash value
62        should represent the allowed attributes for the listed tag. For
63        example, if you want to allow a fair amount of HTML, you can try
64        something like this:
65
66            my %rules = (
67                a       => [qw( href target )],
68                b       => [],
69                caption => [],
70                center  => [],
71                em      => [],
72                i       => [],
73                img     => [qw( alt border height width src style )],
74                li      => [],
75                ol      => [],
76                p       => [qw(style)],
77                span    => [qw(style)],
78                strong  => [],
79                sub     => [],
80                sup     => [],
81                table   => [qw( style border cellspacing cellpadding align )],
82                tbody   => [],
83                td      => [],
84                tr      => [],
85                u       => [],
86                ul      => [],
87            );
88
89            my $hr = HTML::Restrict->new( rules => \%rules )
90
91        Or, to allow only bolded text:
92
93            my $hr = HTML::Restrict->new( rules => { b => [] } );
94
95        Allow bolded text, images and some (but not all) image attributes:
96
97            my %rules = (
98                b   => [ ],
99                img => [qw( src alt width height border / )
100            );
101            my $hr = HTML::Restrict->new( rules => \%rules );
102
103        Since HTML::Parser treats a closing slash as an attribute, you'll
104        need to add "/" to your list of allowed attributes if you'd like
105        your tags to retain closing slashes. For example:
106
107            my $hr = HTML::Restrict->new( rules =>{ hr => [] } );
108            $hr->process( "<hr />"); # returns: <hr>
109
110            my $hr = HTML::Restrict->new( rules =>{ hr => [qw( / )] } );
111            $hr->process( "<hr />"); # returns: <hr />
112
113        HTML::Restrict strips away any tags and attributes which are not
114        explicitly allowed. It also rebuilds your explicitly allowed tags
115        and places their attributes in the order in which they appear in
116        your rules.
117
118        So, if you define the following rules:
119
120            my %rules = (
121                ...
122                img => [qw( src alt title width height id / )]
123                ...
124            );
125
126        then your image tags will all be built like this:
127
128            <img src=".." alt="..." title="..." width="..." height="..." id=".." />
129
130        This gives you greater consistency in your tag layout. If you don't
131        care about element order you don't need to pay any attention to
132        this, but you should be aware that your elements are being
133        reconstructed rather than just stripped down.
134
135        As of 2.1.0, you can also specify a regex to be tested against the
136        attribute value. This feature should be considered experimental for
137        the time being:
138
139            my $hr = HTML::Restrict->new(
140                rules => {
141                    iframe => [
142                        qw( width height allowfullscreen ),
143                        {   src         => qr{^http://www\.youtube\.com},
144                            frameborder => qr{^(0|1)$},
145                        }
146                    ],
147                    img => [ qw( alt ), { src => qr{^/my/images/} }, ],
148                },
149            );
150
151            my $html = '<img src="http://www.example.com/image.jpg" alt="Alt Text">';
152            my $processed = $hr->process( $html );
153
154            # $processed now equals: <img alt="Alt Text">
155
156    *   "trim => [0|1]"
157
158        By default all leading and trailing spaces will be removed when text
159        is processed. Set this value to 0 in order to disable this
160        behaviour.
161
162    *   "uri_schemes => [undef, 'http', 'https', 'irc', ... ]"
163
164        As of version 1.0.3, URI scheme checking is performed on all href
165        and src tag attributes. The following schemes are allowed out of the
166        box. No action is required on your part:
167
168            [ undef, 'http', 'https' ]
169
170        (undef represents relative URIs). These restrictions have been put
171        in place to prevent XSS in the form of:
172
173            <a href="javascript:alert(document.cookie)">click for cookie!</a>
174
175        See URI for more detailed info on scheme parsing. If, for example,
176        you wanted to filter out every scheme barring SSL, you would do it
177        like this:
178
179            uri_schemes => ['https']
180
181        This feature is new in 1.0.3. Previous to this, there was no schema
182        checking at all. Moving forward, you'll need to whitelist explicitly
183        all URI schemas which are not supported by default. This is in
184        keeping with the whitelisting behaviour of this module and is also
185        the safest possible approach. Keep in mind that changes to
186        uri_schemes are not additive, so you'll need to include the defaults
187        in any changes you make, should you wish to keep them:
188
189            # defaults + irc + mailto
190            uri_schemes => [ 'undef', 'http', 'https', 'irc', 'mailto' ]
191
192    *   allow_declaration => [0|1]
193
194        Set this value to true if you'd like to allow/preserve DOCTYPE
195        declarations in your content. Useful when cleaning up your own
196        static files or templates. This feature is off by default.
197
198            my $html = q[<!doctype html><body>foo</body>];
199
200            my $hr = HTML::Restrict->new( allow_declaration => 1 );
201            $html = $hr->process( $html );
202            # $html is now: "<!doctype html>foo"
203
204    *   allow_comments => [0|1]
205
206        Set this value to true if you'd like to allow/preserve HTML comments
207        in your content. Useful when cleaning up your own static files or
208        templates. This feature is off by default.
209
210            my $html = q[<body><!-- comments! -->foo</body>];
211
212            my $hr = HTML::Restrict->new( allow_comments => 1 );
213            $html = $hr->process( $html );
214            # $html is now: "<!-- comments! -->foo"
215
216    *   replace_img => [0|1|CodeRef]
217
218        Set the value to true if you'd like to have img tags replaced with
219        "[IMAGE: ...]" containing the alt attribute text. If you set it to a
220        code reference, you can provide your own replacement (which may even
221        contain HTML).
222
223            sub replacer {
224                my ($tagname, $attr, $text) = @_; # from HTML::Parser
225                return qq{<a href="$attr->{src}">IMAGE: $attr->{alt}</a>};
226            }
227
228            my $hr = HTML::Restrict->new( replace_img => \&replacer );
229
230        This attribute will only take effect if the img tag is not included
231        in the allowed HTML.
232
233    *   strip_enclosed_content => [0|1]
234
235        The default behaviour up to 1.0.4 was to preserve the content
236        between script and style tags, even when the tags themselves were
237        being deleted. So, you'd be left with a bunch of JavaScript or CSS,
238        just with the enclosing tags missing. This is almost never what you
239        want, so starting at 1.0.5 the default will be to remove any script
240        or style info which is enclosed in these tags, unless they have
241        specifically been whitelisted in the rules. This will be a sane
242        default when cleaning up content submitted via a web form. However,
243        if you're using HTML::Restrict to purge your own HTML you can be
244        more restrictive.
245
246            # strip the head section, in addition to JS and CSS
247            my $html = '<html><head>...</head><body>...<script>JS here</script>foo';
248
249            my $hr = HTML::Restrict->new(
250                strip_enclosed_content => [ 'script', 'style', 'head' ]
251            );
252
253            $html = $hr->process( $html );
254            # $html is now '<html><body>...foo';
255
256        The caveat here is that HTML::Restrict will not try to fix broken
257        HTML. In the above example, if you have any opening script, style or
258        head tags which don't also include matching closing tags, all
259        following content will be stripped away, regardless of any parent
260        tags.
261
262        Keep in mind that changes to strip_enclosed_content are not
263        additive, so if you are adding additional tags you'll need to
264        include the entire list of tags whose enclosed content you'd like to
265        remove. This feature strips script and style tags by default.
266
267SUBROUTINES/METHODS
268  process( $html )
269    This is the method which does the real work. It parses your data,
270    removes any tags and attributes which are not specifically allowed and
271    returns the resulting text. Requires and returns a SCALAR.
272
273CAVEATS
274    Please note that all tag and attribute names passed via the rules param
275    must be supplied in lower case.
276
277        # correct
278        my $hr = HTML::Restrict->new( rules => { body => ['onload'] } );
279
280        # throws a fatal error
281        my $hr = HTML::Restrict->new( rules => { Body => ['onLoad'] } );
282
283MOTIVATION
284    There are already several modules on the CPAN which accomplish much of
285    the same thing, but after doing a lot of poking around, I was unable to
286    find a solution with a simple setup which I was happy with.
287
288    The most common use case might be stripping HTML from user submitted
289    data completely or allowing just a few tags and attributes to be
290    displayed. With the exception of URI scheme checking, this module
291    doesn't do any validation on the actual content of the tags or
292    attributes. If this is a requirement, you can either mess with the
293    parser object, post-process the text yourself or have a look at one of
294    the more feature-rich modules in the SEE ALSO section below.
295
296    My aim here is to keep things easy and, hopefully, cover a lot of the
297    less complex use cases with just a few lines of code and some brief
298    documentation. The idea is to be up and running quickly.
299
300SEE ALSO
301    HTML::TagFilter, HTML::Defang, HTML::Declaw, HTML::StripScripts,
302    HTML::Detoxifier, HTML::Sanitizer, HTML::Scrubber
303
304ACKNOWLEDGEMENTS
305    Thanks to Raybec Communications <http://www.raybec.com> for funding my
306    work on this module and for releasing it to the world.
307
308    Thanks also to the following for patches, bug reports and assistance:
309
310    Mark Jubenville (ioncache)
311
312    Duncan Forsyth
313
314    Rick Moore
315
316    Arthur Axel 'fREW' Schmidt
317
318    perlpong
319
320    David Golden
321
322    Graham TerMarsch
323
324    Dagfinn Ilmari Mannsåker
325
326    Graham Knop
327
328    Carwyn Ellis
329
330AUTHOR
331    Olaf Alders <olaf@wundercounter.com>
332
333COPYRIGHT AND LICENSE
334    This software is copyright (c) 2013 by Olaf Alders.
335
336    This is free software; you can redistribute it and/or modify it under
337    the same terms as the Perl 5 programming language system itself.
338
339