1NAME
2 HTML::Restrict - Strip unwanted HTML tags and attributes
3
4VERSION
5 version 2.2.2
6
7SYNOPSIS
8 use HTML::Restrict;
9
10 my $hr = HTML::Restrict->new();
11
12 # use default rules to start with (strip away all HTML)
13 my $processed = $hr->process(' <b>i am bold</b> ');
14
15 # $processed now equals: 'i am bold'
16
17 # Now, a less restrictive example:
18 use HTML::Restrict;
19
20 my $hr = HTML::Restrict->new(
21 rules => {
22 b => [],
23 img => [qw( src alt / )]
24 }
25 );
26
27 my $html = q[<body><b>hello</b> <img src="pic.jpg" alt="me" id="test" /></body>];
28 my $processed = $hr->process( $html );
29
30 # $processed now equals: <b>hello</b> <img src="pic.jpg" alt="me" />
31
32DESCRIPTION
33 This module uses HTML::Parser to strip HTML from text in a restrictive
34 manner. By default all HTML is restricted. You may alter the default
35 behaviour by supplying your own tag rules.
36
37CONSTRUCTOR AND STARTUP
38 new()
39 Creates and returns a new HTML::Restrict object.
40
41 my $hr = HTML::Restrict->new()
42
43 HTML::Restrict doesn't require any params to be passed to new. If your
44 goal is to remove all HTML from text, then no further setup is required.
45 Just pass your text to the process() method and you're done:
46
47 my $plain_text = $hr->process( $html );
48
49 If you need to set up specific rules, have a look at the params which
50 HTML::Restrict recognizes:
51
52 * "rules => \%rules"
53
54 Sets the rules which will be used to process your data. By default
55 all HTML tags are off limits. Use this argument to define the HTML
56 elements and corresponding attributes you'd like to use.
57 Essentially, consider the default behaviour to be:
58
59 rules => {}
60
61 Rules should be passed as a HASHREF of allowed tags. Each hash value
62 should represent the allowed attributes for the listed tag. For
63 example, if you want to allow a fair amount of HTML, you can try
64 something like this:
65
66 my %rules = (
67 a => [qw( href target )],
68 b => [],
69 caption => [],
70 center => [],
71 em => [],
72 i => [],
73 img => [qw( alt border height width src style )],
74 li => [],
75 ol => [],
76 p => [qw(style)],
77 span => [qw(style)],
78 strong => [],
79 sub => [],
80 sup => [],
81 table => [qw( style border cellspacing cellpadding align )],
82 tbody => [],
83 td => [],
84 tr => [],
85 u => [],
86 ul => [],
87 );
88
89 my $hr = HTML::Restrict->new( rules => \%rules )
90
91 Or, to allow only bolded text:
92
93 my $hr = HTML::Restrict->new( rules => { b => [] } );
94
95 Allow bolded text, images and some (but not all) image attributes:
96
97 my %rules = (
98 b => [ ],
99 img => [qw( src alt width height border / )
100 );
101 my $hr = HTML::Restrict->new( rules => \%rules );
102
103 Since HTML::Parser treats a closing slash as an attribute, you'll
104 need to add "/" to your list of allowed attributes if you'd like
105 your tags to retain closing slashes. For example:
106
107 my $hr = HTML::Restrict->new( rules =>{ hr => [] } );
108 $hr->process( "<hr />"); # returns: <hr>
109
110 my $hr = HTML::Restrict->new( rules =>{ hr => [qw( / )] } );
111 $hr->process( "<hr />"); # returns: <hr />
112
113 HTML::Restrict strips away any tags and attributes which are not
114 explicitly allowed. It also rebuilds your explicitly allowed tags
115 and places their attributes in the order in which they appear in
116 your rules.
117
118 So, if you define the following rules:
119
120 my %rules = (
121 ...
122 img => [qw( src alt title width height id / )]
123 ...
124 );
125
126 then your image tags will all be built like this:
127
128 <img src=".." alt="..." title="..." width="..." height="..." id=".." />
129
130 This gives you greater consistency in your tag layout. If you don't
131 care about element order you don't need to pay any attention to
132 this, but you should be aware that your elements are being
133 reconstructed rather than just stripped down.
134
135 As of 2.1.0, you can also specify a regex to be tested against the
136 attribute value. This feature should be considered experimental for
137 the time being:
138
139 my $hr = HTML::Restrict->new(
140 rules => {
141 iframe => [
142 qw( width height allowfullscreen ),
143 { src => qr{^http://www\.youtube\.com},
144 frameborder => qr{^(0|1)$},
145 }
146 ],
147 img => [ qw( alt ), { src => qr{^/my/images/} }, ],
148 },
149 );
150
151 my $html = '<img src="http://www.example.com/image.jpg" alt="Alt Text">';
152 my $processed = $hr->process( $html );
153
154 # $processed now equals: <img alt="Alt Text">
155
156 * "trim => [0|1]"
157
158 By default all leading and trailing spaces will be removed when text
159 is processed. Set this value to 0 in order to disable this
160 behaviour.
161
162 * "uri_schemes => [undef, 'http', 'https', 'irc', ... ]"
163
164 As of version 1.0.3, URI scheme checking is performed on all href
165 and src tag attributes. The following schemes are allowed out of the
166 box. No action is required on your part:
167
168 [ undef, 'http', 'https' ]
169
170 (undef represents relative URIs). These restrictions have been put
171 in place to prevent XSS in the form of:
172
173 <a href="javascript:alert(document.cookie)">click for cookie!</a>
174
175 See URI for more detailed info on scheme parsing. If, for example,
176 you wanted to filter out every scheme barring SSL, you would do it
177 like this:
178
179 uri_schemes => ['https']
180
181 This feature is new in 1.0.3. Previous to this, there was no schema
182 checking at all. Moving forward, you'll need to whitelist explicitly
183 all URI schemas which are not supported by default. This is in
184 keeping with the whitelisting behaviour of this module and is also
185 the safest possible approach. Keep in mind that changes to
186 uri_schemes are not additive, so you'll need to include the defaults
187 in any changes you make, should you wish to keep them:
188
189 # defaults + irc + mailto
190 uri_schemes => [ 'undef', 'http', 'https', 'irc', 'mailto' ]
191
192 * allow_declaration => [0|1]
193
194 Set this value to true if you'd like to allow/preserve DOCTYPE
195 declarations in your content. Useful when cleaning up your own
196 static files or templates. This feature is off by default.
197
198 my $html = q[<!doctype html><body>foo</body>];
199
200 my $hr = HTML::Restrict->new( allow_declaration => 1 );
201 $html = $hr->process( $html );
202 # $html is now: "<!doctype html>foo"
203
204 * allow_comments => [0|1]
205
206 Set this value to true if you'd like to allow/preserve HTML comments
207 in your content. Useful when cleaning up your own static files or
208 templates. This feature is off by default.
209
210 my $html = q[<body><!-- comments! -->foo</body>];
211
212 my $hr = HTML::Restrict->new( allow_comments => 1 );
213 $html = $hr->process( $html );
214 # $html is now: "<!-- comments! -->foo"
215
216 * replace_img => [0|1|CodeRef]
217
218 Set the value to true if you'd like to have img tags replaced with
219 "[IMAGE: ...]" containing the alt attribute text. If you set it to a
220 code reference, you can provide your own replacement (which may even
221 contain HTML).
222
223 sub replacer {
224 my ($tagname, $attr, $text) = @_; # from HTML::Parser
225 return qq{<a href="$attr->{src}">IMAGE: $attr->{alt}</a>};
226 }
227
228 my $hr = HTML::Restrict->new( replace_img => \&replacer );
229
230 This attribute will only take effect if the img tag is not included
231 in the allowed HTML.
232
233 * strip_enclosed_content => [0|1]
234
235 The default behaviour up to 1.0.4 was to preserve the content
236 between script and style tags, even when the tags themselves were
237 being deleted. So, you'd be left with a bunch of JavaScript or CSS,
238 just with the enclosing tags missing. This is almost never what you
239 want, so starting at 1.0.5 the default will be to remove any script
240 or style info which is enclosed in these tags, unless they have
241 specifically been whitelisted in the rules. This will be a sane
242 default when cleaning up content submitted via a web form. However,
243 if you're using HTML::Restrict to purge your own HTML you can be
244 more restrictive.
245
246 # strip the head section, in addition to JS and CSS
247 my $html = '<html><head>...</head><body>...<script>JS here</script>foo';
248
249 my $hr = HTML::Restrict->new(
250 strip_enclosed_content => [ 'script', 'style', 'head' ]
251 );
252
253 $html = $hr->process( $html );
254 # $html is now '<html><body>...foo';
255
256 The caveat here is that HTML::Restrict will not try to fix broken
257 HTML. In the above example, if you have any opening script, style or
258 head tags which don't also include matching closing tags, all
259 following content will be stripped away, regardless of any parent
260 tags.
261
262 Keep in mind that changes to strip_enclosed_content are not
263 additive, so if you are adding additional tags you'll need to
264 include the entire list of tags whose enclosed content you'd like to
265 remove. This feature strips script and style tags by default.
266
267SUBROUTINES/METHODS
268 process( $html )
269 This is the method which does the real work. It parses your data,
270 removes any tags and attributes which are not specifically allowed and
271 returns the resulting text. Requires and returns a SCALAR.
272
273CAVEATS
274 Please note that all tag and attribute names passed via the rules param
275 must be supplied in lower case.
276
277 # correct
278 my $hr = HTML::Restrict->new( rules => { body => ['onload'] } );
279
280 # throws a fatal error
281 my $hr = HTML::Restrict->new( rules => { Body => ['onLoad'] } );
282
283MOTIVATION
284 There are already several modules on the CPAN which accomplish much of
285 the same thing, but after doing a lot of poking around, I was unable to
286 find a solution with a simple setup which I was happy with.
287
288 The most common use case might be stripping HTML from user submitted
289 data completely or allowing just a few tags and attributes to be
290 displayed. With the exception of URI scheme checking, this module
291 doesn't do any validation on the actual content of the tags or
292 attributes. If this is a requirement, you can either mess with the
293 parser object, post-process the text yourself or have a look at one of
294 the more feature-rich modules in the SEE ALSO section below.
295
296 My aim here is to keep things easy and, hopefully, cover a lot of the
297 less complex use cases with just a few lines of code and some brief
298 documentation. The idea is to be up and running quickly.
299
300SEE ALSO
301 HTML::TagFilter, HTML::Defang, HTML::Declaw, HTML::StripScripts,
302 HTML::Detoxifier, HTML::Sanitizer, HTML::Scrubber
303
304ACKNOWLEDGEMENTS
305 Thanks to Raybec Communications <http://www.raybec.com> for funding my
306 work on this module and for releasing it to the world.
307
308 Thanks also to the following for patches, bug reports and assistance:
309
310 Mark Jubenville (ioncache)
311
312 Duncan Forsyth
313
314 Rick Moore
315
316 Arthur Axel 'fREW' Schmidt
317
318 perlpong
319
320 David Golden
321
322 Graham TerMarsch
323
324 Dagfinn Ilmari Mannsåker
325
326 Graham Knop
327
328 Carwyn Ellis
329
330AUTHOR
331 Olaf Alders <olaf@wundercounter.com>
332
333COPYRIGHT AND LICENSE
334 This software is copyright (c) 2013 by Olaf Alders.
335
336 This is free software; you can redistribute it and/or modify it under
337 the same terms as the Perl 5 programming language system itself.
338
339