1NAME
2 HTML::StripScripts - Strip scripting constructs out of HTML
3
4SYNOPSIS
5 use HTML::StripScripts;
6
7 my $hss = HTML::StripScripts->new({ Context => 'Inline' });
8
9 $hss->input_start_document;
10
11 $hss->input_start('<i>');
12 $hss->input_text('hello, world!');
13 $hss->input_end('</i>');
14
15 $hss->input_end_document;
16
17 print $hss->filtered_document;
18
19DESCRIPTION
20 This module strips scripting constructs out of HTML, leaving as much
21 non-scripting markup in place as possible. This allows web applications
22 to display HTML originating from an untrusted source without introducing
23 XSS (cross site scripting) vulnerabilities.
24
25 You will probably use HTML::StripScripts::Parser rather than using this
26 module directly.
27
28 The process is based on whitelists of tags, attributes and attribute
29 values. This approach is the most secure against disguised scripting
30 constructs hidden in malicious HTML documents.
31
32 As well as removing scripting constructs, this module ensures that there
33 is a matching end for each start tag, and that the tags are properly
34 nested.
35
36 Previously, in order to customise the output, you needed to subclass
37 "HTML::StripScripts" and override methods. Now, most customisation can
38 be done through the "Rules" option provided to "new()". (See
39 examples/declaration/ and examples/tags/ for cases where subclassing is
40 necessary.)
41
42 The HTML document must be parsed into start tags, end tags and text
43 before it can be filtered by this module. Use either
44 HTML::StripScripts::Parser or HTML::StripScripts::Regex instead if you
45 want to input an unparsed HTML document.
46
47 See examples/direct/ for an example of how to feed tokens directly to
48 HTML::StripScripts.
49
50CONSTRUCTORS
51 new ( CONFIG )
52 Creates a new "HTML::StripScripts" filter object, bound to a
53 particular filtering policy. If present, the CONFIG parameter must
54 be a hashref. The following keys are recognized (unrecognized keys
55 will be silently ignored).
56
57 $s = HTML::Stripscripts->new({
58 Context => 'Document|Flow|Inline|NoTags',
59 BanList => [qw( br img )] | {br => '1', img => '1'},
60 BanAllBut => [qw(p div span)],
61 AllowSrc => 0|1,
62 AllowHref => 0|1,
63 AllowRelURL => 0|1,
64 AllowMailto => 0|1,
65 EscapeFiltered => 0|1,
66 Rules => { See below for details },
67 });
68
69 "Context"
70 A string specifying the context in which the filtered document
71 will be used. This influences the set of tags that will be
72 allowed.
73
74 If present, the "Context" value must be one of:
75
76 "Document"
77 If "Context" is "Document" then the filter will allow a full
78 HTML document, including the "HTML" tag and "HEAD" and
79 "BODY" sections.
80
81 "Flow"
82 If "Context" is "Flow" then most of the cosmetic tags that
83 one would expect to find in a document body are allowed,
84 including lists and tables but not including forms.
85
86 "Inline"
87 If "Context" is "Inline" then only inline tags such as "B"
88 and "FONT" are allowed.
89
90 "NoTags"
91 If "Context" is "NoTags" then no tags are allowed.
92
93 The default "Context" value is "Flow".
94
95 "BanList"
96 If present, this option must be an arrayref or a hashref. Any
97 tag that would normally be allowed (because it presents no XSS
98 hazard) will be blocked if the lowercase name of the tag is in
99 this list.
100
101 For example, in a guestbook application where "HR" tags are used
102 to separate posts, you may wish to prevent posts from including
103 "HR" tags, even though "HR" is not an XSS risk.
104
105 "BanAllBut"
106 If present, this option must be reference to an array holding a
107 list of lowercase tag names. This has the effect of adding all
108 but the listed tags to the ban list, so that only those tags
109 listed will be allowed.
110
111 "AllowSrc"
112 By default, the filter won't allow constructs that cause the
113 browser to fetch things automatically, such as "SRC" attributes
114 in "IMG" tags. If this option is present and true then those
115 constructs will be allowed.
116
117 "AllowHref"
118 By default, the filter won't allow constructs that cause the
119 browser to fetch things if the user clicks on something, such as
120 the "HREF" attribute in "A" tags. Set this option to a true
121 value to allow this type of construct.
122
123 "AllowRelURL"
124 By default, the filter won't allow relative URLs such as
125 "../foo.html" in "SRC" and "HREF" attribute values. Set this
126 option to a true value to allow them. "AllowHref" and / or
127 "AllowSrc" also need to be set to true for this to have any
128 effect.
129
130 "AllowMailto"
131 By default, "mailto:" links are not allowed. If "AllowMailto" is
132 set to a true value, then this construct will be allowed. This
133 can be enabled separately from AllowHref.
134
135 "EscapeFiltered"
136 By default, any filtered tags are outputted as
137 "<!--filtered-->". If "EscapeFiltered" is set to a true value,
138 then the filtered tags are converted to HTML entities.
139
140 For instance:
141
142 <br> --> <br>
143
144 "Rules"
145 The "Rules" option provides a very flexible way of customising
146 the filter.
147
148 The focus is safety-first, so it is applied after all of the
149 previous validation. This means that you cannot all malicious
150 data should already have been cleared.
151
152 Rules can be specified for tags and for attributes. Any tag or
153 attribute not explicitly listed will be handled by the default
154 "*" rules.
155
156 The following is a synopsis of all of the options that you can
157 use to configure rules. Below, an example is broken into
158 sections and explained.
159
160 Rules => {
161
162 tag => 0 | 1 | sub { tag_callback }
163 | {
164 attr => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
165 '*' => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
166 required => [qw(attrname attrname)],
167 tag => sub { tag_callback }
168 },
169
170 '*' => 0 | 1 | sub { tag_callback }
171 | {
172 attr => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
173 '*' => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
174 tag => sub { tag_callback }
175 }
176
177 }
178
179 EXAMPLE:
180
181 Rules => {
182
183 ##########################
184 ##### EXPLICIT RULES #####
185 ##########################
186
187 ## Allow <br> tags, reject <img> tags
188 br => 1,
189 img => 0,
190
191 ## Send all <div> tags to a sub
192 div => sub { tag_callback },
193
194 ## Allow <blockquote> tags,and allow the 'cite' attribute
195 ## All other attributes are handled by the default C<*>
196 blockquote => {
197 cite => 1,
198 },
199
200 ## Allow <a> tags, and
201 a => {
202
203 ## Allow the 'title' attribute
204 title => 1,
205
206 ## Allow the 'href' attribute if it matches the regex
207 href => '^http://yourdomain.com'
208 OR href => qr{^http://yourdomain.com},
209
210 ## 'style' attributes are handled by a sub
211 style => sub { attr_callback },
212
213 ## All other attributes are rejected
214 '*' => 0,
215
216 ## Additionally, the <a> tag should be handled by this sub
217 tag => sub { tag_callback},
218
219 ## If the <a> tag doesn't have these attributes, filter the tag
220 required => [qw(href title)],
221
222 },
223
224 ##########################
225 ##### DEFAULT RULES #####
226 ##########################
227
228 ## The default '*' rule - accepts all the same options as above.
229 ## If a tag or attribute is not mentioned above, then the default
230 ## rule is applied:
231
232 ## Reject all tags
233 '*' => 0,
234
235 ## Allow all tags and all attributes
236 '*' => 1,
237
238 ## Send all tags to the sub
239 '*' => sub { tag_callback },
240
241 ## Allow all tags, reject all attributes
242 '*' => { '*' => 0 },
243
244 ## Allow all tags, and
245 '*' => {
246
247 ## Allow the 'title' attribute
248 title => 1,
249
250 ## Allow the 'href' attribute if it matches the regex
251 href => '^http://yourdomain.com'
252 OR href => qr{^http://yourdomain.com},
253
254 ## 'style' attributes are handled by a sub
255 style => sub { attr_callback },
256
257 ## All other attributes are rejected
258 '*' => 0,
259
260 ## Additionally, all tags should be handled by this sub
261 tag => sub { tag_callback},
262
263 },
264
265 Tag Callbacks
266 sub tag_callback {
267 my ($filter,$element) = (@_);
268
269 $element = {
270 tag => 'tag',
271 content => 'inner_html',
272 attr => {
273 attr_name => 'attr_value',
274 }
275 };
276 return 0 | 1;
277 }
278
279 A tag callback accepts two parameters, the $filter object
280 and the C$element>. It should return 0 to completely ignore
281 the tag and its content (which includes any nested HTML
282 tags), or 1 to accept and output the tag.
283
284 The $element is a hash ref containing the keys:
285
286 "tag"
287 This is the tagname in lowercase, eg "a", "br", "img". If
288 you set the tag value to an empty string, then the tag will
289 not be outputted, but the tag contents will.
290
291 "content"
292 This is the equivalent of DOM's innerHTML. It contains the
293 text content and any HTML tags contained within this
294 element. You can change the content or set it to an empty
295 string so that it is not outputted.
296
297 "attr"
298 "attr" contains a hashref containing the attribute names and
299 values
300
301 If for instance, you wanted to replace "<b>" tags with "<span>"
302 tags, you could do this:
303
304 sub b_callback {
305 my ($filter,$element) = @_;
306 $element->{tag} = 'span';
307 $element->{attr}{style} = 'font-weight:bold';
308 return 1;
309 }
310
311 Attribute Callbacks
312 sub attr_callback {
313 my ( $filter, $tag, $attr_name, $attr_val ) = @_;
314 return undef | '' | 'value';
315 }
316
317 Attribute callbacks accept four parameters, the $filter object,
318 the $tag name, the $attr_name and the $attr_value.
319
320 It should return either "undef" to reject the attribute, or the
321 value to be used. An empty string keeps the attribute, but
322 without a value.
323
324 "BanList" vs "BanAllBut" vs "Rules"
325 It is not necessary to use "BanList" or "BanAllBut" - everything
326 can be done via "Rules", however it may be simpler to write:
327
328 BanAllBut => [qw(p div span)]
329
330 The logic works as follows:
331
332 * If BanAllBut exists, then ban everything but the tags in the list
333 * Add to the ban list any elements in BanList
334 * Any tags mentioned explicitly in Rules (eg a => 0, br => 1)
335 are added or removed from the BanList
336 * A default rule of { '*' => 0 } would ban all tags except
337 those mentioned in Rules
338 * A default rule of { '*' => 1 } would allow all tags except
339 those disallowed in the ban list, or by explicit rules
340
341METHODS
342 This class provides the following methods:
343
344 hss_init ()
345 This method is called by new() and does the actual initialisation
346 work for the new HTML::StripScripts object.
347
348 input_start_document ()
349 This method initializes the filter, and must be called once before
350 starting on each HTML document to be filtered.
351
352 input_start ( TEXT )
353 Handles a start tag from the input document. TEXT must be the full
354 text of the tag, including angle-brackets.
355
356 input_end ( TEXT )
357 Handles an end tag from the input document. TEXT must be the full
358 text of the end tag, including angle-brackets.
359
360 input_text ( TEXT )
361 Handles some non-tag text from the input document.
362
363 input_process ( TEXT )
364 Handles a processing instruction from the input document.
365
366 input_comment ( TEXT )
367 Handles an HTML comment from the input document.
368
369 input_declaration ( TEXT )
370 Handles an declaration from the input document.
371
372 input_end_document ()
373 Call this method to signal the end of the input document.
374
375 filtered_document ()
376 Returns the filtered document as a string.
377
378SUBCLASSING
379 The only reason for subclassing this module now is to add to the list of
380 accepted tags, attributes and styles (See "WHITELIST INITIALIZATION
381 METHODS"). Everything else can be achieved with "Rules".
382
383 The "HTML::StripScripts" class is subclassable. Filter objects are plain
384 hashes and "HTML::StripScripts" reserves only hash keys that start with
385 "_hss". The filter configuration can be set up by invoking the
386 hss_init() method, which takes the same arguments as new().
387
388OUTPUT METHODS
389 The filter outputs a stream of start tags, end tags, text, comments,
390 declarations and processing instructions, via the following "output_*"
391 methods. Subclasses may override these to intercept the filter output.
392
393 The default implementations of the "output_*" methods pass the text on
394 to the output() method. The default implementation of the output()
395 method appends the text to a string, which can be fetched with the
396 filtered_document() method once processing is complete.
397
398 If the output() method or the individual "output_*" methods are
399 overridden in a subclass, then filtered_document() will not work in that
400 subclass.
401
402 output_start_document ()
403 This method gets called once at the start of each HTML document
404 passed through the filter. The default implementation does nothing.
405
406 output_end_document ()
407 This method gets called once at the end of each HTML document passed
408 through the filter. The default implementation does nothing.
409
410 output_start ( TEXT )
411 This method is used to output a filtered start tag.
412
413 output_end ( TEXT )
414 This method is used to output a filtered end tag.
415
416 output_text ( TEXT )
417 This method is used to output some filtered non-tag text.
418
419 output_declaration ( TEXT )
420 This method is used to output a filtered declaration.
421
422 output_comment ( TEXT )
423 This method is used to output a filtered HTML comment.
424
425 output_process ( TEXT )
426 This method is used to output a filtered processing instruction.
427
428 output ( TEXT )
429 This method is invoked by all of the default "output_*" methods. The
430 default implementation appends the text to the string that the
431 filtered_document() method will return.
432
433 output_stack_entry ( TEXT )
434 This method is invoked when a tag plus all text and nested HTML
435 content within the tag has been processed. It adds the tag plus its
436 content to the content for its parent tag.
437
438REJECT METHODS
439 When the filter encounters something in the input document which it
440 cannot transform into an acceptable construct, it invokes one of the
441 following "reject_*" methods to put something in the output document to
442 take the place of the unacceptable construct.
443
444 The TEXT parameter is the full text of the unacceptable construct.
445
446 The default implementations of these methods output an HTML comment
447 containing the text "filtered". If "EscapeFiltered" is set to true, then
448 the rejected text is HTML escaped instead.
449
450 Subclasses may override these methods, but should exercise caution. The
451 TEXT parameter is unfiltered input and may contain malicious constructs.
452
453 reject_start ( TEXT )
454 reject_end ( TEXT )
455 reject_text ( TEXT )
456 reject_declaration ( TEXT )
457 reject_comment ( TEXT )
458 reject_process ( TEXT )
459
460WHITELIST INITIALIZATION METHODS
461 The filter refers to various whitelists to determine which constructs
462 are acceptable. To modify these whitelists, subclasses can override the
463 following methods.
464
465 Each method is called once at object initialization time, and must
466 return a reference to a nested data structure. These references are
467 installed into the object, and used whenever the filter needs to refer
468 to a whitelist.
469
470 The default implementations of these methods can be invoked as class
471 methods.
472
473 See examples/tags/ and examples/declaration/ for examples of how to
474 override these methods.
475
476 init_context_whitelist ()
477 Returns a reference to the "Context" whitelist, which determines
478 which tags may appear at each point in the document, and which other
479 tags may be nested within them.
480
481 It is a hash, and the keys are context names, such as "Flow" and
482 "Inline".
483
484 The values in the hash are hashrefs. The keys in these subhashes are
485 lowercase tag names, and the values are context names, specifying
486 the context that the tag provides to any other tags nested within
487 it.
488
489 The special context "EMPTY" as a value in a subhash indicates that
490 nothing can be nested within that tag.
491
492 init_attrib_whitelist ()
493 Returns a reference to the "Attrib" whitelist, which determines
494 which attributes each tag can have and the values that those
495 attributes can take.
496
497 It is a hash, and the keys are lowercase tag names.
498
499 The values in the hash are hashrefs. The keys in these subhashes are
500 lowercase attribute names, and the values are attribute value class
501 names, which are short strings describing the type of values that
502 the attribute can take, such as "color" or "number".
503
504 init_attval_whitelist ()
505 Returns a reference to the "AttVal" whitelist, which is a hash that
506 maps attribute value class names from the "Attrib" whitelist to
507 coderefs to subs to validate (and optionally transform) a particular
508 attribute value.
509
510 The filter calls the attribute value validation subs with the
511 following parameters:
512
513 "filter"
514 A reference to the filter object.
515
516 "tagname"
517 The lowercase name of the tag in which the attribute appears.
518
519 "attrname"
520 The name of the attribute.
521
522 "attrval"
523 The attribute value found in the input document, in canonical
524 form (see "CANONICAL FORM").
525
526 The validation sub can return undef to indicate that the attribute
527 should be removed from the tag, or it can return the new value for
528 the attribute, in canonical form.
529
530 init_style_whitelist ()
531 Returns a reference to the "Style" whitelist, which determines which
532 CSS style directives are permitted in "style" tag attributes. The
533 keys are value names such as "color" and "background-color", and the
534 values are class names to be used as keys into the "AttVal"
535 whitelist.
536
537 init_deinter_whitelist
538 Returns a reference to the "DeInter" whitelist, which determines
539 which inline tags the filter should attempt to automatically
540 de-interleave if they are encountered interleaved. For example, the
541 filter will transform:
542
543 <b>hello <i>world</b> !</i>
544
545 Into:
546
547 <b>hello <i>world</i></b><i> !</i>
548
549 because both "b" and "i" appear as keys in the "DeInter" whitelist.
550
551CHARACTER DATA PROCESSING
552 These methods transform attribute values and non-tag text from the input
553 document into canonical form (see "CANONICAL FORM"), and transform text
554 in canonical form into a suitable form for the output document.
555
556 text_to_canonical_form ( TEXT )
557 This method is used to reduce non-tag text from the input document
558 to canonical form before passing it to the filter_text() method.
559
560 The default implementation unescapes all entities that map to
561 "US-ASCII" characters other than ampersand, and replaces any
562 ampersands that don't form part of valid entities with "&".
563
564 quoted_to_canonical_form ( VALUE )
565 This method is used to reduce attribute values quoted with
566 doublequotes or singlequotes to canonical form before passing it to
567 the handler subs in the "AttVal" whitelist.
568
569 The default behavior is the same as that of
570 "text_to_canonical_form()", plus it converts any CR, LF or TAB
571 characters to spaces.
572
573 unquoted_to_canonical_form ( VALUE )
574 This method is used to reduce attribute values without quotes to
575 canonical form before passing it to the handler subs in the "AttVal"
576 whitelist.
577
578 The default implementation simply replaces all ampersands with
579 "&", since that corresponds with the way most browsers treat
580 entities in unquoted values.
581
582 canonical_form_to_text ( TEXT )
583 This method is used to convert the text in canonical form returned
584 by the filter_text() method to a form suitable for inclusion in the
585 output document.
586
587 The default implementation runs anything that doesn't look like a
588 valid entity through the escape_html_metachars() method.
589
590 canonical_form_to_attval ( ATTVAL )
591 This method is used to convert the text in canonical form returned
592 by the "AttVal" handler subs to a form suitable for inclusion in
593 doublequotes in the output tag.
594
595 The default implementation converts CR, LF and TAB characters to a
596 single space, and runs anything that doesn't look like a valid
597 entity through the escape_html_metachars() method.
598
599 validate_href_attribute ( TEXT )
600 If the "AllowHref" filter configuration option is set, then this
601 method is used to validate "href" type attribute values. TEXT is the
602 attribute value in canonical form. Returns a possibly modified
603 attribute value (in canonical form) or "undef" to reject the
604 attribute.
605
606 The default implementation allows only absolute "http" and "https"
607 URLs, permits port numbers and query strings, and imposes reasonable
608 length limits.
609
610 It does not URI escape the query string, and it does not guarantee
611 properly formatted URIs, it just tries to give safe URIs. You can
612 always use an attribute callback (see "Attribute Callbacks") to
613 provide stricter handling.
614
615 validate_mailto ( TEXT )
616 If the "AllowMailto" filter configuration option is set, then this
617 method is used to validate "href" type attribute values which begin
618 with "mailto:". TEXT is the attribute value in canonical form.
619 Returns a possibly modified attribute value (in canonical form) or
620 "undef" to reject the attribute.
621
622 This uses a lightweight regex and does not guarantee that email
623 addresses are properly formatted. You can always use an attribute
624 callback (see "Attribute Callbacks") to provide stricter handling.
625
626 validate_src_attribute ( TEXT )
627 If the "AllowSrc" filter configuration option is set, then this
628 method is used to validate "src" type attribute values. TEXT is the
629 attribute value in canonical form. Returns a possibly modified
630 attribute value (in canonical form) or "undef" to reject the
631 attribute.
632
633 The default implementation behaves as validate_href_attribute().
634
635OTHER METHODS TO OVERRIDE
636 As well as the output, reject, init and cdata methods listed above, it
637 might make sense for subclasses to override the following methods:
638
639 filter_text ( TEXT )
640 This method will be invoked to filter blocks of non-tag text in the
641 input document. Both input and output are in canonical form, see
642 "CANONICAL FORM".
643
644 The default implementation does no filtering.
645
646 escape_html_metachars ( TEXT )
647 This method is used to escape all HTML metacharacters in TEXT. The
648 return value must be a copy of TEXT with metacharacters escaped.
649
650 The default implementation escapes a minimal set of metacharacters
651 for security against XSS vulnerabilities. The set of characters to
652 escape is a compromise between the need for security and the need to
653 ensure that the filter will work for documents in as many different
654 character sets as possible.
655
656 Subclasses which make strong assumptions about the document
657 character set will be able to escape much more aggressively.
658
659 strip_nonprintable ( TEXT )
660 Returns a copy of TEXT with runs of nonprintable characters replaced
661 with spaces or some other harmless string. Avoids replacing anything
662 with the empty string, as that can lead to other security issues.
663
664 The default implementation strips out only NULL characters, in order
665 to avoid scrambling text for as many different character sets as
666 possible.
667
668 Subclasses which make some sort of assumption about the character
669 set in use will be able to have a much wider definition of a
670 nonprintable character, and hence a more secure strip_nonprintable()
671 implementation.
672
673ATTRIBUTE VALUE HANDLER SUBS
674 References to the following subs appear in the "AttVal" whitelist
675 returned by the init_attval_whitelist() method.
676
677 _hss_attval_style( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
678 Attribute value hander for the "style" attribute.
679
680 _hss_attval_size ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
681 Attribute value handler for attributes who's values are some sort of
682 size or length.
683
684 _hss_attval_number ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
685 Attribute value handler for attributes who's values are a simple
686 integer.
687
688 _hss_attval_color ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
689 Attribute value handler for color attributes.
690
691 _hss_attval_text ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
692 Attribute value handler for text attributes.
693
694 _hss_attval_word ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
695 Attribute value handler for attributes who's values must consist of
696 a single short word, with minus characters permitted.
697
698 _hss_attval_wordlist ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
699 Attribute value handler for attributes who's values must consist of
700 one or more words, separated by spaces and/or commas.
701
702 _hss_attval_wordlistq ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
703 Attribute value handler for attributes who's values must consist of
704 one or more words, separated by commas, with optional doublequotes
705 around words and spaces allowed within the doublequotes.
706
707 _hss_attval_href ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
708 Attribute value handler for "href" type attributes. If the
709 "AllowHref" or "AllowMailto" configuration options are set, uses the
710 validate_href_attribute() method to check the attribute value.
711
712 _hss_attval_src ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
713 Attribute value handler for "src" type attributes. If the "AllowSrc"
714 configuration option is set, uses the validate_src_attribute()
715 method to check the attribute value.
716
717 _hss_attval_stylesrc ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
718 Attribute value handler for "src" type style pseudo attributes.
719
720 _hss_attval_novalue ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
721 Attribute value handler for attributes that have no value or a value
722 that is ignored. Just returns the attribute name as the value.
723
724CANONICAL FORM
725 Many of the methods described above deal with text from the input
726 document, encoded in what I call "canonical form", defined as follows:
727
728 All characters other than ampersands represent themselves. Literal
729 ampersands are encoded as "&". Non "US-ASCII" characters may appear
730 as literals in whatever character set is in use, or they may appear as
731 named or numeric HTML entities such as "æ", "穩" and
732 "ÿ". Unknown named entities such as "&foo;" may appear.
733
734 The idea is to be able to be able to reduce input text to a minimal
735 form, without making too many assumptions about the character set in
736 use.
737
738PRIVATE METHODS
739 The following methods are internal to this class, and should not be
740 invoked from elsewhere. Subclasses should not use or override these
741 methods.
742
743 _hss_prepare_ban_list (CFG)
744 Returns a hash ref representing all the banned tags, based on the
745 values of BanList and BanAllBut
746
747 _hss_prepare_rules (CFG)
748 Returns a hash ref representing the tag and attribute rules (See
749 "Rules").
750
751 Returns undef if no filters are specified, in which case the
752 attribute filter code has very little performance impact. If any
753 rules are specified, then every tag and attribute is checked.
754
755 _hss_get_attr_filter ( DEFAULT_FILTERS TAG_FILTERS ATTR_NAME)
756 Returns the attribute filter rule to apply to this particular
757 attribute.
758
759 Checks for:
760
761 - a named attribute rule in a named tag
762 - a default * attribute rule in a named tag
763 - a named attribute rule in the default * rules
764 - a default * attribute rule in the default * rules
765
766 _hss_join_attribs (FILTERED_ATTRIBS)
767 Accepts a hash ref containing the attribute names as the keys, and
768 the attribute values as the values. Escapes them and returns a
769 string ready for output to HTML
770
771 _hss_decode_numeric ( NUMERIC )
772 Returns the string that should replace the numeric entity NUMERIC in
773 the text_to_canonical_form() method.
774
775 _hss_tag_is_banned ( TAGNAME )
776 Returns true if the lower case tag name TAGNAME is on the list of
777 harmless tags that the filter is configured to block, false
778 otherwise.
779
780 _hss_get_to_valid_context ( TAG )
781 Tries to get the filter to a context in which the tag TAG is
782 allowed, by introducing extra end tags or start tags if necessary.
783 TAG can be either the lower case name of a tag or the string
784 'CDATA'.
785
786 Returns 1 if an allowed context is reached, or 0 if there's no
787 reasonable way to get to an allowed context and the tag should just
788 be rejected.
789
790 _hss_close_innermost_tag ()
791 Closes the innermost open tag.
792
793 _hss_context ()
794 Returns the current named context of the filter.
795
796 _hss_valid_in_context ( TAG, CONTEXT )
797 Returns true if the lowercase tag name TAG is valid in context
798 CONTEXT, false otherwise.
799
800 _hss_valid_in_current_context ( TAG )
801 Returns true if the lowercase tag name TAG is valid in the filter's
802 current context, false otherwise.
803
804BUGS AND LIMITATIONS
805 Performance
806 This module does a lot of work to ensure that tags are correctly
807 nested and are not left open, causing unnecessary overhead for
808 applications where that doesn't matter.
809
810 Such applications may benefit from using the more lightweight
811 HTML::Scrubber::StripScripts module instead.
812
813 Strictness
814 URIs and email addresses are cleaned up to be safe, but not
815 necessarily accurate. That would have required adding dependencies.
816 Attribute callbacks can be used to add this functionality if
817 required, or the validation methods can be overridden.
818
819 By default, filtered HTML may not be valid strict XHTML, for
820 instance empty required attributes may be outputted. However, with
821 "Rules", it should be possible to force the HTML to validate.
822
823 REPORTING BUGS
824 Please report any bugs or feature requests to
825 bug-html-stripscripts@rt.cpan.org, or through the web interface at
826 <http://rt.cpan.org>.
827
828SEE ALSO
829 HTML::Parser, HTML::StripScripts::Parser, HTML::StripScripts::Regex
830
831AUTHOR
832 Original author Nick Cleaton <nick@cleaton.net>
833
834 New code added and module maintained by Clinton Gormley
835 <clint@traveljury.com>
836
837COPYRIGHT
838 Copyright (C) 2003 Nick Cleaton. All Rights Reserved.
839
840 Copyright (C) 2007 Clinton Gormley. All Rights Reserved.
841
842LICENSE
843 This module is free software; you can redistribute it and/or modify it
844 under the same terms as Perl itself.
845
846