• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

examples/H12-May-2016-303193

lib/HTML/H03-May-2022-2,220850

t/H12-May-2016-4,2023,676

ChangesH A D12-May-20162.1 KiB5235

MANIFESTH A D12-May-2016594 3029

META.jsonH A D12-May-20161.2 KiB5048

META.ymlH A D12-May-2016688 2524

Makefile.PLH A D12-May-20161.1 KiB5341

READMEH A D12-May-201633.8 KiB846636

README

1NAME
2    HTML::StripScripts - Strip scripting constructs out of HTML
3
4SYNOPSIS
5      use HTML::StripScripts;
6
7      my $hss = HTML::StripScripts->new({ Context => 'Inline' });
8
9      $hss->input_start_document;
10
11      $hss->input_start('<i>');
12      $hss->input_text('hello, world!');
13      $hss->input_end('</i>');
14
15      $hss->input_end_document;
16
17      print $hss->filtered_document;
18
19DESCRIPTION
20    This module strips scripting constructs out of HTML, leaving as much
21    non-scripting markup in place as possible. This allows web applications
22    to display HTML originating from an untrusted source without introducing
23    XSS (cross site scripting) vulnerabilities.
24
25    You will probably use HTML::StripScripts::Parser rather than using this
26    module directly.
27
28    The process is based on whitelists of tags, attributes and attribute
29    values. This approach is the most secure against disguised scripting
30    constructs hidden in malicious HTML documents.
31
32    As well as removing scripting constructs, this module ensures that there
33    is a matching end for each start tag, and that the tags are properly
34    nested.
35
36    Previously, in order to customise the output, you needed to subclass
37    "HTML::StripScripts" and override methods. Now, most customisation can
38    be done through the "Rules" option provided to "new()". (See
39    examples/declaration/ and examples/tags/ for cases where subclassing is
40    necessary.)
41
42    The HTML document must be parsed into start tags, end tags and text
43    before it can be filtered by this module. Use either
44    HTML::StripScripts::Parser or HTML::StripScripts::Regex instead if you
45    want to input an unparsed HTML document.
46
47    See examples/direct/ for an example of how to feed tokens directly to
48    HTML::StripScripts.
49
50CONSTRUCTORS
51    new ( CONFIG )
52        Creates a new "HTML::StripScripts" filter object, bound to a
53        particular filtering policy. If present, the CONFIG parameter must
54        be a hashref. The following keys are recognized (unrecognized keys
55        will be silently ignored).
56
57            $s = HTML::Stripscripts->new({
58                Context         => 'Document|Flow|Inline|NoTags',
59                BanList         => [qw( br img )] | {br => '1', img => '1'},
60                BanAllBut       => [qw(p div span)],
61                AllowSrc        => 0|1,
62                AllowHref       => 0|1,
63                AllowRelURL     => 0|1,
64                AllowMailto     => 0|1,
65                EscapeFiltered  => 0|1,
66                Rules           => { See below for details },
67            });
68
69        "Context"
70            A string specifying the context in which the filtered document
71            will be used. This influences the set of tags that will be
72            allowed.
73
74            If present, the "Context" value must be one of:
75
76            "Document"
77                If "Context" is "Document" then the filter will allow a full
78                HTML document, including the "HTML" tag and "HEAD" and
79                "BODY" sections.
80
81            "Flow"
82                If "Context" is "Flow" then most of the cosmetic tags that
83                one would expect to find in a document body are allowed,
84                including lists and tables but not including forms.
85
86            "Inline"
87                If "Context" is "Inline" then only inline tags such as "B"
88                and "FONT" are allowed.
89
90            "NoTags"
91                If "Context" is "NoTags" then no tags are allowed.
92
93            The default "Context" value is "Flow".
94
95        "BanList"
96            If present, this option must be an arrayref or a hashref. Any
97            tag that would normally be allowed (because it presents no XSS
98            hazard) will be blocked if the lowercase name of the tag is in
99            this list.
100
101            For example, in a guestbook application where "HR" tags are used
102            to separate posts, you may wish to prevent posts from including
103            "HR" tags, even though "HR" is not an XSS risk.
104
105        "BanAllBut"
106            If present, this option must be reference to an array holding a
107            list of lowercase tag names. This has the effect of adding all
108            but the listed tags to the ban list, so that only those tags
109            listed will be allowed.
110
111        "AllowSrc"
112            By default, the filter won't allow constructs that cause the
113            browser to fetch things automatically, such as "SRC" attributes
114            in "IMG" tags. If this option is present and true then those
115            constructs will be allowed.
116
117        "AllowHref"
118            By default, the filter won't allow constructs that cause the
119            browser to fetch things if the user clicks on something, such as
120            the "HREF" attribute in "A" tags. Set this option to a true
121            value to allow this type of construct.
122
123        "AllowRelURL"
124            By default, the filter won't allow relative URLs such as
125            "../foo.html" in "SRC" and "HREF" attribute values. Set this
126            option to a true value to allow them. "AllowHref" and / or
127            "AllowSrc" also need to be set to true for this to have any
128            effect.
129
130        "AllowMailto"
131            By default, "mailto:" links are not allowed. If "AllowMailto" is
132            set to a true value, then this construct will be allowed. This
133            can be enabled separately from AllowHref.
134
135        "EscapeFiltered"
136            By default, any filtered tags are outputted as
137            "<!--filtered-->". If "EscapeFiltered" is set to a true value,
138            then the filtered tags are converted to HTML entities.
139
140            For instance:
141
142              <br>  -->  &lt;br&gt;
143
144        "Rules"
145            The "Rules" option provides a very flexible way of customising
146            the filter.
147
148            The focus is safety-first, so it is applied after all of the
149            previous validation. This means that you cannot all malicious
150            data should already have been cleared.
151
152            Rules can be specified for tags and for attributes. Any tag or
153            attribute not explicitly listed will be handled by the default
154            "*" rules.
155
156            The following is a synopsis of all of the options that you can
157            use to configure rules. Below, an example is broken into
158            sections and explained.
159
160             Rules => {
161
162                 tag => 0 | 1 | sub { tag_callback }
163                        | {
164                            attr      => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
165                            '*'       => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
166                            required  => [qw(attrname attrname)],
167                            tag       => sub { tag_callback }
168                          },
169
170                '*' => 0 | 1 | sub { tag_callback }
171                       | {
172                           attr => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
173                           '*'  => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
174                           tag  => sub { tag_callback }
175                         }
176
177                }
178
179            EXAMPLE:
180
181                Rules => {
182
183                    ##########################
184                    ##### EXPLICIT RULES #####
185                    ##########################
186
187                    ## Allow <br> tags, reject <img> tags
188                    br          => 1,
189                    img         => 0,
190
191                    ## Send all <div> tags to a sub
192                    div         => sub { tag_callback },
193
194                    ## Allow <blockquote> tags,and allow the 'cite' attribute
195                    ## All other attributes are handled by the default C<*>
196                    blockquote  => {
197                        cite    => 1,
198                    },
199
200                    ## Allow <a> tags, and
201                    a  => {
202
203                        ## Allow the 'title' attribute
204                        title     => 1,
205
206                        ## Allow the 'href' attribute if it matches the regex
207                        href    =>   '^http://yourdomain.com'
208                   OR   href    => qr{^http://yourdomain.com},
209
210                        ## 'style' attributes are handled by a sub
211                        style     => sub { attr_callback },
212
213                        ## All other attributes are rejected
214                        '*'       => 0,
215
216                        ## Additionally, the <a> tag should be handled by this sub
217                        tag       => sub { tag_callback},
218
219                        ## If the <a> tag doesn't have these attributes, filter the tag
220                        required  => [qw(href title)],
221
222                    },
223
224                    ##########################
225                    ##### DEFAULT RULES #####
226                    ##########################
227
228                    ## The default '*' rule - accepts all the same options as above.
229                    ## If a tag or attribute is not mentioned above, then the default
230                    ## rule is applied:
231
232                    ## Reject all tags
233                    '*'         => 0,
234
235                    ## Allow all tags and all attributes
236                    '*'         => 1,
237
238                    ## Send all tags to the sub
239                    '*'         => sub { tag_callback },
240
241                    ## Allow all tags, reject all attributes
242                    '*'         => { '*'  => 0 },
243
244                    ## Allow all tags, and
245                    '*' => {
246
247                        ## Allow the 'title' attribute
248                        title   => 1,
249
250                        ## Allow the 'href' attribute if it matches the regex
251                        href    =>   '^http://yourdomain.com'
252                   OR   href    => qr{^http://yourdomain.com},
253
254                        ## 'style' attributes are handled by a sub
255                        style   => sub { attr_callback },
256
257                        ## All other attributes are rejected
258                        '*'     => 0,
259
260                        ## Additionally, all tags should be handled by this sub
261                        tag     => sub { tag_callback},
262
263                    },
264
265            Tag Callbacks
266                    sub tag_callback {
267                        my ($filter,$element) = (@_);
268
269                        $element = {
270                            tag      => 'tag',
271                            content  => 'inner_html',
272                            attr     => {
273                                attr_name => 'attr_value',
274                            }
275                        };
276                        return 0 | 1;
277                    }
278
279                A tag callback accepts two parameters, the $filter object
280                and the C$element>. It should return 0 to completely ignore
281                the tag and its content (which includes any nested HTML
282                tags), or 1 to accept and output the tag.
283
284                The $element is a hash ref containing the keys:
285
286            "tag"
287                This is the tagname in lowercase, eg "a", "br", "img". If
288                you set the tag value to an empty string, then the tag will
289                not be outputted, but the tag contents will.
290
291            "content"
292                This is the equivalent of DOM's innerHTML. It contains the
293                text content and any HTML tags contained within this
294                element. You can change the content or set it to an empty
295                string so that it is not outputted.
296
297            "attr"
298                "attr" contains a hashref containing the attribute names and
299                values
300
301            If for instance, you wanted to replace "<b>" tags with "<span>"
302            tags, you could do this:
303
304                sub b_callback {
305                    my ($filter,$element)   = @_;
306                    $element->{tag}         = 'span';
307                    $element->{attr}{style} = 'font-weight:bold';
308                    return 1;
309                }
310
311        Attribute Callbacks
312                sub attr_callback {
313                    my ( $filter, $tag, $attr_name, $attr_val ) = @_;
314                    return undef | '' | 'value';
315                }
316
317            Attribute callbacks accept four parameters, the $filter object,
318            the $tag name, the $attr_name and the $attr_value.
319
320            It should return either "undef" to reject the attribute, or the
321            value to be used. An empty string keeps the attribute, but
322            without a value.
323
324        "BanList" vs "BanAllBut" vs "Rules"
325            It is not necessary to use "BanList" or "BanAllBut" - everything
326            can be done via "Rules", however it may be simpler to write:
327
328                BanAllBut => [qw(p div span)]
329
330            The logic works as follows:
331
332               * If BanAllBut exists, then ban everything but the tags in the list
333               * Add to the ban list any elements in BanList
334               * Any tags mentioned explicitly in Rules (eg a => 0, br => 1)
335                 are added or removed from the BanList
336               * A default rule of { '*' => 0 } would ban all tags except
337                 those mentioned in Rules
338               * A default rule of { '*' => 1 } would allow all tags except
339                 those disallowed in the ban list, or by explicit rules
340
341METHODS
342    This class provides the following methods:
343
344    hss_init ()
345        This method is called by new() and does the actual initialisation
346        work for the new HTML::StripScripts object.
347
348    input_start_document ()
349        This method initializes the filter, and must be called once before
350        starting on each HTML document to be filtered.
351
352    input_start ( TEXT )
353        Handles a start tag from the input document. TEXT must be the full
354        text of the tag, including angle-brackets.
355
356    input_end ( TEXT )
357        Handles an end tag from the input document. TEXT must be the full
358        text of the end tag, including angle-brackets.
359
360    input_text ( TEXT )
361        Handles some non-tag text from the input document.
362
363    input_process ( TEXT )
364        Handles a processing instruction from the input document.
365
366    input_comment ( TEXT )
367        Handles an HTML comment from the input document.
368
369    input_declaration ( TEXT )
370        Handles an declaration from the input document.
371
372    input_end_document ()
373        Call this method to signal the end of the input document.
374
375    filtered_document ()
376        Returns the filtered document as a string.
377
378SUBCLASSING
379    The only reason for subclassing this module now is to add to the list of
380    accepted tags, attributes and styles (See "WHITELIST INITIALIZATION
381    METHODS"). Everything else can be achieved with "Rules".
382
383    The "HTML::StripScripts" class is subclassable. Filter objects are plain
384    hashes and "HTML::StripScripts" reserves only hash keys that start with
385    "_hss". The filter configuration can be set up by invoking the
386    hss_init() method, which takes the same arguments as new().
387
388OUTPUT METHODS
389    The filter outputs a stream of start tags, end tags, text, comments,
390    declarations and processing instructions, via the following "output_*"
391    methods. Subclasses may override these to intercept the filter output.
392
393    The default implementations of the "output_*" methods pass the text on
394    to the output() method. The default implementation of the output()
395    method appends the text to a string, which can be fetched with the
396    filtered_document() method once processing is complete.
397
398    If the output() method or the individual "output_*" methods are
399    overridden in a subclass, then filtered_document() will not work in that
400    subclass.
401
402    output_start_document ()
403        This method gets called once at the start of each HTML document
404        passed through the filter. The default implementation does nothing.
405
406    output_end_document ()
407        This method gets called once at the end of each HTML document passed
408        through the filter. The default implementation does nothing.
409
410    output_start ( TEXT )
411        This method is used to output a filtered start tag.
412
413    output_end ( TEXT )
414        This method is used to output a filtered end tag.
415
416    output_text ( TEXT )
417        This method is used to output some filtered non-tag text.
418
419    output_declaration ( TEXT )
420        This method is used to output a filtered declaration.
421
422    output_comment ( TEXT )
423        This method is used to output a filtered HTML comment.
424
425    output_process ( TEXT )
426        This method is used to output a filtered processing instruction.
427
428    output ( TEXT )
429        This method is invoked by all of the default "output_*" methods. The
430        default implementation appends the text to the string that the
431        filtered_document() method will return.
432
433    output_stack_entry ( TEXT )
434        This method is invoked when a tag plus all text and nested HTML
435        content within the tag has been processed. It adds the tag plus its
436        content to the content for its parent tag.
437
438REJECT METHODS
439    When the filter encounters something in the input document which it
440    cannot transform into an acceptable construct, it invokes one of the
441    following "reject_*" methods to put something in the output document to
442    take the place of the unacceptable construct.
443
444    The TEXT parameter is the full text of the unacceptable construct.
445
446    The default implementations of these methods output an HTML comment
447    containing the text "filtered". If "EscapeFiltered" is set to true, then
448    the rejected text is HTML escaped instead.
449
450    Subclasses may override these methods, but should exercise caution. The
451    TEXT parameter is unfiltered input and may contain malicious constructs.
452
453    reject_start ( TEXT )
454    reject_end ( TEXT )
455    reject_text ( TEXT )
456    reject_declaration ( TEXT )
457    reject_comment ( TEXT )
458    reject_process ( TEXT )
459
460WHITELIST INITIALIZATION METHODS
461    The filter refers to various whitelists to determine which constructs
462    are acceptable. To modify these whitelists, subclasses can override the
463    following methods.
464
465    Each method is called once at object initialization time, and must
466    return a reference to a nested data structure. These references are
467    installed into the object, and used whenever the filter needs to refer
468    to a whitelist.
469
470    The default implementations of these methods can be invoked as class
471    methods.
472
473    See examples/tags/ and examples/declaration/ for examples of how to
474    override these methods.
475
476    init_context_whitelist ()
477        Returns a reference to the "Context" whitelist, which determines
478        which tags may appear at each point in the document, and which other
479        tags may be nested within them.
480
481        It is a hash, and the keys are context names, such as "Flow" and
482        "Inline".
483
484        The values in the hash are hashrefs. The keys in these subhashes are
485        lowercase tag names, and the values are context names, specifying
486        the context that the tag provides to any other tags nested within
487        it.
488
489        The special context "EMPTY" as a value in a subhash indicates that
490        nothing can be nested within that tag.
491
492    init_attrib_whitelist ()
493        Returns a reference to the "Attrib" whitelist, which determines
494        which attributes each tag can have and the values that those
495        attributes can take.
496
497        It is a hash, and the keys are lowercase tag names.
498
499        The values in the hash are hashrefs. The keys in these subhashes are
500        lowercase attribute names, and the values are attribute value class
501        names, which are short strings describing the type of values that
502        the attribute can take, such as "color" or "number".
503
504    init_attval_whitelist ()
505        Returns a reference to the "AttVal" whitelist, which is a hash that
506        maps attribute value class names from the "Attrib" whitelist to
507        coderefs to subs to validate (and optionally transform) a particular
508        attribute value.
509
510        The filter calls the attribute value validation subs with the
511        following parameters:
512
513        "filter"
514            A reference to the filter object.
515
516        "tagname"
517            The lowercase name of the tag in which the attribute appears.
518
519        "attrname"
520            The name of the attribute.
521
522        "attrval"
523            The attribute value found in the input document, in canonical
524            form (see "CANONICAL FORM").
525
526        The validation sub can return undef to indicate that the attribute
527        should be removed from the tag, or it can return the new value for
528        the attribute, in canonical form.
529
530    init_style_whitelist ()
531        Returns a reference to the "Style" whitelist, which determines which
532        CSS style directives are permitted in "style" tag attributes. The
533        keys are value names such as "color" and "background-color", and the
534        values are class names to be used as keys into the "AttVal"
535        whitelist.
536
537    init_deinter_whitelist
538        Returns a reference to the "DeInter" whitelist, which determines
539        which inline tags the filter should attempt to automatically
540        de-interleave if they are encountered interleaved. For example, the
541        filter will transform:
542
543          <b>hello <i>world</b> !</i>
544
545        Into:
546
547          <b>hello <i>world</i></b><i> !</i>
548
549        because both "b" and "i" appear as keys in the "DeInter" whitelist.
550
551CHARACTER DATA PROCESSING
552    These methods transform attribute values and non-tag text from the input
553    document into canonical form (see "CANONICAL FORM"), and transform text
554    in canonical form into a suitable form for the output document.
555
556    text_to_canonical_form ( TEXT )
557        This method is used to reduce non-tag text from the input document
558        to canonical form before passing it to the filter_text() method.
559
560        The default implementation unescapes all entities that map to
561        "US-ASCII" characters other than ampersand, and replaces any
562        ampersands that don't form part of valid entities with "&amp;".
563
564    quoted_to_canonical_form ( VALUE )
565        This method is used to reduce attribute values quoted with
566        doublequotes or singlequotes to canonical form before passing it to
567        the handler subs in the "AttVal" whitelist.
568
569        The default behavior is the same as that of
570        "text_to_canonical_form()", plus it converts any CR, LF or TAB
571        characters to spaces.
572
573    unquoted_to_canonical_form ( VALUE )
574        This method is used to reduce attribute values without quotes to
575        canonical form before passing it to the handler subs in the "AttVal"
576        whitelist.
577
578        The default implementation simply replaces all ampersands with
579        "&amp;", since that corresponds with the way most browsers treat
580        entities in unquoted values.
581
582    canonical_form_to_text ( TEXT )
583        This method is used to convert the text in canonical form returned
584        by the filter_text() method to a form suitable for inclusion in the
585        output document.
586
587        The default implementation runs anything that doesn't look like a
588        valid entity through the escape_html_metachars() method.
589
590    canonical_form_to_attval ( ATTVAL )
591        This method is used to convert the text in canonical form returned
592        by the "AttVal" handler subs to a form suitable for inclusion in
593        doublequotes in the output tag.
594
595        The default implementation converts CR, LF and TAB characters to a
596        single space, and runs anything that doesn't look like a valid
597        entity through the escape_html_metachars() method.
598
599    validate_href_attribute ( TEXT )
600        If the "AllowHref" filter configuration option is set, then this
601        method is used to validate "href" type attribute values. TEXT is the
602        attribute value in canonical form. Returns a possibly modified
603        attribute value (in canonical form) or "undef" to reject the
604        attribute.
605
606        The default implementation allows only absolute "http" and "https"
607        URLs, permits port numbers and query strings, and imposes reasonable
608        length limits.
609
610        It does not URI escape the query string, and it does not guarantee
611        properly formatted URIs, it just tries to give safe URIs. You can
612        always use an attribute callback (see "Attribute Callbacks") to
613        provide stricter handling.
614
615    validate_mailto ( TEXT )
616        If the "AllowMailto" filter configuration option is set, then this
617        method is used to validate "href" type attribute values which begin
618        with "mailto:". TEXT is the attribute value in canonical form.
619        Returns a possibly modified attribute value (in canonical form) or
620        "undef" to reject the attribute.
621
622        This uses a lightweight regex and does not guarantee that email
623        addresses are properly formatted. You can always use an attribute
624        callback (see "Attribute Callbacks") to provide stricter handling.
625
626    validate_src_attribute ( TEXT )
627        If the "AllowSrc" filter configuration option is set, then this
628        method is used to validate "src" type attribute values. TEXT is the
629        attribute value in canonical form. Returns a possibly modified
630        attribute value (in canonical form) or "undef" to reject the
631        attribute.
632
633        The default implementation behaves as validate_href_attribute().
634
635OTHER METHODS TO OVERRIDE
636    As well as the output, reject, init and cdata methods listed above, it
637    might make sense for subclasses to override the following methods:
638
639    filter_text ( TEXT )
640        This method will be invoked to filter blocks of non-tag text in the
641        input document. Both input and output are in canonical form, see
642        "CANONICAL FORM".
643
644        The default implementation does no filtering.
645
646    escape_html_metachars ( TEXT )
647        This method is used to escape all HTML metacharacters in TEXT. The
648        return value must be a copy of TEXT with metacharacters escaped.
649
650        The default implementation escapes a minimal set of metacharacters
651        for security against XSS vulnerabilities. The set of characters to
652        escape is a compromise between the need for security and the need to
653        ensure that the filter will work for documents in as many different
654        character sets as possible.
655
656        Subclasses which make strong assumptions about the document
657        character set will be able to escape much more aggressively.
658
659    strip_nonprintable ( TEXT )
660        Returns a copy of TEXT with runs of nonprintable characters replaced
661        with spaces or some other harmless string. Avoids replacing anything
662        with the empty string, as that can lead to other security issues.
663
664        The default implementation strips out only NULL characters, in order
665        to avoid scrambling text for as many different character sets as
666        possible.
667
668        Subclasses which make some sort of assumption about the character
669        set in use will be able to have a much wider definition of a
670        nonprintable character, and hence a more secure strip_nonprintable()
671        implementation.
672
673ATTRIBUTE VALUE HANDLER SUBS
674    References to the following subs appear in the "AttVal" whitelist
675    returned by the init_attval_whitelist() method.
676
677    _hss_attval_style( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
678        Attribute value hander for the "style" attribute.
679
680    _hss_attval_size ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
681        Attribute value handler for attributes who's values are some sort of
682        size or length.
683
684    _hss_attval_number ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
685        Attribute value handler for attributes who's values are a simple
686        integer.
687
688    _hss_attval_color ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
689        Attribute value handler for color attributes.
690
691    _hss_attval_text ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
692        Attribute value handler for text attributes.
693
694    _hss_attval_word ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
695        Attribute value handler for attributes who's values must consist of
696        a single short word, with minus characters permitted.
697
698    _hss_attval_wordlist ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
699        Attribute value handler for attributes who's values must consist of
700        one or more words, separated by spaces and/or commas.
701
702    _hss_attval_wordlistq ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
703        Attribute value handler for attributes who's values must consist of
704        one or more words, separated by commas, with optional doublequotes
705        around words and spaces allowed within the doublequotes.
706
707    _hss_attval_href ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
708        Attribute value handler for "href" type attributes. If the
709        "AllowHref" or "AllowMailto" configuration options are set, uses the
710        validate_href_attribute() method to check the attribute value.
711
712    _hss_attval_src ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
713        Attribute value handler for "src" type attributes. If the "AllowSrc"
714        configuration option is set, uses the validate_src_attribute()
715        method to check the attribute value.
716
717    _hss_attval_stylesrc ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
718        Attribute value handler for "src" type style pseudo attributes.
719
720    _hss_attval_novalue ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
721        Attribute value handler for attributes that have no value or a value
722        that is ignored. Just returns the attribute name as the value.
723
724CANONICAL FORM
725    Many of the methods described above deal with text from the input
726    document, encoded in what I call "canonical form", defined as follows:
727
728    All characters other than ampersands represent themselves. Literal
729    ampersands are encoded as "&amp;". Non "US-ASCII" characters may appear
730    as literals in whatever character set is in use, or they may appear as
731    named or numeric HTML entities such as "&aelig;", "&#31337;" and
732    "&#xFF;". Unknown named entities such as "&foo;" may appear.
733
734    The idea is to be able to be able to reduce input text to a minimal
735    form, without making too many assumptions about the character set in
736    use.
737
738PRIVATE METHODS
739    The following methods are internal to this class, and should not be
740    invoked from elsewhere. Subclasses should not use or override these
741    methods.
742
743    _hss_prepare_ban_list (CFG)
744        Returns a hash ref representing all the banned tags, based on the
745        values of BanList and BanAllBut
746
747    _hss_prepare_rules (CFG)
748        Returns a hash ref representing the tag and attribute rules (See
749        "Rules").
750
751        Returns undef if no filters are specified, in which case the
752        attribute filter code has very little performance impact. If any
753        rules are specified, then every tag and attribute is checked.
754
755    _hss_get_attr_filter ( DEFAULT_FILTERS TAG_FILTERS ATTR_NAME)
756        Returns the attribute filter rule to apply to this particular
757        attribute.
758
759        Checks for:
760
761          - a named attribute rule in a named tag
762          - a default * attribute rule in a named tag
763          - a named attribute rule in the default * rules
764          - a default * attribute rule in the default * rules
765
766    _hss_join_attribs (FILTERED_ATTRIBS)
767        Accepts a hash ref containing the attribute names as the keys, and
768        the attribute values as the values. Escapes them and returns a
769        string ready for output to HTML
770
771    _hss_decode_numeric ( NUMERIC )
772        Returns the string that should replace the numeric entity NUMERIC in
773        the text_to_canonical_form() method.
774
775    _hss_tag_is_banned ( TAGNAME )
776        Returns true if the lower case tag name TAGNAME is on the list of
777        harmless tags that the filter is configured to block, false
778        otherwise.
779
780    _hss_get_to_valid_context ( TAG )
781        Tries to get the filter to a context in which the tag TAG is
782        allowed, by introducing extra end tags or start tags if necessary.
783        TAG can be either the lower case name of a tag or the string
784        'CDATA'.
785
786        Returns 1 if an allowed context is reached, or 0 if there's no
787        reasonable way to get to an allowed context and the tag should just
788        be rejected.
789
790    _hss_close_innermost_tag ()
791        Closes the innermost open tag.
792
793    _hss_context ()
794        Returns the current named context of the filter.
795
796    _hss_valid_in_context ( TAG, CONTEXT )
797        Returns true if the lowercase tag name TAG is valid in context
798        CONTEXT, false otherwise.
799
800    _hss_valid_in_current_context ( TAG )
801        Returns true if the lowercase tag name TAG is valid in the filter's
802        current context, false otherwise.
803
804BUGS AND LIMITATIONS
805    Performance
806        This module does a lot of work to ensure that tags are correctly
807        nested and are not left open, causing unnecessary overhead for
808        applications where that doesn't matter.
809
810        Such applications may benefit from using the more lightweight
811        HTML::Scrubber::StripScripts module instead.
812
813    Strictness
814        URIs and email addresses are cleaned up to be safe, but not
815        necessarily accurate. That would have required adding dependencies.
816        Attribute callbacks can be used to add this functionality if
817        required, or the validation methods can be overridden.
818
819        By default, filtered HTML may not be valid strict XHTML, for
820        instance empty required attributes may be outputted. However, with
821        "Rules", it should be possible to force the HTML to validate.
822
823    REPORTING BUGS
824        Please report any bugs or feature requests to
825        bug-html-stripscripts@rt.cpan.org, or through the web interface at
826        <http://rt.cpan.org>.
827
828SEE ALSO
829    HTML::Parser, HTML::StripScripts::Parser, HTML::StripScripts::Regex
830
831AUTHOR
832    Original author Nick Cleaton <nick@cleaton.net>
833
834    New code added and module maintained by Clinton Gormley
835    <clint@traveljury.com>
836
837COPYRIGHT
838    Copyright (C) 2003 Nick Cleaton. All Rights Reserved.
839
840    Copyright (C) 2007 Clinton Gormley. All Rights Reserved.
841
842LICENSE
843    This module is free software; you can redistribute it and/or modify it
844    under the same terms as Perl itself.
845
846