1NAME 2 HTML::StripScripts - Strip scripting constructs out of HTML 3 4SYNOPSIS 5 use HTML::StripScripts; 6 7 my $hss = HTML::StripScripts->new({ Context => 'Inline' }); 8 9 $hss->input_start_document; 10 11 $hss->input_start('<i>'); 12 $hss->input_text('hello, world!'); 13 $hss->input_end('</i>'); 14 15 $hss->input_end_document; 16 17 print $hss->filtered_document; 18 19DESCRIPTION 20 This module strips scripting constructs out of HTML, leaving as much 21 non-scripting markup in place as possible. This allows web applications 22 to display HTML originating from an untrusted source without introducing 23 XSS (cross site scripting) vulnerabilities. 24 25 You will probably use HTML::StripScripts::Parser rather than using this 26 module directly. 27 28 The process is based on whitelists of tags, attributes and attribute 29 values. This approach is the most secure against disguised scripting 30 constructs hidden in malicious HTML documents. 31 32 As well as removing scripting constructs, this module ensures that there 33 is a matching end for each start tag, and that the tags are properly 34 nested. 35 36 Previously, in order to customise the output, you needed to subclass 37 "HTML::StripScripts" and override methods. Now, most customisation can 38 be done through the "Rules" option provided to "new()". (See 39 examples/declaration/ and examples/tags/ for cases where subclassing is 40 necessary.) 41 42 The HTML document must be parsed into start tags, end tags and text 43 before it can be filtered by this module. Use either 44 HTML::StripScripts::Parser or HTML::StripScripts::Regex instead if you 45 want to input an unparsed HTML document. 46 47 See examples/direct/ for an example of how to feed tokens directly to 48 HTML::StripScripts. 49 50CONSTRUCTORS 51 new ( CONFIG ) 52 Creates a new "HTML::StripScripts" filter object, bound to a 53 particular filtering policy. If present, the CONFIG parameter must 54 be a hashref. The following keys are recognized (unrecognized keys 55 will be silently ignored). 56 57 $s = HTML::Stripscripts->new({ 58 Context => 'Document|Flow|Inline|NoTags', 59 BanList => [qw( br img )] | {br => '1', img => '1'}, 60 BanAllBut => [qw(p div span)], 61 AllowSrc => 0|1, 62 AllowHref => 0|1, 63 AllowRelURL => 0|1, 64 AllowMailto => 0|1, 65 EscapeFiltered => 0|1, 66 Rules => { See below for details }, 67 }); 68 69 "Context" 70 A string specifying the context in which the filtered document 71 will be used. This influences the set of tags that will be 72 allowed. 73 74 If present, the "Context" value must be one of: 75 76 "Document" 77 If "Context" is "Document" then the filter will allow a full 78 HTML document, including the "HTML" tag and "HEAD" and 79 "BODY" sections. 80 81 "Flow" 82 If "Context" is "Flow" then most of the cosmetic tags that 83 one would expect to find in a document body are allowed, 84 including lists and tables but not including forms. 85 86 "Inline" 87 If "Context" is "Inline" then only inline tags such as "B" 88 and "FONT" are allowed. 89 90 "NoTags" 91 If "Context" is "NoTags" then no tags are allowed. 92 93 The default "Context" value is "Flow". 94 95 "BanList" 96 If present, this option must be an arrayref or a hashref. Any 97 tag that would normally be allowed (because it presents no XSS 98 hazard) will be blocked if the lowercase name of the tag is in 99 this list. 100 101 For example, in a guestbook application where "HR" tags are used 102 to separate posts, you may wish to prevent posts from including 103 "HR" tags, even though "HR" is not an XSS risk. 104 105 "BanAllBut" 106 If present, this option must be reference to an array holding a 107 list of lowercase tag names. This has the effect of adding all 108 but the listed tags to the ban list, so that only those tags 109 listed will be allowed. 110 111 "AllowSrc" 112 By default, the filter won't allow constructs that cause the 113 browser to fetch things automatically, such as "SRC" attributes 114 in "IMG" tags. If this option is present and true then those 115 constructs will be allowed. 116 117 "AllowHref" 118 By default, the filter won't allow constructs that cause the 119 browser to fetch things if the user clicks on something, such as 120 the "HREF" attribute in "A" tags. Set this option to a true 121 value to allow this type of construct. 122 123 "AllowRelURL" 124 By default, the filter won't allow relative URLs such as 125 "../foo.html" in "SRC" and "HREF" attribute values. Set this 126 option to a true value to allow them. "AllowHref" and / or 127 "AllowSrc" also need to be set to true for this to have any 128 effect. 129 130 "AllowMailto" 131 By default, "mailto:" links are not allowed. If "AllowMailto" is 132 set to a true value, then this construct will be allowed. This 133 can be enabled separately from AllowHref. 134 135 "EscapeFiltered" 136 By default, any filtered tags are outputted as 137 "<!--filtered-->". If "EscapeFiltered" is set to a true value, 138 then the filtered tags are converted to HTML entities. 139 140 For instance: 141 142 <br> --> <br> 143 144 "Rules" 145 The "Rules" option provides a very flexible way of customising 146 the filter. 147 148 The focus is safety-first, so it is applied after all of the 149 previous validation. This means that you cannot all malicious 150 data should already have been cleared. 151 152 Rules can be specified for tags and for attributes. Any tag or 153 attribute not explicitly listed will be handled by the default 154 "*" rules. 155 156 The following is a synopsis of all of the options that you can 157 use to configure rules. Below, an example is broken into 158 sections and explained. 159 160 Rules => { 161 162 tag => 0 | 1 | sub { tag_callback } 163 | { 164 attr => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback}, 165 '*' => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback}, 166 required => [qw(attrname attrname)], 167 tag => sub { tag_callback } 168 }, 169 170 '*' => 0 | 1 | sub { tag_callback } 171 | { 172 attr => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback}, 173 '*' => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback}, 174 tag => sub { tag_callback } 175 } 176 177 } 178 179 EXAMPLE: 180 181 Rules => { 182 183 ########################## 184 ##### EXPLICIT RULES ##### 185 ########################## 186 187 ## Allow <br> tags, reject <img> tags 188 br => 1, 189 img => 0, 190 191 ## Send all <div> tags to a sub 192 div => sub { tag_callback }, 193 194 ## Allow <blockquote> tags,and allow the 'cite' attribute 195 ## All other attributes are handled by the default C<*> 196 blockquote => { 197 cite => 1, 198 }, 199 200 ## Allow <a> tags, and 201 a => { 202 203 ## Allow the 'title' attribute 204 title => 1, 205 206 ## Allow the 'href' attribute if it matches the regex 207 href => '^http://yourdomain.com' 208 OR href => qr{^http://yourdomain.com}, 209 210 ## 'style' attributes are handled by a sub 211 style => sub { attr_callback }, 212 213 ## All other attributes are rejected 214 '*' => 0, 215 216 ## Additionally, the <a> tag should be handled by this sub 217 tag => sub { tag_callback}, 218 219 ## If the <a> tag doesn't have these attributes, filter the tag 220 required => [qw(href title)], 221 222 }, 223 224 ########################## 225 ##### DEFAULT RULES ##### 226 ########################## 227 228 ## The default '*' rule - accepts all the same options as above. 229 ## If a tag or attribute is not mentioned above, then the default 230 ## rule is applied: 231 232 ## Reject all tags 233 '*' => 0, 234 235 ## Allow all tags and all attributes 236 '*' => 1, 237 238 ## Send all tags to the sub 239 '*' => sub { tag_callback }, 240 241 ## Allow all tags, reject all attributes 242 '*' => { '*' => 0 }, 243 244 ## Allow all tags, and 245 '*' => { 246 247 ## Allow the 'title' attribute 248 title => 1, 249 250 ## Allow the 'href' attribute if it matches the regex 251 href => '^http://yourdomain.com' 252 OR href => qr{^http://yourdomain.com}, 253 254 ## 'style' attributes are handled by a sub 255 style => sub { attr_callback }, 256 257 ## All other attributes are rejected 258 '*' => 0, 259 260 ## Additionally, all tags should be handled by this sub 261 tag => sub { tag_callback}, 262 263 }, 264 265 Tag Callbacks 266 sub tag_callback { 267 my ($filter,$element) = (@_); 268 269 $element = { 270 tag => 'tag', 271 content => 'inner_html', 272 attr => { 273 attr_name => 'attr_value', 274 } 275 }; 276 return 0 | 1; 277 } 278 279 A tag callback accepts two parameters, the $filter object 280 and the C$element>. It should return 0 to completely ignore 281 the tag and its content (which includes any nested HTML 282 tags), or 1 to accept and output the tag. 283 284 The $element is a hash ref containing the keys: 285 286 "tag" 287 This is the tagname in lowercase, eg "a", "br", "img". If 288 you set the tag value to an empty string, then the tag will 289 not be outputted, but the tag contents will. 290 291 "content" 292 This is the equivalent of DOM's innerHTML. It contains the 293 text content and any HTML tags contained within this 294 element. You can change the content or set it to an empty 295 string so that it is not outputted. 296 297 "attr" 298 "attr" contains a hashref containing the attribute names and 299 values 300 301 If for instance, you wanted to replace "<b>" tags with "<span>" 302 tags, you could do this: 303 304 sub b_callback { 305 my ($filter,$element) = @_; 306 $element->{tag} = 'span'; 307 $element->{attr}{style} = 'font-weight:bold'; 308 return 1; 309 } 310 311 Attribute Callbacks 312 sub attr_callback { 313 my ( $filter, $tag, $attr_name, $attr_val ) = @_; 314 return undef | '' | 'value'; 315 } 316 317 Attribute callbacks accept four parameters, the $filter object, 318 the $tag name, the $attr_name and the $attr_value. 319 320 It should return either "undef" to reject the attribute, or the 321 value to be used. An empty string keeps the attribute, but 322 without a value. 323 324 "BanList" vs "BanAllBut" vs "Rules" 325 It is not necessary to use "BanList" or "BanAllBut" - everything 326 can be done via "Rules", however it may be simpler to write: 327 328 BanAllBut => [qw(p div span)] 329 330 The logic works as follows: 331 332 * If BanAllBut exists, then ban everything but the tags in the list 333 * Add to the ban list any elements in BanList 334 * Any tags mentioned explicitly in Rules (eg a => 0, br => 1) 335 are added or removed from the BanList 336 * A default rule of { '*' => 0 } would ban all tags except 337 those mentioned in Rules 338 * A default rule of { '*' => 1 } would allow all tags except 339 those disallowed in the ban list, or by explicit rules 340 341METHODS 342 This class provides the following methods: 343 344 hss_init () 345 This method is called by new() and does the actual initialisation 346 work for the new HTML::StripScripts object. 347 348 input_start_document () 349 This method initializes the filter, and must be called once before 350 starting on each HTML document to be filtered. 351 352 input_start ( TEXT ) 353 Handles a start tag from the input document. TEXT must be the full 354 text of the tag, including angle-brackets. 355 356 input_end ( TEXT ) 357 Handles an end tag from the input document. TEXT must be the full 358 text of the end tag, including angle-brackets. 359 360 input_text ( TEXT ) 361 Handles some non-tag text from the input document. 362 363 input_process ( TEXT ) 364 Handles a processing instruction from the input document. 365 366 input_comment ( TEXT ) 367 Handles an HTML comment from the input document. 368 369 input_declaration ( TEXT ) 370 Handles an declaration from the input document. 371 372 input_end_document () 373 Call this method to signal the end of the input document. 374 375 filtered_document () 376 Returns the filtered document as a string. 377 378SUBCLASSING 379 The only reason for subclassing this module now is to add to the list of 380 accepted tags, attributes and styles (See "WHITELIST INITIALIZATION 381 METHODS"). Everything else can be achieved with "Rules". 382 383 The "HTML::StripScripts" class is subclassable. Filter objects are plain 384 hashes and "HTML::StripScripts" reserves only hash keys that start with 385 "_hss". The filter configuration can be set up by invoking the 386 hss_init() method, which takes the same arguments as new(). 387 388OUTPUT METHODS 389 The filter outputs a stream of start tags, end tags, text, comments, 390 declarations and processing instructions, via the following "output_*" 391 methods. Subclasses may override these to intercept the filter output. 392 393 The default implementations of the "output_*" methods pass the text on 394 to the output() method. The default implementation of the output() 395 method appends the text to a string, which can be fetched with the 396 filtered_document() method once processing is complete. 397 398 If the output() method or the individual "output_*" methods are 399 overridden in a subclass, then filtered_document() will not work in that 400 subclass. 401 402 output_start_document () 403 This method gets called once at the start of each HTML document 404 passed through the filter. The default implementation does nothing. 405 406 output_end_document () 407 This method gets called once at the end of each HTML document passed 408 through the filter. The default implementation does nothing. 409 410 output_start ( TEXT ) 411 This method is used to output a filtered start tag. 412 413 output_end ( TEXT ) 414 This method is used to output a filtered end tag. 415 416 output_text ( TEXT ) 417 This method is used to output some filtered non-tag text. 418 419 output_declaration ( TEXT ) 420 This method is used to output a filtered declaration. 421 422 output_comment ( TEXT ) 423 This method is used to output a filtered HTML comment. 424 425 output_process ( TEXT ) 426 This method is used to output a filtered processing instruction. 427 428 output ( TEXT ) 429 This method is invoked by all of the default "output_*" methods. The 430 default implementation appends the text to the string that the 431 filtered_document() method will return. 432 433 output_stack_entry ( TEXT ) 434 This method is invoked when a tag plus all text and nested HTML 435 content within the tag has been processed. It adds the tag plus its 436 content to the content for its parent tag. 437 438REJECT METHODS 439 When the filter encounters something in the input document which it 440 cannot transform into an acceptable construct, it invokes one of the 441 following "reject_*" methods to put something in the output document to 442 take the place of the unacceptable construct. 443 444 The TEXT parameter is the full text of the unacceptable construct. 445 446 The default implementations of these methods output an HTML comment 447 containing the text "filtered". If "EscapeFiltered" is set to true, then 448 the rejected text is HTML escaped instead. 449 450 Subclasses may override these methods, but should exercise caution. The 451 TEXT parameter is unfiltered input and may contain malicious constructs. 452 453 reject_start ( TEXT ) 454 reject_end ( TEXT ) 455 reject_text ( TEXT ) 456 reject_declaration ( TEXT ) 457 reject_comment ( TEXT ) 458 reject_process ( TEXT ) 459 460WHITELIST INITIALIZATION METHODS 461 The filter refers to various whitelists to determine which constructs 462 are acceptable. To modify these whitelists, subclasses can override the 463 following methods. 464 465 Each method is called once at object initialization time, and must 466 return a reference to a nested data structure. These references are 467 installed into the object, and used whenever the filter needs to refer 468 to a whitelist. 469 470 The default implementations of these methods can be invoked as class 471 methods. 472 473 See examples/tags/ and examples/declaration/ for examples of how to 474 override these methods. 475 476 init_context_whitelist () 477 Returns a reference to the "Context" whitelist, which determines 478 which tags may appear at each point in the document, and which other 479 tags may be nested within them. 480 481 It is a hash, and the keys are context names, such as "Flow" and 482 "Inline". 483 484 The values in the hash are hashrefs. The keys in these subhashes are 485 lowercase tag names, and the values are context names, specifying 486 the context that the tag provides to any other tags nested within 487 it. 488 489 The special context "EMPTY" as a value in a subhash indicates that 490 nothing can be nested within that tag. 491 492 init_attrib_whitelist () 493 Returns a reference to the "Attrib" whitelist, which determines 494 which attributes each tag can have and the values that those 495 attributes can take. 496 497 It is a hash, and the keys are lowercase tag names. 498 499 The values in the hash are hashrefs. The keys in these subhashes are 500 lowercase attribute names, and the values are attribute value class 501 names, which are short strings describing the type of values that 502 the attribute can take, such as "color" or "number". 503 504 init_attval_whitelist () 505 Returns a reference to the "AttVal" whitelist, which is a hash that 506 maps attribute value class names from the "Attrib" whitelist to 507 coderefs to subs to validate (and optionally transform) a particular 508 attribute value. 509 510 The filter calls the attribute value validation subs with the 511 following parameters: 512 513 "filter" 514 A reference to the filter object. 515 516 "tagname" 517 The lowercase name of the tag in which the attribute appears. 518 519 "attrname" 520 The name of the attribute. 521 522 "attrval" 523 The attribute value found in the input document, in canonical 524 form (see "CANONICAL FORM"). 525 526 The validation sub can return undef to indicate that the attribute 527 should be removed from the tag, or it can return the new value for 528 the attribute, in canonical form. 529 530 init_style_whitelist () 531 Returns a reference to the "Style" whitelist, which determines which 532 CSS style directives are permitted in "style" tag attributes. The 533 keys are value names such as "color" and "background-color", and the 534 values are class names to be used as keys into the "AttVal" 535 whitelist. 536 537 init_deinter_whitelist 538 Returns a reference to the "DeInter" whitelist, which determines 539 which inline tags the filter should attempt to automatically 540 de-interleave if they are encountered interleaved. For example, the 541 filter will transform: 542 543 <b>hello <i>world</b> !</i> 544 545 Into: 546 547 <b>hello <i>world</i></b><i> !</i> 548 549 because both "b" and "i" appear as keys in the "DeInter" whitelist. 550 551CHARACTER DATA PROCESSING 552 These methods transform attribute values and non-tag text from the input 553 document into canonical form (see "CANONICAL FORM"), and transform text 554 in canonical form into a suitable form for the output document. 555 556 text_to_canonical_form ( TEXT ) 557 This method is used to reduce non-tag text from the input document 558 to canonical form before passing it to the filter_text() method. 559 560 The default implementation unescapes all entities that map to 561 "US-ASCII" characters other than ampersand, and replaces any 562 ampersands that don't form part of valid entities with "&". 563 564 quoted_to_canonical_form ( VALUE ) 565 This method is used to reduce attribute values quoted with 566 doublequotes or singlequotes to canonical form before passing it to 567 the handler subs in the "AttVal" whitelist. 568 569 The default behavior is the same as that of 570 "text_to_canonical_form()", plus it converts any CR, LF or TAB 571 characters to spaces. 572 573 unquoted_to_canonical_form ( VALUE ) 574 This method is used to reduce attribute values without quotes to 575 canonical form before passing it to the handler subs in the "AttVal" 576 whitelist. 577 578 The default implementation simply replaces all ampersands with 579 "&", since that corresponds with the way most browsers treat 580 entities in unquoted values. 581 582 canonical_form_to_text ( TEXT ) 583 This method is used to convert the text in canonical form returned 584 by the filter_text() method to a form suitable for inclusion in the 585 output document. 586 587 The default implementation runs anything that doesn't look like a 588 valid entity through the escape_html_metachars() method. 589 590 canonical_form_to_attval ( ATTVAL ) 591 This method is used to convert the text in canonical form returned 592 by the "AttVal" handler subs to a form suitable for inclusion in 593 doublequotes in the output tag. 594 595 The default implementation converts CR, LF and TAB characters to a 596 single space, and runs anything that doesn't look like a valid 597 entity through the escape_html_metachars() method. 598 599 validate_href_attribute ( TEXT ) 600 If the "AllowHref" filter configuration option is set, then this 601 method is used to validate "href" type attribute values. TEXT is the 602 attribute value in canonical form. Returns a possibly modified 603 attribute value (in canonical form) or "undef" to reject the 604 attribute. 605 606 The default implementation allows only absolute "http" and "https" 607 URLs, permits port numbers and query strings, and imposes reasonable 608 length limits. 609 610 It does not URI escape the query string, and it does not guarantee 611 properly formatted URIs, it just tries to give safe URIs. You can 612 always use an attribute callback (see "Attribute Callbacks") to 613 provide stricter handling. 614 615 validate_mailto ( TEXT ) 616 If the "AllowMailto" filter configuration option is set, then this 617 method is used to validate "href" type attribute values which begin 618 with "mailto:". TEXT is the attribute value in canonical form. 619 Returns a possibly modified attribute value (in canonical form) or 620 "undef" to reject the attribute. 621 622 This uses a lightweight regex and does not guarantee that email 623 addresses are properly formatted. You can always use an attribute 624 callback (see "Attribute Callbacks") to provide stricter handling. 625 626 validate_src_attribute ( TEXT ) 627 If the "AllowSrc" filter configuration option is set, then this 628 method is used to validate "src" type attribute values. TEXT is the 629 attribute value in canonical form. Returns a possibly modified 630 attribute value (in canonical form) or "undef" to reject the 631 attribute. 632 633 The default implementation behaves as validate_href_attribute(). 634 635OTHER METHODS TO OVERRIDE 636 As well as the output, reject, init and cdata methods listed above, it 637 might make sense for subclasses to override the following methods: 638 639 filter_text ( TEXT ) 640 This method will be invoked to filter blocks of non-tag text in the 641 input document. Both input and output are in canonical form, see 642 "CANONICAL FORM". 643 644 The default implementation does no filtering. 645 646 escape_html_metachars ( TEXT ) 647 This method is used to escape all HTML metacharacters in TEXT. The 648 return value must be a copy of TEXT with metacharacters escaped. 649 650 The default implementation escapes a minimal set of metacharacters 651 for security against XSS vulnerabilities. The set of characters to 652 escape is a compromise between the need for security and the need to 653 ensure that the filter will work for documents in as many different 654 character sets as possible. 655 656 Subclasses which make strong assumptions about the document 657 character set will be able to escape much more aggressively. 658 659 strip_nonprintable ( TEXT ) 660 Returns a copy of TEXT with runs of nonprintable characters replaced 661 with spaces or some other harmless string. Avoids replacing anything 662 with the empty string, as that can lead to other security issues. 663 664 The default implementation strips out only NULL characters, in order 665 to avoid scrambling text for as many different character sets as 666 possible. 667 668 Subclasses which make some sort of assumption about the character 669 set in use will be able to have a much wider definition of a 670 nonprintable character, and hence a more secure strip_nonprintable() 671 implementation. 672 673ATTRIBUTE VALUE HANDLER SUBS 674 References to the following subs appear in the "AttVal" whitelist 675 returned by the init_attval_whitelist() method. 676 677 _hss_attval_style( FILTER, TAGNAME, ATTRNAME, ATTRVAL ) 678 Attribute value hander for the "style" attribute. 679 680 _hss_attval_size ( FILTER, TAGNAME, ATTRNAME, ATTRVAL ) 681 Attribute value handler for attributes who's values are some sort of 682 size or length. 683 684 _hss_attval_number ( FILTER, TAGNAME, ATTRNAME, ATTRVAL ) 685 Attribute value handler for attributes who's values are a simple 686 integer. 687 688 _hss_attval_color ( FILTER, TAGNAME, ATTRNAME, ATTRVAL ) 689 Attribute value handler for color attributes. 690 691 _hss_attval_text ( FILTER, TAGNAME, ATTRNAME, ATTRVAL ) 692 Attribute value handler for text attributes. 693 694 _hss_attval_word ( FILTER, TAGNAME, ATTRNAME, ATTRVAL ) 695 Attribute value handler for attributes who's values must consist of 696 a single short word, with minus characters permitted. 697 698 _hss_attval_wordlist ( FILTER, TAGNAME, ATTRNAME, ATTRVAL ) 699 Attribute value handler for attributes who's values must consist of 700 one or more words, separated by spaces and/or commas. 701 702 _hss_attval_wordlistq ( FILTER, TAGNAME, ATTRNAME, ATTRVAL ) 703 Attribute value handler for attributes who's values must consist of 704 one or more words, separated by commas, with optional doublequotes 705 around words and spaces allowed within the doublequotes. 706 707 _hss_attval_href ( FILTER, TAGNAME, ATTRNAME, ATTRVAL ) 708 Attribute value handler for "href" type attributes. If the 709 "AllowHref" or "AllowMailto" configuration options are set, uses the 710 validate_href_attribute() method to check the attribute value. 711 712 _hss_attval_src ( FILTER, TAGNAME, ATTRNAME, ATTRVAL ) 713 Attribute value handler for "src" type attributes. If the "AllowSrc" 714 configuration option is set, uses the validate_src_attribute() 715 method to check the attribute value. 716 717 _hss_attval_stylesrc ( FILTER, TAGNAME, ATTRNAME, ATTRVAL ) 718 Attribute value handler for "src" type style pseudo attributes. 719 720 _hss_attval_novalue ( FILTER, TAGNAME, ATTRNAME, ATTRVAL ) 721 Attribute value handler for attributes that have no value or a value 722 that is ignored. Just returns the attribute name as the value. 723 724CANONICAL FORM 725 Many of the methods described above deal with text from the input 726 document, encoded in what I call "canonical form", defined as follows: 727 728 All characters other than ampersands represent themselves. Literal 729 ampersands are encoded as "&". Non "US-ASCII" characters may appear 730 as literals in whatever character set is in use, or they may appear as 731 named or numeric HTML entities such as "æ", "穩" and 732 "ÿ". Unknown named entities such as "&foo;" may appear. 733 734 The idea is to be able to be able to reduce input text to a minimal 735 form, without making too many assumptions about the character set in 736 use. 737 738PRIVATE METHODS 739 The following methods are internal to this class, and should not be 740 invoked from elsewhere. Subclasses should not use or override these 741 methods. 742 743 _hss_prepare_ban_list (CFG) 744 Returns a hash ref representing all the banned tags, based on the 745 values of BanList and BanAllBut 746 747 _hss_prepare_rules (CFG) 748 Returns a hash ref representing the tag and attribute rules (See 749 "Rules"). 750 751 Returns undef if no filters are specified, in which case the 752 attribute filter code has very little performance impact. If any 753 rules are specified, then every tag and attribute is checked. 754 755 _hss_get_attr_filter ( DEFAULT_FILTERS TAG_FILTERS ATTR_NAME) 756 Returns the attribute filter rule to apply to this particular 757 attribute. 758 759 Checks for: 760 761 - a named attribute rule in a named tag 762 - a default * attribute rule in a named tag 763 - a named attribute rule in the default * rules 764 - a default * attribute rule in the default * rules 765 766 _hss_join_attribs (FILTERED_ATTRIBS) 767 Accepts a hash ref containing the attribute names as the keys, and 768 the attribute values as the values. Escapes them and returns a 769 string ready for output to HTML 770 771 _hss_decode_numeric ( NUMERIC ) 772 Returns the string that should replace the numeric entity NUMERIC in 773 the text_to_canonical_form() method. 774 775 _hss_tag_is_banned ( TAGNAME ) 776 Returns true if the lower case tag name TAGNAME is on the list of 777 harmless tags that the filter is configured to block, false 778 otherwise. 779 780 _hss_get_to_valid_context ( TAG ) 781 Tries to get the filter to a context in which the tag TAG is 782 allowed, by introducing extra end tags or start tags if necessary. 783 TAG can be either the lower case name of a tag or the string 784 'CDATA'. 785 786 Returns 1 if an allowed context is reached, or 0 if there's no 787 reasonable way to get to an allowed context and the tag should just 788 be rejected. 789 790 _hss_close_innermost_tag () 791 Closes the innermost open tag. 792 793 _hss_context () 794 Returns the current named context of the filter. 795 796 _hss_valid_in_context ( TAG, CONTEXT ) 797 Returns true if the lowercase tag name TAG is valid in context 798 CONTEXT, false otherwise. 799 800 _hss_valid_in_current_context ( TAG ) 801 Returns true if the lowercase tag name TAG is valid in the filter's 802 current context, false otherwise. 803 804BUGS AND LIMITATIONS 805 Performance 806 This module does a lot of work to ensure that tags are correctly 807 nested and are not left open, causing unnecessary overhead for 808 applications where that doesn't matter. 809 810 Such applications may benefit from using the more lightweight 811 HTML::Scrubber::StripScripts module instead. 812 813 Strictness 814 URIs and email addresses are cleaned up to be safe, but not 815 necessarily accurate. That would have required adding dependencies. 816 Attribute callbacks can be used to add this functionality if 817 required, or the validation methods can be overridden. 818 819 By default, filtered HTML may not be valid strict XHTML, for 820 instance empty required attributes may be outputted. However, with 821 "Rules", it should be possible to force the HTML to validate. 822 823 REPORTING BUGS 824 Please report any bugs or feature requests to 825 bug-html-stripscripts@rt.cpan.org, or through the web interface at 826 <http://rt.cpan.org>. 827 828SEE ALSO 829 HTML::Parser, HTML::StripScripts::Parser, HTML::StripScripts::Regex 830 831AUTHOR 832 Original author Nick Cleaton <nick@cleaton.net> 833 834 New code added and module maintained by Clinton Gormley 835 <clint@traveljury.com> 836 837COPYRIGHT 838 Copyright (C) 2003 Nick Cleaton. All Rights Reserved. 839 840 Copyright (C) 2007 Clinton Gormley. All Rights Reserved. 841 842LICENSE 843 This module is free software; you can redistribute it and/or modify it 844 under the same terms as Perl itself. 845 846