1.. _clean-chapter: 2.. highlightlang:: python 3 4========================= 5Sanitizing text fragments 6========================= 7 8:py:func:`bleach.clean` is Bleach's HTML sanitization method. 9 10Given a fragment of HTML, Bleach will parse it according to the HTML5 parsing 11algorithm and sanitize any disallowed tags or attributes. This algorithm also 12takes care of things like unclosed and (some) misnested tags. 13 14You may pass in a ``string`` or a ``unicode`` object, but Bleach will always 15return ``unicode``. 16 17.. Note:: 18 19 :py:func:`bleach.clean` is for sanitizing HTML **fragments** and not entire 20 HTML documents. 21 22 23.. Warning:: 24 25 :py:func:`bleach.clean` is for sanitising HTML fragments to use in an HTML 26 context--not for HTML attributes, CSS, JSON, xhtml, SVG, or other contexts. 27 28 For example, this is a safe use of ``clean`` output in an HTML context:: 29 30 <p> 31 {{ bleach.clean(user_bio) }} 32 </p> 33 34 35 This is a **not safe** use of ``clean`` output in an HTML attribute:: 36 37 <body data-bio="{{ bleach.clean(user_bio) }}"> 38 39 40 If you need to use the output of ``bleach.clean()`` in an HTML attribute, you 41 need to pass it through your template library's escape function. For example, 42 Jinja2's ``escape`` or ``django.utils.html.escape`` or something like that. 43 44 If you need to use the output of ``bleach.clean()`` in any other context, 45 you need to pass it through an appropriate sanitizer/escaper for that 46 context. 47 48 49.. autofunction:: bleach.clean 50 51 52Allowed tags (``tags``) 53======================= 54 55The ``tags`` kwarg specifies the allowed set of HTML tags. It should be a list, 56tuple, or other iterable. Any HTML tags not in this list will be escaped or 57stripped from the text. 58 59For example: 60 61.. doctest:: 62 63 >>> import bleach 64 65 >>> bleach.clean( 66 ... '<b><i>an example</i></b>', 67 ... tags=['b'], 68 ... ) 69 '<b><i>an example</i></b>' 70 71 72The default value is a relatively conservative list found in 73``bleach.sanitizer.ALLOWED_TAGS``. 74 75 76.. autodata:: bleach.sanitizer.ALLOWED_TAGS 77 78 79Allowed Attributes (``attributes``) 80=================================== 81 82The ``attributes`` kwarg lets you specify which attributes are allowed. The 83value can be a list, a callable or a map of tag name to list or callable. 84 85The default value is also a conservative dict found in 86``bleach.sanitizer.ALLOWED_ATTRIBUTES``. 87 88 89.. autodata:: bleach.sanitizer.ALLOWED_ATTRIBUTES 90 91.. versionchanged:: 2.0 92 93 Prior to 2.0, the ``attributes`` kwarg value could only be a list or a map. 94 95 96As a list 97--------- 98 99The ``attributes`` value can be a list which specifies the list of attributes 100allowed for any tag. 101 102For example: 103 104.. doctest:: 105 106 >>> import bleach 107 108 >>> bleach.clean( 109 ... '<p class="foo" style="color: red; font-weight: bold;">blah blah blah</p>', 110 ... tags=['p'], 111 ... attributes=['style'], 112 ... styles=['color'], 113 ... ) 114 '<p style="color: red;">blah blah blah</p>' 115 116 117As a dict 118--------- 119 120The ``attributes`` value can be a dict which maps tags to what attributes they can have. 121 122You can also specify ``*``, which will match any tag. 123 124For example, this allows "href" and "rel" for "a" tags, "alt" for the "img" tag 125and "class" for any tag (including "a" and "img"): 126 127.. doctest:: 128 129 >>> import bleach 130 131 >>> attrs = { 132 ... '*': ['class'], 133 ... 'a': ['href', 'rel'], 134 ... 'img': ['alt'], 135 ... } 136 137 >>> bleach.clean( 138 ... '<img alt="an example" width=500>', 139 ... tags=['img'], 140 ... attributes=attrs 141 ... ) 142 '<img alt="an example">' 143 144 145Using functions 146--------------- 147 148You can also use callables that take the tag, attribute name and attribute value 149and returns ``True`` to keep the attribute or ``False`` to drop it. 150 151You can pass a callable as the attributes argument value and it'll run for 152every tag/attr. 153 154For example: 155 156.. doctest:: 157 158 >>> import bleach 159 160 >>> def allow_h(tag, name, value): 161 ... return name[0] == 'h' 162 163 >>> bleach.clean( 164 ... '<a href="http://example.com" title="link">link</a>', 165 ... tags=['a'], 166 ... attributes=allow_h, 167 ... ) 168 '<a href="http://example.com">link</a>' 169 170 171You can also pass a callable as a value in an attributes dict and it'll run for 172attributes for specified tags: 173 174.. doctest:: 175 176 >>> from six.moves.urllib.parse import urlparse 177 >>> import bleach 178 179 >>> def allow_src(tag, name, value): 180 ... if name in ('alt', 'height', 'width'): 181 ... return True 182 ... if name == 'src': 183 ... p = urlparse(value) 184 ... return (not p.netloc) or p.netloc == 'mydomain.com' 185 ... return False 186 187 >>> bleach.clean( 188 ... '<img src="http://example.com" alt="an example">', 189 ... tags=['img'], 190 ... attributes={ 191 ... 'img': allow_src 192 ... } 193 ... ) 194 '<img alt="an example">' 195 196 197.. versionchanged:: 2.0 198 199 In previous versions of Bleach, the callable took an attribute name and a 200 attribute value. Now it takes a tag, an attribute name and an attribute 201 value. 202 203 204Allowed styles (``styles``) 205=========================== 206 207If you allow the ``style`` attribute, you will also need to specify the allowed 208styles users are allowed to set, for example ``color`` and ``background-color``. 209 210The default value is an empty list. In other words, the ``style`` attribute will 211be allowed but no style declaration names will be allowed. 212 213For example, to allow users to set the color and font-weight of text: 214 215.. doctest:: 216 217 >>> import bleach 218 219 >>> tags = ['p', 'em', 'strong'] 220 >>> attrs = { 221 ... '*': ['style'] 222 ... } 223 >>> styles = ['color', 'font-weight'] 224 225 >>> bleach.clean( 226 ... '<p style="font-weight: heavy;">my html</p>', 227 ... tags=tags, 228 ... attributes=attrs, 229 ... styles=styles 230 ... ) 231 '<p style="font-weight: heavy;">my html</p>' 232 233 234Default styles are stored in ``bleach.sanitizer.ALLOWED_STYLES``. 235 236.. autodata:: bleach.sanitizer.ALLOWED_STYLES 237 238 239Allowed protocols (``protocols``) 240================================= 241 242If you allow tags that have attributes containing a URI value (like the ``href`` 243attribute of an anchor tag, you may want to adapt the accepted protocols. 244 245For example, this sets allowed protocols to http, https and smb: 246 247.. doctest:: 248 249 >>> import bleach 250 251 >>> bleach.clean( 252 ... '<a href="smb://more_text">allowed protocol</a>', 253 ... protocols=['http', 'https', 'smb'] 254 ... ) 255 '<a href="smb://more_text">allowed protocol</a>' 256 257 258This adds smb to the Bleach-specified set of allowed protocols: 259 260.. doctest:: 261 262 >>> import bleach 263 264 >>> bleach.clean( 265 ... '<a href="smb://more_text">allowed protocol</a>', 266 ... protocols=bleach.ALLOWED_PROTOCOLS + ['smb'] 267 ... ) 268 '<a href="smb://more_text">allowed protocol</a>' 269 270 271Default protocols are in ``bleach.sanitizer.ALLOWED_PROTOCOLS``. 272 273.. autodata:: bleach.sanitizer.ALLOWED_PROTOCOLS 274 275 276Stripping markup (``strip``) 277============================ 278 279By default, Bleach *escapes* tags that aren't specified in the allowed tags list 280and invalid markup. For example: 281 282.. doctest:: 283 284 >>> import bleach 285 286 >>> bleach.clean('<span>is not allowed</span>') 287 '<span>is not allowed</span>' 288 289 >>> bleach.clean('<b><span>is not allowed</span></b>', tags=['b']) 290 '<b><span>is not allowed</span></b>' 291 292 293If you would rather Bleach stripped this markup entirely, you can pass 294``strip=True``: 295 296.. doctest:: 297 298 >>> import bleach 299 300 >>> bleach.clean('<span>is not allowed</span>', strip=True) 301 'is not allowed' 302 303 >>> bleach.clean('<b><span>is not allowed</span></b>', tags=['b'], strip=True) 304 '<b>is not allowed</b>' 305 306 307Stripping comments (``strip_comments``) 308======================================= 309 310By default, Bleach will strip out HTML comments. To disable this behavior, set 311``strip_comments=False``: 312 313.. doctest:: 314 315 >>> import bleach 316 317 >>> html = 'my<!-- commented --> html' 318 319 >>> bleach.clean(html) 320 'my html' 321 322 >>> bleach.clean(html, strip_comments=False) 323 'my<!-- commented --> html' 324 325 326Using ``bleach.sanitizer.Cleaner`` 327================================== 328 329If you're cleaning a lot of text or you need better control of things, you 330should create a :py:class:`bleach.sanitizer.Cleaner` instance. 331 332.. autoclass:: bleach.sanitizer.Cleaner 333 :members: 334 335.. versionadded:: 2.0 336 337 338html5lib Filters (``filters``) 339------------------------------ 340 341Bleach sanitizing is implemented as an html5lib filter. The consequence of this 342is that we can pass the streamed content through additional specified filters 343after the :py:class:`bleach.sanitizer.BleachSanitizingFilter` filter has run. 344 345This lets you add data, drop data and change data as it is being serialized back 346to a unicode. 347 348Documentation on html5lib Filters is here: 349http://html5lib.readthedocs.io/en/latest/movingparts.html#filters 350 351Trivial Filter example: 352 353.. doctest:: 354 355 >>> from bleach.sanitizer import Cleaner 356 >>> from bleach.html5lib_shim import Filter 357 358 >>> class MooFilter(Filter): 359 ... def __iter__(self): 360 ... for token in Filter.__iter__(self): 361 ... if token['type'] in ['StartTag', 'EmptyTag'] and token['data']: 362 ... for attr, value in token['data'].items(): 363 ... token['data'][attr] = 'moo' 364 ... yield token 365 ... 366 >>> ATTRS = { 367 ... 'img': ['rel', 'src'] 368 ... } 369 ... 370 >>> TAGS = ['img'] 371 >>> cleaner = Cleaner(tags=TAGS, attributes=ATTRS, filters=[MooFilter]) 372 >>> dirty = 'this is cute! <img src="http://example.com/puppy.jpg" rel="nofollow">' 373 >>> cleaner.clean(dirty) 374 'this is cute! <img rel="moo" src="moo">' 375 376 377.. Warning:: 378 379 Filters change the output of cleaning. Make sure that whatever changes the 380 filter is applying maintain the safety guarantees of the output. 381 382.. versionadded:: 2.0 383 384 385Using ``bleach.sanitizer.BleachSanitizerFilter`` 386================================================ 387 388``bleach.clean`` creates a ``bleach.sanitizer.Cleaner`` which creates a 389``bleach.sanitizer.BleachSanitizerFilter`` which does the sanitizing work. 390 391``BleachSanitizerFilter`` is an html5lib filter and can be used anywhere you can 392use an html5lib filter. 393 394.. autoclass:: bleach.sanitizer.BleachSanitizerFilter 395 396 397.. versionadded:: 2.0 398