1.. _clean-chapter:
2.. highlightlang:: python
3
4=========================
5Sanitizing text fragments
6=========================
7
8:py:func:`bleach.clean` is Bleach's HTML sanitization method.
9
10Given a fragment of HTML, Bleach will parse it according to the HTML5 parsing
11algorithm and sanitize any disallowed tags or attributes. This algorithm also
12takes care of things like unclosed and (some) misnested tags.
13
14You may pass in a ``string`` or a ``unicode`` object, but Bleach will always
15return ``unicode``.
16
17.. Note::
18
19   :py:func:`bleach.clean` is for sanitizing HTML **fragments** and not entire
20   HTML documents.
21
22
23.. Warning::
24
25   :py:func:`bleach.clean` is for sanitising HTML fragments to use in an HTML
26   context--not for HTML attributes, CSS, JSON, xhtml, SVG, or other contexts.
27
28   For example, this is a safe use of ``clean`` output in an HTML context::
29
30     <p>
31       {{ bleach.clean(user_bio) }}
32     </p>
33
34
35   This is a **not safe** use of ``clean`` output in an HTML attribute::
36
37     <body data-bio="{{ bleach.clean(user_bio) }}">
38
39
40   If you need to use the output of ``bleach.clean()`` in an HTML attribute, you
41   need to pass it through your template library's escape function. For example,
42   Jinja2's ``escape`` or ``django.utils.html.escape`` or something like that.
43
44   If you need to use the output of ``bleach.clean()`` in any other context,
45   you need to pass it through an appropriate sanitizer/escaper for that
46   context.
47
48
49.. autofunction:: bleach.clean
50
51
52Allowed tags (``tags``)
53=======================
54
55The ``tags`` kwarg specifies the allowed set of HTML tags. It should be a list,
56tuple, or other iterable. Any HTML tags not in this list will be escaped or
57stripped from the text.
58
59For example:
60
61.. doctest::
62
63   >>> import bleach
64
65   >>> bleach.clean(
66   ...     '<b><i>an example</i></b>',
67   ...     tags=['b'],
68   ... )
69   '<b>&lt;i&gt;an example&lt;/i&gt;</b>'
70
71
72The default value is a relatively conservative list found in
73``bleach.sanitizer.ALLOWED_TAGS``.
74
75
76.. autodata:: bleach.sanitizer.ALLOWED_TAGS
77
78
79Allowed Attributes (``attributes``)
80===================================
81
82The ``attributes`` kwarg lets you specify which attributes are allowed. The
83value can be a list, a callable or a map of tag name to list or callable.
84
85The default value is also a conservative dict found in
86``bleach.sanitizer.ALLOWED_ATTRIBUTES``.
87
88
89.. autodata:: bleach.sanitizer.ALLOWED_ATTRIBUTES
90
91.. versionchanged:: 2.0
92
93   Prior to 2.0, the ``attributes`` kwarg value could only be a list or a map.
94
95
96As a list
97---------
98
99The ``attributes`` value can be a list which specifies the list of attributes
100allowed for any tag.
101
102For example:
103
104.. doctest::
105
106   >>> import bleach
107
108   >>> bleach.clean(
109   ...     '<p class="foo" style="color: red; font-weight: bold;">blah blah blah</p>',
110   ...     tags=['p'],
111   ...     attributes=['style'],
112   ...     styles=['color'],
113   ... )
114   '<p style="color: red;">blah blah blah</p>'
115
116
117As a dict
118---------
119
120The ``attributes`` value can be a dict which maps tags to what attributes they can have.
121
122You can also specify ``*``, which will match any tag.
123
124For example, this allows "href" and "rel" for "a" tags, "alt" for the "img" tag
125and "class" for any tag (including "a" and "img"):
126
127.. doctest::
128
129   >>> import bleach
130
131   >>> attrs = {
132   ...     '*': ['class'],
133   ...     'a': ['href', 'rel'],
134   ...     'img': ['alt'],
135   ... }
136
137   >>> bleach.clean(
138   ...    '<img alt="an example" width=500>',
139   ...    tags=['img'],
140   ...    attributes=attrs
141   ... )
142   '<img alt="an example">'
143
144
145Using functions
146---------------
147
148You can also use callables that take the tag, attribute name and attribute value
149and returns ``True`` to keep the attribute or ``False`` to drop it.
150
151You can pass a callable as the attributes argument value and it'll run for
152every tag/attr.
153
154For example:
155
156.. doctest::
157
158   >>> import bleach
159
160   >>> def allow_h(tag, name, value):
161   ...     return name[0] == 'h'
162
163   >>> bleach.clean(
164   ...    '<a href="http://example.com" title="link">link</a>',
165   ...    tags=['a'],
166   ...    attributes=allow_h,
167   ... )
168   '<a href="http://example.com">link</a>'
169
170
171You can also pass a callable as a value in an attributes dict and it'll run for
172attributes for specified tags:
173
174.. doctest::
175
176   >>> from six.moves.urllib.parse import urlparse
177   >>> import bleach
178
179   >>> def allow_src(tag, name, value):
180   ...     if name in ('alt', 'height', 'width'):
181   ...         return True
182   ...     if name == 'src':
183   ...         p = urlparse(value)
184   ...         return (not p.netloc) or p.netloc == 'mydomain.com'
185   ...     return False
186
187   >>> bleach.clean(
188   ...    '<img src="http://example.com" alt="an example">',
189   ...    tags=['img'],
190   ...    attributes={
191   ...        'img': allow_src
192   ...    }
193   ... )
194   '<img alt="an example">'
195
196
197.. versionchanged:: 2.0
198
199   In previous versions of Bleach, the callable took an attribute name and a
200   attribute value. Now it takes a tag, an attribute name and an attribute
201   value.
202
203
204Allowed styles (``styles``)
205===========================
206
207If you allow the ``style`` attribute, you will also need to specify the allowed
208styles users are allowed to set, for example ``color`` and ``background-color``.
209
210The default value is an empty list. In other words, the ``style`` attribute will
211be allowed but no style declaration names will be allowed.
212
213For example, to allow users to set the color and font-weight of text:
214
215.. doctest::
216
217   >>> import bleach
218
219   >>> tags = ['p', 'em', 'strong']
220   >>> attrs = {
221   ...     '*': ['style']
222   ... }
223   >>> styles = ['color', 'font-weight']
224
225   >>> bleach.clean(
226   ...     '<p style="font-weight: heavy;">my html</p>',
227   ...     tags=tags,
228   ...     attributes=attrs,
229   ...     styles=styles
230   ... )
231   '<p style="font-weight: heavy;">my html</p>'
232
233
234Default styles are stored in ``bleach.sanitizer.ALLOWED_STYLES``.
235
236.. autodata:: bleach.sanitizer.ALLOWED_STYLES
237
238
239Allowed protocols (``protocols``)
240=================================
241
242If you allow tags that have attributes containing a URI value (like the ``href``
243attribute of an anchor tag, you may want to adapt the accepted protocols.
244
245For example, this sets allowed protocols to http, https and smb:
246
247.. doctest::
248
249   >>> import bleach
250
251   >>> bleach.clean(
252   ...     '<a href="smb://more_text">allowed protocol</a>',
253   ...     protocols=['http', 'https', 'smb']
254   ... )
255   '<a href="smb://more_text">allowed protocol</a>'
256
257
258This adds smb to the Bleach-specified set of allowed protocols:
259
260.. doctest::
261
262   >>> import bleach
263
264   >>> bleach.clean(
265   ...     '<a href="smb://more_text">allowed protocol</a>',
266   ...     protocols=bleach.ALLOWED_PROTOCOLS + ['smb']
267   ... )
268   '<a href="smb://more_text">allowed protocol</a>'
269
270
271Default protocols are in ``bleach.sanitizer.ALLOWED_PROTOCOLS``.
272
273.. autodata:: bleach.sanitizer.ALLOWED_PROTOCOLS
274
275
276Stripping markup (``strip``)
277============================
278
279By default, Bleach *escapes* tags that aren't specified in the allowed tags list
280and invalid markup. For example:
281
282.. doctest::
283
284   >>> import bleach
285
286   >>> bleach.clean('<span>is not allowed</span>')
287   '&lt;span&gt;is not allowed&lt;/span&gt;'
288
289   >>> bleach.clean('<b><span>is not allowed</span></b>', tags=['b'])
290   '<b>&lt;span&gt;is not allowed&lt;/span&gt;</b>'
291
292
293If you would rather Bleach stripped this markup entirely, you can pass
294``strip=True``:
295
296.. doctest::
297
298   >>> import bleach
299
300   >>> bleach.clean('<span>is not allowed</span>', strip=True)
301   'is not allowed'
302
303   >>> bleach.clean('<b><span>is not allowed</span></b>', tags=['b'], strip=True)
304   '<b>is not allowed</b>'
305
306
307Stripping comments (``strip_comments``)
308=======================================
309
310By default, Bleach will strip out HTML comments. To disable this behavior, set
311``strip_comments=False``:
312
313.. doctest::
314
315   >>> import bleach
316
317   >>> html = 'my<!-- commented --> html'
318
319   >>> bleach.clean(html)
320   'my html'
321
322   >>> bleach.clean(html, strip_comments=False)
323   'my<!-- commented --> html'
324
325
326Using ``bleach.sanitizer.Cleaner``
327==================================
328
329If you're cleaning a lot of text or you need better control of things, you
330should create a :py:class:`bleach.sanitizer.Cleaner` instance.
331
332.. autoclass:: bleach.sanitizer.Cleaner
333   :members:
334
335.. versionadded:: 2.0
336
337
338html5lib Filters (``filters``)
339------------------------------
340
341Bleach sanitizing is implemented as an html5lib filter. The consequence of this
342is that we can pass the streamed content through additional specified filters
343after the :py:class:`bleach.sanitizer.BleachSanitizingFilter` filter has run.
344
345This lets you add data, drop data and change data as it is being serialized back
346to a unicode.
347
348Documentation on html5lib Filters is here:
349http://html5lib.readthedocs.io/en/latest/movingparts.html#filters
350
351Trivial Filter example:
352
353.. doctest::
354
355   >>> from bleach.sanitizer import Cleaner
356   >>> from bleach.html5lib_shim import Filter
357
358   >>> class MooFilter(Filter):
359   ...     def __iter__(self):
360   ...         for token in Filter.__iter__(self):
361   ...             if token['type'] in ['StartTag', 'EmptyTag'] and token['data']:
362   ...                 for attr, value in token['data'].items():
363   ...                     token['data'][attr] = 'moo'
364   ...             yield token
365   ...
366   >>> ATTRS = {
367   ...     'img': ['rel', 'src']
368   ... }
369   ...
370   >>> TAGS = ['img']
371   >>> cleaner = Cleaner(tags=TAGS, attributes=ATTRS, filters=[MooFilter])
372   >>> dirty = 'this is cute! <img src="http://example.com/puppy.jpg" rel="nofollow">'
373   >>> cleaner.clean(dirty)
374   'this is cute! <img rel="moo" src="moo">'
375
376
377.. Warning::
378
379   Filters change the output of cleaning. Make sure that whatever changes the
380   filter is applying maintain the safety guarantees of the output.
381
382.. versionadded:: 2.0
383
384
385Using ``bleach.sanitizer.BleachSanitizerFilter``
386================================================
387
388``bleach.clean`` creates a ``bleach.sanitizer.Cleaner`` which creates a
389``bleach.sanitizer.BleachSanitizerFilter`` which does the sanitizing work.
390
391``BleachSanitizerFilter`` is an html5lib filter and can be used anywhere you can
392use an html5lib filter.
393
394.. autoclass:: bleach.sanitizer.BleachSanitizerFilter
395
396
397.. versionadded:: 2.0
398