1.. highlight:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com>
9.. sectionauthor:: Georg Brandl <georg@python.org>
10
11Unicode Objects
12^^^^^^^^^^^^^^^
13
14Since the implementation of :pep:`393` in Python 3.3, Unicode objects internally
15use a variety of representations, in order to allow handling the complete range
16of Unicode characters while staying memory efficient.  There are special cases
17for strings where all code points are below 128, 256, or 65536; otherwise, code
18points must be below 1114112 (which is the full Unicode range).
19
20:c:type:`Py_UNICODE*` and UTF-8 representations are created on demand and cached
21in the Unicode object.  The :c:type:`Py_UNICODE*` representation is deprecated
22and inefficient.
23
24Due to the transition between the old APIs and the new APIs, Unicode objects
25can internally be in two states depending on how they were created:
26
27* "canonical" Unicode objects are all objects created by a non-deprecated
28  Unicode API.  They use the most efficient representation allowed by the
29  implementation.
30
31* "legacy" Unicode objects have been created through one of the deprecated
32  APIs (typically :c:func:`PyUnicode_FromUnicode`) and only bear the
33  :c:type:`Py_UNICODE*` representation; you will have to call
34  :c:func:`PyUnicode_READY` on them before calling any other API.
35
36.. note::
37   The "legacy" Unicode object will be removed in Python 3.12 with deprecated
38   APIs. All Unicode objects will be "canonical" since then. See :pep:`623`
39   for more information.
40
41
42Unicode Type
43""""""""""""
44
45These are the basic Unicode object types used for the Unicode implementation in
46Python:
47
48.. c:type:: Py_UCS4
49            Py_UCS2
50            Py_UCS1
51
52   These types are typedefs for unsigned integer types wide enough to contain
53   characters of 32 bits, 16 bits and 8 bits, respectively.  When dealing with
54   single Unicode characters, use :c:type:`Py_UCS4`.
55
56   .. versionadded:: 3.3
57
58
59.. c:type:: Py_UNICODE
60
61   This is a typedef of :c:type:`wchar_t`, which is a 16-bit type or 32-bit type
62   depending on the platform.
63
64   .. versionchanged:: 3.3
65      In previous versions, this was a 16-bit type or a 32-bit type depending on
66      whether you selected a "narrow" or "wide" Unicode version of Python at
67      build time.
68
69
70.. c:type:: PyASCIIObject
71            PyCompactUnicodeObject
72            PyUnicodeObject
73
74   These subtypes of :c:type:`PyObject` represent a Python Unicode object.  In
75   almost all cases, they shouldn't be used directly, since all API functions
76   that deal with Unicode objects take and return :c:type:`PyObject` pointers.
77
78   .. versionadded:: 3.3
79
80
81.. c:var:: PyTypeObject PyUnicode_Type
82
83   This instance of :c:type:`PyTypeObject` represents the Python Unicode type.  It
84   is exposed to Python code as ``str``.
85
86
87The following APIs are really C macros and can be used to do fast checks and to
88access internal read-only data of Unicode objects:
89
90.. c:function:: int PyUnicode_Check(PyObject *o)
91
92   Return true if the object *o* is a Unicode object or an instance of a Unicode
93   subtype.  This function always succeeds.
94
95
96.. c:function:: int PyUnicode_CheckExact(PyObject *o)
97
98   Return true if the object *o* is a Unicode object, but not an instance of a
99   subtype.  This function always succeeds.
100
101
102.. c:function:: int PyUnicode_READY(PyObject *o)
103
104   Ensure the string object *o* is in the "canonical" representation.  This is
105   required before using any of the access macros described below.
106
107   .. XXX expand on when it is not required
108
109   Returns ``0`` on success and ``-1`` with an exception set on failure, which in
110   particular happens if memory allocation fails.
111
112   .. versionadded:: 3.3
113
114   .. deprecated-removed:: 3.10 3.12
115      This API will be removed with :c:func:`PyUnicode_FromUnicode`.
116
117
118.. c:function:: Py_ssize_t PyUnicode_GET_LENGTH(PyObject *o)
119
120   Return the length of the Unicode string, in code points.  *o* has to be a
121   Unicode object in the "canonical" representation (not checked).
122
123   .. versionadded:: 3.3
124
125
126.. c:function:: Py_UCS1* PyUnicode_1BYTE_DATA(PyObject *o)
127                Py_UCS2* PyUnicode_2BYTE_DATA(PyObject *o)
128                Py_UCS4* PyUnicode_4BYTE_DATA(PyObject *o)
129
130   Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4
131   integer types for direct character access.  No checks are performed if the
132   canonical representation has the correct character size; use
133   :c:func:`PyUnicode_KIND` to select the right macro.  Make sure
134   :c:func:`PyUnicode_READY` has been called before accessing this.
135
136   .. versionadded:: 3.3
137
138
139.. c:macro:: PyUnicode_WCHAR_KIND
140             PyUnicode_1BYTE_KIND
141             PyUnicode_2BYTE_KIND
142             PyUnicode_4BYTE_KIND
143
144   Return values of the :c:func:`PyUnicode_KIND` macro.
145
146   .. versionadded:: 3.3
147
148   .. deprecated-removed:: 3.10 3.12
149      ``PyUnicode_WCHAR_KIND`` is deprecated.
150
151
152.. c:function:: unsigned int PyUnicode_KIND(PyObject *o)
153
154   Return one of the PyUnicode kind constants (see above) that indicate how many
155   bytes per character this Unicode object uses to store its data.  *o* has to
156   be a Unicode object in the "canonical" representation (not checked).
157
158   .. XXX document "0" return value?
159
160   .. versionadded:: 3.3
161
162
163.. c:function:: void* PyUnicode_DATA(PyObject *o)
164
165   Return a void pointer to the raw Unicode buffer.  *o* has to be a Unicode
166   object in the "canonical" representation (not checked).
167
168   .. versionadded:: 3.3
169
170
171.. c:function:: void PyUnicode_WRITE(int kind, void *data, Py_ssize_t index, \
172                                     Py_UCS4 value)
173
174   Write into a canonical representation *data* (as obtained with
175   :c:func:`PyUnicode_DATA`).  This macro does not do any sanity checks and is
176   intended for usage in loops.  The caller should cache the *kind* value and
177   *data* pointer as obtained from other macro calls.  *index* is the index in
178   the string (starts at 0) and *value* is the new code point value which should
179   be written to that location.
180
181   .. versionadded:: 3.3
182
183
184.. c:function:: Py_UCS4 PyUnicode_READ(int kind, void *data, Py_ssize_t index)
185
186   Read a code point from a canonical representation *data* (as obtained with
187   :c:func:`PyUnicode_DATA`).  No checks or ready calls are performed.
188
189   .. versionadded:: 3.3
190
191
192.. c:function:: Py_UCS4 PyUnicode_READ_CHAR(PyObject *o, Py_ssize_t index)
193
194   Read a character from a Unicode object *o*, which must be in the "canonical"
195   representation.  This is less efficient than :c:func:`PyUnicode_READ` if you
196   do multiple consecutive reads.
197
198   .. versionadded:: 3.3
199
200
201.. c:macro:: PyUnicode_MAX_CHAR_VALUE(o)
202
203   Return the maximum code point that is suitable for creating another string
204   based on *o*, which must be in the "canonical" representation.  This is
205   always an approximation but more efficient than iterating over the string.
206
207   .. versionadded:: 3.3
208
209
210.. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
211
212   Return the size of the deprecated :c:type:`Py_UNICODE` representation, in
213   code units (this includes surrogate pairs as 2 units).  *o* has to be a
214   Unicode object (not checked).
215
216   .. deprecated-removed:: 3.3 3.12
217      Part of the old-style Unicode API, please migrate to using
218      :c:func:`PyUnicode_GET_LENGTH`.
219
220
221.. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
222
223   Return the size of the deprecated :c:type:`Py_UNICODE` representation in
224   bytes.  *o* has to be a Unicode object (not checked).
225
226   .. deprecated-removed:: 3.3 3.12
227      Part of the old-style Unicode API, please migrate to using
228      :c:func:`PyUnicode_GET_LENGTH`.
229
230
231.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
232                const char* PyUnicode_AS_DATA(PyObject *o)
233
234   Return a pointer to a :c:type:`Py_UNICODE` representation of the object.  The
235   returned buffer is always terminated with an extra null code point.  It
236   may also contain embedded null code points, which would cause the string
237   to be truncated when used in most C functions.  The ``AS_DATA`` form
238   casts the pointer to :c:type:`const char *`.  The *o* argument has to be
239   a Unicode object (not checked).
240
241   .. versionchanged:: 3.3
242      This macro is now inefficient -- because in many cases the
243      :c:type:`Py_UNICODE` representation does not exist and needs to be created
244      -- and can fail (return ``NULL`` with an exception set).  Try to port the
245      code to use the new :c:func:`PyUnicode_nBYTE_DATA` macros or use
246      :c:func:`PyUnicode_WRITE` or :c:func:`PyUnicode_READ`.
247
248   .. deprecated-removed:: 3.3 3.12
249      Part of the old-style Unicode API, please migrate to using the
250      :c:func:`PyUnicode_nBYTE_DATA` family of macros.
251
252
253.. c:function:: int PyUnicode_IsIdentifier(PyObject *o)
254
255   Return ``1`` if the string is a valid identifier according to the language
256   definition, section :ref:`identifiers`. Return ``0`` otherwise.
257
258   .. versionchanged:: 3.9
259      The function does not call :c:func:`Py_FatalError` anymore if the string
260      is not ready.
261
262
263Unicode Character Properties
264""""""""""""""""""""""""""""
265
266Unicode provides many different character properties. The most often needed ones
267are available through these macros which are mapped to C functions depending on
268the Python configuration.
269
270
271.. c:function:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
272
273   Return ``1`` or ``0`` depending on whether *ch* is a whitespace character.
274
275
276.. c:function:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
277
278   Return ``1`` or ``0`` depending on whether *ch* is a lowercase character.
279
280
281.. c:function:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
282
283   Return ``1`` or ``0`` depending on whether *ch* is an uppercase character.
284
285
286.. c:function:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
287
288   Return ``1`` or ``0`` depending on whether *ch* is a titlecase character.
289
290
291.. c:function:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
292
293   Return ``1`` or ``0`` depending on whether *ch* is a linebreak character.
294
295
296.. c:function:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
297
298   Return ``1`` or ``0`` depending on whether *ch* is a decimal character.
299
300
301.. c:function:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
302
303   Return ``1`` or ``0`` depending on whether *ch* is a digit character.
304
305
306.. c:function:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
307
308   Return ``1`` or ``0`` depending on whether *ch* is a numeric character.
309
310
311.. c:function:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
312
313   Return ``1`` or ``0`` depending on whether *ch* is an alphabetic character.
314
315
316.. c:function:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
317
318   Return ``1`` or ``0`` depending on whether *ch* is an alphanumeric character.
319
320
321.. c:function:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)
322
323   Return ``1`` or ``0`` depending on whether *ch* is a printable character.
324   Nonprintable characters are those characters defined in the Unicode character
325   database as "Other" or "Separator", excepting the ASCII space (0x20) which is
326   considered printable.  (Note that printable characters in this context are
327   those which should not be escaped when :func:`repr` is invoked on a string.
328   It has no bearing on the handling of strings written to :data:`sys.stdout` or
329   :data:`sys.stderr`.)
330
331
332These APIs can be used for fast direct character conversions:
333
334
335.. c:function:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
336
337   Return the character *ch* converted to lower case.
338
339   .. deprecated:: 3.3
340      This function uses simple case mappings.
341
342
343.. c:function:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
344
345   Return the character *ch* converted to upper case.
346
347   .. deprecated:: 3.3
348      This function uses simple case mappings.
349
350
351.. c:function:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
352
353   Return the character *ch* converted to title case.
354
355   .. deprecated:: 3.3
356      This function uses simple case mappings.
357
358
359.. c:function:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
360
361   Return the character *ch* converted to a decimal positive integer.  Return
362   ``-1`` if this is not possible.  This macro does not raise exceptions.
363
364
365.. c:function:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
366
367   Return the character *ch* converted to a single digit integer. Return ``-1`` if
368   this is not possible.  This macro does not raise exceptions.
369
370
371.. c:function:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
372
373   Return the character *ch* converted to a double. Return ``-1.0`` if this is not
374   possible.  This macro does not raise exceptions.
375
376
377These APIs can be used to work with surrogates:
378
379.. c:macro:: Py_UNICODE_IS_SURROGATE(ch)
380
381   Check if *ch* is a surrogate (``0xD800 <= ch <= 0xDFFF``).
382
383.. c:macro:: Py_UNICODE_IS_HIGH_SURROGATE(ch)
384
385   Check if *ch* is a high surrogate (``0xD800 <= ch <= 0xDBFF``).
386
387.. c:macro:: Py_UNICODE_IS_LOW_SURROGATE(ch)
388
389   Check if *ch* is a low surrogate (``0xDC00 <= ch <= 0xDFFF``).
390
391.. c:macro:: Py_UNICODE_JOIN_SURROGATES(high, low)
392
393   Join two surrogate characters and return a single Py_UCS4 value.
394   *high* and *low* are respectively the leading and trailing surrogates in a
395   surrogate pair.
396
397
398Creating and accessing Unicode strings
399""""""""""""""""""""""""""""""""""""""
400
401To create Unicode objects and access their basic sequence properties, use these
402APIs:
403
404.. c:function:: PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
405
406   Create a new Unicode object.  *maxchar* should be the true maximum code point
407   to be placed in the string.  As an approximation, it can be rounded up to the
408   nearest value in the sequence 127, 255, 65535, 1114111.
409
410   This is the recommended way to allocate a new Unicode object.  Objects
411   created using this function are not resizable.
412
413   .. versionadded:: 3.3
414
415
416.. c:function:: PyObject* PyUnicode_FromKindAndData(int kind, const void *buffer, \
417                                                    Py_ssize_t size)
418
419   Create a new Unicode object with the given *kind* (possible values are
420   :c:macro:`PyUnicode_1BYTE_KIND` etc., as returned by
421   :c:func:`PyUnicode_KIND`).  The *buffer* must point to an array of *size*
422   units of 1, 2 or 4 bytes per character, as given by the kind.
423
424   .. versionadded:: 3.3
425
426
427.. c:function:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
428
429   Create a Unicode object from the char buffer *u*.  The bytes will be
430   interpreted as being UTF-8 encoded.  The buffer is copied into the new
431   object. If the buffer is not ``NULL``, the return value might be a shared
432   object, i.e. modification of the data is not allowed.
433
434   If *u* is ``NULL``, this function behaves like :c:func:`PyUnicode_FromUnicode`
435   with the buffer set to ``NULL``.  This usage is deprecated in favor of
436   :c:func:`PyUnicode_New`, and will be removed in Python 3.12.
437
438
439.. c:function:: PyObject *PyUnicode_FromString(const char *u)
440
441   Create a Unicode object from a UTF-8 encoded null-terminated char buffer
442   *u*.
443
444
445.. c:function:: PyObject* PyUnicode_FromFormat(const char *format, ...)
446
447   Take a C :c:func:`printf`\ -style *format* string and a variable number of
448   arguments, calculate the size of the resulting Python Unicode string and return
449   a string with the values formatted into it.  The variable arguments must be C
450   types and must correspond exactly to the format characters in the *format*
451   ASCII-encoded string. The following format characters are allowed:
452
453   .. % This should be exactly the same as the table in PyErr_Format.
454   .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
455   .. % because not all compilers support the %z width modifier -- we fake it
456   .. % when necessary via interpolating PY_FORMAT_SIZE_T.
457   .. % Similar comments apply to the %ll width modifier and
458
459   .. tabularcolumns:: |l|l|L|
460
461   +-------------------+---------------------+----------------------------------+
462   | Format Characters | Type                | Comment                          |
463   +===================+=====================+==================================+
464   | :attr:`%%`        | *n/a*               | The literal % character.         |
465   +-------------------+---------------------+----------------------------------+
466   | :attr:`%c`        | int                 | A single character,              |
467   |                   |                     | represented as a C int.          |
468   +-------------------+---------------------+----------------------------------+
469   | :attr:`%d`        | int                 | Equivalent to                    |
470   |                   |                     | ``printf("%d")``. [1]_           |
471   +-------------------+---------------------+----------------------------------+
472   | :attr:`%u`        | unsigned int        | Equivalent to                    |
473   |                   |                     | ``printf("%u")``. [1]_           |
474   +-------------------+---------------------+----------------------------------+
475   | :attr:`%ld`       | long                | Equivalent to                    |
476   |                   |                     | ``printf("%ld")``. [1]_          |
477   +-------------------+---------------------+----------------------------------+
478   | :attr:`%li`       | long                | Equivalent to                    |
479   |                   |                     | ``printf("%li")``. [1]_          |
480   +-------------------+---------------------+----------------------------------+
481   | :attr:`%lu`       | unsigned long       | Equivalent to                    |
482   |                   |                     | ``printf("%lu")``. [1]_          |
483   +-------------------+---------------------+----------------------------------+
484   | :attr:`%lld`      | long long           | Equivalent to                    |
485   |                   |                     | ``printf("%lld")``. [1]_         |
486   +-------------------+---------------------+----------------------------------+
487   | :attr:`%lli`      | long long           | Equivalent to                    |
488   |                   |                     | ``printf("%lli")``. [1]_         |
489   +-------------------+---------------------+----------------------------------+
490   | :attr:`%llu`      | unsigned long long  | Equivalent to                    |
491   |                   |                     | ``printf("%llu")``. [1]_         |
492   +-------------------+---------------------+----------------------------------+
493   | :attr:`%zd`       | Py_ssize_t          | Equivalent to                    |
494   |                   |                     | ``printf("%zd")``. [1]_          |
495   +-------------------+---------------------+----------------------------------+
496   | :attr:`%zi`       | Py_ssize_t          | Equivalent to                    |
497   |                   |                     | ``printf("%zi")``. [1]_          |
498   +-------------------+---------------------+----------------------------------+
499   | :attr:`%zu`       | size_t              | Equivalent to                    |
500   |                   |                     | ``printf("%zu")``. [1]_          |
501   +-------------------+---------------------+----------------------------------+
502   | :attr:`%i`        | int                 | Equivalent to                    |
503   |                   |                     | ``printf("%i")``. [1]_           |
504   +-------------------+---------------------+----------------------------------+
505   | :attr:`%x`        | int                 | Equivalent to                    |
506   |                   |                     | ``printf("%x")``. [1]_           |
507   +-------------------+---------------------+----------------------------------+
508   | :attr:`%s`        | const char\*        | A null-terminated C character    |
509   |                   |                     | array.                           |
510   +-------------------+---------------------+----------------------------------+
511   | :attr:`%p`        | const void\*        | The hex representation of a C    |
512   |                   |                     | pointer. Mostly equivalent to    |
513   |                   |                     | ``printf("%p")`` except that     |
514   |                   |                     | it is guaranteed to start with   |
515   |                   |                     | the literal ``0x`` regardless    |
516   |                   |                     | of what the platform's           |
517   |                   |                     | ``printf`` yields.               |
518   +-------------------+---------------------+----------------------------------+
519   | :attr:`%A`        | PyObject\*          | The result of calling            |
520   |                   |                     | :func:`ascii`.                   |
521   +-------------------+---------------------+----------------------------------+
522   | :attr:`%U`        | PyObject\*          | A Unicode object.                |
523   +-------------------+---------------------+----------------------------------+
524   | :attr:`%V`        | PyObject\*,         | A Unicode object (which may be   |
525   |                   | const char\*        | ``NULL``) and a null-terminated  |
526   |                   |                     | C character array as a second    |
527   |                   |                     | parameter (which will be used,   |
528   |                   |                     | if the first parameter is        |
529   |                   |                     | ``NULL``).                       |
530   +-------------------+---------------------+----------------------------------+
531   | :attr:`%S`        | PyObject\*          | The result of calling            |
532   |                   |                     | :c:func:`PyObject_Str`.          |
533   +-------------------+---------------------+----------------------------------+
534   | :attr:`%R`        | PyObject\*          | The result of calling            |
535   |                   |                     | :c:func:`PyObject_Repr`.         |
536   +-------------------+---------------------+----------------------------------+
537
538   An unrecognized format character causes all the rest of the format string to be
539   copied as-is to the result string, and any extra arguments discarded.
540
541   .. note::
542      The width formatter unit is number of characters rather than bytes.
543      The precision formatter unit is number of bytes for ``"%s"`` and
544      ``"%V"`` (if the ``PyObject*`` argument is ``NULL``), and a number of
545      characters for ``"%A"``, ``"%U"``, ``"%S"``, ``"%R"`` and ``"%V"``
546      (if the ``PyObject*`` argument is not ``NULL``).
547
548   .. [1] For integer specifiers (d, u, ld, li, lu, lld, lli, llu, zd, zi,
549      zu, i, x): the 0-conversion flag has effect even when a precision is given.
550
551   .. versionchanged:: 3.2
552      Support for ``"%lld"`` and ``"%llu"`` added.
553
554   .. versionchanged:: 3.3
555      Support for ``"%li"``, ``"%lli"`` and ``"%zi"`` added.
556
557   .. versionchanged:: 3.4
558      Support width and precision formatter for ``"%s"``, ``"%A"``, ``"%U"``,
559      ``"%V"``, ``"%S"``, ``"%R"`` added.
560
561
562.. c:function:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
563
564   Identical to :c:func:`PyUnicode_FromFormat` except that it takes exactly two
565   arguments.
566
567
568.. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, \
569                               const char *encoding, const char *errors)
570
571   Decode an encoded object *obj* to a Unicode object.
572
573   :class:`bytes`, :class:`bytearray` and other
574   :term:`bytes-like objects <bytes-like object>`
575   are decoded according to the given *encoding* and using the error handling
576   defined by *errors*. Both can be ``NULL`` to have the interface use the default
577   values (see :ref:`builtincodecs` for details).
578
579   All other objects, including Unicode objects, cause a :exc:`TypeError` to be
580   set.
581
582   The API returns ``NULL`` if there was an error.  The caller is responsible for
583   decref'ing the returned objects.
584
585
586.. c:function:: Py_ssize_t PyUnicode_GetLength(PyObject *unicode)
587
588   Return the length of the Unicode object, in code points.
589
590   .. versionadded:: 3.3
591
592
593.. c:function:: Py_ssize_t PyUnicode_CopyCharacters(PyObject *to, \
594                                                    Py_ssize_t to_start, \
595                                                    PyObject *from, \
596                                                    Py_ssize_t from_start, \
597                                                    Py_ssize_t how_many)
598
599   Copy characters from one Unicode object into another.  This function performs
600   character conversion when necessary and falls back to :c:func:`memcpy` if
601   possible.  Returns ``-1`` and sets an exception on error, otherwise returns
602   the number of copied characters.
603
604   .. versionadded:: 3.3
605
606
607.. c:function:: Py_ssize_t PyUnicode_Fill(PyObject *unicode, Py_ssize_t start, \
608                        Py_ssize_t length, Py_UCS4 fill_char)
609
610   Fill a string with a character: write *fill_char* into
611   ``unicode[start:start+length]``.
612
613   Fail if *fill_char* is bigger than the string maximum character, or if the
614   string has more than 1 reference.
615
616   Return the number of written character, or return ``-1`` and raise an
617   exception on error.
618
619   .. versionadded:: 3.3
620
621
622.. c:function:: int PyUnicode_WriteChar(PyObject *unicode, Py_ssize_t index, \
623                                        Py_UCS4 character)
624
625   Write a character to a string.  The string must have been created through
626   :c:func:`PyUnicode_New`.  Since Unicode strings are supposed to be immutable,
627   the string must not be shared, or have been hashed yet.
628
629   This function checks that *unicode* is a Unicode object, that the index is
630   not out of bounds, and that the object can be modified safely (i.e. that it
631   its reference count is one).
632
633   .. versionadded:: 3.3
634
635
636.. c:function:: Py_UCS4 PyUnicode_ReadChar(PyObject *unicode, Py_ssize_t index)
637
638   Read a character from a string.  This function checks that *unicode* is a
639   Unicode object and the index is not out of bounds, in contrast to the macro
640   version :c:func:`PyUnicode_READ_CHAR`.
641
642   .. versionadded:: 3.3
643
644
645.. c:function:: PyObject* PyUnicode_Substring(PyObject *str, Py_ssize_t start, \
646                                              Py_ssize_t end)
647
648   Return a substring of *str*, from character index *start* (included) to
649   character index *end* (excluded).  Negative indices are not supported.
650
651   .. versionadded:: 3.3
652
653
654.. c:function:: Py_UCS4* PyUnicode_AsUCS4(PyObject *u, Py_UCS4 *buffer, \
655                                          Py_ssize_t buflen, int copy_null)
656
657   Copy the string *u* into a UCS4 buffer, including a null character, if
658   *copy_null* is set.  Returns ``NULL`` and sets an exception on error (in
659   particular, a :exc:`SystemError` if *buflen* is smaller than the length of
660   *u*).  *buffer* is returned on success.
661
662   .. versionadded:: 3.3
663
664
665.. c:function:: Py_UCS4* PyUnicode_AsUCS4Copy(PyObject *u)
666
667   Copy the string *u* into a new UCS4 buffer that is allocated using
668   :c:func:`PyMem_Malloc`.  If this fails, ``NULL`` is returned with a
669   :exc:`MemoryError` set.  The returned buffer always has an extra
670   null code point appended.
671
672   .. versionadded:: 3.3
673
674
675Deprecated Py_UNICODE APIs
676""""""""""""""""""""""""""
677
678.. deprecated-removed:: 3.3 3.12
679
680These API functions are deprecated with the implementation of :pep:`393`.
681Extension modules can continue using them, as they will not be removed in Python
6823.x, but need to be aware that their use can now cause performance and memory hits.
683
684
685.. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
686
687   Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u*
688   may be ``NULL`` which causes the contents to be undefined. It is the user's
689   responsibility to fill in the needed data.  The buffer is copied into the new
690   object.
691
692   If the buffer is not ``NULL``, the return value might be a shared object.
693   Therefore, modification of the resulting Unicode object is only allowed when
694   *u* is ``NULL``.
695
696   If the buffer is ``NULL``, :c:func:`PyUnicode_READY` must be called once the
697   string content has been filled before using any of the access macros such as
698   :c:func:`PyUnicode_KIND`.
699
700   .. deprecated-removed:: 3.3 3.12
701      Part of the old-style Unicode API, please migrate to using
702      :c:func:`PyUnicode_FromKindAndData`, :c:func:`PyUnicode_FromWideChar`, or
703      :c:func:`PyUnicode_New`.
704
705
706.. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
707
708   Return a read-only pointer to the Unicode object's internal
709   :c:type:`Py_UNICODE` buffer, or ``NULL`` on error. This will create the
710   :c:type:`Py_UNICODE*` representation of the object if it is not yet
711   available. The buffer is always terminated with an extra null code point.
712   Note that the resulting :c:type:`Py_UNICODE` string may also contain
713   embedded null code points, which would cause the string to be truncated when
714   used in most C functions.
715
716   .. deprecated-removed:: 3.3 3.12
717      Part of the old-style Unicode API, please migrate to using
718      :c:func:`PyUnicode_AsUCS4`, :c:func:`PyUnicode_AsWideChar`,
719      :c:func:`PyUnicode_ReadChar` or similar new APIs.
720
721
722.. c:function:: PyObject* PyUnicode_TransformDecimalToASCII(Py_UNICODE *s, Py_ssize_t size)
723
724   Create a Unicode object by replacing all decimal digits in
725   :c:type:`Py_UNICODE` buffer of the given *size* by ASCII digits 0--9
726   according to their decimal value.  Return ``NULL`` if an exception occurs.
727
728   .. deprecated-removed:: 3.3 3.11
729      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
730      :c:func:`Py_UNICODE_TODECIMAL`.
731
732
733.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size)
734
735   Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE`
736   array length (excluding the extra null terminator) in *size*.
737   Note that the resulting :c:type:`Py_UNICODE*` string
738   may contain embedded null code points, which would cause the string to be
739   truncated when used in most C functions.
740
741   .. versionadded:: 3.3
742
743   .. deprecated-removed:: 3.3 3.12
744      Part of the old-style Unicode API, please migrate to using
745      :c:func:`PyUnicode_AsUCS4`, :c:func:`PyUnicode_AsWideChar`,
746      :c:func:`PyUnicode_ReadChar` or similar new APIs.
747
748
749.. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
750
751   Return the size of the deprecated :c:type:`Py_UNICODE` representation, in
752   code units (this includes surrogate pairs as 2 units).
753
754   .. deprecated-removed:: 3.3 3.12
755      Part of the old-style Unicode API, please migrate to using
756      :c:func:`PyUnicode_GET_LENGTH`.
757
758
759.. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj)
760
761   Copy an instance of a Unicode subtype to a new true Unicode object if
762   necessary. If *obj* is already a true Unicode object (not a subtype),
763   return the reference with incremented refcount.
764
765   Objects other than Unicode or its subtypes will cause a :exc:`TypeError`.
766
767
768Locale Encoding
769"""""""""""""""
770
771The current locale encoding can be used to decode text from the operating
772system.
773
774.. c:function:: PyObject* PyUnicode_DecodeLocaleAndSize(const char *str, \
775                                                        Py_ssize_t len, \
776                                                        const char *errors)
777
778   Decode a string from UTF-8 on Android and VxWorks, or from the current
779   locale encoding on other platforms. The supported
780   error handlers are ``"strict"`` and ``"surrogateescape"``
781   (:pep:`383`). The decoder uses ``"strict"`` error handler if
782   *errors* is ``NULL``.  *str* must end with a null character but
783   cannot contain embedded null characters.
784
785   Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` to decode a string from
786   :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
787   Python startup).
788
789   This function ignores the :ref:`Python UTF-8 Mode <utf8-mode>`.
790
791   .. seealso::
792
793      The :c:func:`Py_DecodeLocale` function.
794
795   .. versionadded:: 3.3
796
797   .. versionchanged:: 3.7
798      The function now also uses the current locale encoding for the
799      ``surrogateescape`` error handler, except on Android. Previously, :c:func:`Py_DecodeLocale`
800      was used for the ``surrogateescape``, and the current locale encoding was
801      used for ``strict``.
802
803
804.. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors)
805
806   Similar to :c:func:`PyUnicode_DecodeLocaleAndSize`, but compute the string
807   length using :c:func:`strlen`.
808
809   .. versionadded:: 3.3
810
811
812.. c:function:: PyObject* PyUnicode_EncodeLocale(PyObject *unicode, const char *errors)
813
814   Encode a Unicode object to UTF-8 on Android and VxWorks, or to the current
815   locale encoding on other platforms. The
816   supported error handlers are ``"strict"`` and ``"surrogateescape"``
817   (:pep:`383`). The encoder uses ``"strict"`` error handler if
818   *errors* is ``NULL``. Return a :class:`bytes` object. *unicode* cannot
819   contain embedded null characters.
820
821   Use :c:func:`PyUnicode_EncodeFSDefault` to encode a string to
822   :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
823   Python startup).
824
825   This function ignores the :ref:`Python UTF-8 Mode <utf8-mode>`.
826
827   .. seealso::
828
829      The :c:func:`Py_EncodeLocale` function.
830
831   .. versionadded:: 3.3
832
833   .. versionchanged:: 3.7
834      The function now also uses the current locale encoding for the
835      ``surrogateescape`` error handler, except on Android. Previously,
836      :c:func:`Py_EncodeLocale`
837      was used for the ``surrogateescape``, and the current locale encoding was
838      used for ``strict``.
839
840
841File System Encoding
842""""""""""""""""""""
843
844To encode and decode file names and other environment strings,
845:c:data:`Py_FileSystemDefaultEncoding` should be used as the encoding, and
846:c:data:`Py_FileSystemDefaultEncodeErrors` should be used as the error handler
847(:pep:`383` and :pep:`529`). To encode file names to :class:`bytes` during
848argument parsing, the ``"O&"`` converter should be used, passing
849:c:func:`PyUnicode_FSConverter` as the conversion function:
850
851.. c:function:: int PyUnicode_FSConverter(PyObject* obj, void* result)
852
853   ParseTuple converter: encode :class:`str` objects -- obtained directly or
854   through the :class:`os.PathLike` interface -- to :class:`bytes` using
855   :c:func:`PyUnicode_EncodeFSDefault`; :class:`bytes` objects are output as-is.
856   *result* must be a :c:type:`PyBytesObject*` which must be released when it is
857   no longer used.
858
859   .. versionadded:: 3.1
860
861   .. versionchanged:: 3.6
862      Accepts a :term:`path-like object`.
863
864To decode file names to :class:`str` during argument parsing, the ``"O&"``
865converter should be used, passing :c:func:`PyUnicode_FSDecoder` as the
866conversion function:
867
868.. c:function:: int PyUnicode_FSDecoder(PyObject* obj, void* result)
869
870   ParseTuple converter: decode :class:`bytes` objects -- obtained either
871   directly or indirectly through the :class:`os.PathLike` interface -- to
872   :class:`str` using :c:func:`PyUnicode_DecodeFSDefaultAndSize`; :class:`str`
873   objects are output as-is. *result* must be a :c:type:`PyUnicodeObject*` which
874   must be released when it is no longer used.
875
876   .. versionadded:: 3.2
877
878   .. versionchanged:: 3.6
879      Accepts a :term:`path-like object`.
880
881
882.. c:function:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size)
883
884   Decode a string from the :term:`filesystem encoding and error handler`.
885
886   If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
887   locale encoding.
888
889   :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the
890   locale encoding and cannot be modified later. If you need to decode a string
891   from the current locale encoding, use
892   :c:func:`PyUnicode_DecodeLocaleAndSize`.
893
894   .. seealso::
895
896      The :c:func:`Py_DecodeLocale` function.
897
898   .. versionchanged:: 3.6
899      Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
900
901
902.. c:function:: PyObject* PyUnicode_DecodeFSDefault(const char *s)
903
904   Decode a null-terminated string from the :term:`filesystem encoding and
905   error handler`.
906
907   If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
908   locale encoding.
909
910   Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length.
911
912   .. versionchanged:: 3.6
913      Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
914
915
916.. c:function:: PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode)
917
918   Encode a Unicode object to :c:data:`Py_FileSystemDefaultEncoding` with the
919   :c:data:`Py_FileSystemDefaultEncodeErrors` error handler, and return
920   :class:`bytes`. Note that the resulting :class:`bytes` object may contain
921   null bytes.
922
923   If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
924   locale encoding.
925
926   :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the
927   locale encoding and cannot be modified later. If you need to encode a string
928   to the current locale encoding, use :c:func:`PyUnicode_EncodeLocale`.
929
930   .. seealso::
931
932      The :c:func:`Py_EncodeLocale` function.
933
934   .. versionadded:: 3.2
935
936   .. versionchanged:: 3.6
937      Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
938
939wchar_t Support
940"""""""""""""""
941
942:c:type:`wchar_t` support for platforms which support it:
943
944.. c:function:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
945
946   Create a Unicode object from the :c:type:`wchar_t` buffer *w* of the given *size*.
947   Passing ``-1`` as the *size* indicates that the function must itself compute the length,
948   using wcslen.
949   Return ``NULL`` on failure.
950
951
952.. c:function:: Py_ssize_t PyUnicode_AsWideChar(PyObject *unicode, wchar_t *w, Py_ssize_t size)
953
954   Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*.  At most
955   *size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing
956   null termination character).  Return the number of :c:type:`wchar_t` characters
957   copied or ``-1`` in case of an error.  Note that the resulting :c:type:`wchar_t*`
958   string may or may not be null-terminated.  It is the responsibility of the caller
959   to make sure that the :c:type:`wchar_t*` string is null-terminated in case this is
960   required by the application. Also, note that the :c:type:`wchar_t*` string
961   might contain null characters, which would cause the string to be truncated
962   when used with most C functions.
963
964
965.. c:function:: wchar_t* PyUnicode_AsWideCharString(PyObject *unicode, Py_ssize_t *size)
966
967   Convert the Unicode object to a wide character string. The output string
968   always ends with a null character. If *size* is not ``NULL``, write the number
969   of wide characters (excluding the trailing null termination character) into
970   *\*size*. Note that the resulting :c:type:`wchar_t` string might contain
971   null characters, which would cause the string to be truncated when used with
972   most C functions. If *size* is ``NULL`` and the :c:type:`wchar_t*` string
973   contains null characters a :exc:`ValueError` is raised.
974
975   Returns a buffer allocated by :c:func:`PyMem_Alloc` (use
976   :c:func:`PyMem_Free` to free it) on success. On error, returns ``NULL``
977   and *\*size* is undefined. Raises a :exc:`MemoryError` if memory allocation
978   is failed.
979
980   .. versionadded:: 3.2
981
982   .. versionchanged:: 3.7
983      Raises a :exc:`ValueError` if *size* is ``NULL`` and the :c:type:`wchar_t*`
984      string contains null characters.
985
986
987.. _builtincodecs:
988
989Built-in Codecs
990^^^^^^^^^^^^^^^
991
992Python provides a set of built-in codecs which are written in C for speed. All of
993these codecs are directly usable via the following functions.
994
995Many of the following APIs take two arguments encoding and errors, and they
996have the same semantics as the ones of the built-in :func:`str` string object
997constructor.
998
999Setting encoding to ``NULL`` causes the default encoding to be used
1000which is UTF-8.  The file system calls should use
1001:c:func:`PyUnicode_FSConverter` for encoding file names. This uses the
1002variable :c:data:`Py_FileSystemDefaultEncoding` internally. This
1003variable should be treated as read-only: on some systems, it will be a
1004pointer to a static string, on others, it will change at run-time
1005(such as when the application invokes setlocale).
1006
1007Error handling is set by errors which may also be set to ``NULL`` meaning to use
1008the default handling defined for the codec.  Default error handling for all
1009built-in codecs is "strict" (:exc:`ValueError` is raised).
1010
1011The codecs all use a similar interface.  Only deviation from the following
1012generic ones are documented for simplicity.
1013
1014
1015Generic Codecs
1016""""""""""""""
1017
1018These are the generic codec APIs:
1019
1020
1021.. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, \
1022                              const char *encoding, const char *errors)
1023
1024   Create a Unicode object by decoding *size* bytes of the encoded string *s*.
1025   *encoding* and *errors* have the same meaning as the parameters of the same name
1026   in the :func:`str` built-in function.  The codec to be used is looked up
1027   using the Python codec registry.  Return ``NULL`` if an exception was raised by
1028   the codec.
1029
1030
1031.. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, \
1032                              const char *encoding, const char *errors)
1033
1034   Encode a Unicode object and return the result as Python bytes object.
1035   *encoding* and *errors* have the same meaning as the parameters of the same
1036   name in the Unicode :meth:`~str.encode` method. The codec to be used is looked up
1037   using the Python codec registry. Return ``NULL`` if an exception was raised by
1038   the codec.
1039
1040
1041.. c:function:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, \
1042                              const char *encoding, const char *errors)
1043
1044   Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* and return a Python
1045   bytes object.  *encoding* and *errors* have the same meaning as the
1046   parameters of the same name in the Unicode :meth:`~str.encode` method.  The codec
1047   to be used is looked up using the Python codec registry.  Return ``NULL`` if an
1048   exception was raised by the codec.
1049
1050   .. deprecated-removed:: 3.3 3.11
1051      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1052      :c:func:`PyUnicode_AsEncodedString`.
1053
1054
1055UTF-8 Codecs
1056""""""""""""
1057
1058These are the UTF-8 codec APIs:
1059
1060
1061.. c:function:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
1062
1063   Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
1064   *s*. Return ``NULL`` if an exception was raised by the codec.
1065
1066
1067.. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, \
1068                              const char *errors, Py_ssize_t *consumed)
1069
1070   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF8`. If
1071   *consumed* is not ``NULL``, trailing incomplete UTF-8 byte sequences will not be
1072   treated as an error. Those bytes will not be decoded and the number of bytes
1073   that have been decoded will be stored in *consumed*.
1074
1075
1076.. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
1077
1078   Encode a Unicode object using UTF-8 and return the result as Python bytes
1079   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1080   raised by the codec.
1081
1082
1083.. c:function:: const char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size)
1084
1085   Return a pointer to the UTF-8 encoding of the Unicode object, and
1086   store the size of the encoded representation (in bytes) in *size*.  The
1087   *size* argument can be ``NULL``; in this case no size will be stored.  The
1088   returned buffer always has an extra null byte appended (not included in
1089   *size*), regardless of whether there are any other null code points.
1090
1091   In the case of an error, ``NULL`` is returned with an exception set and no
1092   *size* is stored.
1093
1094   This caches the UTF-8 representation of the string in the Unicode object, and
1095   subsequent calls will return a pointer to the same buffer.  The caller is not
1096   responsible for deallocating the buffer.
1097
1098   .. versionadded:: 3.3
1099
1100   .. versionchanged:: 3.7
1101      The return type is now ``const char *`` rather of ``char *``.
1102
1103   .. versionchanged:: 3.10
1104      This function is a part of the :ref:`limited API <stable>`.
1105
1106
1107.. c:function:: const char* PyUnicode_AsUTF8(PyObject *unicode)
1108
1109   As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size.
1110
1111   .. versionadded:: 3.3
1112
1113   .. versionchanged:: 3.7
1114      The return type is now ``const char *`` rather of ``char *``.
1115
1116
1117.. c:function:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
1118
1119   Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* using UTF-8 and
1120   return a Python bytes object.  Return ``NULL`` if an exception was raised by
1121   the codec.
1122
1123   .. deprecated-removed:: 3.3 3.11
1124      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1125      :c:func:`PyUnicode_AsUTF8String`, :c:func:`PyUnicode_AsUTF8AndSize` or
1126      :c:func:`PyUnicode_AsEncodedString`.
1127
1128
1129UTF-32 Codecs
1130"""""""""""""
1131
1132These are the UTF-32 codec APIs:
1133
1134
1135.. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, \
1136                              const char *errors, int *byteorder)
1137
1138   Decode *size* bytes from a UTF-32 encoded buffer string and return the
1139   corresponding Unicode object.  *errors* (if non-``NULL``) defines the error
1140   handling. It defaults to "strict".
1141
1142   If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte
1143   order::
1144
1145      *byteorder == -1: little endian
1146      *byteorder == 0:  native order
1147      *byteorder == 1:  big endian
1148
1149   If ``*byteorder`` is zero, and the first four bytes of the input data are a
1150   byte order mark (BOM), the decoder switches to this byte order and the BOM is
1151   not copied into the resulting Unicode string.  If ``*byteorder`` is ``-1`` or
1152   ``1``, any byte order mark is copied to the output.
1153
1154   After completion, *\*byteorder* is set to the current byte order at the end
1155   of input data.
1156
1157   If *byteorder* is ``NULL``, the codec starts in native order mode.
1158
1159   Return ``NULL`` if an exception was raised by the codec.
1160
1161
1162.. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, \
1163                              const char *errors, int *byteorder, Py_ssize_t *consumed)
1164
1165   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF32`. If
1166   *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF32Stateful` will not treat
1167   trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
1168   by four) as an error. Those bytes will not be decoded and the number of bytes
1169   that have been decoded will be stored in *consumed*.
1170
1171
1172.. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
1173
1174   Return a Python byte string using the UTF-32 encoding in native byte
1175   order. The string always starts with a BOM mark.  Error handling is "strict".
1176   Return ``NULL`` if an exception was raised by the codec.
1177
1178
1179.. c:function:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, \
1180                              const char *errors, int byteorder)
1181
1182   Return a Python bytes object holding the UTF-32 encoded value of the Unicode
1183   data in *s*.  Output is written according to the following byte order::
1184
1185      byteorder == -1: little endian
1186      byteorder == 0:  native byte order (writes a BOM mark)
1187      byteorder == 1:  big endian
1188
1189   If byteorder is ``0``, the output string will always start with the Unicode BOM
1190   mark (U+FEFF). In the other two modes, no BOM mark is prepended.
1191
1192   If ``Py_UNICODE_WIDE`` is not defined, surrogate pairs will be output
1193   as a single code point.
1194
1195   Return ``NULL`` if an exception was raised by the codec.
1196
1197   .. deprecated-removed:: 3.3 3.11
1198      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1199      :c:func:`PyUnicode_AsUTF32String` or :c:func:`PyUnicode_AsEncodedString`.
1200
1201
1202UTF-16 Codecs
1203"""""""""""""
1204
1205These are the UTF-16 codec APIs:
1206
1207
1208.. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, \
1209                              const char *errors, int *byteorder)
1210
1211   Decode *size* bytes from a UTF-16 encoded buffer string and return the
1212   corresponding Unicode object.  *errors* (if non-``NULL``) defines the error
1213   handling. It defaults to "strict".
1214
1215   If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte
1216   order::
1217
1218      *byteorder == -1: little endian
1219      *byteorder == 0:  native order
1220      *byteorder == 1:  big endian
1221
1222   If ``*byteorder`` is zero, and the first two bytes of the input data are a
1223   byte order mark (BOM), the decoder switches to this byte order and the BOM is
1224   not copied into the resulting Unicode string.  If ``*byteorder`` is ``-1`` or
1225   ``1``, any byte order mark is copied to the output (where it will result in
1226   either a ``\ufeff`` or a ``\ufffe`` character).
1227
1228   After completion, *\*byteorder* is set to the current byte order at the end
1229   of input data.
1230
1231   If *byteorder* is ``NULL``, the codec starts in native order mode.
1232
1233   Return ``NULL`` if an exception was raised by the codec.
1234
1235
1236.. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, \
1237                              const char *errors, int *byteorder, Py_ssize_t *consumed)
1238
1239   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF16`. If
1240   *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF16Stateful` will not treat
1241   trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
1242   split surrogate pair) as an error. Those bytes will not be decoded and the
1243   number of bytes that have been decoded will be stored in *consumed*.
1244
1245
1246.. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
1247
1248   Return a Python byte string using the UTF-16 encoding in native byte
1249   order. The string always starts with a BOM mark.  Error handling is "strict".
1250   Return ``NULL`` if an exception was raised by the codec.
1251
1252
1253.. c:function:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, \
1254                              const char *errors, int byteorder)
1255
1256   Return a Python bytes object holding the UTF-16 encoded value of the Unicode
1257   data in *s*.  Output is written according to the following byte order::
1258
1259      byteorder == -1: little endian
1260      byteorder == 0:  native byte order (writes a BOM mark)
1261      byteorder == 1:  big endian
1262
1263   If byteorder is ``0``, the output string will always start with the Unicode BOM
1264   mark (U+FEFF). In the other two modes, no BOM mark is prepended.
1265
1266   If ``Py_UNICODE_WIDE`` is defined, a single :c:type:`Py_UNICODE` value may get
1267   represented as a surrogate pair. If it is not defined, each :c:type:`Py_UNICODE`
1268   values is interpreted as a UCS-2 character.
1269
1270   Return ``NULL`` if an exception was raised by the codec.
1271
1272   .. deprecated-removed:: 3.3 3.11
1273      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1274      :c:func:`PyUnicode_AsUTF16String` or :c:func:`PyUnicode_AsEncodedString`.
1275
1276
1277UTF-7 Codecs
1278""""""""""""
1279
1280These are the UTF-7 codec APIs:
1281
1282
1283.. c:function:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors)
1284
1285   Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string
1286   *s*.  Return ``NULL`` if an exception was raised by the codec.
1287
1288
1289.. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, \
1290                              const char *errors, Py_ssize_t *consumed)
1291
1292   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF7`.  If
1293   *consumed* is not ``NULL``, trailing incomplete UTF-7 base-64 sections will not
1294   be treated as an error.  Those bytes will not be decoded and the number of
1295   bytes that have been decoded will be stored in *consumed*.
1296
1297
1298.. c:function:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, \
1299                              int base64SetO, int base64WhiteSpace, const char *errors)
1300
1301   Encode the :c:type:`Py_UNICODE` buffer of the given size using UTF-7 and
1302   return a Python bytes object.  Return ``NULL`` if an exception was raised by
1303   the codec.
1304
1305   If *base64SetO* is nonzero, "Set O" (punctuation that has no otherwise
1306   special meaning) will be encoded in base-64.  If *base64WhiteSpace* is
1307   nonzero, whitespace will be encoded in base-64.  Both are set to zero for the
1308   Python "utf-7" codec.
1309
1310   .. deprecated-removed:: 3.3 3.11
1311      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1312      :c:func:`PyUnicode_AsEncodedString`.
1313
1314
1315Unicode-Escape Codecs
1316"""""""""""""""""""""
1317
1318These are the "Unicode Escape" codec APIs:
1319
1320
1321.. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, \
1322                              Py_ssize_t size, const char *errors)
1323
1324   Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
1325   string *s*.  Return ``NULL`` if an exception was raised by the codec.
1326
1327
1328.. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
1329
1330   Encode a Unicode object using Unicode-Escape and return the result as a
1331   bytes object.  Error handling is "strict".  Return ``NULL`` if an exception was
1332   raised by the codec.
1333
1334
1335.. c:function:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
1336
1337   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Unicode-Escape and
1338   return a bytes object.  Return ``NULL`` if an exception was raised by the codec.
1339
1340   .. deprecated-removed:: 3.3 3.11
1341      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1342      :c:func:`PyUnicode_AsUnicodeEscapeString`.
1343
1344
1345Raw-Unicode-Escape Codecs
1346"""""""""""""""""""""""""
1347
1348These are the "Raw Unicode Escape" codec APIs:
1349
1350
1351.. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, \
1352                              Py_ssize_t size, const char *errors)
1353
1354   Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
1355   encoded string *s*.  Return ``NULL`` if an exception was raised by the codec.
1356
1357
1358.. c:function:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
1359
1360   Encode a Unicode object using Raw-Unicode-Escape and return the result as
1361   a bytes object.  Error handling is "strict".  Return ``NULL`` if an exception
1362   was raised by the codec.
1363
1364
1365.. c:function:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, \
1366                              Py_ssize_t size)
1367
1368   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Raw-Unicode-Escape
1369   and return a bytes object.  Return ``NULL`` if an exception was raised by the codec.
1370
1371   .. deprecated-removed:: 3.3 3.11
1372      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1373      :c:func:`PyUnicode_AsRawUnicodeEscapeString` or
1374      :c:func:`PyUnicode_AsEncodedString`.
1375
1376
1377Latin-1 Codecs
1378""""""""""""""
1379
1380These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
1381ordinals and only these are accepted by the codecs during encoding.
1382
1383
1384.. c:function:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
1385
1386   Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
1387   *s*.  Return ``NULL`` if an exception was raised by the codec.
1388
1389
1390.. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
1391
1392   Encode a Unicode object using Latin-1 and return the result as Python bytes
1393   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1394   raised by the codec.
1395
1396
1397.. c:function:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
1398
1399   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Latin-1 and
1400   return a Python bytes object.  Return ``NULL`` if an exception was raised by
1401   the codec.
1402
1403   .. deprecated-removed:: 3.3 3.11
1404      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1405      :c:func:`PyUnicode_AsLatin1String` or
1406      :c:func:`PyUnicode_AsEncodedString`.
1407
1408
1409ASCII Codecs
1410""""""""""""
1411
1412These are the ASCII codec APIs.  Only 7-bit ASCII data is accepted. All other
1413codes generate errors.
1414
1415
1416.. c:function:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
1417
1418   Create a Unicode object by decoding *size* bytes of the ASCII encoded string
1419   *s*.  Return ``NULL`` if an exception was raised by the codec.
1420
1421
1422.. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
1423
1424   Encode a Unicode object using ASCII and return the result as Python bytes
1425   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1426   raised by the codec.
1427
1428
1429.. c:function:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
1430
1431   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using ASCII and
1432   return a Python bytes object.  Return ``NULL`` if an exception was raised by
1433   the codec.
1434
1435   .. deprecated-removed:: 3.3 3.11
1436      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1437      :c:func:`PyUnicode_AsASCIIString` or
1438      :c:func:`PyUnicode_AsEncodedString`.
1439
1440
1441Character Map Codecs
1442""""""""""""""""""""
1443
1444This codec is special in that it can be used to implement many different codecs
1445(and this is in fact what was done to obtain most of the standard codecs
1446included in the :mod:`encodings` package). The codec uses mapping to encode and
1447decode characters.  The mapping objects provided must support the
1448:meth:`__getitem__` mapping interface; dictionaries and sequences work well.
1449
1450These are the mapping codec APIs:
1451
1452.. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *data, Py_ssize_t size, \
1453                              PyObject *mapping, const char *errors)
1454
1455   Create a Unicode object by decoding *size* bytes of the encoded string *s*
1456   using the given *mapping* object.  Return ``NULL`` if an exception was raised
1457   by the codec.
1458
1459   If *mapping* is ``NULL``, Latin-1 decoding will be applied.  Else
1460   *mapping* must map bytes ordinals (integers in the range from 0 to 255)
1461   to Unicode strings, integers (which are then interpreted as Unicode
1462   ordinals) or ``None``.  Unmapped data bytes -- ones which cause a
1463   :exc:`LookupError`, as well as ones which get mapped to ``None``,
1464   ``0xFFFE`` or ``'\ufffe'``, are treated as undefined mappings and cause
1465   an error.
1466
1467
1468.. c:function:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
1469
1470   Encode a Unicode object using the given *mapping* object and return the
1471   result as a bytes object.  Error handling is "strict".  Return ``NULL`` if an
1472   exception was raised by the codec.
1473
1474   The *mapping* object must map Unicode ordinal integers to bytes objects,
1475   integers in the range from 0 to 255 or ``None``.  Unmapped character
1476   ordinals (ones which cause a :exc:`LookupError`) as well as mapped to
1477   ``None`` are treated as "undefined mapping" and cause an error.
1478
1479
1480.. c:function:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, \
1481                              PyObject *mapping, const char *errors)
1482
1483   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using the given
1484   *mapping* object and return the result as a bytes object.  Return ``NULL`` if
1485   an exception was raised by the codec.
1486
1487   .. deprecated-removed:: 3.3 3.11
1488      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1489      :c:func:`PyUnicode_AsCharmapString` or
1490      :c:func:`PyUnicode_AsEncodedString`.
1491
1492
1493The following codec API is special in that maps Unicode to Unicode.
1494
1495.. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
1496
1497   Translate a string by applying a character mapping table to it and return the
1498   resulting Unicode object. Return ``NULL`` if an exception was raised by the
1499   codec.
1500
1501   The mapping table must map Unicode ordinal integers to Unicode ordinal integers
1502   or ``None`` (causing deletion of the character).
1503
1504   Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
1505   and sequences work well.  Unmapped character ordinals (ones which cause a
1506   :exc:`LookupError`) are left untouched and are copied as-is.
1507
1508   *errors* has the usual meaning for codecs. It may be ``NULL`` which indicates to
1509   use the default error handling.
1510
1511
1512.. c:function:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, \
1513                              PyObject *mapping, const char *errors)
1514
1515   Translate a :c:type:`Py_UNICODE` buffer of the given *size* by applying a
1516   character *mapping* table to it and return the resulting Unicode object.
1517   Return ``NULL`` when an exception was raised by the codec.
1518
1519   .. deprecated-removed:: 3.3 3.11
1520      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1521      :c:func:`PyUnicode_Translate`. or :ref:`generic codec based API
1522      <codec-registry>`
1523
1524
1525MBCS codecs for Windows
1526"""""""""""""""""""""""
1527
1528These are the MBCS codec APIs. They are currently only available on Windows and
1529use the Win32 MBCS converters to implement the conversions.  Note that MBCS (or
1530DBCS) is a class of encodings, not just one.  The target encoding is defined by
1531the user settings on the machine running the codec.
1532
1533.. c:function:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
1534
1535   Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
1536   Return ``NULL`` if an exception was raised by the codec.
1537
1538
1539.. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, Py_ssize_t size, \
1540                              const char *errors, Py_ssize_t *consumed)
1541
1542   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeMBCS`. If
1543   *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeMBCSStateful` will not decode
1544   trailing lead byte and the number of bytes that have been decoded will be stored
1545   in *consumed*.
1546
1547
1548.. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
1549
1550   Encode a Unicode object using MBCS and return the result as Python bytes
1551   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1552   raised by the codec.
1553
1554
1555.. c:function:: PyObject* PyUnicode_EncodeCodePage(int code_page, PyObject *unicode, const char *errors)
1556
1557   Encode the Unicode object using the specified code page and return a Python
1558   bytes object.  Return ``NULL`` if an exception was raised by the codec. Use
1559   :c:data:`CP_ACP` code page to get the MBCS encoder.
1560
1561   .. versionadded:: 3.3
1562
1563
1564.. c:function:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
1565
1566   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using MBCS and return
1567   a Python bytes object.  Return ``NULL`` if an exception was raised by the
1568   codec.
1569
1570   .. deprecated-removed:: 3.3 4.0
1571      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1572      :c:func:`PyUnicode_AsMBCSString`, :c:func:`PyUnicode_EncodeCodePage` or
1573      :c:func:`PyUnicode_AsEncodedString`.
1574
1575
1576Methods & Slots
1577"""""""""""""""
1578
1579
1580.. _unicodemethodsandslots:
1581
1582Methods and Slot Functions
1583^^^^^^^^^^^^^^^^^^^^^^^^^^
1584
1585The following APIs are capable of handling Unicode objects and strings on input
1586(we refer to them as strings in the descriptions) and return Unicode objects or
1587integers as appropriate.
1588
1589They all return ``NULL`` or ``-1`` if an exception occurs.
1590
1591
1592.. c:function:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
1593
1594   Concat two strings giving a new Unicode string.
1595
1596
1597.. c:function:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
1598
1599   Split a string giving a list of Unicode strings.  If *sep* is ``NULL``, splitting
1600   will be done at all whitespace substrings.  Otherwise, splits occur at the given
1601   separator.  At most *maxsplit* splits will be done.  If negative, no limit is
1602   set.  Separators are not included in the resulting list.
1603
1604
1605.. c:function:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
1606
1607   Split a Unicode string at line breaks, returning a list of Unicode strings.
1608   CRLF is considered to be one line break.  If *keepend* is ``0``, the Line break
1609   characters are not included in the resulting strings.
1610
1611
1612.. c:function:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
1613
1614   Join a sequence of strings using the given *separator* and return the resulting
1615   Unicode string.
1616
1617
1618.. c:function:: Py_ssize_t PyUnicode_Tailmatch(PyObject *str, PyObject *substr, \
1619                        Py_ssize_t start, Py_ssize_t end, int direction)
1620
1621   Return ``1`` if *substr* matches ``str[start:end]`` at the given tail end
1622   (*direction* == ``-1`` means to do a prefix match, *direction* == ``1`` a suffix match),
1623   ``0`` otherwise. Return ``-1`` if an error occurred.
1624
1625
1626.. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, \
1627                               Py_ssize_t start, Py_ssize_t end, int direction)
1628
1629   Return the first position of *substr* in ``str[start:end]`` using the given
1630   *direction* (*direction* == ``1`` means to do a forward search, *direction* == ``-1`` a
1631   backward search).  The return value is the index of the first match; a value of
1632   ``-1`` indicates that no match was found, and ``-2`` indicates that an error
1633   occurred and an exception has been set.
1634
1635
1636.. c:function:: Py_ssize_t PyUnicode_FindChar(PyObject *str, Py_UCS4 ch, \
1637                               Py_ssize_t start, Py_ssize_t end, int direction)
1638
1639   Return the first position of the character *ch* in ``str[start:end]`` using
1640   the given *direction* (*direction* == ``1`` means to do a forward search,
1641   *direction* == ``-1`` a backward search).  The return value is the index of the
1642   first match; a value of ``-1`` indicates that no match was found, and ``-2``
1643   indicates that an error occurred and an exception has been set.
1644
1645   .. versionadded:: 3.3
1646
1647   .. versionchanged:: 3.7
1648      *start* and *end* are now adjusted to behave like ``str[start:end]``.
1649
1650
1651.. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, \
1652                               Py_ssize_t start, Py_ssize_t end)
1653
1654   Return the number of non-overlapping occurrences of *substr* in
1655   ``str[start:end]``.  Return ``-1`` if an error occurred.
1656
1657
1658.. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, \
1659                              PyObject *replstr, Py_ssize_t maxcount)
1660
1661   Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
1662   return the resulting Unicode object. *maxcount* == ``-1`` means replace all
1663   occurrences.
1664
1665
1666.. c:function:: int PyUnicode_Compare(PyObject *left, PyObject *right)
1667
1668   Compare two strings and return ``-1``, ``0``, ``1`` for less than, equal, and greater than,
1669   respectively.
1670
1671   This function returns ``-1`` upon failure, so one should call
1672   :c:func:`PyErr_Occurred` to check for errors.
1673
1674
1675.. c:function:: int PyUnicode_CompareWithASCIIString(PyObject *uni, const char *string)
1676
1677   Compare a Unicode object, *uni*, with *string* and return ``-1``, ``0``, ``1`` for less
1678   than, equal, and greater than, respectively. It is best to pass only
1679   ASCII-encoded strings, but the function interprets the input string as
1680   ISO-8859-1 if it contains non-ASCII characters.
1681
1682   This function does not raise exceptions.
1683
1684
1685.. c:function:: PyObject* PyUnicode_RichCompare(PyObject *left,  PyObject *right,  int op)
1686
1687   Rich compare two Unicode strings and return one of the following:
1688
1689   * ``NULL`` in case an exception was raised
1690   * :const:`Py_True` or :const:`Py_False` for successful comparisons
1691   * :const:`Py_NotImplemented` in case the type combination is unknown
1692
1693   Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
1694   :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
1695
1696
1697.. c:function:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
1698
1699   Return a new string object from *format* and *args*; this is analogous to
1700   ``format % args``.
1701
1702
1703.. c:function:: int PyUnicode_Contains(PyObject *container, PyObject *element)
1704
1705   Check whether *element* is contained in *container* and return true or false
1706   accordingly.
1707
1708   *element* has to coerce to a one element Unicode string. ``-1`` is returned
1709   if there was an error.
1710
1711
1712.. c:function:: void PyUnicode_InternInPlace(PyObject **string)
1713
1714   Intern the argument *\*string* in place.  The argument must be the address of a
1715   pointer variable pointing to a Python Unicode string object.  If there is an
1716   existing interned string that is the same as *\*string*, it sets *\*string* to
1717   it (decrementing the reference count of the old string object and incrementing
1718   the reference count of the interned string object), otherwise it leaves
1719   *\*string* alone and interns it (incrementing its reference count).
1720   (Clarification: even though there is a lot of talk about reference counts, think
1721   of this function as reference-count-neutral; you own the object after the call
1722   if and only if you owned it before the call.)
1723
1724
1725.. c:function:: PyObject* PyUnicode_InternFromString(const char *v)
1726
1727   A combination of :c:func:`PyUnicode_FromString` and
1728   :c:func:`PyUnicode_InternInPlace`, returning either a new Unicode string
1729   object that has been interned, or a new ("owned") reference to an earlier
1730   interned string object with the same value.
1731