1.. highlightlang:: c
2
3.. _unicodeobjects:
4
5Unicode Objects and Codecs
6--------------------------
7
8.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com>
9.. sectionauthor:: Georg Brandl <georg@python.org>
10
11Unicode Objects
12^^^^^^^^^^^^^^^
13
14Since the implementation of :pep:`393` in Python 3.3, Unicode objects internally
15use a variety of representations, in order to allow handling the complete range
16of Unicode characters while staying memory efficient.  There are special cases
17for strings where all code points are below 128, 256, or 65536; otherwise, code
18points must be below 1114112 (which is the full Unicode range).
19
20:c:type:`Py_UNICODE*` and UTF-8 representations are created on demand and cached
21in the Unicode object.  The :c:type:`Py_UNICODE*` representation is deprecated
22and inefficient; it should be avoided in performance- or memory-sensitive
23situations.
24
25Due to the transition between the old APIs and the new APIs, Unicode objects
26can internally be in two states depending on how they were created:
27
28* "canonical" Unicode objects are all objects created by a non-deprecated
29  Unicode API.  They use the most efficient representation allowed by the
30  implementation.
31
32* "legacy" Unicode objects have been created through one of the deprecated
33  APIs (typically :c:func:`PyUnicode_FromUnicode`) and only bear the
34  :c:type:`Py_UNICODE*` representation; you will have to call
35  :c:func:`PyUnicode_READY` on them before calling any other API.
36
37
38Unicode Type
39""""""""""""
40
41These are the basic Unicode object types used for the Unicode implementation in
42Python:
43
44.. c:type:: Py_UCS4
45            Py_UCS2
46            Py_UCS1
47
48   These types are typedefs for unsigned integer types wide enough to contain
49   characters of 32 bits, 16 bits and 8 bits, respectively.  When dealing with
50   single Unicode characters, use :c:type:`Py_UCS4`.
51
52   .. versionadded:: 3.3
53
54
55.. c:type:: Py_UNICODE
56
57   This is a typedef of :c:type:`wchar_t`, which is a 16-bit type or 32-bit type
58   depending on the platform.
59
60   .. versionchanged:: 3.3
61      In previous versions, this was a 16-bit type or a 32-bit type depending on
62      whether you selected a "narrow" or "wide" Unicode version of Python at
63      build time.
64
65
66.. c:type:: PyASCIIObject
67            PyCompactUnicodeObject
68            PyUnicodeObject
69
70   These subtypes of :c:type:`PyObject` represent a Python Unicode object.  In
71   almost all cases, they shouldn't be used directly, since all API functions
72   that deal with Unicode objects take and return :c:type:`PyObject` pointers.
73
74   .. versionadded:: 3.3
75
76
77.. c:var:: PyTypeObject PyUnicode_Type
78
79   This instance of :c:type:`PyTypeObject` represents the Python Unicode type.  It
80   is exposed to Python code as ``str``.
81
82
83The following APIs are really C macros and can be used to do fast checks and to
84access internal read-only data of Unicode objects:
85
86.. c:function:: int PyUnicode_Check(PyObject *o)
87
88   Return true if the object *o* is a Unicode object or an instance of a Unicode
89   subtype.
90
91
92.. c:function:: int PyUnicode_CheckExact(PyObject *o)
93
94   Return true if the object *o* is a Unicode object, but not an instance of a
95   subtype.
96
97
98.. c:function:: int PyUnicode_READY(PyObject *o)
99
100   Ensure the string object *o* is in the "canonical" representation.  This is
101   required before using any of the access macros described below.
102
103   .. XXX expand on when it is not required
104
105   Returns ``0`` on success and ``-1`` with an exception set on failure, which in
106   particular happens if memory allocation fails.
107
108   .. versionadded:: 3.3
109
110
111.. c:function:: Py_ssize_t PyUnicode_GET_LENGTH(PyObject *o)
112
113   Return the length of the Unicode string, in code points.  *o* has to be a
114   Unicode object in the "canonical" representation (not checked).
115
116   .. versionadded:: 3.3
117
118
119.. c:function:: Py_UCS1* PyUnicode_1BYTE_DATA(PyObject *o)
120                Py_UCS2* PyUnicode_2BYTE_DATA(PyObject *o)
121                Py_UCS4* PyUnicode_4BYTE_DATA(PyObject *o)
122
123   Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4
124   integer types for direct character access.  No checks are performed if the
125   canonical representation has the correct character size; use
126   :c:func:`PyUnicode_KIND` to select the right macro.  Make sure
127   :c:func:`PyUnicode_READY` has been called before accessing this.
128
129   .. versionadded:: 3.3
130
131
132.. c:macro:: PyUnicode_WCHAR_KIND
133             PyUnicode_1BYTE_KIND
134             PyUnicode_2BYTE_KIND
135             PyUnicode_4BYTE_KIND
136
137   Return values of the :c:func:`PyUnicode_KIND` macro.
138
139   .. versionadded:: 3.3
140
141
142.. c:function:: int PyUnicode_KIND(PyObject *o)
143
144   Return one of the PyUnicode kind constants (see above) that indicate how many
145   bytes per character this Unicode object uses to store its data.  *o* has to
146   be a Unicode object in the "canonical" representation (not checked).
147
148   .. XXX document "0" return value?
149
150   .. versionadded:: 3.3
151
152
153.. c:function:: void* PyUnicode_DATA(PyObject *o)
154
155   Return a void pointer to the raw Unicode buffer.  *o* has to be a Unicode
156   object in the "canonical" representation (not checked).
157
158   .. versionadded:: 3.3
159
160
161.. c:function:: void PyUnicode_WRITE(int kind, void *data, Py_ssize_t index, \
162                                     Py_UCS4 value)
163
164   Write into a canonical representation *data* (as obtained with
165   :c:func:`PyUnicode_DATA`).  This macro does not do any sanity checks and is
166   intended for usage in loops.  The caller should cache the *kind* value and
167   *data* pointer as obtained from other macro calls.  *index* is the index in
168   the string (starts at 0) and *value* is the new code point value which should
169   be written to that location.
170
171   .. versionadded:: 3.3
172
173
174.. c:function:: Py_UCS4 PyUnicode_READ(int kind, void *data, Py_ssize_t index)
175
176   Read a code point from a canonical representation *data* (as obtained with
177   :c:func:`PyUnicode_DATA`).  No checks or ready calls are performed.
178
179   .. versionadded:: 3.3
180
181
182.. c:function:: Py_UCS4 PyUnicode_READ_CHAR(PyObject *o, Py_ssize_t index)
183
184   Read a character from a Unicode object *o*, which must be in the "canonical"
185   representation.  This is less efficient than :c:func:`PyUnicode_READ` if you
186   do multiple consecutive reads.
187
188   .. versionadded:: 3.3
189
190
191.. c:function:: PyUnicode_MAX_CHAR_VALUE(PyObject *o)
192
193   Return the maximum code point that is suitable for creating another string
194   based on *o*, which must be in the "canonical" representation.  This is
195   always an approximation but more efficient than iterating over the string.
196
197   .. versionadded:: 3.3
198
199
200.. c:function:: int PyUnicode_ClearFreeList()
201
202   Clear the free list. Return the total number of freed items.
203
204
205.. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
206
207   Return the size of the deprecated :c:type:`Py_UNICODE` representation, in
208   code units (this includes surrogate pairs as 2 units).  *o* has to be a
209   Unicode object (not checked).
210
211   .. deprecated-removed:: 3.3 4.0
212      Part of the old-style Unicode API, please migrate to using
213      :c:func:`PyUnicode_GET_LENGTH`.
214
215
216.. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
217
218   Return the size of the deprecated :c:type:`Py_UNICODE` representation in
219   bytes.  *o* has to be a Unicode object (not checked).
220
221   .. deprecated-removed:: 3.3 4.0
222      Part of the old-style Unicode API, please migrate to using
223      :c:func:`PyUnicode_GET_LENGTH`.
224
225
226.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
227                const char* PyUnicode_AS_DATA(PyObject *o)
228
229   Return a pointer to a :c:type:`Py_UNICODE` representation of the object.  The
230   returned buffer is always terminated with an extra null code point.  It
231   may also contain embedded null code points, which would cause the string
232   to be truncated when used in most C functions.  The ``AS_DATA`` form
233   casts the pointer to :c:type:`const char *`.  The *o* argument has to be
234   a Unicode object (not checked).
235
236   .. versionchanged:: 3.3
237      This macro is now inefficient -- because in many cases the
238      :c:type:`Py_UNICODE` representation does not exist and needs to be created
239      -- and can fail (return ``NULL`` with an exception set).  Try to port the
240      code to use the new :c:func:`PyUnicode_nBYTE_DATA` macros or use
241      :c:func:`PyUnicode_WRITE` or :c:func:`PyUnicode_READ`.
242
243   .. deprecated-removed:: 3.3 4.0
244      Part of the old-style Unicode API, please migrate to using the
245      :c:func:`PyUnicode_nBYTE_DATA` family of macros.
246
247
248Unicode Character Properties
249""""""""""""""""""""""""""""
250
251Unicode provides many different character properties. The most often needed ones
252are available through these macros which are mapped to C functions depending on
253the Python configuration.
254
255
256.. c:function:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
257
258   Return ``1`` or ``0`` depending on whether *ch* is a whitespace character.
259
260
261.. c:function:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
262
263   Return ``1`` or ``0`` depending on whether *ch* is a lowercase character.
264
265
266.. c:function:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
267
268   Return ``1`` or ``0`` depending on whether *ch* is an uppercase character.
269
270
271.. c:function:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
272
273   Return ``1`` or ``0`` depending on whether *ch* is a titlecase character.
274
275
276.. c:function:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
277
278   Return ``1`` or ``0`` depending on whether *ch* is a linebreak character.
279
280
281.. c:function:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
282
283   Return ``1`` or ``0`` depending on whether *ch* is a decimal character.
284
285
286.. c:function:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
287
288   Return ``1`` or ``0`` depending on whether *ch* is a digit character.
289
290
291.. c:function:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
292
293   Return ``1`` or ``0`` depending on whether *ch* is a numeric character.
294
295
296.. c:function:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
297
298   Return ``1`` or ``0`` depending on whether *ch* is an alphabetic character.
299
300
301.. c:function:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
302
303   Return ``1`` or ``0`` depending on whether *ch* is an alphanumeric character.
304
305
306.. c:function:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)
307
308   Return ``1`` or ``0`` depending on whether *ch* is a printable character.
309   Nonprintable characters are those characters defined in the Unicode character
310   database as "Other" or "Separator", excepting the ASCII space (0x20) which is
311   considered printable.  (Note that printable characters in this context are
312   those which should not be escaped when :func:`repr` is invoked on a string.
313   It has no bearing on the handling of strings written to :data:`sys.stdout` or
314   :data:`sys.stderr`.)
315
316
317These APIs can be used for fast direct character conversions:
318
319
320.. c:function:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
321
322   Return the character *ch* converted to lower case.
323
324   .. deprecated:: 3.3
325      This function uses simple case mappings.
326
327
328.. c:function:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
329
330   Return the character *ch* converted to upper case.
331
332   .. deprecated:: 3.3
333      This function uses simple case mappings.
334
335
336.. c:function:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
337
338   Return the character *ch* converted to title case.
339
340   .. deprecated:: 3.3
341      This function uses simple case mappings.
342
343
344.. c:function:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
345
346   Return the character *ch* converted to a decimal positive integer.  Return
347   ``-1`` if this is not possible.  This macro does not raise exceptions.
348
349
350.. c:function:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
351
352   Return the character *ch* converted to a single digit integer. Return ``-1`` if
353   this is not possible.  This macro does not raise exceptions.
354
355
356.. c:function:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
357
358   Return the character *ch* converted to a double. Return ``-1.0`` if this is not
359   possible.  This macro does not raise exceptions.
360
361
362These APIs can be used to work with surrogates:
363
364.. c:macro:: Py_UNICODE_IS_SURROGATE(ch)
365
366   Check if *ch* is a surrogate (``0xD800 <= ch <= 0xDFFF``).
367
368.. c:macro:: Py_UNICODE_IS_HIGH_SURROGATE(ch)
369
370   Check if *ch* is a high surrogate (``0xD800 <= ch <= 0xDBFF``).
371
372.. c:macro:: Py_UNICODE_IS_LOW_SURROGATE(ch)
373
374   Check if *ch* is a low surrogate (``0xDC00 <= ch <= 0xDFFF``).
375
376.. c:macro:: Py_UNICODE_JOIN_SURROGATES(high, low)
377
378   Join two surrogate characters and return a single Py_UCS4 value.
379   *high* and *low* are respectively the leading and trailing surrogates in a
380   surrogate pair.
381
382
383Creating and accessing Unicode strings
384""""""""""""""""""""""""""""""""""""""
385
386To create Unicode objects and access their basic sequence properties, use these
387APIs:
388
389.. c:function:: PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
390
391   Create a new Unicode object.  *maxchar* should be the true maximum code point
392   to be placed in the string.  As an approximation, it can be rounded up to the
393   nearest value in the sequence 127, 255, 65535, 1114111.
394
395   This is the recommended way to allocate a new Unicode object.  Objects
396   created using this function are not resizable.
397
398   .. versionadded:: 3.3
399
400
401.. c:function:: PyObject* PyUnicode_FromKindAndData(int kind, const void *buffer, \
402                                                    Py_ssize_t size)
403
404   Create a new Unicode object with the given *kind* (possible values are
405   :c:macro:`PyUnicode_1BYTE_KIND` etc., as returned by
406   :c:func:`PyUnicode_KIND`).  The *buffer* must point to an array of *size*
407   units of 1, 2 or 4 bytes per character, as given by the kind.
408
409   .. versionadded:: 3.3
410
411
412.. c:function:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
413
414   Create a Unicode object from the char buffer *u*.  The bytes will be
415   interpreted as being UTF-8 encoded.  The buffer is copied into the new
416   object. If the buffer is not ``NULL``, the return value might be a shared
417   object, i.e. modification of the data is not allowed.
418
419   If *u* is ``NULL``, this function behaves like :c:func:`PyUnicode_FromUnicode`
420   with the buffer set to ``NULL``.  This usage is deprecated in favor of
421   :c:func:`PyUnicode_New`.
422
423
424.. c:function:: PyObject *PyUnicode_FromString(const char *u)
425
426   Create a Unicode object from a UTF-8 encoded null-terminated char buffer
427   *u*.
428
429
430.. c:function:: PyObject* PyUnicode_FromFormat(const char *format, ...)
431
432   Take a C :c:func:`printf`\ -style *format* string and a variable number of
433   arguments, calculate the size of the resulting Python Unicode string and return
434   a string with the values formatted into it.  The variable arguments must be C
435   types and must correspond exactly to the format characters in the *format*
436   ASCII-encoded string. The following format characters are allowed:
437
438   .. % This should be exactly the same as the table in PyErr_Format.
439   .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
440   .. % because not all compilers support the %z width modifier -- we fake it
441   .. % when necessary via interpolating PY_FORMAT_SIZE_T.
442   .. % Similar comments apply to the %ll width modifier and
443
444   .. tabularcolumns:: |l|l|L|
445
446   +-------------------+---------------------+----------------------------------+
447   | Format Characters | Type                | Comment                          |
448   +===================+=====================+==================================+
449   | :attr:`%%`        | *n/a*               | The literal % character.         |
450   +-------------------+---------------------+----------------------------------+
451   | :attr:`%c`        | int                 | A single character,              |
452   |                   |                     | represented as a C int.          |
453   +-------------------+---------------------+----------------------------------+
454   | :attr:`%d`        | int                 | Equivalent to                    |
455   |                   |                     | ``printf("%d")``. [1]_           |
456   +-------------------+---------------------+----------------------------------+
457   | :attr:`%u`        | unsigned int        | Equivalent to                    |
458   |                   |                     | ``printf("%u")``. [1]_           |
459   +-------------------+---------------------+----------------------------------+
460   | :attr:`%ld`       | long                | Equivalent to                    |
461   |                   |                     | ``printf("%ld")``. [1]_          |
462   +-------------------+---------------------+----------------------------------+
463   | :attr:`%li`       | long                | Equivalent to                    |
464   |                   |                     | ``printf("%li")``. [1]_          |
465   +-------------------+---------------------+----------------------------------+
466   | :attr:`%lu`       | unsigned long       | Equivalent to                    |
467   |                   |                     | ``printf("%lu")``. [1]_          |
468   +-------------------+---------------------+----------------------------------+
469   | :attr:`%lld`      | long long           | Equivalent to                    |
470   |                   |                     | ``printf("%lld")``. [1]_         |
471   +-------------------+---------------------+----------------------------------+
472   | :attr:`%lli`      | long long           | Equivalent to                    |
473   |                   |                     | ``printf("%lli")``. [1]_         |
474   +-------------------+---------------------+----------------------------------+
475   | :attr:`%llu`      | unsigned long long  | Equivalent to                    |
476   |                   |                     | ``printf("%llu")``. [1]_         |
477   +-------------------+---------------------+----------------------------------+
478   | :attr:`%zd`       | Py_ssize_t          | Equivalent to                    |
479   |                   |                     | ``printf("%zd")``. [1]_          |
480   +-------------------+---------------------+----------------------------------+
481   | :attr:`%zi`       | Py_ssize_t          | Equivalent to                    |
482   |                   |                     | ``printf("%zi")``. [1]_          |
483   +-------------------+---------------------+----------------------------------+
484   | :attr:`%zu`       | size_t              | Equivalent to                    |
485   |                   |                     | ``printf("%zu")``. [1]_          |
486   +-------------------+---------------------+----------------------------------+
487   | :attr:`%i`        | int                 | Equivalent to                    |
488   |                   |                     | ``printf("%i")``. [1]_           |
489   +-------------------+---------------------+----------------------------------+
490   | :attr:`%x`        | int                 | Equivalent to                    |
491   |                   |                     | ``printf("%x")``. [1]_           |
492   +-------------------+---------------------+----------------------------------+
493   | :attr:`%s`        | const char\*        | A null-terminated C character    |
494   |                   |                     | array.                           |
495   +-------------------+---------------------+----------------------------------+
496   | :attr:`%p`        | const void\*        | The hex representation of a C    |
497   |                   |                     | pointer. Mostly equivalent to    |
498   |                   |                     | ``printf("%p")`` except that     |
499   |                   |                     | it is guaranteed to start with   |
500   |                   |                     | the literal ``0x`` regardless    |
501   |                   |                     | of what the platform's           |
502   |                   |                     | ``printf`` yields.               |
503   +-------------------+---------------------+----------------------------------+
504   | :attr:`%A`        | PyObject\*          | The result of calling            |
505   |                   |                     | :func:`ascii`.                   |
506   +-------------------+---------------------+----------------------------------+
507   | :attr:`%U`        | PyObject\*          | A Unicode object.                |
508   +-------------------+---------------------+----------------------------------+
509   | :attr:`%V`        | PyObject\*,         | A Unicode object (which may be   |
510   |                   | const char\*        | ``NULL``) and a null-terminated  |
511   |                   |                     | C character array as a second    |
512   |                   |                     | parameter (which will be used,   |
513   |                   |                     | if the first parameter is        |
514   |                   |                     | ``NULL``).                       |
515   +-------------------+---------------------+----------------------------------+
516   | :attr:`%S`        | PyObject\*          | The result of calling            |
517   |                   |                     | :c:func:`PyObject_Str`.          |
518   +-------------------+---------------------+----------------------------------+
519   | :attr:`%R`        | PyObject\*          | The result of calling            |
520   |                   |                     | :c:func:`PyObject_Repr`.         |
521   +-------------------+---------------------+----------------------------------+
522
523   An unrecognized format character causes all the rest of the format string to be
524   copied as-is to the result string, and any extra arguments discarded.
525
526   .. note::
527      The width formatter unit is number of characters rather than bytes.
528      The precision formatter unit is number of bytes for ``"%s"`` and
529      ``"%V"`` (if the ``PyObject*`` argument is ``NULL``), and a number of
530      characters for ``"%A"``, ``"%U"``, ``"%S"``, ``"%R"`` and ``"%V"``
531      (if the ``PyObject*`` argument is not ``NULL``).
532
533   .. [1] For integer specifiers (d, u, ld, li, lu, lld, lli, llu, zd, zi,
534      zu, i, x): the 0-conversion flag has effect even when a precision is given.
535
536   .. versionchanged:: 3.2
537      Support for ``"%lld"`` and ``"%llu"`` added.
538
539   .. versionchanged:: 3.3
540      Support for ``"%li"``, ``"%lli"`` and ``"%zi"`` added.
541
542   .. versionchanged:: 3.4
543      Support width and precision formatter for ``"%s"``, ``"%A"``, ``"%U"``,
544      ``"%V"``, ``"%S"``, ``"%R"`` added.
545
546
547.. c:function:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
548
549   Identical to :c:func:`PyUnicode_FromFormat` except that it takes exactly two
550   arguments.
551
552
553.. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, \
554                               const char *encoding, const char *errors)
555
556   Decode an encoded object *obj* to a Unicode object.
557
558   :class:`bytes`, :class:`bytearray` and other
559   :term:`bytes-like objects <bytes-like object>`
560   are decoded according to the given *encoding* and using the error handling
561   defined by *errors*. Both can be ``NULL`` to have the interface use the default
562   values (see :ref:`builtincodecs` for details).
563
564   All other objects, including Unicode objects, cause a :exc:`TypeError` to be
565   set.
566
567   The API returns ``NULL`` if there was an error.  The caller is responsible for
568   decref'ing the returned objects.
569
570
571.. c:function:: Py_ssize_t PyUnicode_GetLength(PyObject *unicode)
572
573   Return the length of the Unicode object, in code points.
574
575   .. versionadded:: 3.3
576
577
578.. c:function:: Py_ssize_t PyUnicode_CopyCharacters(PyObject *to, \
579                                                    Py_ssize_t to_start, \
580                                                    PyObject *from, \
581                                                    Py_ssize_t from_start, \
582                                                    Py_ssize_t how_many)
583
584   Copy characters from one Unicode object into another.  This function performs
585   character conversion when necessary and falls back to :c:func:`memcpy` if
586   possible.  Returns ``-1`` and sets an exception on error, otherwise returns
587   the number of copied characters.
588
589   .. versionadded:: 3.3
590
591
592.. c:function:: Py_ssize_t PyUnicode_Fill(PyObject *unicode, Py_ssize_t start, \
593                        Py_ssize_t length, Py_UCS4 fill_char)
594
595   Fill a string with a character: write *fill_char* into
596   ``unicode[start:start+length]``.
597
598   Fail if *fill_char* is bigger than the string maximum character, or if the
599   string has more than 1 reference.
600
601   Return the number of written character, or return ``-1`` and raise an
602   exception on error.
603
604   .. versionadded:: 3.3
605
606
607.. c:function:: int PyUnicode_WriteChar(PyObject *unicode, Py_ssize_t index, \
608                                        Py_UCS4 character)
609
610   Write a character to a string.  The string must have been created through
611   :c:func:`PyUnicode_New`.  Since Unicode strings are supposed to be immutable,
612   the string must not be shared, or have been hashed yet.
613
614   This function checks that *unicode* is a Unicode object, that the index is
615   not out of bounds, and that the object can be modified safely (i.e. that it
616   its reference count is one).
617
618   .. versionadded:: 3.3
619
620
621.. c:function:: Py_UCS4 PyUnicode_ReadChar(PyObject *unicode, Py_ssize_t index)
622
623   Read a character from a string.  This function checks that *unicode* is a
624   Unicode object and the index is not out of bounds, in contrast to the macro
625   version :c:func:`PyUnicode_READ_CHAR`.
626
627   .. versionadded:: 3.3
628
629
630.. c:function:: PyObject* PyUnicode_Substring(PyObject *str, Py_ssize_t start, \
631                                              Py_ssize_t end)
632
633   Return a substring of *str*, from character index *start* (included) to
634   character index *end* (excluded).  Negative indices are not supported.
635
636   .. versionadded:: 3.3
637
638
639.. c:function:: Py_UCS4* PyUnicode_AsUCS4(PyObject *u, Py_UCS4 *buffer, \
640                                          Py_ssize_t buflen, int copy_null)
641
642   Copy the string *u* into a UCS4 buffer, including a null character, if
643   *copy_null* is set.  Returns ``NULL`` and sets an exception on error (in
644   particular, a :exc:`SystemError` if *buflen* is smaller than the length of
645   *u*).  *buffer* is returned on success.
646
647   .. versionadded:: 3.3
648
649
650.. c:function:: Py_UCS4* PyUnicode_AsUCS4Copy(PyObject *u)
651
652   Copy the string *u* into a new UCS4 buffer that is allocated using
653   :c:func:`PyMem_Malloc`.  If this fails, ``NULL`` is returned with a
654   :exc:`MemoryError` set.  The returned buffer always has an extra
655   null code point appended.
656
657   .. versionadded:: 3.3
658
659
660Deprecated Py_UNICODE APIs
661""""""""""""""""""""""""""
662
663.. deprecated-removed:: 3.3 4.0
664
665These API functions are deprecated with the implementation of :pep:`393`.
666Extension modules can continue using them, as they will not be removed in Python
6673.x, but need to be aware that their use can now cause performance and memory hits.
668
669
670.. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
671
672   Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u*
673   may be ``NULL`` which causes the contents to be undefined. It is the user's
674   responsibility to fill in the needed data.  The buffer is copied into the new
675   object.
676
677   If the buffer is not ``NULL``, the return value might be a shared object.
678   Therefore, modification of the resulting Unicode object is only allowed when
679   *u* is ``NULL``.
680
681   If the buffer is ``NULL``, :c:func:`PyUnicode_READY` must be called once the
682   string content has been filled before using any of the access macros such as
683   :c:func:`PyUnicode_KIND`.
684
685   Please migrate to using :c:func:`PyUnicode_FromKindAndData`,
686   :c:func:`PyUnicode_FromWideChar` or :c:func:`PyUnicode_New`.
687
688
689.. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
690
691   Return a read-only pointer to the Unicode object's internal
692   :c:type:`Py_UNICODE` buffer, or ``NULL`` on error. This will create the
693   :c:type:`Py_UNICODE*` representation of the object if it is not yet
694   available. The buffer is always terminated with an extra null code point.
695   Note that the resulting :c:type:`Py_UNICODE` string may also contain
696   embedded null code points, which would cause the string to be truncated when
697   used in most C functions.
698
699   Please migrate to using :c:func:`PyUnicode_AsUCS4`,
700   :c:func:`PyUnicode_AsWideChar`, :c:func:`PyUnicode_ReadChar` or similar new
701   APIs.
702
703
704.. c:function:: PyObject* PyUnicode_TransformDecimalToASCII(Py_UNICODE *s, Py_ssize_t size)
705
706   Create a Unicode object by replacing all decimal digits in
707   :c:type:`Py_UNICODE` buffer of the given *size* by ASCII digits 0--9
708   according to their decimal value.  Return ``NULL`` if an exception occurs.
709
710
711.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size)
712
713   Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE`
714   array length (excluding the extra null terminator) in *size*.
715   Note that the resulting :c:type:`Py_UNICODE*` string
716   may contain embedded null code points, which would cause the string to be
717   truncated when used in most C functions.
718
719   .. versionadded:: 3.3
720
721
722.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeCopy(PyObject *unicode)
723
724   Create a copy of a Unicode string ending with a null code point. Return ``NULL``
725   and raise a :exc:`MemoryError` exception on memory allocation failure,
726   otherwise return a new allocated buffer (use :c:func:`PyMem_Free` to free
727   the buffer). Note that the resulting :c:type:`Py_UNICODE*` string may
728   contain embedded null code points, which would cause the string to be
729   truncated when used in most C functions.
730
731   .. versionadded:: 3.2
732
733   Please migrate to using :c:func:`PyUnicode_AsUCS4Copy` or similar new APIs.
734
735
736.. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
737
738   Return the size of the deprecated :c:type:`Py_UNICODE` representation, in
739   code units (this includes surrogate pairs as 2 units).
740
741   Please migrate to using :c:func:`PyUnicode_GetLength`.
742
743
744.. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj)
745
746   Copy an instance of a Unicode subtype to a new true Unicode object if
747   necessary. If *obj* is already a true Unicode object (not a subtype),
748   return the reference with incremented refcount.
749
750   Objects other than Unicode or its subtypes will cause a :exc:`TypeError`.
751
752
753Locale Encoding
754"""""""""""""""
755
756The current locale encoding can be used to decode text from the operating
757system.
758
759.. c:function:: PyObject* PyUnicode_DecodeLocaleAndSize(const char *str, \
760                                                        Py_ssize_t len, \
761                                                        const char *errors)
762
763   Decode a string from UTF-8 on Android, or from the current locale encoding
764   on other platforms. The supported
765   error handlers are ``"strict"`` and ``"surrogateescape"``
766   (:pep:`383`). The decoder uses ``"strict"`` error handler if
767   *errors* is ``NULL``.  *str* must end with a null character but
768   cannot contain embedded null characters.
769
770   Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` to decode a string from
771   :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
772   Python startup).
773
774   This function ignores the Python UTF-8 mode.
775
776   .. seealso::
777
778      The :c:func:`Py_DecodeLocale` function.
779
780   .. versionadded:: 3.3
781
782   .. versionchanged:: 3.7
783      The function now also uses the current locale encoding for the
784      ``surrogateescape`` error handler, except on Android. Previously, :c:func:`Py_DecodeLocale`
785      was used for the ``surrogateescape``, and the current locale encoding was
786      used for ``strict``.
787
788
789.. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors)
790
791   Similar to :c:func:`PyUnicode_DecodeLocaleAndSize`, but compute the string
792   length using :c:func:`strlen`.
793
794   .. versionadded:: 3.3
795
796
797.. c:function:: PyObject* PyUnicode_EncodeLocale(PyObject *unicode, const char *errors)
798
799   Encode a Unicode object to UTF-8 on Android, or to the current locale
800   encoding on other platforms. The
801   supported error handlers are ``"strict"`` and ``"surrogateescape"``
802   (:pep:`383`). The encoder uses ``"strict"`` error handler if
803   *errors* is ``NULL``. Return a :class:`bytes` object. *unicode* cannot
804   contain embedded null characters.
805
806   Use :c:func:`PyUnicode_EncodeFSDefault` to encode a string to
807   :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
808   Python startup).
809
810   This function ignores the Python UTF-8 mode.
811
812   .. seealso::
813
814      The :c:func:`Py_EncodeLocale` function.
815
816   .. versionadded:: 3.3
817
818   .. versionchanged:: 3.7
819      The function now also uses the current locale encoding for the
820      ``surrogateescape`` error handler, except on Android. Previously,
821      :c:func:`Py_EncodeLocale`
822      was used for the ``surrogateescape``, and the current locale encoding was
823      used for ``strict``.
824
825
826File System Encoding
827""""""""""""""""""""
828
829To encode and decode file names and other environment strings,
830:c:data:`Py_FileSystemDefaultEncoding` should be used as the encoding, and
831:c:data:`Py_FileSystemDefaultEncodeErrors` should be used as the error handler
832(:pep:`383` and :pep:`529`). To encode file names to :class:`bytes` during
833argument parsing, the ``"O&"`` converter should be used, passing
834:c:func:`PyUnicode_FSConverter` as the conversion function:
835
836.. c:function:: int PyUnicode_FSConverter(PyObject* obj, void* result)
837
838   ParseTuple converter: encode :class:`str` objects -- obtained directly or
839   through the :class:`os.PathLike` interface -- to :class:`bytes` using
840   :c:func:`PyUnicode_EncodeFSDefault`; :class:`bytes` objects are output as-is.
841   *result* must be a :c:type:`PyBytesObject*` which must be released when it is
842   no longer used.
843
844   .. versionadded:: 3.1
845
846   .. versionchanged:: 3.6
847      Accepts a :term:`path-like object`.
848
849To decode file names to :class:`str` during argument parsing, the ``"O&"``
850converter should be used, passing :c:func:`PyUnicode_FSDecoder` as the
851conversion function:
852
853.. c:function:: int PyUnicode_FSDecoder(PyObject* obj, void* result)
854
855   ParseTuple converter: decode :class:`bytes` objects -- obtained either
856   directly or indirectly through the :class:`os.PathLike` interface -- to
857   :class:`str` using :c:func:`PyUnicode_DecodeFSDefaultAndSize`; :class:`str`
858   objects are output as-is. *result* must be a :c:type:`PyUnicodeObject*` which
859   must be released when it is no longer used.
860
861   .. versionadded:: 3.2
862
863   .. versionchanged:: 3.6
864      Accepts a :term:`path-like object`.
865
866
867.. c:function:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size)
868
869   Decode a string using :c:data:`Py_FileSystemDefaultEncoding` and the
870   :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
871
872   If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
873   locale encoding.
874
875   :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the
876   locale encoding and cannot be modified later. If you need to decode a string
877   from the current locale encoding, use
878   :c:func:`PyUnicode_DecodeLocaleAndSize`.
879
880   .. seealso::
881
882      The :c:func:`Py_DecodeLocale` function.
883
884   .. versionchanged:: 3.6
885      Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
886
887
888.. c:function:: PyObject* PyUnicode_DecodeFSDefault(const char *s)
889
890   Decode a null-terminated string using :c:data:`Py_FileSystemDefaultEncoding`
891   and the :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
892
893   If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
894   locale encoding.
895
896   Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length.
897
898   .. versionchanged:: 3.6
899      Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
900
901
902.. c:function:: PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode)
903
904   Encode a Unicode object to :c:data:`Py_FileSystemDefaultEncoding` with the
905   :c:data:`Py_FileSystemDefaultEncodeErrors` error handler, and return
906   :class:`bytes`. Note that the resulting :class:`bytes` object may contain
907   null bytes.
908
909   If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the
910   locale encoding.
911
912   :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the
913   locale encoding and cannot be modified later. If you need to encode a string
914   to the current locale encoding, use :c:func:`PyUnicode_EncodeLocale`.
915
916   .. seealso::
917
918      The :c:func:`Py_EncodeLocale` function.
919
920   .. versionadded:: 3.2
921
922   .. versionchanged:: 3.6
923      Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler.
924
925wchar_t Support
926"""""""""""""""
927
928:c:type:`wchar_t` support for platforms which support it:
929
930.. c:function:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
931
932   Create a Unicode object from the :c:type:`wchar_t` buffer *w* of the given *size*.
933   Passing ``-1`` as the *size* indicates that the function must itself compute the length,
934   using wcslen.
935   Return ``NULL`` on failure.
936
937
938.. c:function:: Py_ssize_t PyUnicode_AsWideChar(PyObject *unicode, wchar_t *w, Py_ssize_t size)
939
940   Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*.  At most
941   *size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing
942   null termination character).  Return the number of :c:type:`wchar_t` characters
943   copied or ``-1`` in case of an error.  Note that the resulting :c:type:`wchar_t*`
944   string may or may not be null-terminated.  It is the responsibility of the caller
945   to make sure that the :c:type:`wchar_t*` string is null-terminated in case this is
946   required by the application. Also, note that the :c:type:`wchar_t*` string
947   might contain null characters, which would cause the string to be truncated
948   when used with most C functions.
949
950
951.. c:function:: wchar_t* PyUnicode_AsWideCharString(PyObject *unicode, Py_ssize_t *size)
952
953   Convert the Unicode object to a wide character string. The output string
954   always ends with a null character. If *size* is not ``NULL``, write the number
955   of wide characters (excluding the trailing null termination character) into
956   *\*size*. Note that the resulting :c:type:`wchar_t` string might contain
957   null characters, which would cause the string to be truncated when used with
958   most C functions. If *size* is ``NULL`` and the :c:type:`wchar_t*` string
959   contains null characters a :exc:`ValueError` is raised.
960
961   Returns a buffer allocated by :c:func:`PyMem_Alloc` (use
962   :c:func:`PyMem_Free` to free it) on success. On error, returns ``NULL``
963   and *\*size* is undefined. Raises a :exc:`MemoryError` if memory allocation
964   is failed.
965
966   .. versionadded:: 3.2
967
968   .. versionchanged:: 3.7
969      Raises a :exc:`ValueError` if *size* is ``NULL`` and the :c:type:`wchar_t*`
970      string contains null characters.
971
972
973.. _builtincodecs:
974
975Built-in Codecs
976^^^^^^^^^^^^^^^
977
978Python provides a set of built-in codecs which are written in C for speed. All of
979these codecs are directly usable via the following functions.
980
981Many of the following APIs take two arguments encoding and errors, and they
982have the same semantics as the ones of the built-in :func:`str` string object
983constructor.
984
985Setting encoding to ``NULL`` causes the default encoding to be used
986which is ASCII.  The file system calls should use
987:c:func:`PyUnicode_FSConverter` for encoding file names. This uses the
988variable :c:data:`Py_FileSystemDefaultEncoding` internally. This
989variable should be treated as read-only: on some systems, it will be a
990pointer to a static string, on others, it will change at run-time
991(such as when the application invokes setlocale).
992
993Error handling is set by errors which may also be set to ``NULL`` meaning to use
994the default handling defined for the codec.  Default error handling for all
995built-in codecs is "strict" (:exc:`ValueError` is raised).
996
997The codecs all use a similar interface.  Only deviation from the following
998generic ones are documented for simplicity.
999
1000
1001Generic Codecs
1002""""""""""""""
1003
1004These are the generic codec APIs:
1005
1006
1007.. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, \
1008                              const char *encoding, const char *errors)
1009
1010   Create a Unicode object by decoding *size* bytes of the encoded string *s*.
1011   *encoding* and *errors* have the same meaning as the parameters of the same name
1012   in the :func:`str` built-in function.  The codec to be used is looked up
1013   using the Python codec registry.  Return ``NULL`` if an exception was raised by
1014   the codec.
1015
1016
1017.. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, \
1018                              const char *encoding, const char *errors)
1019
1020   Encode a Unicode object and return the result as Python bytes object.
1021   *encoding* and *errors* have the same meaning as the parameters of the same
1022   name in the Unicode :meth:`~str.encode` method. The codec to be used is looked up
1023   using the Python codec registry. Return ``NULL`` if an exception was raised by
1024   the codec.
1025
1026
1027.. c:function:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, \
1028                              const char *encoding, const char *errors)
1029
1030   Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* and return a Python
1031   bytes object.  *encoding* and *errors* have the same meaning as the
1032   parameters of the same name in the Unicode :meth:`~str.encode` method.  The codec
1033   to be used is looked up using the Python codec registry.  Return ``NULL`` if an
1034   exception was raised by the codec.
1035
1036   .. deprecated-removed:: 3.3 4.0
1037      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1038      :c:func:`PyUnicode_AsEncodedString`.
1039
1040
1041UTF-8 Codecs
1042""""""""""""
1043
1044These are the UTF-8 codec APIs:
1045
1046
1047.. c:function:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
1048
1049   Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
1050   *s*. Return ``NULL`` if an exception was raised by the codec.
1051
1052
1053.. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, \
1054                              const char *errors, Py_ssize_t *consumed)
1055
1056   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF8`. If
1057   *consumed* is not ``NULL``, trailing incomplete UTF-8 byte sequences will not be
1058   treated as an error. Those bytes will not be decoded and the number of bytes
1059   that have been decoded will be stored in *consumed*.
1060
1061
1062.. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
1063
1064   Encode a Unicode object using UTF-8 and return the result as Python bytes
1065   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1066   raised by the codec.
1067
1068
1069.. c:function:: const char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size)
1070
1071   Return a pointer to the UTF-8 encoding of the Unicode object, and
1072   store the size of the encoded representation (in bytes) in *size*.  The
1073   *size* argument can be ``NULL``; in this case no size will be stored.  The
1074   returned buffer always has an extra null byte appended (not included in
1075   *size*), regardless of whether there are any other null code points.
1076
1077   In the case of an error, ``NULL`` is returned with an exception set and no
1078   *size* is stored.
1079
1080   This caches the UTF-8 representation of the string in the Unicode object, and
1081   subsequent calls will return a pointer to the same buffer.  The caller is not
1082   responsible for deallocating the buffer.
1083
1084   .. versionadded:: 3.3
1085
1086   .. versionchanged:: 3.7
1087      The return type is now ``const char *`` rather of ``char *``.
1088
1089
1090.. c:function:: const char* PyUnicode_AsUTF8(PyObject *unicode)
1091
1092   As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size.
1093
1094   .. versionadded:: 3.3
1095
1096   .. versionchanged:: 3.7
1097      The return type is now ``const char *`` rather of ``char *``.
1098
1099
1100.. c:function:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
1101
1102   Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* using UTF-8 and
1103   return a Python bytes object.  Return ``NULL`` if an exception was raised by
1104   the codec.
1105
1106   .. deprecated-removed:: 3.3 4.0
1107      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1108      :c:func:`PyUnicode_AsUTF8String`, :c:func:`PyUnicode_AsUTF8AndSize` or
1109      :c:func:`PyUnicode_AsEncodedString`.
1110
1111
1112UTF-32 Codecs
1113"""""""""""""
1114
1115These are the UTF-32 codec APIs:
1116
1117
1118.. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, \
1119                              const char *errors, int *byteorder)
1120
1121   Decode *size* bytes from a UTF-32 encoded buffer string and return the
1122   corresponding Unicode object.  *errors* (if non-``NULL``) defines the error
1123   handling. It defaults to "strict".
1124
1125   If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte
1126   order::
1127
1128      *byteorder == -1: little endian
1129      *byteorder == 0:  native order
1130      *byteorder == 1:  big endian
1131
1132   If ``*byteorder`` is zero, and the first four bytes of the input data are a
1133   byte order mark (BOM), the decoder switches to this byte order and the BOM is
1134   not copied into the resulting Unicode string.  If ``*byteorder`` is ``-1`` or
1135   ``1``, any byte order mark is copied to the output.
1136
1137   After completion, *\*byteorder* is set to the current byte order at the end
1138   of input data.
1139
1140   If *byteorder* is ``NULL``, the codec starts in native order mode.
1141
1142   Return ``NULL`` if an exception was raised by the codec.
1143
1144
1145.. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, \
1146                              const char *errors, int *byteorder, Py_ssize_t *consumed)
1147
1148   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF32`. If
1149   *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF32Stateful` will not treat
1150   trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
1151   by four) as an error. Those bytes will not be decoded and the number of bytes
1152   that have been decoded will be stored in *consumed*.
1153
1154
1155.. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
1156
1157   Return a Python byte string using the UTF-32 encoding in native byte
1158   order. The string always starts with a BOM mark.  Error handling is "strict".
1159   Return ``NULL`` if an exception was raised by the codec.
1160
1161
1162.. c:function:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, \
1163                              const char *errors, int byteorder)
1164
1165   Return a Python bytes object holding the UTF-32 encoded value of the Unicode
1166   data in *s*.  Output is written according to the following byte order::
1167
1168      byteorder == -1: little endian
1169      byteorder == 0:  native byte order (writes a BOM mark)
1170      byteorder == 1:  big endian
1171
1172   If byteorder is ``0``, the output string will always start with the Unicode BOM
1173   mark (U+FEFF). In the other two modes, no BOM mark is prepended.
1174
1175   If ``Py_UNICODE_WIDE`` is not defined, surrogate pairs will be output
1176   as a single code point.
1177
1178   Return ``NULL`` if an exception was raised by the codec.
1179
1180   .. deprecated-removed:: 3.3 4.0
1181      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1182      :c:func:`PyUnicode_AsUTF32String` or :c:func:`PyUnicode_AsEncodedString`.
1183
1184
1185UTF-16 Codecs
1186"""""""""""""
1187
1188These are the UTF-16 codec APIs:
1189
1190
1191.. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, \
1192                              const char *errors, int *byteorder)
1193
1194   Decode *size* bytes from a UTF-16 encoded buffer string and return the
1195   corresponding Unicode object.  *errors* (if non-``NULL``) defines the error
1196   handling. It defaults to "strict".
1197
1198   If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte
1199   order::
1200
1201      *byteorder == -1: little endian
1202      *byteorder == 0:  native order
1203      *byteorder == 1:  big endian
1204
1205   If ``*byteorder`` is zero, and the first two bytes of the input data are a
1206   byte order mark (BOM), the decoder switches to this byte order and the BOM is
1207   not copied into the resulting Unicode string.  If ``*byteorder`` is ``-1`` or
1208   ``1``, any byte order mark is copied to the output (where it will result in
1209   either a ``\ufeff`` or a ``\ufffe`` character).
1210
1211   After completion, *\*byteorder* is set to the current byte order at the end
1212   of input data.
1213
1214   If *byteorder* is ``NULL``, the codec starts in native order mode.
1215
1216   Return ``NULL`` if an exception was raised by the codec.
1217
1218
1219.. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, \
1220                              const char *errors, int *byteorder, Py_ssize_t *consumed)
1221
1222   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF16`. If
1223   *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF16Stateful` will not treat
1224   trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
1225   split surrogate pair) as an error. Those bytes will not be decoded and the
1226   number of bytes that have been decoded will be stored in *consumed*.
1227
1228
1229.. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
1230
1231   Return a Python byte string using the UTF-16 encoding in native byte
1232   order. The string always starts with a BOM mark.  Error handling is "strict".
1233   Return ``NULL`` if an exception was raised by the codec.
1234
1235
1236.. c:function:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, \
1237                              const char *errors, int byteorder)
1238
1239   Return a Python bytes object holding the UTF-16 encoded value of the Unicode
1240   data in *s*.  Output is written according to the following byte order::
1241
1242      byteorder == -1: little endian
1243      byteorder == 0:  native byte order (writes a BOM mark)
1244      byteorder == 1:  big endian
1245
1246   If byteorder is ``0``, the output string will always start with the Unicode BOM
1247   mark (U+FEFF). In the other two modes, no BOM mark is prepended.
1248
1249   If ``Py_UNICODE_WIDE`` is defined, a single :c:type:`Py_UNICODE` value may get
1250   represented as a surrogate pair. If it is not defined, each :c:type:`Py_UNICODE`
1251   values is interpreted as a UCS-2 character.
1252
1253   Return ``NULL`` if an exception was raised by the codec.
1254
1255   .. deprecated-removed:: 3.3 4.0
1256      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1257      :c:func:`PyUnicode_AsUTF16String` or :c:func:`PyUnicode_AsEncodedString`.
1258
1259
1260UTF-7 Codecs
1261""""""""""""
1262
1263These are the UTF-7 codec APIs:
1264
1265
1266.. c:function:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors)
1267
1268   Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string
1269   *s*.  Return ``NULL`` if an exception was raised by the codec.
1270
1271
1272.. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, \
1273                              const char *errors, Py_ssize_t *consumed)
1274
1275   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF7`.  If
1276   *consumed* is not ``NULL``, trailing incomplete UTF-7 base-64 sections will not
1277   be treated as an error.  Those bytes will not be decoded and the number of
1278   bytes that have been decoded will be stored in *consumed*.
1279
1280
1281.. c:function:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, \
1282                              int base64SetO, int base64WhiteSpace, const char *errors)
1283
1284   Encode the :c:type:`Py_UNICODE` buffer of the given size using UTF-7 and
1285   return a Python bytes object.  Return ``NULL`` if an exception was raised by
1286   the codec.
1287
1288   If *base64SetO* is nonzero, "Set O" (punctuation that has no otherwise
1289   special meaning) will be encoded in base-64.  If *base64WhiteSpace* is
1290   nonzero, whitespace will be encoded in base-64.  Both are set to zero for the
1291   Python "utf-7" codec.
1292
1293   .. deprecated-removed:: 3.3 4.0
1294      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1295      :c:func:`PyUnicode_AsEncodedString`.
1296
1297
1298Unicode-Escape Codecs
1299"""""""""""""""""""""
1300
1301These are the "Unicode Escape" codec APIs:
1302
1303
1304.. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, \
1305                              Py_ssize_t size, const char *errors)
1306
1307   Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
1308   string *s*.  Return ``NULL`` if an exception was raised by the codec.
1309
1310
1311.. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
1312
1313   Encode a Unicode object using Unicode-Escape and return the result as a
1314   bytes object.  Error handling is "strict".  Return ``NULL`` if an exception was
1315   raised by the codec.
1316
1317
1318.. c:function:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
1319
1320   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Unicode-Escape and
1321   return a bytes object.  Return ``NULL`` if an exception was raised by the codec.
1322
1323   .. deprecated-removed:: 3.3 4.0
1324      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1325      :c:func:`PyUnicode_AsUnicodeEscapeString`.
1326
1327
1328Raw-Unicode-Escape Codecs
1329"""""""""""""""""""""""""
1330
1331These are the "Raw Unicode Escape" codec APIs:
1332
1333
1334.. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, \
1335                              Py_ssize_t size, const char *errors)
1336
1337   Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
1338   encoded string *s*.  Return ``NULL`` if an exception was raised by the codec.
1339
1340
1341.. c:function:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
1342
1343   Encode a Unicode object using Raw-Unicode-Escape and return the result as
1344   a bytes object.  Error handling is "strict".  Return ``NULL`` if an exception
1345   was raised by the codec.
1346
1347
1348.. c:function:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, \
1349                              Py_ssize_t size)
1350
1351   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Raw-Unicode-Escape
1352   and return a bytes object.  Return ``NULL`` if an exception was raised by the codec.
1353
1354   .. deprecated-removed:: 3.3 4.0
1355      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1356      :c:func:`PyUnicode_AsRawUnicodeEscapeString` or
1357      :c:func:`PyUnicode_AsEncodedString`.
1358
1359
1360Latin-1 Codecs
1361""""""""""""""
1362
1363These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
1364ordinals and only these are accepted by the codecs during encoding.
1365
1366
1367.. c:function:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
1368
1369   Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
1370   *s*.  Return ``NULL`` if an exception was raised by the codec.
1371
1372
1373.. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
1374
1375   Encode a Unicode object using Latin-1 and return the result as Python bytes
1376   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1377   raised by the codec.
1378
1379
1380.. c:function:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
1381
1382   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Latin-1 and
1383   return a Python bytes object.  Return ``NULL`` if an exception was raised by
1384   the codec.
1385
1386   .. deprecated-removed:: 3.3 4.0
1387      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1388      :c:func:`PyUnicode_AsLatin1String` or
1389      :c:func:`PyUnicode_AsEncodedString`.
1390
1391
1392ASCII Codecs
1393""""""""""""
1394
1395These are the ASCII codec APIs.  Only 7-bit ASCII data is accepted. All other
1396codes generate errors.
1397
1398
1399.. c:function:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
1400
1401   Create a Unicode object by decoding *size* bytes of the ASCII encoded string
1402   *s*.  Return ``NULL`` if an exception was raised by the codec.
1403
1404
1405.. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
1406
1407   Encode a Unicode object using ASCII and return the result as Python bytes
1408   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1409   raised by the codec.
1410
1411
1412.. c:function:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
1413
1414   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using ASCII and
1415   return a Python bytes object.  Return ``NULL`` if an exception was raised by
1416   the codec.
1417
1418   .. deprecated-removed:: 3.3 4.0
1419      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1420      :c:func:`PyUnicode_AsASCIIString` or
1421      :c:func:`PyUnicode_AsEncodedString`.
1422
1423
1424Character Map Codecs
1425""""""""""""""""""""
1426
1427This codec is special in that it can be used to implement many different codecs
1428(and this is in fact what was done to obtain most of the standard codecs
1429included in the :mod:`encodings` package). The codec uses mapping to encode and
1430decode characters.  The mapping objects provided must support the
1431:meth:`__getitem__` mapping interface; dictionaries and sequences work well.
1432
1433These are the mapping codec APIs:
1434
1435.. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *data, Py_ssize_t size, \
1436                              PyObject *mapping, const char *errors)
1437
1438   Create a Unicode object by decoding *size* bytes of the encoded string *s*
1439   using the given *mapping* object.  Return ``NULL`` if an exception was raised
1440   by the codec.
1441
1442   If *mapping* is ``NULL``, Latin-1 decoding will be applied.  Else
1443   *mapping* must map bytes ordinals (integers in the range from 0 to 255)
1444   to Unicode strings, integers (which are then interpreted as Unicode
1445   ordinals) or ``None``.  Unmapped data bytes -- ones which cause a
1446   :exc:`LookupError`, as well as ones which get mapped to ``None``,
1447   ``0xFFFE`` or ``'\ufffe'``, are treated as undefined mappings and cause
1448   an error.
1449
1450
1451.. c:function:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
1452
1453   Encode a Unicode object using the given *mapping* object and return the
1454   result as a bytes object.  Error handling is "strict".  Return ``NULL`` if an
1455   exception was raised by the codec.
1456
1457   The *mapping* object must map Unicode ordinal integers to bytes objects,
1458   integers in the range from 0 to 255 or ``None``.  Unmapped character
1459   ordinals (ones which cause a :exc:`LookupError`) as well as mapped to
1460   ``None`` are treated as "undefined mapping" and cause an error.
1461
1462
1463.. c:function:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, \
1464                              PyObject *mapping, const char *errors)
1465
1466   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using the given
1467   *mapping* object and return the result as a bytes object.  Return ``NULL`` if
1468   an exception was raised by the codec.
1469
1470   .. deprecated-removed:: 3.3 4.0
1471      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1472      :c:func:`PyUnicode_AsCharmapString` or
1473      :c:func:`PyUnicode_AsEncodedString`.
1474
1475
1476The following codec API is special in that maps Unicode to Unicode.
1477
1478.. c:function:: PyObject* PyUnicode_Translate(PyObject *unicode, \
1479                              PyObject *mapping, const char *errors)
1480
1481   Translate a Unicode object using the given *mapping* object and return the
1482   resulting Unicode object.  Return ``NULL`` if an exception was raised by the
1483   codec.
1484
1485   The *mapping* object must map Unicode ordinal integers to Unicode strings,
1486   integers (which are then interpreted as Unicode ordinals) or ``None``
1487   (causing deletion of the character).  Unmapped character ordinals (ones
1488   which cause a :exc:`LookupError`) are left untouched and are copied as-is.
1489
1490
1491.. c:function:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, \
1492                              PyObject *mapping, const char *errors)
1493
1494   Translate a :c:type:`Py_UNICODE` buffer of the given *size* by applying a
1495   character *mapping* table to it and return the resulting Unicode object.
1496   Return ``NULL`` when an exception was raised by the codec.
1497
1498   .. deprecated-removed:: 3.3 4.0
1499      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1500      :c:func:`PyUnicode_Translate`. or :ref:`generic codec based API
1501      <codec-registry>`
1502
1503
1504MBCS codecs for Windows
1505"""""""""""""""""""""""
1506
1507These are the MBCS codec APIs. They are currently only available on Windows and
1508use the Win32 MBCS converters to implement the conversions.  Note that MBCS (or
1509DBCS) is a class of encodings, not just one.  The target encoding is defined by
1510the user settings on the machine running the codec.
1511
1512.. c:function:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
1513
1514   Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
1515   Return ``NULL`` if an exception was raised by the codec.
1516
1517
1518.. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, Py_ssize_t size, \
1519                              const char *errors, Py_ssize_t *consumed)
1520
1521   If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeMBCS`. If
1522   *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeMBCSStateful` will not decode
1523   trailing lead byte and the number of bytes that have been decoded will be stored
1524   in *consumed*.
1525
1526
1527.. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
1528
1529   Encode a Unicode object using MBCS and return the result as Python bytes
1530   object.  Error handling is "strict".  Return ``NULL`` if an exception was
1531   raised by the codec.
1532
1533
1534.. c:function:: PyObject* PyUnicode_EncodeCodePage(int code_page, PyObject *unicode, const char *errors)
1535
1536   Encode the Unicode object using the specified code page and return a Python
1537   bytes object.  Return ``NULL`` if an exception was raised by the codec. Use
1538   :c:data:`CP_ACP` code page to get the MBCS encoder.
1539
1540   .. versionadded:: 3.3
1541
1542
1543.. c:function:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
1544
1545   Encode the :c:type:`Py_UNICODE` buffer of the given *size* using MBCS and return
1546   a Python bytes object.  Return ``NULL`` if an exception was raised by the
1547   codec.
1548
1549   .. deprecated-removed:: 3.3 4.0
1550      Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using
1551      :c:func:`PyUnicode_AsMBCSString`, :c:func:`PyUnicode_EncodeCodePage` or
1552      :c:func:`PyUnicode_AsEncodedString`.
1553
1554
1555Methods & Slots
1556"""""""""""""""
1557
1558
1559.. _unicodemethodsandslots:
1560
1561Methods and Slot Functions
1562^^^^^^^^^^^^^^^^^^^^^^^^^^
1563
1564The following APIs are capable of handling Unicode objects and strings on input
1565(we refer to them as strings in the descriptions) and return Unicode objects or
1566integers as appropriate.
1567
1568They all return ``NULL`` or ``-1`` if an exception occurs.
1569
1570
1571.. c:function:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
1572
1573   Concat two strings giving a new Unicode string.
1574
1575
1576.. c:function:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
1577
1578   Split a string giving a list of Unicode strings.  If *sep* is ``NULL``, splitting
1579   will be done at all whitespace substrings.  Otherwise, splits occur at the given
1580   separator.  At most *maxsplit* splits will be done.  If negative, no limit is
1581   set.  Separators are not included in the resulting list.
1582
1583
1584.. c:function:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
1585
1586   Split a Unicode string at line breaks, returning a list of Unicode strings.
1587   CRLF is considered to be one line break.  If *keepend* is ``0``, the Line break
1588   characters are not included in the resulting strings.
1589
1590
1591.. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, \
1592                              const char *errors)
1593
1594   Translate a string by applying a character mapping table to it and return the
1595   resulting Unicode object.
1596
1597   The mapping table must map Unicode ordinal integers to Unicode ordinal integers
1598   or ``None`` (causing deletion of the character).
1599
1600   Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
1601   and sequences work well.  Unmapped character ordinals (ones which cause a
1602   :exc:`LookupError`) are left untouched and are copied as-is.
1603
1604   *errors* has the usual meaning for codecs. It may be ``NULL`` which indicates to
1605   use the default error handling.
1606
1607
1608.. c:function:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
1609
1610   Join a sequence of strings using the given *separator* and return the resulting
1611   Unicode string.
1612
1613
1614.. c:function:: Py_ssize_t PyUnicode_Tailmatch(PyObject *str, PyObject *substr, \
1615                        Py_ssize_t start, Py_ssize_t end, int direction)
1616
1617   Return ``1`` if *substr* matches ``str[start:end]`` at the given tail end
1618   (*direction* == ``-1`` means to do a prefix match, *direction* == ``1`` a suffix match),
1619   ``0`` otherwise. Return ``-1`` if an error occurred.
1620
1621
1622.. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, \
1623                               Py_ssize_t start, Py_ssize_t end, int direction)
1624
1625   Return the first position of *substr* in ``str[start:end]`` using the given
1626   *direction* (*direction* == ``1`` means to do a forward search, *direction* == ``-1`` a
1627   backward search).  The return value is the index of the first match; a value of
1628   ``-1`` indicates that no match was found, and ``-2`` indicates that an error
1629   occurred and an exception has been set.
1630
1631
1632.. c:function:: Py_ssize_t PyUnicode_FindChar(PyObject *str, Py_UCS4 ch, \
1633                               Py_ssize_t start, Py_ssize_t end, int direction)
1634
1635   Return the first position of the character *ch* in ``str[start:end]`` using
1636   the given *direction* (*direction* == ``1`` means to do a forward search,
1637   *direction* == ``-1`` a backward search).  The return value is the index of the
1638   first match; a value of ``-1`` indicates that no match was found, and ``-2``
1639   indicates that an error occurred and an exception has been set.
1640
1641   .. versionadded:: 3.3
1642
1643   .. versionchanged:: 3.7
1644      *start* and *end* are now adjusted to behave like ``str[start:end]``.
1645
1646
1647.. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, \
1648                               Py_ssize_t start, Py_ssize_t end)
1649
1650   Return the number of non-overlapping occurrences of *substr* in
1651   ``str[start:end]``.  Return ``-1`` if an error occurred.
1652
1653
1654.. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, \
1655                              PyObject *replstr, Py_ssize_t maxcount)
1656
1657   Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
1658   return the resulting Unicode object. *maxcount* == ``-1`` means replace all
1659   occurrences.
1660
1661
1662.. c:function:: int PyUnicode_Compare(PyObject *left, PyObject *right)
1663
1664   Compare two strings and return ``-1``, ``0``, ``1`` for less than, equal, and greater than,
1665   respectively.
1666
1667   This function returns ``-1`` upon failure, so one should call
1668   :c:func:`PyErr_Occurred` to check for errors.
1669
1670
1671.. c:function:: int PyUnicode_CompareWithASCIIString(PyObject *uni, const char *string)
1672
1673   Compare a Unicode object, *uni*, with *string* and return ``-1``, ``0``, ``1`` for less
1674   than, equal, and greater than, respectively. It is best to pass only
1675   ASCII-encoded strings, but the function interprets the input string as
1676   ISO-8859-1 if it contains non-ASCII characters.
1677
1678   This function does not raise exceptions.
1679
1680
1681.. c:function:: PyObject* PyUnicode_RichCompare(PyObject *left,  PyObject *right,  int op)
1682
1683   Rich compare two Unicode strings and return one of the following:
1684
1685   * ``NULL`` in case an exception was raised
1686   * :const:`Py_True` or :const:`Py_False` for successful comparisons
1687   * :const:`Py_NotImplemented` in case the type combination is unknown
1688
1689   Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
1690   :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
1691
1692
1693.. c:function:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
1694
1695   Return a new string object from *format* and *args*; this is analogous to
1696   ``format % args``.
1697
1698
1699.. c:function:: int PyUnicode_Contains(PyObject *container, PyObject *element)
1700
1701   Check whether *element* is contained in *container* and return true or false
1702   accordingly.
1703
1704   *element* has to coerce to a one element Unicode string. ``-1`` is returned
1705   if there was an error.
1706
1707
1708.. c:function:: void PyUnicode_InternInPlace(PyObject **string)
1709
1710   Intern the argument *\*string* in place.  The argument must be the address of a
1711   pointer variable pointing to a Python Unicode string object.  If there is an
1712   existing interned string that is the same as *\*string*, it sets *\*string* to
1713   it (decrementing the reference count of the old string object and incrementing
1714   the reference count of the interned string object), otherwise it leaves
1715   *\*string* alone and interns it (incrementing its reference count).
1716   (Clarification: even though there is a lot of talk about reference counts, think
1717   of this function as reference-count-neutral; you own the object after the call
1718   if and only if you owned it before the call.)
1719
1720
1721.. c:function:: PyObject* PyUnicode_InternFromString(const char *v)
1722
1723   A combination of :c:func:`PyUnicode_FromString` and
1724   :c:func:`PyUnicode_InternInPlace`, returning either a new Unicode string
1725   object that has been interned, or a new ("owned") reference to an earlier
1726   interned string object with the same value.
1727