1.. highlightlang:: c 2 3.. _unicodeobjects: 4 5Unicode Objects and Codecs 6-------------------------- 7 8.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com> 9.. sectionauthor:: Georg Brandl <georg@python.org> 10 11Unicode Objects 12^^^^^^^^^^^^^^^ 13 14Since the implementation of :pep:`393` in Python 3.3, Unicode objects internally 15use a variety of representations, in order to allow handling the complete range 16of Unicode characters while staying memory efficient. There are special cases 17for strings where all code points are below 128, 256, or 65536; otherwise, code 18points must be below 1114112 (which is the full Unicode range). 19 20:c:type:`Py_UNICODE*` and UTF-8 representations are created on demand and cached 21in the Unicode object. The :c:type:`Py_UNICODE*` representation is deprecated 22and inefficient; it should be avoided in performance- or memory-sensitive 23situations. 24 25Due to the transition between the old APIs and the new APIs, Unicode objects 26can internally be in two states depending on how they were created: 27 28* "canonical" Unicode objects are all objects created by a non-deprecated 29 Unicode API. They use the most efficient representation allowed by the 30 implementation. 31 32* "legacy" Unicode objects have been created through one of the deprecated 33 APIs (typically :c:func:`PyUnicode_FromUnicode`) and only bear the 34 :c:type:`Py_UNICODE*` representation; you will have to call 35 :c:func:`PyUnicode_READY` on them before calling any other API. 36 37 38Unicode Type 39"""""""""""" 40 41These are the basic Unicode object types used for the Unicode implementation in 42Python: 43 44.. c:type:: Py_UCS4 45 Py_UCS2 46 Py_UCS1 47 48 These types are typedefs for unsigned integer types wide enough to contain 49 characters of 32 bits, 16 bits and 8 bits, respectively. When dealing with 50 single Unicode characters, use :c:type:`Py_UCS4`. 51 52 .. versionadded:: 3.3 53 54 55.. c:type:: Py_UNICODE 56 57 This is a typedef of :c:type:`wchar_t`, which is a 16-bit type or 32-bit type 58 depending on the platform. 59 60 .. versionchanged:: 3.3 61 In previous versions, this was a 16-bit type or a 32-bit type depending on 62 whether you selected a "narrow" or "wide" Unicode version of Python at 63 build time. 64 65 66.. c:type:: PyASCIIObject 67 PyCompactUnicodeObject 68 PyUnicodeObject 69 70 These subtypes of :c:type:`PyObject` represent a Python Unicode object. In 71 almost all cases, they shouldn't be used directly, since all API functions 72 that deal with Unicode objects take and return :c:type:`PyObject` pointers. 73 74 .. versionadded:: 3.3 75 76 77.. c:var:: PyTypeObject PyUnicode_Type 78 79 This instance of :c:type:`PyTypeObject` represents the Python Unicode type. It 80 is exposed to Python code as ``str``. 81 82 83The following APIs are really C macros and can be used to do fast checks and to 84access internal read-only data of Unicode objects: 85 86.. c:function:: int PyUnicode_Check(PyObject *o) 87 88 Return true if the object *o* is a Unicode object or an instance of a Unicode 89 subtype. 90 91 92.. c:function:: int PyUnicode_CheckExact(PyObject *o) 93 94 Return true if the object *o* is a Unicode object, but not an instance of a 95 subtype. 96 97 98.. c:function:: int PyUnicode_READY(PyObject *o) 99 100 Ensure the string object *o* is in the "canonical" representation. This is 101 required before using any of the access macros described below. 102 103 .. XXX expand on when it is not required 104 105 Returns ``0`` on success and ``-1`` with an exception set on failure, which in 106 particular happens if memory allocation fails. 107 108 .. versionadded:: 3.3 109 110 111.. c:function:: Py_ssize_t PyUnicode_GET_LENGTH(PyObject *o) 112 113 Return the length of the Unicode string, in code points. *o* has to be a 114 Unicode object in the "canonical" representation (not checked). 115 116 .. versionadded:: 3.3 117 118 119.. c:function:: Py_UCS1* PyUnicode_1BYTE_DATA(PyObject *o) 120 Py_UCS2* PyUnicode_2BYTE_DATA(PyObject *o) 121 Py_UCS4* PyUnicode_4BYTE_DATA(PyObject *o) 122 123 Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4 124 integer types for direct character access. No checks are performed if the 125 canonical representation has the correct character size; use 126 :c:func:`PyUnicode_KIND` to select the right macro. Make sure 127 :c:func:`PyUnicode_READY` has been called before accessing this. 128 129 .. versionadded:: 3.3 130 131 132.. c:macro:: PyUnicode_WCHAR_KIND 133 PyUnicode_1BYTE_KIND 134 PyUnicode_2BYTE_KIND 135 PyUnicode_4BYTE_KIND 136 137 Return values of the :c:func:`PyUnicode_KIND` macro. 138 139 .. versionadded:: 3.3 140 141 142.. c:function:: int PyUnicode_KIND(PyObject *o) 143 144 Return one of the PyUnicode kind constants (see above) that indicate how many 145 bytes per character this Unicode object uses to store its data. *o* has to 146 be a Unicode object in the "canonical" representation (not checked). 147 148 .. XXX document "0" return value? 149 150 .. versionadded:: 3.3 151 152 153.. c:function:: void* PyUnicode_DATA(PyObject *o) 154 155 Return a void pointer to the raw Unicode buffer. *o* has to be a Unicode 156 object in the "canonical" representation (not checked). 157 158 .. versionadded:: 3.3 159 160 161.. c:function:: void PyUnicode_WRITE(int kind, void *data, Py_ssize_t index, \ 162 Py_UCS4 value) 163 164 Write into a canonical representation *data* (as obtained with 165 :c:func:`PyUnicode_DATA`). This macro does not do any sanity checks and is 166 intended for usage in loops. The caller should cache the *kind* value and 167 *data* pointer as obtained from other macro calls. *index* is the index in 168 the string (starts at 0) and *value* is the new code point value which should 169 be written to that location. 170 171 .. versionadded:: 3.3 172 173 174.. c:function:: Py_UCS4 PyUnicode_READ(int kind, void *data, Py_ssize_t index) 175 176 Read a code point from a canonical representation *data* (as obtained with 177 :c:func:`PyUnicode_DATA`). No checks or ready calls are performed. 178 179 .. versionadded:: 3.3 180 181 182.. c:function:: Py_UCS4 PyUnicode_READ_CHAR(PyObject *o, Py_ssize_t index) 183 184 Read a character from a Unicode object *o*, which must be in the "canonical" 185 representation. This is less efficient than :c:func:`PyUnicode_READ` if you 186 do multiple consecutive reads. 187 188 .. versionadded:: 3.3 189 190 191.. c:function:: PyUnicode_MAX_CHAR_VALUE(PyObject *o) 192 193 Return the maximum code point that is suitable for creating another string 194 based on *o*, which must be in the "canonical" representation. This is 195 always an approximation but more efficient than iterating over the string. 196 197 .. versionadded:: 3.3 198 199 200.. c:function:: int PyUnicode_ClearFreeList() 201 202 Clear the free list. Return the total number of freed items. 203 204 205.. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o) 206 207 Return the size of the deprecated :c:type:`Py_UNICODE` representation, in 208 code units (this includes surrogate pairs as 2 units). *o* has to be a 209 Unicode object (not checked). 210 211 .. deprecated-removed:: 3.3 4.0 212 Part of the old-style Unicode API, please migrate to using 213 :c:func:`PyUnicode_GET_LENGTH`. 214 215 216.. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o) 217 218 Return the size of the deprecated :c:type:`Py_UNICODE` representation in 219 bytes. *o* has to be a Unicode object (not checked). 220 221 .. deprecated-removed:: 3.3 4.0 222 Part of the old-style Unicode API, please migrate to using 223 :c:func:`PyUnicode_GET_LENGTH`. 224 225 226.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o) 227 const char* PyUnicode_AS_DATA(PyObject *o) 228 229 Return a pointer to a :c:type:`Py_UNICODE` representation of the object. The 230 returned buffer is always terminated with an extra null code point. It 231 may also contain embedded null code points, which would cause the string 232 to be truncated when used in most C functions. The ``AS_DATA`` form 233 casts the pointer to :c:type:`const char *`. The *o* argument has to be 234 a Unicode object (not checked). 235 236 .. versionchanged:: 3.3 237 This macro is now inefficient -- because in many cases the 238 :c:type:`Py_UNICODE` representation does not exist and needs to be created 239 -- and can fail (return ``NULL`` with an exception set). Try to port the 240 code to use the new :c:func:`PyUnicode_nBYTE_DATA` macros or use 241 :c:func:`PyUnicode_WRITE` or :c:func:`PyUnicode_READ`. 242 243 .. deprecated-removed:: 3.3 4.0 244 Part of the old-style Unicode API, please migrate to using the 245 :c:func:`PyUnicode_nBYTE_DATA` family of macros. 246 247 248Unicode Character Properties 249"""""""""""""""""""""""""""" 250 251Unicode provides many different character properties. The most often needed ones 252are available through these macros which are mapped to C functions depending on 253the Python configuration. 254 255 256.. c:function:: int Py_UNICODE_ISSPACE(Py_UNICODE ch) 257 258 Return ``1`` or ``0`` depending on whether *ch* is a whitespace character. 259 260 261.. c:function:: int Py_UNICODE_ISLOWER(Py_UNICODE ch) 262 263 Return ``1`` or ``0`` depending on whether *ch* is a lowercase character. 264 265 266.. c:function:: int Py_UNICODE_ISUPPER(Py_UNICODE ch) 267 268 Return ``1`` or ``0`` depending on whether *ch* is an uppercase character. 269 270 271.. c:function:: int Py_UNICODE_ISTITLE(Py_UNICODE ch) 272 273 Return ``1`` or ``0`` depending on whether *ch* is a titlecase character. 274 275 276.. c:function:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch) 277 278 Return ``1`` or ``0`` depending on whether *ch* is a linebreak character. 279 280 281.. c:function:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch) 282 283 Return ``1`` or ``0`` depending on whether *ch* is a decimal character. 284 285 286.. c:function:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch) 287 288 Return ``1`` or ``0`` depending on whether *ch* is a digit character. 289 290 291.. c:function:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch) 292 293 Return ``1`` or ``0`` depending on whether *ch* is a numeric character. 294 295 296.. c:function:: int Py_UNICODE_ISALPHA(Py_UNICODE ch) 297 298 Return ``1`` or ``0`` depending on whether *ch* is an alphabetic character. 299 300 301.. c:function:: int Py_UNICODE_ISALNUM(Py_UNICODE ch) 302 303 Return ``1`` or ``0`` depending on whether *ch* is an alphanumeric character. 304 305 306.. c:function:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch) 307 308 Return ``1`` or ``0`` depending on whether *ch* is a printable character. 309 Nonprintable characters are those characters defined in the Unicode character 310 database as "Other" or "Separator", excepting the ASCII space (0x20) which is 311 considered printable. (Note that printable characters in this context are 312 those which should not be escaped when :func:`repr` is invoked on a string. 313 It has no bearing on the handling of strings written to :data:`sys.stdout` or 314 :data:`sys.stderr`.) 315 316 317These APIs can be used for fast direct character conversions: 318 319 320.. c:function:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch) 321 322 Return the character *ch* converted to lower case. 323 324 .. deprecated:: 3.3 325 This function uses simple case mappings. 326 327 328.. c:function:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch) 329 330 Return the character *ch* converted to upper case. 331 332 .. deprecated:: 3.3 333 This function uses simple case mappings. 334 335 336.. c:function:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch) 337 338 Return the character *ch* converted to title case. 339 340 .. deprecated:: 3.3 341 This function uses simple case mappings. 342 343 344.. c:function:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch) 345 346 Return the character *ch* converted to a decimal positive integer. Return 347 ``-1`` if this is not possible. This macro does not raise exceptions. 348 349 350.. c:function:: int Py_UNICODE_TODIGIT(Py_UNICODE ch) 351 352 Return the character *ch* converted to a single digit integer. Return ``-1`` if 353 this is not possible. This macro does not raise exceptions. 354 355 356.. c:function:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch) 357 358 Return the character *ch* converted to a double. Return ``-1.0`` if this is not 359 possible. This macro does not raise exceptions. 360 361 362These APIs can be used to work with surrogates: 363 364.. c:macro:: Py_UNICODE_IS_SURROGATE(ch) 365 366 Check if *ch* is a surrogate (``0xD800 <= ch <= 0xDFFF``). 367 368.. c:macro:: Py_UNICODE_IS_HIGH_SURROGATE(ch) 369 370 Check if *ch* is a high surrogate (``0xD800 <= ch <= 0xDBFF``). 371 372.. c:macro:: Py_UNICODE_IS_LOW_SURROGATE(ch) 373 374 Check if *ch* is a low surrogate (``0xDC00 <= ch <= 0xDFFF``). 375 376.. c:macro:: Py_UNICODE_JOIN_SURROGATES(high, low) 377 378 Join two surrogate characters and return a single Py_UCS4 value. 379 *high* and *low* are respectively the leading and trailing surrogates in a 380 surrogate pair. 381 382 383Creating and accessing Unicode strings 384"""""""""""""""""""""""""""""""""""""" 385 386To create Unicode objects and access their basic sequence properties, use these 387APIs: 388 389.. c:function:: PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar) 390 391 Create a new Unicode object. *maxchar* should be the true maximum code point 392 to be placed in the string. As an approximation, it can be rounded up to the 393 nearest value in the sequence 127, 255, 65535, 1114111. 394 395 This is the recommended way to allocate a new Unicode object. Objects 396 created using this function are not resizable. 397 398 .. versionadded:: 3.3 399 400 401.. c:function:: PyObject* PyUnicode_FromKindAndData(int kind, const void *buffer, \ 402 Py_ssize_t size) 403 404 Create a new Unicode object with the given *kind* (possible values are 405 :c:macro:`PyUnicode_1BYTE_KIND` etc., as returned by 406 :c:func:`PyUnicode_KIND`). The *buffer* must point to an array of *size* 407 units of 1, 2 or 4 bytes per character, as given by the kind. 408 409 .. versionadded:: 3.3 410 411 412.. c:function:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size) 413 414 Create a Unicode object from the char buffer *u*. The bytes will be 415 interpreted as being UTF-8 encoded. The buffer is copied into the new 416 object. If the buffer is not ``NULL``, the return value might be a shared 417 object, i.e. modification of the data is not allowed. 418 419 If *u* is ``NULL``, this function behaves like :c:func:`PyUnicode_FromUnicode` 420 with the buffer set to ``NULL``. This usage is deprecated in favor of 421 :c:func:`PyUnicode_New`. 422 423 424.. c:function:: PyObject *PyUnicode_FromString(const char *u) 425 426 Create a Unicode object from a UTF-8 encoded null-terminated char buffer 427 *u*. 428 429 430.. c:function:: PyObject* PyUnicode_FromFormat(const char *format, ...) 431 432 Take a C :c:func:`printf`\ -style *format* string and a variable number of 433 arguments, calculate the size of the resulting Python Unicode string and return 434 a string with the values formatted into it. The variable arguments must be C 435 types and must correspond exactly to the format characters in the *format* 436 ASCII-encoded string. The following format characters are allowed: 437 438 .. % This should be exactly the same as the table in PyErr_Format. 439 .. % The descriptions for %zd and %zu are wrong, but the truth is complicated 440 .. % because not all compilers support the %z width modifier -- we fake it 441 .. % when necessary via interpolating PY_FORMAT_SIZE_T. 442 .. % Similar comments apply to the %ll width modifier and 443 444 .. tabularcolumns:: |l|l|L| 445 446 +-------------------+---------------------+----------------------------------+ 447 | Format Characters | Type | Comment | 448 +===================+=====================+==================================+ 449 | :attr:`%%` | *n/a* | The literal % character. | 450 +-------------------+---------------------+----------------------------------+ 451 | :attr:`%c` | int | A single character, | 452 | | | represented as a C int. | 453 +-------------------+---------------------+----------------------------------+ 454 | :attr:`%d` | int | Equivalent to | 455 | | | ``printf("%d")``. [1]_ | 456 +-------------------+---------------------+----------------------------------+ 457 | :attr:`%u` | unsigned int | Equivalent to | 458 | | | ``printf("%u")``. [1]_ | 459 +-------------------+---------------------+----------------------------------+ 460 | :attr:`%ld` | long | Equivalent to | 461 | | | ``printf("%ld")``. [1]_ | 462 +-------------------+---------------------+----------------------------------+ 463 | :attr:`%li` | long | Equivalent to | 464 | | | ``printf("%li")``. [1]_ | 465 +-------------------+---------------------+----------------------------------+ 466 | :attr:`%lu` | unsigned long | Equivalent to | 467 | | | ``printf("%lu")``. [1]_ | 468 +-------------------+---------------------+----------------------------------+ 469 | :attr:`%lld` | long long | Equivalent to | 470 | | | ``printf("%lld")``. [1]_ | 471 +-------------------+---------------------+----------------------------------+ 472 | :attr:`%lli` | long long | Equivalent to | 473 | | | ``printf("%lli")``. [1]_ | 474 +-------------------+---------------------+----------------------------------+ 475 | :attr:`%llu` | unsigned long long | Equivalent to | 476 | | | ``printf("%llu")``. [1]_ | 477 +-------------------+---------------------+----------------------------------+ 478 | :attr:`%zd` | Py_ssize_t | Equivalent to | 479 | | | ``printf("%zd")``. [1]_ | 480 +-------------------+---------------------+----------------------------------+ 481 | :attr:`%zi` | Py_ssize_t | Equivalent to | 482 | | | ``printf("%zi")``. [1]_ | 483 +-------------------+---------------------+----------------------------------+ 484 | :attr:`%zu` | size_t | Equivalent to | 485 | | | ``printf("%zu")``. [1]_ | 486 +-------------------+---------------------+----------------------------------+ 487 | :attr:`%i` | int | Equivalent to | 488 | | | ``printf("%i")``. [1]_ | 489 +-------------------+---------------------+----------------------------------+ 490 | :attr:`%x` | int | Equivalent to | 491 | | | ``printf("%x")``. [1]_ | 492 +-------------------+---------------------+----------------------------------+ 493 | :attr:`%s` | const char\* | A null-terminated C character | 494 | | | array. | 495 +-------------------+---------------------+----------------------------------+ 496 | :attr:`%p` | const void\* | The hex representation of a C | 497 | | | pointer. Mostly equivalent to | 498 | | | ``printf("%p")`` except that | 499 | | | it is guaranteed to start with | 500 | | | the literal ``0x`` regardless | 501 | | | of what the platform's | 502 | | | ``printf`` yields. | 503 +-------------------+---------------------+----------------------------------+ 504 | :attr:`%A` | PyObject\* | The result of calling | 505 | | | :func:`ascii`. | 506 +-------------------+---------------------+----------------------------------+ 507 | :attr:`%U` | PyObject\* | A Unicode object. | 508 +-------------------+---------------------+----------------------------------+ 509 | :attr:`%V` | PyObject\*, | A Unicode object (which may be | 510 | | const char\* | ``NULL``) and a null-terminated | 511 | | | C character array as a second | 512 | | | parameter (which will be used, | 513 | | | if the first parameter is | 514 | | | ``NULL``). | 515 +-------------------+---------------------+----------------------------------+ 516 | :attr:`%S` | PyObject\* | The result of calling | 517 | | | :c:func:`PyObject_Str`. | 518 +-------------------+---------------------+----------------------------------+ 519 | :attr:`%R` | PyObject\* | The result of calling | 520 | | | :c:func:`PyObject_Repr`. | 521 +-------------------+---------------------+----------------------------------+ 522 523 An unrecognized format character causes all the rest of the format string to be 524 copied as-is to the result string, and any extra arguments discarded. 525 526 .. note:: 527 The width formatter unit is number of characters rather than bytes. 528 The precision formatter unit is number of bytes for ``"%s"`` and 529 ``"%V"`` (if the ``PyObject*`` argument is ``NULL``), and a number of 530 characters for ``"%A"``, ``"%U"``, ``"%S"``, ``"%R"`` and ``"%V"`` 531 (if the ``PyObject*`` argument is not ``NULL``). 532 533 .. [1] For integer specifiers (d, u, ld, li, lu, lld, lli, llu, zd, zi, 534 zu, i, x): the 0-conversion flag has effect even when a precision is given. 535 536 .. versionchanged:: 3.2 537 Support for ``"%lld"`` and ``"%llu"`` added. 538 539 .. versionchanged:: 3.3 540 Support for ``"%li"``, ``"%lli"`` and ``"%zi"`` added. 541 542 .. versionchanged:: 3.4 543 Support width and precision formatter for ``"%s"``, ``"%A"``, ``"%U"``, 544 ``"%V"``, ``"%S"``, ``"%R"`` added. 545 546 547.. c:function:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs) 548 549 Identical to :c:func:`PyUnicode_FromFormat` except that it takes exactly two 550 arguments. 551 552 553.. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, \ 554 const char *encoding, const char *errors) 555 556 Decode an encoded object *obj* to a Unicode object. 557 558 :class:`bytes`, :class:`bytearray` and other 559 :term:`bytes-like objects <bytes-like object>` 560 are decoded according to the given *encoding* and using the error handling 561 defined by *errors*. Both can be ``NULL`` to have the interface use the default 562 values (see :ref:`builtincodecs` for details). 563 564 All other objects, including Unicode objects, cause a :exc:`TypeError` to be 565 set. 566 567 The API returns ``NULL`` if there was an error. The caller is responsible for 568 decref'ing the returned objects. 569 570 571.. c:function:: Py_ssize_t PyUnicode_GetLength(PyObject *unicode) 572 573 Return the length of the Unicode object, in code points. 574 575 .. versionadded:: 3.3 576 577 578.. c:function:: Py_ssize_t PyUnicode_CopyCharacters(PyObject *to, \ 579 Py_ssize_t to_start, \ 580 PyObject *from, \ 581 Py_ssize_t from_start, \ 582 Py_ssize_t how_many) 583 584 Copy characters from one Unicode object into another. This function performs 585 character conversion when necessary and falls back to :c:func:`memcpy` if 586 possible. Returns ``-1`` and sets an exception on error, otherwise returns 587 the number of copied characters. 588 589 .. versionadded:: 3.3 590 591 592.. c:function:: Py_ssize_t PyUnicode_Fill(PyObject *unicode, Py_ssize_t start, \ 593 Py_ssize_t length, Py_UCS4 fill_char) 594 595 Fill a string with a character: write *fill_char* into 596 ``unicode[start:start+length]``. 597 598 Fail if *fill_char* is bigger than the string maximum character, or if the 599 string has more than 1 reference. 600 601 Return the number of written character, or return ``-1`` and raise an 602 exception on error. 603 604 .. versionadded:: 3.3 605 606 607.. c:function:: int PyUnicode_WriteChar(PyObject *unicode, Py_ssize_t index, \ 608 Py_UCS4 character) 609 610 Write a character to a string. The string must have been created through 611 :c:func:`PyUnicode_New`. Since Unicode strings are supposed to be immutable, 612 the string must not be shared, or have been hashed yet. 613 614 This function checks that *unicode* is a Unicode object, that the index is 615 not out of bounds, and that the object can be modified safely (i.e. that it 616 its reference count is one). 617 618 .. versionadded:: 3.3 619 620 621.. c:function:: Py_UCS4 PyUnicode_ReadChar(PyObject *unicode, Py_ssize_t index) 622 623 Read a character from a string. This function checks that *unicode* is a 624 Unicode object and the index is not out of bounds, in contrast to the macro 625 version :c:func:`PyUnicode_READ_CHAR`. 626 627 .. versionadded:: 3.3 628 629 630.. c:function:: PyObject* PyUnicode_Substring(PyObject *str, Py_ssize_t start, \ 631 Py_ssize_t end) 632 633 Return a substring of *str*, from character index *start* (included) to 634 character index *end* (excluded). Negative indices are not supported. 635 636 .. versionadded:: 3.3 637 638 639.. c:function:: Py_UCS4* PyUnicode_AsUCS4(PyObject *u, Py_UCS4 *buffer, \ 640 Py_ssize_t buflen, int copy_null) 641 642 Copy the string *u* into a UCS4 buffer, including a null character, if 643 *copy_null* is set. Returns ``NULL`` and sets an exception on error (in 644 particular, a :exc:`SystemError` if *buflen* is smaller than the length of 645 *u*). *buffer* is returned on success. 646 647 .. versionadded:: 3.3 648 649 650.. c:function:: Py_UCS4* PyUnicode_AsUCS4Copy(PyObject *u) 651 652 Copy the string *u* into a new UCS4 buffer that is allocated using 653 :c:func:`PyMem_Malloc`. If this fails, ``NULL`` is returned with a 654 :exc:`MemoryError` set. The returned buffer always has an extra 655 null code point appended. 656 657 .. versionadded:: 3.3 658 659 660Deprecated Py_UNICODE APIs 661"""""""""""""""""""""""""" 662 663.. deprecated-removed:: 3.3 4.0 664 665These API functions are deprecated with the implementation of :pep:`393`. 666Extension modules can continue using them, as they will not be removed in Python 6673.x, but need to be aware that their use can now cause performance and memory hits. 668 669 670.. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size) 671 672 Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u* 673 may be ``NULL`` which causes the contents to be undefined. It is the user's 674 responsibility to fill in the needed data. The buffer is copied into the new 675 object. 676 677 If the buffer is not ``NULL``, the return value might be a shared object. 678 Therefore, modification of the resulting Unicode object is only allowed when 679 *u* is ``NULL``. 680 681 If the buffer is ``NULL``, :c:func:`PyUnicode_READY` must be called once the 682 string content has been filled before using any of the access macros such as 683 :c:func:`PyUnicode_KIND`. 684 685 Please migrate to using :c:func:`PyUnicode_FromKindAndData`, 686 :c:func:`PyUnicode_FromWideChar` or :c:func:`PyUnicode_New`. 687 688 689.. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode) 690 691 Return a read-only pointer to the Unicode object's internal 692 :c:type:`Py_UNICODE` buffer, or ``NULL`` on error. This will create the 693 :c:type:`Py_UNICODE*` representation of the object if it is not yet 694 available. The buffer is always terminated with an extra null code point. 695 Note that the resulting :c:type:`Py_UNICODE` string may also contain 696 embedded null code points, which would cause the string to be truncated when 697 used in most C functions. 698 699 Please migrate to using :c:func:`PyUnicode_AsUCS4`, 700 :c:func:`PyUnicode_AsWideChar`, :c:func:`PyUnicode_ReadChar` or similar new 701 APIs. 702 703 704.. c:function:: PyObject* PyUnicode_TransformDecimalToASCII(Py_UNICODE *s, Py_ssize_t size) 705 706 Create a Unicode object by replacing all decimal digits in 707 :c:type:`Py_UNICODE` buffer of the given *size* by ASCII digits 0--9 708 according to their decimal value. Return ``NULL`` if an exception occurs. 709 710 711.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size) 712 713 Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE` 714 array length (excluding the extra null terminator) in *size*. 715 Note that the resulting :c:type:`Py_UNICODE*` string 716 may contain embedded null code points, which would cause the string to be 717 truncated when used in most C functions. 718 719 .. versionadded:: 3.3 720 721 722.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeCopy(PyObject *unicode) 723 724 Create a copy of a Unicode string ending with a null code point. Return ``NULL`` 725 and raise a :exc:`MemoryError` exception on memory allocation failure, 726 otherwise return a new allocated buffer (use :c:func:`PyMem_Free` to free 727 the buffer). Note that the resulting :c:type:`Py_UNICODE*` string may 728 contain embedded null code points, which would cause the string to be 729 truncated when used in most C functions. 730 731 .. versionadded:: 3.2 732 733 Please migrate to using :c:func:`PyUnicode_AsUCS4Copy` or similar new APIs. 734 735 736.. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode) 737 738 Return the size of the deprecated :c:type:`Py_UNICODE` representation, in 739 code units (this includes surrogate pairs as 2 units). 740 741 Please migrate to using :c:func:`PyUnicode_GetLength`. 742 743 744.. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj) 745 746 Copy an instance of a Unicode subtype to a new true Unicode object if 747 necessary. If *obj* is already a true Unicode object (not a subtype), 748 return the reference with incremented refcount. 749 750 Objects other than Unicode or its subtypes will cause a :exc:`TypeError`. 751 752 753Locale Encoding 754""""""""""""""" 755 756The current locale encoding can be used to decode text from the operating 757system. 758 759.. c:function:: PyObject* PyUnicode_DecodeLocaleAndSize(const char *str, \ 760 Py_ssize_t len, \ 761 const char *errors) 762 763 Decode a string from UTF-8 on Android, or from the current locale encoding 764 on other platforms. The supported 765 error handlers are ``"strict"`` and ``"surrogateescape"`` 766 (:pep:`383`). The decoder uses ``"strict"`` error handler if 767 *errors* is ``NULL``. *str* must end with a null character but 768 cannot contain embedded null characters. 769 770 Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` to decode a string from 771 :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at 772 Python startup). 773 774 This function ignores the Python UTF-8 mode. 775 776 .. seealso:: 777 778 The :c:func:`Py_DecodeLocale` function. 779 780 .. versionadded:: 3.3 781 782 .. versionchanged:: 3.7 783 The function now also uses the current locale encoding for the 784 ``surrogateescape`` error handler, except on Android. Previously, :c:func:`Py_DecodeLocale` 785 was used for the ``surrogateescape``, and the current locale encoding was 786 used for ``strict``. 787 788 789.. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors) 790 791 Similar to :c:func:`PyUnicode_DecodeLocaleAndSize`, but compute the string 792 length using :c:func:`strlen`. 793 794 .. versionadded:: 3.3 795 796 797.. c:function:: PyObject* PyUnicode_EncodeLocale(PyObject *unicode, const char *errors) 798 799 Encode a Unicode object to UTF-8 on Android, or to the current locale 800 encoding on other platforms. The 801 supported error handlers are ``"strict"`` and ``"surrogateescape"`` 802 (:pep:`383`). The encoder uses ``"strict"`` error handler if 803 *errors* is ``NULL``. Return a :class:`bytes` object. *unicode* cannot 804 contain embedded null characters. 805 806 Use :c:func:`PyUnicode_EncodeFSDefault` to encode a string to 807 :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at 808 Python startup). 809 810 This function ignores the Python UTF-8 mode. 811 812 .. seealso:: 813 814 The :c:func:`Py_EncodeLocale` function. 815 816 .. versionadded:: 3.3 817 818 .. versionchanged:: 3.7 819 The function now also uses the current locale encoding for the 820 ``surrogateescape`` error handler, except on Android. Previously, 821 :c:func:`Py_EncodeLocale` 822 was used for the ``surrogateescape``, and the current locale encoding was 823 used for ``strict``. 824 825 826File System Encoding 827"""""""""""""""""""" 828 829To encode and decode file names and other environment strings, 830:c:data:`Py_FileSystemDefaultEncoding` should be used as the encoding, and 831:c:data:`Py_FileSystemDefaultEncodeErrors` should be used as the error handler 832(:pep:`383` and :pep:`529`). To encode file names to :class:`bytes` during 833argument parsing, the ``"O&"`` converter should be used, passing 834:c:func:`PyUnicode_FSConverter` as the conversion function: 835 836.. c:function:: int PyUnicode_FSConverter(PyObject* obj, void* result) 837 838 ParseTuple converter: encode :class:`str` objects -- obtained directly or 839 through the :class:`os.PathLike` interface -- to :class:`bytes` using 840 :c:func:`PyUnicode_EncodeFSDefault`; :class:`bytes` objects are output as-is. 841 *result* must be a :c:type:`PyBytesObject*` which must be released when it is 842 no longer used. 843 844 .. versionadded:: 3.1 845 846 .. versionchanged:: 3.6 847 Accepts a :term:`path-like object`. 848 849To decode file names to :class:`str` during argument parsing, the ``"O&"`` 850converter should be used, passing :c:func:`PyUnicode_FSDecoder` as the 851conversion function: 852 853.. c:function:: int PyUnicode_FSDecoder(PyObject* obj, void* result) 854 855 ParseTuple converter: decode :class:`bytes` objects -- obtained either 856 directly or indirectly through the :class:`os.PathLike` interface -- to 857 :class:`str` using :c:func:`PyUnicode_DecodeFSDefaultAndSize`; :class:`str` 858 objects are output as-is. *result* must be a :c:type:`PyUnicodeObject*` which 859 must be released when it is no longer used. 860 861 .. versionadded:: 3.2 862 863 .. versionchanged:: 3.6 864 Accepts a :term:`path-like object`. 865 866 867.. c:function:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size) 868 869 Decode a string using :c:data:`Py_FileSystemDefaultEncoding` and the 870 :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 871 872 If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the 873 locale encoding. 874 875 :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the 876 locale encoding and cannot be modified later. If you need to decode a string 877 from the current locale encoding, use 878 :c:func:`PyUnicode_DecodeLocaleAndSize`. 879 880 .. seealso:: 881 882 The :c:func:`Py_DecodeLocale` function. 883 884 .. versionchanged:: 3.6 885 Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 886 887 888.. c:function:: PyObject* PyUnicode_DecodeFSDefault(const char *s) 889 890 Decode a null-terminated string using :c:data:`Py_FileSystemDefaultEncoding` 891 and the :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 892 893 If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the 894 locale encoding. 895 896 Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length. 897 898 .. versionchanged:: 3.6 899 Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 900 901 902.. c:function:: PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode) 903 904 Encode a Unicode object to :c:data:`Py_FileSystemDefaultEncoding` with the 905 :c:data:`Py_FileSystemDefaultEncodeErrors` error handler, and return 906 :class:`bytes`. Note that the resulting :class:`bytes` object may contain 907 null bytes. 908 909 If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the 910 locale encoding. 911 912 :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the 913 locale encoding and cannot be modified later. If you need to encode a string 914 to the current locale encoding, use :c:func:`PyUnicode_EncodeLocale`. 915 916 .. seealso:: 917 918 The :c:func:`Py_EncodeLocale` function. 919 920 .. versionadded:: 3.2 921 922 .. versionchanged:: 3.6 923 Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 924 925wchar_t Support 926""""""""""""""" 927 928:c:type:`wchar_t` support for platforms which support it: 929 930.. c:function:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size) 931 932 Create a Unicode object from the :c:type:`wchar_t` buffer *w* of the given *size*. 933 Passing ``-1`` as the *size* indicates that the function must itself compute the length, 934 using wcslen. 935 Return ``NULL`` on failure. 936 937 938.. c:function:: Py_ssize_t PyUnicode_AsWideChar(PyObject *unicode, wchar_t *w, Py_ssize_t size) 939 940 Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*. At most 941 *size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing 942 null termination character). Return the number of :c:type:`wchar_t` characters 943 copied or ``-1`` in case of an error. Note that the resulting :c:type:`wchar_t*` 944 string may or may not be null-terminated. It is the responsibility of the caller 945 to make sure that the :c:type:`wchar_t*` string is null-terminated in case this is 946 required by the application. Also, note that the :c:type:`wchar_t*` string 947 might contain null characters, which would cause the string to be truncated 948 when used with most C functions. 949 950 951.. c:function:: wchar_t* PyUnicode_AsWideCharString(PyObject *unicode, Py_ssize_t *size) 952 953 Convert the Unicode object to a wide character string. The output string 954 always ends with a null character. If *size* is not ``NULL``, write the number 955 of wide characters (excluding the trailing null termination character) into 956 *\*size*. Note that the resulting :c:type:`wchar_t` string might contain 957 null characters, which would cause the string to be truncated when used with 958 most C functions. If *size* is ``NULL`` and the :c:type:`wchar_t*` string 959 contains null characters a :exc:`ValueError` is raised. 960 961 Returns a buffer allocated by :c:func:`PyMem_Alloc` (use 962 :c:func:`PyMem_Free` to free it) on success. On error, returns ``NULL`` 963 and *\*size* is undefined. Raises a :exc:`MemoryError` if memory allocation 964 is failed. 965 966 .. versionadded:: 3.2 967 968 .. versionchanged:: 3.7 969 Raises a :exc:`ValueError` if *size* is ``NULL`` and the :c:type:`wchar_t*` 970 string contains null characters. 971 972 973.. _builtincodecs: 974 975Built-in Codecs 976^^^^^^^^^^^^^^^ 977 978Python provides a set of built-in codecs which are written in C for speed. All of 979these codecs are directly usable via the following functions. 980 981Many of the following APIs take two arguments encoding and errors, and they 982have the same semantics as the ones of the built-in :func:`str` string object 983constructor. 984 985Setting encoding to ``NULL`` causes the default encoding to be used 986which is ASCII. The file system calls should use 987:c:func:`PyUnicode_FSConverter` for encoding file names. This uses the 988variable :c:data:`Py_FileSystemDefaultEncoding` internally. This 989variable should be treated as read-only: on some systems, it will be a 990pointer to a static string, on others, it will change at run-time 991(such as when the application invokes setlocale). 992 993Error handling is set by errors which may also be set to ``NULL`` meaning to use 994the default handling defined for the codec. Default error handling for all 995built-in codecs is "strict" (:exc:`ValueError` is raised). 996 997The codecs all use a similar interface. Only deviation from the following 998generic ones are documented for simplicity. 999 1000 1001Generic Codecs 1002"""""""""""""" 1003 1004These are the generic codec APIs: 1005 1006 1007.. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, \ 1008 const char *encoding, const char *errors) 1009 1010 Create a Unicode object by decoding *size* bytes of the encoded string *s*. 1011 *encoding* and *errors* have the same meaning as the parameters of the same name 1012 in the :func:`str` built-in function. The codec to be used is looked up 1013 using the Python codec registry. Return ``NULL`` if an exception was raised by 1014 the codec. 1015 1016 1017.. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, \ 1018 const char *encoding, const char *errors) 1019 1020 Encode a Unicode object and return the result as Python bytes object. 1021 *encoding* and *errors* have the same meaning as the parameters of the same 1022 name in the Unicode :meth:`~str.encode` method. The codec to be used is looked up 1023 using the Python codec registry. Return ``NULL`` if an exception was raised by 1024 the codec. 1025 1026 1027.. c:function:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, \ 1028 const char *encoding, const char *errors) 1029 1030 Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* and return a Python 1031 bytes object. *encoding* and *errors* have the same meaning as the 1032 parameters of the same name in the Unicode :meth:`~str.encode` method. The codec 1033 to be used is looked up using the Python codec registry. Return ``NULL`` if an 1034 exception was raised by the codec. 1035 1036 .. deprecated-removed:: 3.3 4.0 1037 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1038 :c:func:`PyUnicode_AsEncodedString`. 1039 1040 1041UTF-8 Codecs 1042"""""""""""" 1043 1044These are the UTF-8 codec APIs: 1045 1046 1047.. c:function:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors) 1048 1049 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string 1050 *s*. Return ``NULL`` if an exception was raised by the codec. 1051 1052 1053.. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, \ 1054 const char *errors, Py_ssize_t *consumed) 1055 1056 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF8`. If 1057 *consumed* is not ``NULL``, trailing incomplete UTF-8 byte sequences will not be 1058 treated as an error. Those bytes will not be decoded and the number of bytes 1059 that have been decoded will be stored in *consumed*. 1060 1061 1062.. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode) 1063 1064 Encode a Unicode object using UTF-8 and return the result as Python bytes 1065 object. Error handling is "strict". Return ``NULL`` if an exception was 1066 raised by the codec. 1067 1068 1069.. c:function:: const char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size) 1070 1071 Return a pointer to the UTF-8 encoding of the Unicode object, and 1072 store the size of the encoded representation (in bytes) in *size*. The 1073 *size* argument can be ``NULL``; in this case no size will be stored. The 1074 returned buffer always has an extra null byte appended (not included in 1075 *size*), regardless of whether there are any other null code points. 1076 1077 In the case of an error, ``NULL`` is returned with an exception set and no 1078 *size* is stored. 1079 1080 This caches the UTF-8 representation of the string in the Unicode object, and 1081 subsequent calls will return a pointer to the same buffer. The caller is not 1082 responsible for deallocating the buffer. 1083 1084 .. versionadded:: 3.3 1085 1086 .. versionchanged:: 3.7 1087 The return type is now ``const char *`` rather of ``char *``. 1088 1089 1090.. c:function:: const char* PyUnicode_AsUTF8(PyObject *unicode) 1091 1092 As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size. 1093 1094 .. versionadded:: 3.3 1095 1096 .. versionchanged:: 3.7 1097 The return type is now ``const char *`` rather of ``char *``. 1098 1099 1100.. c:function:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1101 1102 Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* using UTF-8 and 1103 return a Python bytes object. Return ``NULL`` if an exception was raised by 1104 the codec. 1105 1106 .. deprecated-removed:: 3.3 4.0 1107 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1108 :c:func:`PyUnicode_AsUTF8String`, :c:func:`PyUnicode_AsUTF8AndSize` or 1109 :c:func:`PyUnicode_AsEncodedString`. 1110 1111 1112UTF-32 Codecs 1113""""""""""""" 1114 1115These are the UTF-32 codec APIs: 1116 1117 1118.. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, \ 1119 const char *errors, int *byteorder) 1120 1121 Decode *size* bytes from a UTF-32 encoded buffer string and return the 1122 corresponding Unicode object. *errors* (if non-``NULL``) defines the error 1123 handling. It defaults to "strict". 1124 1125 If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte 1126 order:: 1127 1128 *byteorder == -1: little endian 1129 *byteorder == 0: native order 1130 *byteorder == 1: big endian 1131 1132 If ``*byteorder`` is zero, and the first four bytes of the input data are a 1133 byte order mark (BOM), the decoder switches to this byte order and the BOM is 1134 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or 1135 ``1``, any byte order mark is copied to the output. 1136 1137 After completion, *\*byteorder* is set to the current byte order at the end 1138 of input data. 1139 1140 If *byteorder* is ``NULL``, the codec starts in native order mode. 1141 1142 Return ``NULL`` if an exception was raised by the codec. 1143 1144 1145.. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, \ 1146 const char *errors, int *byteorder, Py_ssize_t *consumed) 1147 1148 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF32`. If 1149 *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF32Stateful` will not treat 1150 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible 1151 by four) as an error. Those bytes will not be decoded and the number of bytes 1152 that have been decoded will be stored in *consumed*. 1153 1154 1155.. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode) 1156 1157 Return a Python byte string using the UTF-32 encoding in native byte 1158 order. The string always starts with a BOM mark. Error handling is "strict". 1159 Return ``NULL`` if an exception was raised by the codec. 1160 1161 1162.. c:function:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, \ 1163 const char *errors, int byteorder) 1164 1165 Return a Python bytes object holding the UTF-32 encoded value of the Unicode 1166 data in *s*. Output is written according to the following byte order:: 1167 1168 byteorder == -1: little endian 1169 byteorder == 0: native byte order (writes a BOM mark) 1170 byteorder == 1: big endian 1171 1172 If byteorder is ``0``, the output string will always start with the Unicode BOM 1173 mark (U+FEFF). In the other two modes, no BOM mark is prepended. 1174 1175 If ``Py_UNICODE_WIDE`` is not defined, surrogate pairs will be output 1176 as a single code point. 1177 1178 Return ``NULL`` if an exception was raised by the codec. 1179 1180 .. deprecated-removed:: 3.3 4.0 1181 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1182 :c:func:`PyUnicode_AsUTF32String` or :c:func:`PyUnicode_AsEncodedString`. 1183 1184 1185UTF-16 Codecs 1186""""""""""""" 1187 1188These are the UTF-16 codec APIs: 1189 1190 1191.. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, \ 1192 const char *errors, int *byteorder) 1193 1194 Decode *size* bytes from a UTF-16 encoded buffer string and return the 1195 corresponding Unicode object. *errors* (if non-``NULL``) defines the error 1196 handling. It defaults to "strict". 1197 1198 If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte 1199 order:: 1200 1201 *byteorder == -1: little endian 1202 *byteorder == 0: native order 1203 *byteorder == 1: big endian 1204 1205 If ``*byteorder`` is zero, and the first two bytes of the input data are a 1206 byte order mark (BOM), the decoder switches to this byte order and the BOM is 1207 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or 1208 ``1``, any byte order mark is copied to the output (where it will result in 1209 either a ``\ufeff`` or a ``\ufffe`` character). 1210 1211 After completion, *\*byteorder* is set to the current byte order at the end 1212 of input data. 1213 1214 If *byteorder* is ``NULL``, the codec starts in native order mode. 1215 1216 Return ``NULL`` if an exception was raised by the codec. 1217 1218 1219.. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, \ 1220 const char *errors, int *byteorder, Py_ssize_t *consumed) 1221 1222 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF16`. If 1223 *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF16Stateful` will not treat 1224 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a 1225 split surrogate pair) as an error. Those bytes will not be decoded and the 1226 number of bytes that have been decoded will be stored in *consumed*. 1227 1228 1229.. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode) 1230 1231 Return a Python byte string using the UTF-16 encoding in native byte 1232 order. The string always starts with a BOM mark. Error handling is "strict". 1233 Return ``NULL`` if an exception was raised by the codec. 1234 1235 1236.. c:function:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, \ 1237 const char *errors, int byteorder) 1238 1239 Return a Python bytes object holding the UTF-16 encoded value of the Unicode 1240 data in *s*. Output is written according to the following byte order:: 1241 1242 byteorder == -1: little endian 1243 byteorder == 0: native byte order (writes a BOM mark) 1244 byteorder == 1: big endian 1245 1246 If byteorder is ``0``, the output string will always start with the Unicode BOM 1247 mark (U+FEFF). In the other two modes, no BOM mark is prepended. 1248 1249 If ``Py_UNICODE_WIDE`` is defined, a single :c:type:`Py_UNICODE` value may get 1250 represented as a surrogate pair. If it is not defined, each :c:type:`Py_UNICODE` 1251 values is interpreted as a UCS-2 character. 1252 1253 Return ``NULL`` if an exception was raised by the codec. 1254 1255 .. deprecated-removed:: 3.3 4.0 1256 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1257 :c:func:`PyUnicode_AsUTF16String` or :c:func:`PyUnicode_AsEncodedString`. 1258 1259 1260UTF-7 Codecs 1261"""""""""""" 1262 1263These are the UTF-7 codec APIs: 1264 1265 1266.. c:function:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors) 1267 1268 Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string 1269 *s*. Return ``NULL`` if an exception was raised by the codec. 1270 1271 1272.. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, \ 1273 const char *errors, Py_ssize_t *consumed) 1274 1275 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF7`. If 1276 *consumed* is not ``NULL``, trailing incomplete UTF-7 base-64 sections will not 1277 be treated as an error. Those bytes will not be decoded and the number of 1278 bytes that have been decoded will be stored in *consumed*. 1279 1280 1281.. c:function:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, \ 1282 int base64SetO, int base64WhiteSpace, const char *errors) 1283 1284 Encode the :c:type:`Py_UNICODE` buffer of the given size using UTF-7 and 1285 return a Python bytes object. Return ``NULL`` if an exception was raised by 1286 the codec. 1287 1288 If *base64SetO* is nonzero, "Set O" (punctuation that has no otherwise 1289 special meaning) will be encoded in base-64. If *base64WhiteSpace* is 1290 nonzero, whitespace will be encoded in base-64. Both are set to zero for the 1291 Python "utf-7" codec. 1292 1293 .. deprecated-removed:: 3.3 4.0 1294 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1295 :c:func:`PyUnicode_AsEncodedString`. 1296 1297 1298Unicode-Escape Codecs 1299""""""""""""""""""""" 1300 1301These are the "Unicode Escape" codec APIs: 1302 1303 1304.. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, \ 1305 Py_ssize_t size, const char *errors) 1306 1307 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded 1308 string *s*. Return ``NULL`` if an exception was raised by the codec. 1309 1310 1311.. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode) 1312 1313 Encode a Unicode object using Unicode-Escape and return the result as a 1314 bytes object. Error handling is "strict". Return ``NULL`` if an exception was 1315 raised by the codec. 1316 1317 1318.. c:function:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size) 1319 1320 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Unicode-Escape and 1321 return a bytes object. Return ``NULL`` if an exception was raised by the codec. 1322 1323 .. deprecated-removed:: 3.3 4.0 1324 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1325 :c:func:`PyUnicode_AsUnicodeEscapeString`. 1326 1327 1328Raw-Unicode-Escape Codecs 1329""""""""""""""""""""""""" 1330 1331These are the "Raw Unicode Escape" codec APIs: 1332 1333 1334.. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, \ 1335 Py_ssize_t size, const char *errors) 1336 1337 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape 1338 encoded string *s*. Return ``NULL`` if an exception was raised by the codec. 1339 1340 1341.. c:function:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode) 1342 1343 Encode a Unicode object using Raw-Unicode-Escape and return the result as 1344 a bytes object. Error handling is "strict". Return ``NULL`` if an exception 1345 was raised by the codec. 1346 1347 1348.. c:function:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, \ 1349 Py_ssize_t size) 1350 1351 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Raw-Unicode-Escape 1352 and return a bytes object. Return ``NULL`` if an exception was raised by the codec. 1353 1354 .. deprecated-removed:: 3.3 4.0 1355 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1356 :c:func:`PyUnicode_AsRawUnicodeEscapeString` or 1357 :c:func:`PyUnicode_AsEncodedString`. 1358 1359 1360Latin-1 Codecs 1361"""""""""""""" 1362 1363These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode 1364ordinals and only these are accepted by the codecs during encoding. 1365 1366 1367.. c:function:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors) 1368 1369 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string 1370 *s*. Return ``NULL`` if an exception was raised by the codec. 1371 1372 1373.. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode) 1374 1375 Encode a Unicode object using Latin-1 and return the result as Python bytes 1376 object. Error handling is "strict". Return ``NULL`` if an exception was 1377 raised by the codec. 1378 1379 1380.. c:function:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1381 1382 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Latin-1 and 1383 return a Python bytes object. Return ``NULL`` if an exception was raised by 1384 the codec. 1385 1386 .. deprecated-removed:: 3.3 4.0 1387 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1388 :c:func:`PyUnicode_AsLatin1String` or 1389 :c:func:`PyUnicode_AsEncodedString`. 1390 1391 1392ASCII Codecs 1393"""""""""""" 1394 1395These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other 1396codes generate errors. 1397 1398 1399.. c:function:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors) 1400 1401 Create a Unicode object by decoding *size* bytes of the ASCII encoded string 1402 *s*. Return ``NULL`` if an exception was raised by the codec. 1403 1404 1405.. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode) 1406 1407 Encode a Unicode object using ASCII and return the result as Python bytes 1408 object. Error handling is "strict". Return ``NULL`` if an exception was 1409 raised by the codec. 1410 1411 1412.. c:function:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1413 1414 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using ASCII and 1415 return a Python bytes object. Return ``NULL`` if an exception was raised by 1416 the codec. 1417 1418 .. deprecated-removed:: 3.3 4.0 1419 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1420 :c:func:`PyUnicode_AsASCIIString` or 1421 :c:func:`PyUnicode_AsEncodedString`. 1422 1423 1424Character Map Codecs 1425"""""""""""""""""""" 1426 1427This codec is special in that it can be used to implement many different codecs 1428(and this is in fact what was done to obtain most of the standard codecs 1429included in the :mod:`encodings` package). The codec uses mapping to encode and 1430decode characters. The mapping objects provided must support the 1431:meth:`__getitem__` mapping interface; dictionaries and sequences work well. 1432 1433These are the mapping codec APIs: 1434 1435.. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *data, Py_ssize_t size, \ 1436 PyObject *mapping, const char *errors) 1437 1438 Create a Unicode object by decoding *size* bytes of the encoded string *s* 1439 using the given *mapping* object. Return ``NULL`` if an exception was raised 1440 by the codec. 1441 1442 If *mapping* is ``NULL``, Latin-1 decoding will be applied. Else 1443 *mapping* must map bytes ordinals (integers in the range from 0 to 255) 1444 to Unicode strings, integers (which are then interpreted as Unicode 1445 ordinals) or ``None``. Unmapped data bytes -- ones which cause a 1446 :exc:`LookupError`, as well as ones which get mapped to ``None``, 1447 ``0xFFFE`` or ``'\ufffe'``, are treated as undefined mappings and cause 1448 an error. 1449 1450 1451.. c:function:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping) 1452 1453 Encode a Unicode object using the given *mapping* object and return the 1454 result as a bytes object. Error handling is "strict". Return ``NULL`` if an 1455 exception was raised by the codec. 1456 1457 The *mapping* object must map Unicode ordinal integers to bytes objects, 1458 integers in the range from 0 to 255 or ``None``. Unmapped character 1459 ordinals (ones which cause a :exc:`LookupError`) as well as mapped to 1460 ``None`` are treated as "undefined mapping" and cause an error. 1461 1462 1463.. c:function:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, \ 1464 PyObject *mapping, const char *errors) 1465 1466 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using the given 1467 *mapping* object and return the result as a bytes object. Return ``NULL`` if 1468 an exception was raised by the codec. 1469 1470 .. deprecated-removed:: 3.3 4.0 1471 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1472 :c:func:`PyUnicode_AsCharmapString` or 1473 :c:func:`PyUnicode_AsEncodedString`. 1474 1475 1476The following codec API is special in that maps Unicode to Unicode. 1477 1478.. c:function:: PyObject* PyUnicode_Translate(PyObject *unicode, \ 1479 PyObject *mapping, const char *errors) 1480 1481 Translate a Unicode object using the given *mapping* object and return the 1482 resulting Unicode object. Return ``NULL`` if an exception was raised by the 1483 codec. 1484 1485 The *mapping* object must map Unicode ordinal integers to Unicode strings, 1486 integers (which are then interpreted as Unicode ordinals) or ``None`` 1487 (causing deletion of the character). Unmapped character ordinals (ones 1488 which cause a :exc:`LookupError`) are left untouched and are copied as-is. 1489 1490 1491.. c:function:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, \ 1492 PyObject *mapping, const char *errors) 1493 1494 Translate a :c:type:`Py_UNICODE` buffer of the given *size* by applying a 1495 character *mapping* table to it and return the resulting Unicode object. 1496 Return ``NULL`` when an exception was raised by the codec. 1497 1498 .. deprecated-removed:: 3.3 4.0 1499 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1500 :c:func:`PyUnicode_Translate`. or :ref:`generic codec based API 1501 <codec-registry>` 1502 1503 1504MBCS codecs for Windows 1505""""""""""""""""""""""" 1506 1507These are the MBCS codec APIs. They are currently only available on Windows and 1508use the Win32 MBCS converters to implement the conversions. Note that MBCS (or 1509DBCS) is a class of encodings, not just one. The target encoding is defined by 1510the user settings on the machine running the codec. 1511 1512.. c:function:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors) 1513 1514 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*. 1515 Return ``NULL`` if an exception was raised by the codec. 1516 1517 1518.. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, Py_ssize_t size, \ 1519 const char *errors, Py_ssize_t *consumed) 1520 1521 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeMBCS`. If 1522 *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeMBCSStateful` will not decode 1523 trailing lead byte and the number of bytes that have been decoded will be stored 1524 in *consumed*. 1525 1526 1527.. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode) 1528 1529 Encode a Unicode object using MBCS and return the result as Python bytes 1530 object. Error handling is "strict". Return ``NULL`` if an exception was 1531 raised by the codec. 1532 1533 1534.. c:function:: PyObject* PyUnicode_EncodeCodePage(int code_page, PyObject *unicode, const char *errors) 1535 1536 Encode the Unicode object using the specified code page and return a Python 1537 bytes object. Return ``NULL`` if an exception was raised by the codec. Use 1538 :c:data:`CP_ACP` code page to get the MBCS encoder. 1539 1540 .. versionadded:: 3.3 1541 1542 1543.. c:function:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1544 1545 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using MBCS and return 1546 a Python bytes object. Return ``NULL`` if an exception was raised by the 1547 codec. 1548 1549 .. deprecated-removed:: 3.3 4.0 1550 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1551 :c:func:`PyUnicode_AsMBCSString`, :c:func:`PyUnicode_EncodeCodePage` or 1552 :c:func:`PyUnicode_AsEncodedString`. 1553 1554 1555Methods & Slots 1556""""""""""""""" 1557 1558 1559.. _unicodemethodsandslots: 1560 1561Methods and Slot Functions 1562^^^^^^^^^^^^^^^^^^^^^^^^^^ 1563 1564The following APIs are capable of handling Unicode objects and strings on input 1565(we refer to them as strings in the descriptions) and return Unicode objects or 1566integers as appropriate. 1567 1568They all return ``NULL`` or ``-1`` if an exception occurs. 1569 1570 1571.. c:function:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right) 1572 1573 Concat two strings giving a new Unicode string. 1574 1575 1576.. c:function:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit) 1577 1578 Split a string giving a list of Unicode strings. If *sep* is ``NULL``, splitting 1579 will be done at all whitespace substrings. Otherwise, splits occur at the given 1580 separator. At most *maxsplit* splits will be done. If negative, no limit is 1581 set. Separators are not included in the resulting list. 1582 1583 1584.. c:function:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend) 1585 1586 Split a Unicode string at line breaks, returning a list of Unicode strings. 1587 CRLF is considered to be one line break. If *keepend* is ``0``, the Line break 1588 characters are not included in the resulting strings. 1589 1590 1591.. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, \ 1592 const char *errors) 1593 1594 Translate a string by applying a character mapping table to it and return the 1595 resulting Unicode object. 1596 1597 The mapping table must map Unicode ordinal integers to Unicode ordinal integers 1598 or ``None`` (causing deletion of the character). 1599 1600 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries 1601 and sequences work well. Unmapped character ordinals (ones which cause a 1602 :exc:`LookupError`) are left untouched and are copied as-is. 1603 1604 *errors* has the usual meaning for codecs. It may be ``NULL`` which indicates to 1605 use the default error handling. 1606 1607 1608.. c:function:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq) 1609 1610 Join a sequence of strings using the given *separator* and return the resulting 1611 Unicode string. 1612 1613 1614.. c:function:: Py_ssize_t PyUnicode_Tailmatch(PyObject *str, PyObject *substr, \ 1615 Py_ssize_t start, Py_ssize_t end, int direction) 1616 1617 Return ``1`` if *substr* matches ``str[start:end]`` at the given tail end 1618 (*direction* == ``-1`` means to do a prefix match, *direction* == ``1`` a suffix match), 1619 ``0`` otherwise. Return ``-1`` if an error occurred. 1620 1621 1622.. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, \ 1623 Py_ssize_t start, Py_ssize_t end, int direction) 1624 1625 Return the first position of *substr* in ``str[start:end]`` using the given 1626 *direction* (*direction* == ``1`` means to do a forward search, *direction* == ``-1`` a 1627 backward search). The return value is the index of the first match; a value of 1628 ``-1`` indicates that no match was found, and ``-2`` indicates that an error 1629 occurred and an exception has been set. 1630 1631 1632.. c:function:: Py_ssize_t PyUnicode_FindChar(PyObject *str, Py_UCS4 ch, \ 1633 Py_ssize_t start, Py_ssize_t end, int direction) 1634 1635 Return the first position of the character *ch* in ``str[start:end]`` using 1636 the given *direction* (*direction* == ``1`` means to do a forward search, 1637 *direction* == ``-1`` a backward search). The return value is the index of the 1638 first match; a value of ``-1`` indicates that no match was found, and ``-2`` 1639 indicates that an error occurred and an exception has been set. 1640 1641 .. versionadded:: 3.3 1642 1643 .. versionchanged:: 3.7 1644 *start* and *end* are now adjusted to behave like ``str[start:end]``. 1645 1646 1647.. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, \ 1648 Py_ssize_t start, Py_ssize_t end) 1649 1650 Return the number of non-overlapping occurrences of *substr* in 1651 ``str[start:end]``. Return ``-1`` if an error occurred. 1652 1653 1654.. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, \ 1655 PyObject *replstr, Py_ssize_t maxcount) 1656 1657 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and 1658 return the resulting Unicode object. *maxcount* == ``-1`` means replace all 1659 occurrences. 1660 1661 1662.. c:function:: int PyUnicode_Compare(PyObject *left, PyObject *right) 1663 1664 Compare two strings and return ``-1``, ``0``, ``1`` for less than, equal, and greater than, 1665 respectively. 1666 1667 This function returns ``-1`` upon failure, so one should call 1668 :c:func:`PyErr_Occurred` to check for errors. 1669 1670 1671.. c:function:: int PyUnicode_CompareWithASCIIString(PyObject *uni, const char *string) 1672 1673 Compare a Unicode object, *uni*, with *string* and return ``-1``, ``0``, ``1`` for less 1674 than, equal, and greater than, respectively. It is best to pass only 1675 ASCII-encoded strings, but the function interprets the input string as 1676 ISO-8859-1 if it contains non-ASCII characters. 1677 1678 This function does not raise exceptions. 1679 1680 1681.. c:function:: PyObject* PyUnicode_RichCompare(PyObject *left, PyObject *right, int op) 1682 1683 Rich compare two Unicode strings and return one of the following: 1684 1685 * ``NULL`` in case an exception was raised 1686 * :const:`Py_True` or :const:`Py_False` for successful comparisons 1687 * :const:`Py_NotImplemented` in case the type combination is unknown 1688 1689 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`, 1690 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`. 1691 1692 1693.. c:function:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args) 1694 1695 Return a new string object from *format* and *args*; this is analogous to 1696 ``format % args``. 1697 1698 1699.. c:function:: int PyUnicode_Contains(PyObject *container, PyObject *element) 1700 1701 Check whether *element* is contained in *container* and return true or false 1702 accordingly. 1703 1704 *element* has to coerce to a one element Unicode string. ``-1`` is returned 1705 if there was an error. 1706 1707 1708.. c:function:: void PyUnicode_InternInPlace(PyObject **string) 1709 1710 Intern the argument *\*string* in place. The argument must be the address of a 1711 pointer variable pointing to a Python Unicode string object. If there is an 1712 existing interned string that is the same as *\*string*, it sets *\*string* to 1713 it (decrementing the reference count of the old string object and incrementing 1714 the reference count of the interned string object), otherwise it leaves 1715 *\*string* alone and interns it (incrementing its reference count). 1716 (Clarification: even though there is a lot of talk about reference counts, think 1717 of this function as reference-count-neutral; you own the object after the call 1718 if and only if you owned it before the call.) 1719 1720 1721.. c:function:: PyObject* PyUnicode_InternFromString(const char *v) 1722 1723 A combination of :c:func:`PyUnicode_FromString` and 1724 :c:func:`PyUnicode_InternInPlace`, returning either a new Unicode string 1725 object that has been interned, or a new ("owned") reference to an earlier 1726 interned string object with the same value. 1727