1.. highlight:: c 2 3.. _unicodeobjects: 4 5Unicode Objects and Codecs 6-------------------------- 7 8.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com> 9.. sectionauthor:: Georg Brandl <georg@python.org> 10 11Unicode Objects 12^^^^^^^^^^^^^^^ 13 14Since the implementation of :pep:`393` in Python 3.3, Unicode objects internally 15use a variety of representations, in order to allow handling the complete range 16of Unicode characters while staying memory efficient. There are special cases 17for strings where all code points are below 128, 256, or 65536; otherwise, code 18points must be below 1114112 (which is the full Unicode range). 19 20:c:type:`Py_UNICODE*` and UTF-8 representations are created on demand and cached 21in the Unicode object. The :c:type:`Py_UNICODE*` representation is deprecated 22and inefficient. 23 24Due to the transition between the old APIs and the new APIs, Unicode objects 25can internally be in two states depending on how they were created: 26 27* "canonical" Unicode objects are all objects created by a non-deprecated 28 Unicode API. They use the most efficient representation allowed by the 29 implementation. 30 31* "legacy" Unicode objects have been created through one of the deprecated 32 APIs (typically :c:func:`PyUnicode_FromUnicode`) and only bear the 33 :c:type:`Py_UNICODE*` representation; you will have to call 34 :c:func:`PyUnicode_READY` on them before calling any other API. 35 36.. note:: 37 The "legacy" Unicode object will be removed in Python 3.12 with deprecated 38 APIs. All Unicode objects will be "canonical" since then. See :pep:`623` 39 for more information. 40 41 42Unicode Type 43"""""""""""" 44 45These are the basic Unicode object types used for the Unicode implementation in 46Python: 47 48.. c:type:: Py_UCS4 49 Py_UCS2 50 Py_UCS1 51 52 These types are typedefs for unsigned integer types wide enough to contain 53 characters of 32 bits, 16 bits and 8 bits, respectively. When dealing with 54 single Unicode characters, use :c:type:`Py_UCS4`. 55 56 .. versionadded:: 3.3 57 58 59.. c:type:: Py_UNICODE 60 61 This is a typedef of :c:type:`wchar_t`, which is a 16-bit type or 32-bit type 62 depending on the platform. 63 64 .. versionchanged:: 3.3 65 In previous versions, this was a 16-bit type or a 32-bit type depending on 66 whether you selected a "narrow" or "wide" Unicode version of Python at 67 build time. 68 69 70.. c:type:: PyASCIIObject 71 PyCompactUnicodeObject 72 PyUnicodeObject 73 74 These subtypes of :c:type:`PyObject` represent a Python Unicode object. In 75 almost all cases, they shouldn't be used directly, since all API functions 76 that deal with Unicode objects take and return :c:type:`PyObject` pointers. 77 78 .. versionadded:: 3.3 79 80 81.. c:var:: PyTypeObject PyUnicode_Type 82 83 This instance of :c:type:`PyTypeObject` represents the Python Unicode type. It 84 is exposed to Python code as ``str``. 85 86 87The following APIs are really C macros and can be used to do fast checks and to 88access internal read-only data of Unicode objects: 89 90.. c:function:: int PyUnicode_Check(PyObject *o) 91 92 Return true if the object *o* is a Unicode object or an instance of a Unicode 93 subtype. 94 95 96.. c:function:: int PyUnicode_CheckExact(PyObject *o) 97 98 Return true if the object *o* is a Unicode object, but not an instance of a 99 subtype. 100 101 102.. c:function:: int PyUnicode_READY(PyObject *o) 103 104 Ensure the string object *o* is in the "canonical" representation. This is 105 required before using any of the access macros described below. 106 107 .. XXX expand on when it is not required 108 109 Returns ``0`` on success and ``-1`` with an exception set on failure, which in 110 particular happens if memory allocation fails. 111 112 .. versionadded:: 3.3 113 114 .. deprecated-removed:: 3.10 3.12 115 This API will be removed with :c:func:`PyUnicode_FromUnicode`. 116 117 118.. c:function:: Py_ssize_t PyUnicode_GET_LENGTH(PyObject *o) 119 120 Return the length of the Unicode string, in code points. *o* has to be a 121 Unicode object in the "canonical" representation (not checked). 122 123 .. versionadded:: 3.3 124 125 126.. c:function:: Py_UCS1* PyUnicode_1BYTE_DATA(PyObject *o) 127 Py_UCS2* PyUnicode_2BYTE_DATA(PyObject *o) 128 Py_UCS4* PyUnicode_4BYTE_DATA(PyObject *o) 129 130 Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4 131 integer types for direct character access. No checks are performed if the 132 canonical representation has the correct character size; use 133 :c:func:`PyUnicode_KIND` to select the right macro. Make sure 134 :c:func:`PyUnicode_READY` has been called before accessing this. 135 136 .. versionadded:: 3.3 137 138 139.. c:macro:: PyUnicode_WCHAR_KIND 140 PyUnicode_1BYTE_KIND 141 PyUnicode_2BYTE_KIND 142 PyUnicode_4BYTE_KIND 143 144 Return values of the :c:func:`PyUnicode_KIND` macro. 145 146 .. versionadded:: 3.3 147 148 .. deprecated-removed:: 3.10 3.12 149 ``PyUnicode_WCHAR_KIND`` is deprecated. 150 151 152.. c:function:: int PyUnicode_KIND(PyObject *o) 153 154 Return one of the PyUnicode kind constants (see above) that indicate how many 155 bytes per character this Unicode object uses to store its data. *o* has to 156 be a Unicode object in the "canonical" representation (not checked). 157 158 .. XXX document "0" return value? 159 160 .. versionadded:: 3.3 161 162 163.. c:function:: void* PyUnicode_DATA(PyObject *o) 164 165 Return a void pointer to the raw Unicode buffer. *o* has to be a Unicode 166 object in the "canonical" representation (not checked). 167 168 .. versionadded:: 3.3 169 170 171.. c:function:: void PyUnicode_WRITE(int kind, void *data, Py_ssize_t index, \ 172 Py_UCS4 value) 173 174 Write into a canonical representation *data* (as obtained with 175 :c:func:`PyUnicode_DATA`). This macro does not do any sanity checks and is 176 intended for usage in loops. The caller should cache the *kind* value and 177 *data* pointer as obtained from other macro calls. *index* is the index in 178 the string (starts at 0) and *value* is the new code point value which should 179 be written to that location. 180 181 .. versionadded:: 3.3 182 183 184.. c:function:: Py_UCS4 PyUnicode_READ(int kind, void *data, Py_ssize_t index) 185 186 Read a code point from a canonical representation *data* (as obtained with 187 :c:func:`PyUnicode_DATA`). No checks or ready calls are performed. 188 189 .. versionadded:: 3.3 190 191 192.. c:function:: Py_UCS4 PyUnicode_READ_CHAR(PyObject *o, Py_ssize_t index) 193 194 Read a character from a Unicode object *o*, which must be in the "canonical" 195 representation. This is less efficient than :c:func:`PyUnicode_READ` if you 196 do multiple consecutive reads. 197 198 .. versionadded:: 3.3 199 200 201.. c:macro:: PyUnicode_MAX_CHAR_VALUE(o) 202 203 Return the maximum code point that is suitable for creating another string 204 based on *o*, which must be in the "canonical" representation. This is 205 always an approximation but more efficient than iterating over the string. 206 207 .. versionadded:: 3.3 208 209 210.. c:function:: int PyUnicode_ClearFreeList() 211 212 Clear the free list. Return the total number of freed items. 213 214 215.. c:function:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o) 216 217 Return the size of the deprecated :c:type:`Py_UNICODE` representation, in 218 code units (this includes surrogate pairs as 2 units). *o* has to be a 219 Unicode object (not checked). 220 221 .. deprecated-removed:: 3.3 3.12 222 Part of the old-style Unicode API, please migrate to using 223 :c:func:`PyUnicode_GET_LENGTH`. 224 225 226.. c:function:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o) 227 228 Return the size of the deprecated :c:type:`Py_UNICODE` representation in 229 bytes. *o* has to be a Unicode object (not checked). 230 231 .. deprecated-removed:: 3.3 3.12 232 Part of the old-style Unicode API, please migrate to using 233 :c:func:`PyUnicode_GET_LENGTH`. 234 235 236.. c:function:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o) 237 const char* PyUnicode_AS_DATA(PyObject *o) 238 239 Return a pointer to a :c:type:`Py_UNICODE` representation of the object. The 240 returned buffer is always terminated with an extra null code point. It 241 may also contain embedded null code points, which would cause the string 242 to be truncated when used in most C functions. The ``AS_DATA`` form 243 casts the pointer to :c:type:`const char *`. The *o* argument has to be 244 a Unicode object (not checked). 245 246 .. versionchanged:: 3.3 247 This macro is now inefficient -- because in many cases the 248 :c:type:`Py_UNICODE` representation does not exist and needs to be created 249 -- and can fail (return ``NULL`` with an exception set). Try to port the 250 code to use the new :c:func:`PyUnicode_nBYTE_DATA` macros or use 251 :c:func:`PyUnicode_WRITE` or :c:func:`PyUnicode_READ`. 252 253 .. deprecated-removed:: 3.3 3.12 254 Part of the old-style Unicode API, please migrate to using the 255 :c:func:`PyUnicode_nBYTE_DATA` family of macros. 256 257 258Unicode Character Properties 259"""""""""""""""""""""""""""" 260 261Unicode provides many different character properties. The most often needed ones 262are available through these macros which are mapped to C functions depending on 263the Python configuration. 264 265 266.. c:function:: int Py_UNICODE_ISSPACE(Py_UNICODE ch) 267 268 Return ``1`` or ``0`` depending on whether *ch* is a whitespace character. 269 270 271.. c:function:: int Py_UNICODE_ISLOWER(Py_UNICODE ch) 272 273 Return ``1`` or ``0`` depending on whether *ch* is a lowercase character. 274 275 276.. c:function:: int Py_UNICODE_ISUPPER(Py_UNICODE ch) 277 278 Return ``1`` or ``0`` depending on whether *ch* is an uppercase character. 279 280 281.. c:function:: int Py_UNICODE_ISTITLE(Py_UNICODE ch) 282 283 Return ``1`` or ``0`` depending on whether *ch* is a titlecase character. 284 285 286.. c:function:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch) 287 288 Return ``1`` or ``0`` depending on whether *ch* is a linebreak character. 289 290 291.. c:function:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch) 292 293 Return ``1`` or ``0`` depending on whether *ch* is a decimal character. 294 295 296.. c:function:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch) 297 298 Return ``1`` or ``0`` depending on whether *ch* is a digit character. 299 300 301.. c:function:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch) 302 303 Return ``1`` or ``0`` depending on whether *ch* is a numeric character. 304 305 306.. c:function:: int Py_UNICODE_ISALPHA(Py_UNICODE ch) 307 308 Return ``1`` or ``0`` depending on whether *ch* is an alphabetic character. 309 310 311.. c:function:: int Py_UNICODE_ISALNUM(Py_UNICODE ch) 312 313 Return ``1`` or ``0`` depending on whether *ch* is an alphanumeric character. 314 315 316.. c:function:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch) 317 318 Return ``1`` or ``0`` depending on whether *ch* is a printable character. 319 Nonprintable characters are those characters defined in the Unicode character 320 database as "Other" or "Separator", excepting the ASCII space (0x20) which is 321 considered printable. (Note that printable characters in this context are 322 those which should not be escaped when :func:`repr` is invoked on a string. 323 It has no bearing on the handling of strings written to :data:`sys.stdout` or 324 :data:`sys.stderr`.) 325 326 327These APIs can be used for fast direct character conversions: 328 329 330.. c:function:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch) 331 332 Return the character *ch* converted to lower case. 333 334 .. deprecated:: 3.3 335 This function uses simple case mappings. 336 337 338.. c:function:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch) 339 340 Return the character *ch* converted to upper case. 341 342 .. deprecated:: 3.3 343 This function uses simple case mappings. 344 345 346.. c:function:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch) 347 348 Return the character *ch* converted to title case. 349 350 .. deprecated:: 3.3 351 This function uses simple case mappings. 352 353 354.. c:function:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch) 355 356 Return the character *ch* converted to a decimal positive integer. Return 357 ``-1`` if this is not possible. This macro does not raise exceptions. 358 359 360.. c:function:: int Py_UNICODE_TODIGIT(Py_UNICODE ch) 361 362 Return the character *ch* converted to a single digit integer. Return ``-1`` if 363 this is not possible. This macro does not raise exceptions. 364 365 366.. c:function:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch) 367 368 Return the character *ch* converted to a double. Return ``-1.0`` if this is not 369 possible. This macro does not raise exceptions. 370 371 372These APIs can be used to work with surrogates: 373 374.. c:macro:: Py_UNICODE_IS_SURROGATE(ch) 375 376 Check if *ch* is a surrogate (``0xD800 <= ch <= 0xDFFF``). 377 378.. c:macro:: Py_UNICODE_IS_HIGH_SURROGATE(ch) 379 380 Check if *ch* is a high surrogate (``0xD800 <= ch <= 0xDBFF``). 381 382.. c:macro:: Py_UNICODE_IS_LOW_SURROGATE(ch) 383 384 Check if *ch* is a low surrogate (``0xDC00 <= ch <= 0xDFFF``). 385 386.. c:macro:: Py_UNICODE_JOIN_SURROGATES(high, low) 387 388 Join two surrogate characters and return a single Py_UCS4 value. 389 *high* and *low* are respectively the leading and trailing surrogates in a 390 surrogate pair. 391 392 393Creating and accessing Unicode strings 394"""""""""""""""""""""""""""""""""""""" 395 396To create Unicode objects and access their basic sequence properties, use these 397APIs: 398 399.. c:function:: PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar) 400 401 Create a new Unicode object. *maxchar* should be the true maximum code point 402 to be placed in the string. As an approximation, it can be rounded up to the 403 nearest value in the sequence 127, 255, 65535, 1114111. 404 405 This is the recommended way to allocate a new Unicode object. Objects 406 created using this function are not resizable. 407 408 .. versionadded:: 3.3 409 410 411.. c:function:: PyObject* PyUnicode_FromKindAndData(int kind, const void *buffer, \ 412 Py_ssize_t size) 413 414 Create a new Unicode object with the given *kind* (possible values are 415 :c:macro:`PyUnicode_1BYTE_KIND` etc., as returned by 416 :c:func:`PyUnicode_KIND`). The *buffer* must point to an array of *size* 417 units of 1, 2 or 4 bytes per character, as given by the kind. 418 419 .. versionadded:: 3.3 420 421 422.. c:function:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size) 423 424 Create a Unicode object from the char buffer *u*. The bytes will be 425 interpreted as being UTF-8 encoded. The buffer is copied into the new 426 object. If the buffer is not ``NULL``, the return value might be a shared 427 object, i.e. modification of the data is not allowed. 428 429 If *u* is ``NULL``, this function behaves like :c:func:`PyUnicode_FromUnicode` 430 with the buffer set to ``NULL``. This usage is deprecated in favor of 431 :c:func:`PyUnicode_New`, and will be removed in Python 3.12. 432 433 434.. c:function:: PyObject *PyUnicode_FromString(const char *u) 435 436 Create a Unicode object from a UTF-8 encoded null-terminated char buffer 437 *u*. 438 439 440.. c:function:: PyObject* PyUnicode_FromFormat(const char *format, ...) 441 442 Take a C :c:func:`printf`\ -style *format* string and a variable number of 443 arguments, calculate the size of the resulting Python Unicode string and return 444 a string with the values formatted into it. The variable arguments must be C 445 types and must correspond exactly to the format characters in the *format* 446 ASCII-encoded string. The following format characters are allowed: 447 448 .. % This should be exactly the same as the table in PyErr_Format. 449 .. % The descriptions for %zd and %zu are wrong, but the truth is complicated 450 .. % because not all compilers support the %z width modifier -- we fake it 451 .. % when necessary via interpolating PY_FORMAT_SIZE_T. 452 .. % Similar comments apply to the %ll width modifier and 453 454 .. tabularcolumns:: |l|l|L| 455 456 +-------------------+---------------------+----------------------------------+ 457 | Format Characters | Type | Comment | 458 +===================+=====================+==================================+ 459 | :attr:`%%` | *n/a* | The literal % character. | 460 +-------------------+---------------------+----------------------------------+ 461 | :attr:`%c` | int | A single character, | 462 | | | represented as a C int. | 463 +-------------------+---------------------+----------------------------------+ 464 | :attr:`%d` | int | Equivalent to | 465 | | | ``printf("%d")``. [1]_ | 466 +-------------------+---------------------+----------------------------------+ 467 | :attr:`%u` | unsigned int | Equivalent to | 468 | | | ``printf("%u")``. [1]_ | 469 +-------------------+---------------------+----------------------------------+ 470 | :attr:`%ld` | long | Equivalent to | 471 | | | ``printf("%ld")``. [1]_ | 472 +-------------------+---------------------+----------------------------------+ 473 | :attr:`%li` | long | Equivalent to | 474 | | | ``printf("%li")``. [1]_ | 475 +-------------------+---------------------+----------------------------------+ 476 | :attr:`%lu` | unsigned long | Equivalent to | 477 | | | ``printf("%lu")``. [1]_ | 478 +-------------------+---------------------+----------------------------------+ 479 | :attr:`%lld` | long long | Equivalent to | 480 | | | ``printf("%lld")``. [1]_ | 481 +-------------------+---------------------+----------------------------------+ 482 | :attr:`%lli` | long long | Equivalent to | 483 | | | ``printf("%lli")``. [1]_ | 484 +-------------------+---------------------+----------------------------------+ 485 | :attr:`%llu` | unsigned long long | Equivalent to | 486 | | | ``printf("%llu")``. [1]_ | 487 +-------------------+---------------------+----------------------------------+ 488 | :attr:`%zd` | Py_ssize_t | Equivalent to | 489 | | | ``printf("%zd")``. [1]_ | 490 +-------------------+---------------------+----------------------------------+ 491 | :attr:`%zi` | Py_ssize_t | Equivalent to | 492 | | | ``printf("%zi")``. [1]_ | 493 +-------------------+---------------------+----------------------------------+ 494 | :attr:`%zu` | size_t | Equivalent to | 495 | | | ``printf("%zu")``. [1]_ | 496 +-------------------+---------------------+----------------------------------+ 497 | :attr:`%i` | int | Equivalent to | 498 | | | ``printf("%i")``. [1]_ | 499 +-------------------+---------------------+----------------------------------+ 500 | :attr:`%x` | int | Equivalent to | 501 | | | ``printf("%x")``. [1]_ | 502 +-------------------+---------------------+----------------------------------+ 503 | :attr:`%s` | const char\* | A null-terminated C character | 504 | | | array. | 505 +-------------------+---------------------+----------------------------------+ 506 | :attr:`%p` | const void\* | The hex representation of a C | 507 | | | pointer. Mostly equivalent to | 508 | | | ``printf("%p")`` except that | 509 | | | it is guaranteed to start with | 510 | | | the literal ``0x`` regardless | 511 | | | of what the platform's | 512 | | | ``printf`` yields. | 513 +-------------------+---------------------+----------------------------------+ 514 | :attr:`%A` | PyObject\* | The result of calling | 515 | | | :func:`ascii`. | 516 +-------------------+---------------------+----------------------------------+ 517 | :attr:`%U` | PyObject\* | A Unicode object. | 518 +-------------------+---------------------+----------------------------------+ 519 | :attr:`%V` | PyObject\*, | A Unicode object (which may be | 520 | | const char\* | ``NULL``) and a null-terminated | 521 | | | C character array as a second | 522 | | | parameter (which will be used, | 523 | | | if the first parameter is | 524 | | | ``NULL``). | 525 +-------------------+---------------------+----------------------------------+ 526 | :attr:`%S` | PyObject\* | The result of calling | 527 | | | :c:func:`PyObject_Str`. | 528 +-------------------+---------------------+----------------------------------+ 529 | :attr:`%R` | PyObject\* | The result of calling | 530 | | | :c:func:`PyObject_Repr`. | 531 +-------------------+---------------------+----------------------------------+ 532 533 An unrecognized format character causes all the rest of the format string to be 534 copied as-is to the result string, and any extra arguments discarded. 535 536 .. note:: 537 The width formatter unit is number of characters rather than bytes. 538 The precision formatter unit is number of bytes for ``"%s"`` and 539 ``"%V"`` (if the ``PyObject*`` argument is ``NULL``), and a number of 540 characters for ``"%A"``, ``"%U"``, ``"%S"``, ``"%R"`` and ``"%V"`` 541 (if the ``PyObject*`` argument is not ``NULL``). 542 543 .. [1] For integer specifiers (d, u, ld, li, lu, lld, lli, llu, zd, zi, 544 zu, i, x): the 0-conversion flag has effect even when a precision is given. 545 546 .. versionchanged:: 3.2 547 Support for ``"%lld"`` and ``"%llu"`` added. 548 549 .. versionchanged:: 3.3 550 Support for ``"%li"``, ``"%lli"`` and ``"%zi"`` added. 551 552 .. versionchanged:: 3.4 553 Support width and precision formatter for ``"%s"``, ``"%A"``, ``"%U"``, 554 ``"%V"``, ``"%S"``, ``"%R"`` added. 555 556 557.. c:function:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs) 558 559 Identical to :c:func:`PyUnicode_FromFormat` except that it takes exactly two 560 arguments. 561 562 563.. c:function:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, \ 564 const char *encoding, const char *errors) 565 566 Decode an encoded object *obj* to a Unicode object. 567 568 :class:`bytes`, :class:`bytearray` and other 569 :term:`bytes-like objects <bytes-like object>` 570 are decoded according to the given *encoding* and using the error handling 571 defined by *errors*. Both can be ``NULL`` to have the interface use the default 572 values (see :ref:`builtincodecs` for details). 573 574 All other objects, including Unicode objects, cause a :exc:`TypeError` to be 575 set. 576 577 The API returns ``NULL`` if there was an error. The caller is responsible for 578 decref'ing the returned objects. 579 580 581.. c:function:: Py_ssize_t PyUnicode_GetLength(PyObject *unicode) 582 583 Return the length of the Unicode object, in code points. 584 585 .. versionadded:: 3.3 586 587 588.. c:function:: Py_ssize_t PyUnicode_CopyCharacters(PyObject *to, \ 589 Py_ssize_t to_start, \ 590 PyObject *from, \ 591 Py_ssize_t from_start, \ 592 Py_ssize_t how_many) 593 594 Copy characters from one Unicode object into another. This function performs 595 character conversion when necessary and falls back to :c:func:`memcpy` if 596 possible. Returns ``-1`` and sets an exception on error, otherwise returns 597 the number of copied characters. 598 599 .. versionadded:: 3.3 600 601 602.. c:function:: Py_ssize_t PyUnicode_Fill(PyObject *unicode, Py_ssize_t start, \ 603 Py_ssize_t length, Py_UCS4 fill_char) 604 605 Fill a string with a character: write *fill_char* into 606 ``unicode[start:start+length]``. 607 608 Fail if *fill_char* is bigger than the string maximum character, or if the 609 string has more than 1 reference. 610 611 Return the number of written character, or return ``-1`` and raise an 612 exception on error. 613 614 .. versionadded:: 3.3 615 616 617.. c:function:: int PyUnicode_WriteChar(PyObject *unicode, Py_ssize_t index, \ 618 Py_UCS4 character) 619 620 Write a character to a string. The string must have been created through 621 :c:func:`PyUnicode_New`. Since Unicode strings are supposed to be immutable, 622 the string must not be shared, or have been hashed yet. 623 624 This function checks that *unicode* is a Unicode object, that the index is 625 not out of bounds, and that the object can be modified safely (i.e. that it 626 its reference count is one). 627 628 .. versionadded:: 3.3 629 630 631.. c:function:: Py_UCS4 PyUnicode_ReadChar(PyObject *unicode, Py_ssize_t index) 632 633 Read a character from a string. This function checks that *unicode* is a 634 Unicode object and the index is not out of bounds, in contrast to the macro 635 version :c:func:`PyUnicode_READ_CHAR`. 636 637 .. versionadded:: 3.3 638 639 640.. c:function:: PyObject* PyUnicode_Substring(PyObject *str, Py_ssize_t start, \ 641 Py_ssize_t end) 642 643 Return a substring of *str*, from character index *start* (included) to 644 character index *end* (excluded). Negative indices are not supported. 645 646 .. versionadded:: 3.3 647 648 649.. c:function:: Py_UCS4* PyUnicode_AsUCS4(PyObject *u, Py_UCS4 *buffer, \ 650 Py_ssize_t buflen, int copy_null) 651 652 Copy the string *u* into a UCS4 buffer, including a null character, if 653 *copy_null* is set. Returns ``NULL`` and sets an exception on error (in 654 particular, a :exc:`SystemError` if *buflen* is smaller than the length of 655 *u*). *buffer* is returned on success. 656 657 .. versionadded:: 3.3 658 659 660.. c:function:: Py_UCS4* PyUnicode_AsUCS4Copy(PyObject *u) 661 662 Copy the string *u* into a new UCS4 buffer that is allocated using 663 :c:func:`PyMem_Malloc`. If this fails, ``NULL`` is returned with a 664 :exc:`MemoryError` set. The returned buffer always has an extra 665 null code point appended. 666 667 .. versionadded:: 3.3 668 669 670Deprecated Py_UNICODE APIs 671"""""""""""""""""""""""""" 672 673.. deprecated-removed:: 3.3 3.12 674 675These API functions are deprecated with the implementation of :pep:`393`. 676Extension modules can continue using them, as they will not be removed in Python 6773.x, but need to be aware that their use can now cause performance and memory hits. 678 679 680.. c:function:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size) 681 682 Create a Unicode object from the Py_UNICODE buffer *u* of the given size. *u* 683 may be ``NULL`` which causes the contents to be undefined. It is the user's 684 responsibility to fill in the needed data. The buffer is copied into the new 685 object. 686 687 If the buffer is not ``NULL``, the return value might be a shared object. 688 Therefore, modification of the resulting Unicode object is only allowed when 689 *u* is ``NULL``. 690 691 If the buffer is ``NULL``, :c:func:`PyUnicode_READY` must be called once the 692 string content has been filled before using any of the access macros such as 693 :c:func:`PyUnicode_KIND`. 694 695 .. deprecated-removed:: 3.3 3.12 696 Part of the old-style Unicode API, please migrate to using 697 :c:func:`PyUnicode_FromKindAndData`, :c:func:`PyUnicode_FromWideChar`, or 698 :c:func:`PyUnicode_New`. 699 700 701.. c:function:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode) 702 703 Return a read-only pointer to the Unicode object's internal 704 :c:type:`Py_UNICODE` buffer, or ``NULL`` on error. This will create the 705 :c:type:`Py_UNICODE*` representation of the object if it is not yet 706 available. The buffer is always terminated with an extra null code point. 707 Note that the resulting :c:type:`Py_UNICODE` string may also contain 708 embedded null code points, which would cause the string to be truncated when 709 used in most C functions. 710 711 .. deprecated-removed:: 3.3 3.12 712 Part of the old-style Unicode API, please migrate to using 713 :c:func:`PyUnicode_AsUCS4`, :c:func:`PyUnicode_AsWideChar`, 714 :c:func:`PyUnicode_ReadChar` or similar new APIs. 715 716 .. deprecated-removed:: 3.3 3.10 717 718 719.. c:function:: PyObject* PyUnicode_TransformDecimalToASCII(Py_UNICODE *s, Py_ssize_t size) 720 721 Create a Unicode object by replacing all decimal digits in 722 :c:type:`Py_UNICODE` buffer of the given *size* by ASCII digits 0--9 723 according to their decimal value. Return ``NULL`` if an exception occurs. 724 725 .. deprecated-removed:: 3.3 3.11 726 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 727 :c:func:`Py_UNICODE_TODECIMAL`. 728 729 730.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeAndSize(PyObject *unicode, Py_ssize_t *size) 731 732 Like :c:func:`PyUnicode_AsUnicode`, but also saves the :c:func:`Py_UNICODE` 733 array length (excluding the extra null terminator) in *size*. 734 Note that the resulting :c:type:`Py_UNICODE*` string 735 may contain embedded null code points, which would cause the string to be 736 truncated when used in most C functions. 737 738 .. versionadded:: 3.3 739 740 .. deprecated-removed:: 3.3 3.12 741 Part of the old-style Unicode API, please migrate to using 742 :c:func:`PyUnicode_AsUCS4`, :c:func:`PyUnicode_AsWideChar`, 743 :c:func:`PyUnicode_ReadChar` or similar new APIs. 744 745 746.. c:function:: Py_UNICODE* PyUnicode_AsUnicodeCopy(PyObject *unicode) 747 748 Create a copy of a Unicode string ending with a null code point. Return ``NULL`` 749 and raise a :exc:`MemoryError` exception on memory allocation failure, 750 otherwise return a new allocated buffer (use :c:func:`PyMem_Free` to free 751 the buffer). Note that the resulting :c:type:`Py_UNICODE*` string may 752 contain embedded null code points, which would cause the string to be 753 truncated when used in most C functions. 754 755 .. versionadded:: 3.2 756 757 Please migrate to using :c:func:`PyUnicode_AsUCS4Copy` or similar new APIs. 758 759 760.. c:function:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode) 761 762 Return the size of the deprecated :c:type:`Py_UNICODE` representation, in 763 code units (this includes surrogate pairs as 2 units). 764 765 .. deprecated-removed:: 3.3 3.12 766 Part of the old-style Unicode API, please migrate to using 767 :c:func:`PyUnicode_GET_LENGTH`. 768 769 770.. c:function:: PyObject* PyUnicode_FromObject(PyObject *obj) 771 772 Copy an instance of a Unicode subtype to a new true Unicode object if 773 necessary. If *obj* is already a true Unicode object (not a subtype), 774 return the reference with incremented refcount. 775 776 Objects other than Unicode or its subtypes will cause a :exc:`TypeError`. 777 778 779Locale Encoding 780""""""""""""""" 781 782The current locale encoding can be used to decode text from the operating 783system. 784 785.. c:function:: PyObject* PyUnicode_DecodeLocaleAndSize(const char *str, \ 786 Py_ssize_t len, \ 787 const char *errors) 788 789 Decode a string from UTF-8 on Android and VxWorks, or from the current 790 locale encoding on other platforms. The supported 791 error handlers are ``"strict"`` and ``"surrogateescape"`` 792 (:pep:`383`). The decoder uses ``"strict"`` error handler if 793 *errors* is ``NULL``. *str* must end with a null character but 794 cannot contain embedded null characters. 795 796 Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` to decode a string from 797 :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at 798 Python startup). 799 800 This function ignores the Python UTF-8 mode. 801 802 .. seealso:: 803 804 The :c:func:`Py_DecodeLocale` function. 805 806 .. versionadded:: 3.3 807 808 .. versionchanged:: 3.7 809 The function now also uses the current locale encoding for the 810 ``surrogateescape`` error handler, except on Android. Previously, :c:func:`Py_DecodeLocale` 811 was used for the ``surrogateescape``, and the current locale encoding was 812 used for ``strict``. 813 814 815.. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors) 816 817 Similar to :c:func:`PyUnicode_DecodeLocaleAndSize`, but compute the string 818 length using :c:func:`strlen`. 819 820 .. versionadded:: 3.3 821 822 823.. c:function:: PyObject* PyUnicode_EncodeLocale(PyObject *unicode, const char *errors) 824 825 Encode a Unicode object to UTF-8 on Android and VxWorks, or to the current 826 locale encoding on other platforms. The 827 supported error handlers are ``"strict"`` and ``"surrogateescape"`` 828 (:pep:`383`). The encoder uses ``"strict"`` error handler if 829 *errors* is ``NULL``. Return a :class:`bytes` object. *unicode* cannot 830 contain embedded null characters. 831 832 Use :c:func:`PyUnicode_EncodeFSDefault` to encode a string to 833 :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at 834 Python startup). 835 836 This function ignores the Python UTF-8 mode. 837 838 .. seealso:: 839 840 The :c:func:`Py_EncodeLocale` function. 841 842 .. versionadded:: 3.3 843 844 .. versionchanged:: 3.7 845 The function now also uses the current locale encoding for the 846 ``surrogateescape`` error handler, except on Android. Previously, 847 :c:func:`Py_EncodeLocale` 848 was used for the ``surrogateescape``, and the current locale encoding was 849 used for ``strict``. 850 851 852File System Encoding 853"""""""""""""""""""" 854 855To encode and decode file names and other environment strings, 856:c:data:`Py_FileSystemDefaultEncoding` should be used as the encoding, and 857:c:data:`Py_FileSystemDefaultEncodeErrors` should be used as the error handler 858(:pep:`383` and :pep:`529`). To encode file names to :class:`bytes` during 859argument parsing, the ``"O&"`` converter should be used, passing 860:c:func:`PyUnicode_FSConverter` as the conversion function: 861 862.. c:function:: int PyUnicode_FSConverter(PyObject* obj, void* result) 863 864 ParseTuple converter: encode :class:`str` objects -- obtained directly or 865 through the :class:`os.PathLike` interface -- to :class:`bytes` using 866 :c:func:`PyUnicode_EncodeFSDefault`; :class:`bytes` objects are output as-is. 867 *result* must be a :c:type:`PyBytesObject*` which must be released when it is 868 no longer used. 869 870 .. versionadded:: 3.1 871 872 .. versionchanged:: 3.6 873 Accepts a :term:`path-like object`. 874 875To decode file names to :class:`str` during argument parsing, the ``"O&"`` 876converter should be used, passing :c:func:`PyUnicode_FSDecoder` as the 877conversion function: 878 879.. c:function:: int PyUnicode_FSDecoder(PyObject* obj, void* result) 880 881 ParseTuple converter: decode :class:`bytes` objects -- obtained either 882 directly or indirectly through the :class:`os.PathLike` interface -- to 883 :class:`str` using :c:func:`PyUnicode_DecodeFSDefaultAndSize`; :class:`str` 884 objects are output as-is. *result* must be a :c:type:`PyUnicodeObject*` which 885 must be released when it is no longer used. 886 887 .. versionadded:: 3.2 888 889 .. versionchanged:: 3.6 890 Accepts a :term:`path-like object`. 891 892 893.. c:function:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size) 894 895 Decode a string using :c:data:`Py_FileSystemDefaultEncoding` and the 896 :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 897 898 If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the 899 locale encoding. 900 901 :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the 902 locale encoding and cannot be modified later. If you need to decode a string 903 from the current locale encoding, use 904 :c:func:`PyUnicode_DecodeLocaleAndSize`. 905 906 .. seealso:: 907 908 The :c:func:`Py_DecodeLocale` function. 909 910 .. versionchanged:: 3.6 911 Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 912 913 914.. c:function:: PyObject* PyUnicode_DecodeFSDefault(const char *s) 915 916 Decode a null-terminated string using :c:data:`Py_FileSystemDefaultEncoding` 917 and the :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 918 919 If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the 920 locale encoding. 921 922 Use :c:func:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length. 923 924 .. versionchanged:: 3.6 925 Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 926 927 928.. c:function:: PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode) 929 930 Encode a Unicode object to :c:data:`Py_FileSystemDefaultEncoding` with the 931 :c:data:`Py_FileSystemDefaultEncodeErrors` error handler, and return 932 :class:`bytes`. Note that the resulting :class:`bytes` object may contain 933 null bytes. 934 935 If :c:data:`Py_FileSystemDefaultEncoding` is not set, fall back to the 936 locale encoding. 937 938 :c:data:`Py_FileSystemDefaultEncoding` is initialized at startup from the 939 locale encoding and cannot be modified later. If you need to encode a string 940 to the current locale encoding, use :c:func:`PyUnicode_EncodeLocale`. 941 942 .. seealso:: 943 944 The :c:func:`Py_EncodeLocale` function. 945 946 .. versionadded:: 3.2 947 948 .. versionchanged:: 3.6 949 Use :c:data:`Py_FileSystemDefaultEncodeErrors` error handler. 950 951wchar_t Support 952""""""""""""""" 953 954:c:type:`wchar_t` support for platforms which support it: 955 956.. c:function:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size) 957 958 Create a Unicode object from the :c:type:`wchar_t` buffer *w* of the given *size*. 959 Passing ``-1`` as the *size* indicates that the function must itself compute the length, 960 using wcslen. 961 Return ``NULL`` on failure. 962 963 964.. c:function:: Py_ssize_t PyUnicode_AsWideChar(PyObject *unicode, wchar_t *w, Py_ssize_t size) 965 966 Copy the Unicode object contents into the :c:type:`wchar_t` buffer *w*. At most 967 *size* :c:type:`wchar_t` characters are copied (excluding a possibly trailing 968 null termination character). Return the number of :c:type:`wchar_t` characters 969 copied or ``-1`` in case of an error. Note that the resulting :c:type:`wchar_t*` 970 string may or may not be null-terminated. It is the responsibility of the caller 971 to make sure that the :c:type:`wchar_t*` string is null-terminated in case this is 972 required by the application. Also, note that the :c:type:`wchar_t*` string 973 might contain null characters, which would cause the string to be truncated 974 when used with most C functions. 975 976 977.. c:function:: wchar_t* PyUnicode_AsWideCharString(PyObject *unicode, Py_ssize_t *size) 978 979 Convert the Unicode object to a wide character string. The output string 980 always ends with a null character. If *size* is not ``NULL``, write the number 981 of wide characters (excluding the trailing null termination character) into 982 *\*size*. Note that the resulting :c:type:`wchar_t` string might contain 983 null characters, which would cause the string to be truncated when used with 984 most C functions. If *size* is ``NULL`` and the :c:type:`wchar_t*` string 985 contains null characters a :exc:`ValueError` is raised. 986 987 Returns a buffer allocated by :c:func:`PyMem_Alloc` (use 988 :c:func:`PyMem_Free` to free it) on success. On error, returns ``NULL`` 989 and *\*size* is undefined. Raises a :exc:`MemoryError` if memory allocation 990 is failed. 991 992 .. versionadded:: 3.2 993 994 .. versionchanged:: 3.7 995 Raises a :exc:`ValueError` if *size* is ``NULL`` and the :c:type:`wchar_t*` 996 string contains null characters. 997 998 999.. _builtincodecs: 1000 1001Built-in Codecs 1002^^^^^^^^^^^^^^^ 1003 1004Python provides a set of built-in codecs which are written in C for speed. All of 1005these codecs are directly usable via the following functions. 1006 1007Many of the following APIs take two arguments encoding and errors, and they 1008have the same semantics as the ones of the built-in :func:`str` string object 1009constructor. 1010 1011Setting encoding to ``NULL`` causes the default encoding to be used 1012which is ASCII. The file system calls should use 1013:c:func:`PyUnicode_FSConverter` for encoding file names. This uses the 1014variable :c:data:`Py_FileSystemDefaultEncoding` internally. This 1015variable should be treated as read-only: on some systems, it will be a 1016pointer to a static string, on others, it will change at run-time 1017(such as when the application invokes setlocale). 1018 1019Error handling is set by errors which may also be set to ``NULL`` meaning to use 1020the default handling defined for the codec. Default error handling for all 1021built-in codecs is "strict" (:exc:`ValueError` is raised). 1022 1023The codecs all use a similar interface. Only deviation from the following 1024generic ones are documented for simplicity. 1025 1026 1027Generic Codecs 1028"""""""""""""" 1029 1030These are the generic codec APIs: 1031 1032 1033.. c:function:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, \ 1034 const char *encoding, const char *errors) 1035 1036 Create a Unicode object by decoding *size* bytes of the encoded string *s*. 1037 *encoding* and *errors* have the same meaning as the parameters of the same name 1038 in the :func:`str` built-in function. The codec to be used is looked up 1039 using the Python codec registry. Return ``NULL`` if an exception was raised by 1040 the codec. 1041 1042 1043.. c:function:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, \ 1044 const char *encoding, const char *errors) 1045 1046 Encode a Unicode object and return the result as Python bytes object. 1047 *encoding* and *errors* have the same meaning as the parameters of the same 1048 name in the Unicode :meth:`~str.encode` method. The codec to be used is looked up 1049 using the Python codec registry. Return ``NULL`` if an exception was raised by 1050 the codec. 1051 1052 1053.. c:function:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, \ 1054 const char *encoding, const char *errors) 1055 1056 Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* and return a Python 1057 bytes object. *encoding* and *errors* have the same meaning as the 1058 parameters of the same name in the Unicode :meth:`~str.encode` method. The codec 1059 to be used is looked up using the Python codec registry. Return ``NULL`` if an 1060 exception was raised by the codec. 1061 1062 .. deprecated-removed:: 3.3 3.11 1063 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1064 :c:func:`PyUnicode_AsEncodedString`. 1065 1066 1067UTF-8 Codecs 1068"""""""""""" 1069 1070These are the UTF-8 codec APIs: 1071 1072 1073.. c:function:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors) 1074 1075 Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string 1076 *s*. Return ``NULL`` if an exception was raised by the codec. 1077 1078 1079.. c:function:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, \ 1080 const char *errors, Py_ssize_t *consumed) 1081 1082 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF8`. If 1083 *consumed* is not ``NULL``, trailing incomplete UTF-8 byte sequences will not be 1084 treated as an error. Those bytes will not be decoded and the number of bytes 1085 that have been decoded will be stored in *consumed*. 1086 1087 1088.. c:function:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode) 1089 1090 Encode a Unicode object using UTF-8 and return the result as Python bytes 1091 object. Error handling is "strict". Return ``NULL`` if an exception was 1092 raised by the codec. 1093 1094 1095.. c:function:: const char* PyUnicode_AsUTF8AndSize(PyObject *unicode, Py_ssize_t *size) 1096 1097 Return a pointer to the UTF-8 encoding of the Unicode object, and 1098 store the size of the encoded representation (in bytes) in *size*. The 1099 *size* argument can be ``NULL``; in this case no size will be stored. The 1100 returned buffer always has an extra null byte appended (not included in 1101 *size*), regardless of whether there are any other null code points. 1102 1103 In the case of an error, ``NULL`` is returned with an exception set and no 1104 *size* is stored. 1105 1106 This caches the UTF-8 representation of the string in the Unicode object, and 1107 subsequent calls will return a pointer to the same buffer. The caller is not 1108 responsible for deallocating the buffer. 1109 1110 .. versionadded:: 3.3 1111 1112 .. versionchanged:: 3.7 1113 The return type is now ``const char *`` rather of ``char *``. 1114 1115 1116.. c:function:: const char* PyUnicode_AsUTF8(PyObject *unicode) 1117 1118 As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size. 1119 1120 .. versionadded:: 3.3 1121 1122 .. versionchanged:: 3.7 1123 The return type is now ``const char *`` rather of ``char *``. 1124 1125 1126.. c:function:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1127 1128 Encode the :c:type:`Py_UNICODE` buffer *s* of the given *size* using UTF-8 and 1129 return a Python bytes object. Return ``NULL`` if an exception was raised by 1130 the codec. 1131 1132 .. deprecated-removed:: 3.3 3.11 1133 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1134 :c:func:`PyUnicode_AsUTF8String`, :c:func:`PyUnicode_AsUTF8AndSize` or 1135 :c:func:`PyUnicode_AsEncodedString`. 1136 1137 1138UTF-32 Codecs 1139""""""""""""" 1140 1141These are the UTF-32 codec APIs: 1142 1143 1144.. c:function:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, \ 1145 const char *errors, int *byteorder) 1146 1147 Decode *size* bytes from a UTF-32 encoded buffer string and return the 1148 corresponding Unicode object. *errors* (if non-``NULL``) defines the error 1149 handling. It defaults to "strict". 1150 1151 If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte 1152 order:: 1153 1154 *byteorder == -1: little endian 1155 *byteorder == 0: native order 1156 *byteorder == 1: big endian 1157 1158 If ``*byteorder`` is zero, and the first four bytes of the input data are a 1159 byte order mark (BOM), the decoder switches to this byte order and the BOM is 1160 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or 1161 ``1``, any byte order mark is copied to the output. 1162 1163 After completion, *\*byteorder* is set to the current byte order at the end 1164 of input data. 1165 1166 If *byteorder* is ``NULL``, the codec starts in native order mode. 1167 1168 Return ``NULL`` if an exception was raised by the codec. 1169 1170 1171.. c:function:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, \ 1172 const char *errors, int *byteorder, Py_ssize_t *consumed) 1173 1174 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF32`. If 1175 *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF32Stateful` will not treat 1176 trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible 1177 by four) as an error. Those bytes will not be decoded and the number of bytes 1178 that have been decoded will be stored in *consumed*. 1179 1180 1181.. c:function:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode) 1182 1183 Return a Python byte string using the UTF-32 encoding in native byte 1184 order. The string always starts with a BOM mark. Error handling is "strict". 1185 Return ``NULL`` if an exception was raised by the codec. 1186 1187 1188.. c:function:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, \ 1189 const char *errors, int byteorder) 1190 1191 Return a Python bytes object holding the UTF-32 encoded value of the Unicode 1192 data in *s*. Output is written according to the following byte order:: 1193 1194 byteorder == -1: little endian 1195 byteorder == 0: native byte order (writes a BOM mark) 1196 byteorder == 1: big endian 1197 1198 If byteorder is ``0``, the output string will always start with the Unicode BOM 1199 mark (U+FEFF). In the other two modes, no BOM mark is prepended. 1200 1201 If ``Py_UNICODE_WIDE`` is not defined, surrogate pairs will be output 1202 as a single code point. 1203 1204 Return ``NULL`` if an exception was raised by the codec. 1205 1206 .. deprecated-removed:: 3.3 3.11 1207 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1208 :c:func:`PyUnicode_AsUTF32String` or :c:func:`PyUnicode_AsEncodedString`. 1209 1210 1211UTF-16 Codecs 1212""""""""""""" 1213 1214These are the UTF-16 codec APIs: 1215 1216 1217.. c:function:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, \ 1218 const char *errors, int *byteorder) 1219 1220 Decode *size* bytes from a UTF-16 encoded buffer string and return the 1221 corresponding Unicode object. *errors* (if non-``NULL``) defines the error 1222 handling. It defaults to "strict". 1223 1224 If *byteorder* is non-``NULL``, the decoder starts decoding using the given byte 1225 order:: 1226 1227 *byteorder == -1: little endian 1228 *byteorder == 0: native order 1229 *byteorder == 1: big endian 1230 1231 If ``*byteorder`` is zero, and the first two bytes of the input data are a 1232 byte order mark (BOM), the decoder switches to this byte order and the BOM is 1233 not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or 1234 ``1``, any byte order mark is copied to the output (where it will result in 1235 either a ``\ufeff`` or a ``\ufffe`` character). 1236 1237 After completion, *\*byteorder* is set to the current byte order at the end 1238 of input data. 1239 1240 If *byteorder* is ``NULL``, the codec starts in native order mode. 1241 1242 Return ``NULL`` if an exception was raised by the codec. 1243 1244 1245.. c:function:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, \ 1246 const char *errors, int *byteorder, Py_ssize_t *consumed) 1247 1248 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF16`. If 1249 *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeUTF16Stateful` will not treat 1250 trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a 1251 split surrogate pair) as an error. Those bytes will not be decoded and the 1252 number of bytes that have been decoded will be stored in *consumed*. 1253 1254 1255.. c:function:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode) 1256 1257 Return a Python byte string using the UTF-16 encoding in native byte 1258 order. The string always starts with a BOM mark. Error handling is "strict". 1259 Return ``NULL`` if an exception was raised by the codec. 1260 1261 1262.. c:function:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, \ 1263 const char *errors, int byteorder) 1264 1265 Return a Python bytes object holding the UTF-16 encoded value of the Unicode 1266 data in *s*. Output is written according to the following byte order:: 1267 1268 byteorder == -1: little endian 1269 byteorder == 0: native byte order (writes a BOM mark) 1270 byteorder == 1: big endian 1271 1272 If byteorder is ``0``, the output string will always start with the Unicode BOM 1273 mark (U+FEFF). In the other two modes, no BOM mark is prepended. 1274 1275 If ``Py_UNICODE_WIDE`` is defined, a single :c:type:`Py_UNICODE` value may get 1276 represented as a surrogate pair. If it is not defined, each :c:type:`Py_UNICODE` 1277 values is interpreted as a UCS-2 character. 1278 1279 Return ``NULL`` if an exception was raised by the codec. 1280 1281 .. deprecated-removed:: 3.3 3.11 1282 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1283 :c:func:`PyUnicode_AsUTF16String` or :c:func:`PyUnicode_AsEncodedString`. 1284 1285 1286UTF-7 Codecs 1287"""""""""""" 1288 1289These are the UTF-7 codec APIs: 1290 1291 1292.. c:function:: PyObject* PyUnicode_DecodeUTF7(const char *s, Py_ssize_t size, const char *errors) 1293 1294 Create a Unicode object by decoding *size* bytes of the UTF-7 encoded string 1295 *s*. Return ``NULL`` if an exception was raised by the codec. 1296 1297 1298.. c:function:: PyObject* PyUnicode_DecodeUTF7Stateful(const char *s, Py_ssize_t size, \ 1299 const char *errors, Py_ssize_t *consumed) 1300 1301 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeUTF7`. If 1302 *consumed* is not ``NULL``, trailing incomplete UTF-7 base-64 sections will not 1303 be treated as an error. Those bytes will not be decoded and the number of 1304 bytes that have been decoded will be stored in *consumed*. 1305 1306 1307.. c:function:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE *s, Py_ssize_t size, \ 1308 int base64SetO, int base64WhiteSpace, const char *errors) 1309 1310 Encode the :c:type:`Py_UNICODE` buffer of the given size using UTF-7 and 1311 return a Python bytes object. Return ``NULL`` if an exception was raised by 1312 the codec. 1313 1314 If *base64SetO* is nonzero, "Set O" (punctuation that has no otherwise 1315 special meaning) will be encoded in base-64. If *base64WhiteSpace* is 1316 nonzero, whitespace will be encoded in base-64. Both are set to zero for the 1317 Python "utf-7" codec. 1318 1319 .. deprecated-removed:: 3.3 3.11 1320 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1321 :c:func:`PyUnicode_AsEncodedString`. 1322 1323 1324Unicode-Escape Codecs 1325""""""""""""""""""""" 1326 1327These are the "Unicode Escape" codec APIs: 1328 1329 1330.. c:function:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, \ 1331 Py_ssize_t size, const char *errors) 1332 1333 Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded 1334 string *s*. Return ``NULL`` if an exception was raised by the codec. 1335 1336 1337.. c:function:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode) 1338 1339 Encode a Unicode object using Unicode-Escape and return the result as a 1340 bytes object. Error handling is "strict". Return ``NULL`` if an exception was 1341 raised by the codec. 1342 1343 1344.. c:function:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size) 1345 1346 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Unicode-Escape and 1347 return a bytes object. Return ``NULL`` if an exception was raised by the codec. 1348 1349 .. deprecated-removed:: 3.3 3.11 1350 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1351 :c:func:`PyUnicode_AsUnicodeEscapeString`. 1352 1353 1354Raw-Unicode-Escape Codecs 1355""""""""""""""""""""""""" 1356 1357These are the "Raw Unicode Escape" codec APIs: 1358 1359 1360.. c:function:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, \ 1361 Py_ssize_t size, const char *errors) 1362 1363 Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape 1364 encoded string *s*. Return ``NULL`` if an exception was raised by the codec. 1365 1366 1367.. c:function:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode) 1368 1369 Encode a Unicode object using Raw-Unicode-Escape and return the result as 1370 a bytes object. Error handling is "strict". Return ``NULL`` if an exception 1371 was raised by the codec. 1372 1373 1374.. c:function:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, \ 1375 Py_ssize_t size) 1376 1377 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Raw-Unicode-Escape 1378 and return a bytes object. Return ``NULL`` if an exception was raised by the codec. 1379 1380 .. deprecated-removed:: 3.3 3.11 1381 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1382 :c:func:`PyUnicode_AsRawUnicodeEscapeString` or 1383 :c:func:`PyUnicode_AsEncodedString`. 1384 1385 1386Latin-1 Codecs 1387"""""""""""""" 1388 1389These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode 1390ordinals and only these are accepted by the codecs during encoding. 1391 1392 1393.. c:function:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors) 1394 1395 Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string 1396 *s*. Return ``NULL`` if an exception was raised by the codec. 1397 1398 1399.. c:function:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode) 1400 1401 Encode a Unicode object using Latin-1 and return the result as Python bytes 1402 object. Error handling is "strict". Return ``NULL`` if an exception was 1403 raised by the codec. 1404 1405 1406.. c:function:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1407 1408 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using Latin-1 and 1409 return a Python bytes object. Return ``NULL`` if an exception was raised by 1410 the codec. 1411 1412 .. deprecated-removed:: 3.3 3.11 1413 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1414 :c:func:`PyUnicode_AsLatin1String` or 1415 :c:func:`PyUnicode_AsEncodedString`. 1416 1417 1418ASCII Codecs 1419"""""""""""" 1420 1421These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other 1422codes generate errors. 1423 1424 1425.. c:function:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors) 1426 1427 Create a Unicode object by decoding *size* bytes of the ASCII encoded string 1428 *s*. Return ``NULL`` if an exception was raised by the codec. 1429 1430 1431.. c:function:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode) 1432 1433 Encode a Unicode object using ASCII and return the result as Python bytes 1434 object. Error handling is "strict". Return ``NULL`` if an exception was 1435 raised by the codec. 1436 1437 1438.. c:function:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1439 1440 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using ASCII and 1441 return a Python bytes object. Return ``NULL`` if an exception was raised by 1442 the codec. 1443 1444 .. deprecated-removed:: 3.3 3.11 1445 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1446 :c:func:`PyUnicode_AsASCIIString` or 1447 :c:func:`PyUnicode_AsEncodedString`. 1448 1449 1450Character Map Codecs 1451"""""""""""""""""""" 1452 1453This codec is special in that it can be used to implement many different codecs 1454(and this is in fact what was done to obtain most of the standard codecs 1455included in the :mod:`encodings` package). The codec uses mapping to encode and 1456decode characters. The mapping objects provided must support the 1457:meth:`__getitem__` mapping interface; dictionaries and sequences work well. 1458 1459These are the mapping codec APIs: 1460 1461.. c:function:: PyObject* PyUnicode_DecodeCharmap(const char *data, Py_ssize_t size, \ 1462 PyObject *mapping, const char *errors) 1463 1464 Create a Unicode object by decoding *size* bytes of the encoded string *s* 1465 using the given *mapping* object. Return ``NULL`` if an exception was raised 1466 by the codec. 1467 1468 If *mapping* is ``NULL``, Latin-1 decoding will be applied. Else 1469 *mapping* must map bytes ordinals (integers in the range from 0 to 255) 1470 to Unicode strings, integers (which are then interpreted as Unicode 1471 ordinals) or ``None``. Unmapped data bytes -- ones which cause a 1472 :exc:`LookupError`, as well as ones which get mapped to ``None``, 1473 ``0xFFFE`` or ``'\ufffe'``, are treated as undefined mappings and cause 1474 an error. 1475 1476 1477.. c:function:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping) 1478 1479 Encode a Unicode object using the given *mapping* object and return the 1480 result as a bytes object. Error handling is "strict". Return ``NULL`` if an 1481 exception was raised by the codec. 1482 1483 The *mapping* object must map Unicode ordinal integers to bytes objects, 1484 integers in the range from 0 to 255 or ``None``. Unmapped character 1485 ordinals (ones which cause a :exc:`LookupError`) as well as mapped to 1486 ``None`` are treated as "undefined mapping" and cause an error. 1487 1488 1489.. c:function:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, \ 1490 PyObject *mapping, const char *errors) 1491 1492 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using the given 1493 *mapping* object and return the result as a bytes object. Return ``NULL`` if 1494 an exception was raised by the codec. 1495 1496 .. deprecated-removed:: 3.3 3.11 1497 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1498 :c:func:`PyUnicode_AsCharmapString` or 1499 :c:func:`PyUnicode_AsEncodedString`. 1500 1501 1502The following codec API is special in that maps Unicode to Unicode. 1503 1504.. c:function:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors) 1505 1506 Translate a string by applying a character mapping table to it and return the 1507 resulting Unicode object. Return ``NULL`` if an exception was raised by the 1508 codec. 1509 1510 The mapping table must map Unicode ordinal integers to Unicode ordinal integers 1511 or ``None`` (causing deletion of the character). 1512 1513 Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries 1514 and sequences work well. Unmapped character ordinals (ones which cause a 1515 :exc:`LookupError`) are left untouched and are copied as-is. 1516 1517 *errors* has the usual meaning for codecs. It may be ``NULL`` which indicates to 1518 use the default error handling. 1519 1520 1521.. c:function:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, \ 1522 PyObject *mapping, const char *errors) 1523 1524 Translate a :c:type:`Py_UNICODE` buffer of the given *size* by applying a 1525 character *mapping* table to it and return the resulting Unicode object. 1526 Return ``NULL`` when an exception was raised by the codec. 1527 1528 .. deprecated-removed:: 3.3 3.11 1529 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1530 :c:func:`PyUnicode_Translate`. or :ref:`generic codec based API 1531 <codec-registry>` 1532 1533 1534MBCS codecs for Windows 1535""""""""""""""""""""""" 1536 1537These are the MBCS codec APIs. They are currently only available on Windows and 1538use the Win32 MBCS converters to implement the conversions. Note that MBCS (or 1539DBCS) is a class of encodings, not just one. The target encoding is defined by 1540the user settings on the machine running the codec. 1541 1542.. c:function:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors) 1543 1544 Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*. 1545 Return ``NULL`` if an exception was raised by the codec. 1546 1547 1548.. c:function:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, Py_ssize_t size, \ 1549 const char *errors, Py_ssize_t *consumed) 1550 1551 If *consumed* is ``NULL``, behave like :c:func:`PyUnicode_DecodeMBCS`. If 1552 *consumed* is not ``NULL``, :c:func:`PyUnicode_DecodeMBCSStateful` will not decode 1553 trailing lead byte and the number of bytes that have been decoded will be stored 1554 in *consumed*. 1555 1556 1557.. c:function:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode) 1558 1559 Encode a Unicode object using MBCS and return the result as Python bytes 1560 object. Error handling is "strict". Return ``NULL`` if an exception was 1561 raised by the codec. 1562 1563 1564.. c:function:: PyObject* PyUnicode_EncodeCodePage(int code_page, PyObject *unicode, const char *errors) 1565 1566 Encode the Unicode object using the specified code page and return a Python 1567 bytes object. Return ``NULL`` if an exception was raised by the codec. Use 1568 :c:data:`CP_ACP` code page to get the MBCS encoder. 1569 1570 .. versionadded:: 3.3 1571 1572 1573.. c:function:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors) 1574 1575 Encode the :c:type:`Py_UNICODE` buffer of the given *size* using MBCS and return 1576 a Python bytes object. Return ``NULL`` if an exception was raised by the 1577 codec. 1578 1579 .. deprecated-removed:: 3.3 4.0 1580 Part of the old-style :c:type:`Py_UNICODE` API; please migrate to using 1581 :c:func:`PyUnicode_AsMBCSString`, :c:func:`PyUnicode_EncodeCodePage` or 1582 :c:func:`PyUnicode_AsEncodedString`. 1583 1584 1585Methods & Slots 1586""""""""""""""" 1587 1588 1589.. _unicodemethodsandslots: 1590 1591Methods and Slot Functions 1592^^^^^^^^^^^^^^^^^^^^^^^^^^ 1593 1594The following APIs are capable of handling Unicode objects and strings on input 1595(we refer to them as strings in the descriptions) and return Unicode objects or 1596integers as appropriate. 1597 1598They all return ``NULL`` or ``-1`` if an exception occurs. 1599 1600 1601.. c:function:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right) 1602 1603 Concat two strings giving a new Unicode string. 1604 1605 1606.. c:function:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit) 1607 1608 Split a string giving a list of Unicode strings. If *sep* is ``NULL``, splitting 1609 will be done at all whitespace substrings. Otherwise, splits occur at the given 1610 separator. At most *maxsplit* splits will be done. If negative, no limit is 1611 set. Separators are not included in the resulting list. 1612 1613 1614.. c:function:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend) 1615 1616 Split a Unicode string at line breaks, returning a list of Unicode strings. 1617 CRLF is considered to be one line break. If *keepend* is ``0``, the Line break 1618 characters are not included in the resulting strings. 1619 1620 1621.. c:function:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq) 1622 1623 Join a sequence of strings using the given *separator* and return the resulting 1624 Unicode string. 1625 1626 1627.. c:function:: Py_ssize_t PyUnicode_Tailmatch(PyObject *str, PyObject *substr, \ 1628 Py_ssize_t start, Py_ssize_t end, int direction) 1629 1630 Return ``1`` if *substr* matches ``str[start:end]`` at the given tail end 1631 (*direction* == ``-1`` means to do a prefix match, *direction* == ``1`` a suffix match), 1632 ``0`` otherwise. Return ``-1`` if an error occurred. 1633 1634 1635.. c:function:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, \ 1636 Py_ssize_t start, Py_ssize_t end, int direction) 1637 1638 Return the first position of *substr* in ``str[start:end]`` using the given 1639 *direction* (*direction* == ``1`` means to do a forward search, *direction* == ``-1`` a 1640 backward search). The return value is the index of the first match; a value of 1641 ``-1`` indicates that no match was found, and ``-2`` indicates that an error 1642 occurred and an exception has been set. 1643 1644 1645.. c:function:: Py_ssize_t PyUnicode_FindChar(PyObject *str, Py_UCS4 ch, \ 1646 Py_ssize_t start, Py_ssize_t end, int direction) 1647 1648 Return the first position of the character *ch* in ``str[start:end]`` using 1649 the given *direction* (*direction* == ``1`` means to do a forward search, 1650 *direction* == ``-1`` a backward search). The return value is the index of the 1651 first match; a value of ``-1`` indicates that no match was found, and ``-2`` 1652 indicates that an error occurred and an exception has been set. 1653 1654 .. versionadded:: 3.3 1655 1656 .. versionchanged:: 3.7 1657 *start* and *end* are now adjusted to behave like ``str[start:end]``. 1658 1659 1660.. c:function:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, \ 1661 Py_ssize_t start, Py_ssize_t end) 1662 1663 Return the number of non-overlapping occurrences of *substr* in 1664 ``str[start:end]``. Return ``-1`` if an error occurred. 1665 1666 1667.. c:function:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, \ 1668 PyObject *replstr, Py_ssize_t maxcount) 1669 1670 Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and 1671 return the resulting Unicode object. *maxcount* == ``-1`` means replace all 1672 occurrences. 1673 1674 1675.. c:function:: int PyUnicode_Compare(PyObject *left, PyObject *right) 1676 1677 Compare two strings and return ``-1``, ``0``, ``1`` for less than, equal, and greater than, 1678 respectively. 1679 1680 This function returns ``-1`` upon failure, so one should call 1681 :c:func:`PyErr_Occurred` to check for errors. 1682 1683 1684.. c:function:: int PyUnicode_CompareWithASCIIString(PyObject *uni, const char *string) 1685 1686 Compare a Unicode object, *uni*, with *string* and return ``-1``, ``0``, ``1`` for less 1687 than, equal, and greater than, respectively. It is best to pass only 1688 ASCII-encoded strings, but the function interprets the input string as 1689 ISO-8859-1 if it contains non-ASCII characters. 1690 1691 This function does not raise exceptions. 1692 1693 1694.. c:function:: PyObject* PyUnicode_RichCompare(PyObject *left, PyObject *right, int op) 1695 1696 Rich compare two Unicode strings and return one of the following: 1697 1698 * ``NULL`` in case an exception was raised 1699 * :const:`Py_True` or :const:`Py_False` for successful comparisons 1700 * :const:`Py_NotImplemented` in case the type combination is unknown 1701 1702 Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`, 1703 :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`. 1704 1705 1706.. c:function:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args) 1707 1708 Return a new string object from *format* and *args*; this is analogous to 1709 ``format % args``. 1710 1711 1712.. c:function:: int PyUnicode_Contains(PyObject *container, PyObject *element) 1713 1714 Check whether *element* is contained in *container* and return true or false 1715 accordingly. 1716 1717 *element* has to coerce to a one element Unicode string. ``-1`` is returned 1718 if there was an error. 1719 1720 1721.. c:function:: void PyUnicode_InternInPlace(PyObject **string) 1722 1723 Intern the argument *\*string* in place. The argument must be the address of a 1724 pointer variable pointing to a Python Unicode string object. If there is an 1725 existing interned string that is the same as *\*string*, it sets *\*string* to 1726 it (decrementing the reference count of the old string object and incrementing 1727 the reference count of the interned string object), otherwise it leaves 1728 *\*string* alone and interns it (incrementing its reference count). 1729 (Clarification: even though there is a lot of talk about reference counts, think 1730 of this function as reference-count-neutral; you own the object after the call 1731 if and only if you owned it before the call.) 1732 1733 1734.. c:function:: PyObject* PyUnicode_InternFromString(const char *v) 1735 1736 A combination of :c:func:`PyUnicode_FromString` and 1737 :c:func:`PyUnicode_InternInPlace`, returning either a new Unicode string 1738 object that has been interned, or a new ("owned") reference to an earlier 1739 interned string object with the same value. 1740