1.. _categorical: 2 3{{ header }} 4 5**************** 6Categorical data 7**************** 8 9This is an introduction to pandas categorical data type, including a short comparison 10with R's ``factor``. 11 12``Categoricals`` are a pandas data type corresponding to categorical variables in 13statistics. A categorical variable takes on a limited, and usually fixed, 14number of possible values (``categories``; ``levels`` in R). Examples are gender, 15social class, blood type, country affiliation, observation time or rating via 16Likert scales. 17 18In contrast to statistical categorical variables, categorical data might have an order (e.g. 19'strongly agree' vs 'agree' or 'first observation' vs. 'second observation'), but numerical 20operations (additions, divisions, ...) are not possible. 21 22All values of categorical data are either in ``categories`` or ``np.nan``. Order is defined by 23the order of ``categories``, not lexical order of the values. Internally, the data structure 24consists of a ``categories`` array and an integer array of ``codes`` which point to the real value in 25the ``categories`` array. 26 27The categorical data type is useful in the following cases: 28 29* A string variable consisting of only a few different values. Converting such a string 30 variable to a categorical variable will save some memory, see :ref:`here <categorical.memory>`. 31* The lexical order of a variable is not the same as the logical order ("one", "two", "three"). 32 By converting to a categorical and specifying an order on the categories, sorting and 33 min/max will use the logical order instead of the lexical order, see :ref:`here <categorical.sort>`. 34* As a signal to other Python libraries that this column should be treated as a categorical 35 variable (e.g. to use suitable statistical methods or plot types). 36 37See also the :ref:`API docs on categoricals<api.arrays.categorical>`. 38 39.. _categorical.objectcreation: 40 41Object creation 42--------------- 43 44Series creation 45~~~~~~~~~~~~~~~ 46 47Categorical ``Series`` or columns in a ``DataFrame`` can be created in several ways: 48 49By specifying ``dtype="category"`` when constructing a ``Series``: 50 51.. ipython:: python 52 53 s = pd.Series(["a", "b", "c", "a"], dtype="category") 54 s 55 56By converting an existing ``Series`` or column to a ``category`` dtype: 57 58.. ipython:: python 59 60 df = pd.DataFrame({"A": ["a", "b", "c", "a"]}) 61 df["B"] = df["A"].astype("category") 62 df 63 64By using special functions, such as :func:`~pandas.cut`, which groups data into 65discrete bins. See the :ref:`example on tiling <reshaping.tile.cut>` in the docs. 66 67.. ipython:: python 68 69 df = pd.DataFrame({"value": np.random.randint(0, 100, 20)}) 70 labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)] 71 72 df["group"] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels) 73 df.head(10) 74 75By passing a :class:`pandas.Categorical` object to a ``Series`` or assigning it to a ``DataFrame``. 76 77.. ipython:: python 78 79 raw_cat = pd.Categorical( 80 ["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=False 81 ) 82 s = pd.Series(raw_cat) 83 s 84 df = pd.DataFrame({"A": ["a", "b", "c", "a"]}) 85 df["B"] = raw_cat 86 df 87 88Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`: 89 90.. ipython:: python 91 92 df.dtypes 93 94DataFrame creation 95~~~~~~~~~~~~~~~~~~ 96 97Similar to the previous section where a single column was converted to categorical, all columns in a 98``DataFrame`` can be batch converted to categorical either during or after construction. 99 100This can be done during construction by specifying ``dtype="category"`` in the ``DataFrame`` constructor: 101 102.. ipython:: python 103 104 df = pd.DataFrame({"A": list("abca"), "B": list("bccd")}, dtype="category") 105 df.dtypes 106 107Note that the categories present in each column differ; the conversion is done column by column, so 108only labels present in a given column are categories: 109 110.. ipython:: python 111 112 df["A"] 113 df["B"] 114 115 116Analogously, all columns in an existing ``DataFrame`` can be batch converted using :meth:`DataFrame.astype`: 117 118.. ipython:: python 119 120 df = pd.DataFrame({"A": list("abca"), "B": list("bccd")}) 121 df_cat = df.astype("category") 122 df_cat.dtypes 123 124This conversion is likewise done column by column: 125 126.. ipython:: python 127 128 df_cat["A"] 129 df_cat["B"] 130 131 132Controlling behavior 133~~~~~~~~~~~~~~~~~~~~ 134 135In the examples above where we passed ``dtype='category'``, we used the default 136behavior: 137 1381. Categories are inferred from the data. 1392. Categories are unordered. 140 141To control those behaviors, instead of passing ``'category'``, use an instance 142of :class:`~pandas.api.types.CategoricalDtype`. 143 144.. ipython:: python 145 146 from pandas.api.types import CategoricalDtype 147 148 s = pd.Series(["a", "b", "c", "a"]) 149 cat_type = CategoricalDtype(categories=["b", "c", "d"], ordered=True) 150 s_cat = s.astype(cat_type) 151 s_cat 152 153Similarly, a ``CategoricalDtype`` can be used with a ``DataFrame`` to ensure that categories 154are consistent among all columns. 155 156.. ipython:: python 157 158 from pandas.api.types import CategoricalDtype 159 160 df = pd.DataFrame({"A": list("abca"), "B": list("bccd")}) 161 cat_type = CategoricalDtype(categories=list("abcd"), ordered=True) 162 df_cat = df.astype(cat_type) 163 df_cat["A"] 164 df_cat["B"] 165 166.. note:: 167 168 To perform table-wise conversion, where all labels in the entire ``DataFrame`` are used as 169 categories for each column, the ``categories`` parameter can be determined programmatically by 170 ``categories = pd.unique(df.to_numpy().ravel())``. 171 172If you already have ``codes`` and ``categories``, you can use the 173:func:`~pandas.Categorical.from_codes` constructor to save the factorize step 174during normal constructor mode: 175 176.. ipython:: python 177 178 splitter = np.random.choice([0, 1], 5, p=[0.5, 0.5]) 179 s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"])) 180 181 182Regaining original data 183~~~~~~~~~~~~~~~~~~~~~~~ 184 185To get back to the original ``Series`` or NumPy array, use 186``Series.astype(original_dtype)`` or ``np.asarray(categorical)``: 187 188.. ipython:: python 189 190 s = pd.Series(["a", "b", "c", "a"]) 191 s 192 s2 = s.astype("category") 193 s2 194 s2.astype(str) 195 np.asarray(s2) 196 197.. note:: 198 199 In contrast to R's ``factor`` function, categorical data is not converting input values to 200 strings; categories will end up the same data type as the original values. 201 202.. note:: 203 204 In contrast to R's ``factor`` function, there is currently no way to assign/change labels at 205 creation time. Use ``categories`` to change the categories after creation time. 206 207.. _categorical.categoricaldtype: 208 209CategoricalDtype 210---------------- 211 212A categorical's type is fully described by 213 2141. ``categories``: a sequence of unique values and no missing values 2152. ``ordered``: a boolean 216 217This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`. 218The ``categories`` argument is optional, which implies that the actual categories 219should be inferred from whatever is present in the data when the 220:class:`pandas.Categorical` is created. The categories are assumed to be unordered 221by default. 222 223.. ipython:: python 224 225 from pandas.api.types import CategoricalDtype 226 227 CategoricalDtype(["a", "b", "c"]) 228 CategoricalDtype(["a", "b", "c"], ordered=True) 229 CategoricalDtype() 230 231A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas 232expects a ``dtype``. For example :func:`pandas.read_csv`, 233:func:`pandas.DataFrame.astype`, or in the ``Series`` constructor. 234 235.. note:: 236 237 As a convenience, you can use the string ``'category'`` in place of a 238 :class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of 239 the categories being unordered, and equal to the set values present in the 240 array. In other words, ``dtype='category'`` is equivalent to 241 ``dtype=CategoricalDtype()``. 242 243Equality semantics 244~~~~~~~~~~~~~~~~~~ 245 246Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal 247whenever they have the same categories and order. When comparing two 248unordered categoricals, the order of the ``categories`` is not considered. 249 250.. ipython:: python 251 252 c1 = CategoricalDtype(["a", "b", "c"], ordered=False) 253 254 # Equal, since order is not considered when ordered=False 255 c1 == CategoricalDtype(["b", "c", "a"], ordered=False) 256 257 # Unequal, since the second CategoricalDtype is ordered 258 c1 == CategoricalDtype(["a", "b", "c"], ordered=True) 259 260All instances of ``CategoricalDtype`` compare equal to the string ``'category'``. 261 262.. ipython:: python 263 264 c1 == "category" 265 266.. warning:: 267 268 Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``, 269 and since all instances ``CategoricalDtype`` compare equal to ``'category'``, 270 all instances of ``CategoricalDtype`` compare equal to a 271 ``CategoricalDtype(None, False)``, regardless of ``categories`` or 272 ``ordered``. 273 274Description 275----------- 276 277Using :meth:`~DataFrame.describe` on categorical data will produce similar 278output to a ``Series`` or ``DataFrame`` of type ``string``. 279 280.. ipython:: python 281 282 cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"]) 283 df = pd.DataFrame({"cat": cat, "s": ["a", "c", "c", np.nan]}) 284 df.describe() 285 df["cat"].describe() 286 287.. _categorical.cat: 288 289Working with categories 290----------------------- 291 292Categorical data has a ``categories`` and a ``ordered`` property, which list their 293possible values and whether the ordering matters or not. These properties are 294exposed as ``s.cat.categories`` and ``s.cat.ordered``. If you don't manually 295specify categories and ordering, they are inferred from the passed arguments. 296 297.. ipython:: python 298 299 s = pd.Series(["a", "b", "c", "a"], dtype="category") 300 s.cat.categories 301 s.cat.ordered 302 303It's also possible to pass in the categories in a specific order: 304 305.. ipython:: python 306 307 s = pd.Series(pd.Categorical(["a", "b", "c", "a"], categories=["c", "b", "a"])) 308 s.cat.categories 309 s.cat.ordered 310 311.. note:: 312 313 New categorical data are **not** automatically ordered. You must explicitly 314 pass ``ordered=True`` to indicate an ordered ``Categorical``. 315 316 317.. note:: 318 319 The result of :meth:`~Series.unique` is not always the same as ``Series.cat.categories``, 320 because ``Series.unique()`` has a couple of guarantees, namely that it returns categories 321 in the order of appearance, and it only includes values that are actually present. 322 323 .. ipython:: python 324 325 s = pd.Series(list("babc")).astype(CategoricalDtype(list("abcd"))) 326 s 327 328 # categories 329 s.cat.categories 330 331 # uniques 332 s.unique() 333 334Renaming categories 335~~~~~~~~~~~~~~~~~~~ 336 337Renaming categories is done by assigning new values to the 338``Series.cat.categories`` property or by using the 339:meth:`~pandas.Categorical.rename_categories` method: 340 341 342.. ipython:: python 343 344 s = pd.Series(["a", "b", "c", "a"], dtype="category") 345 s 346 s.cat.categories = ["Group %s" % g for g in s.cat.categories] 347 s 348 s = s.cat.rename_categories([1, 2, 3]) 349 s 350 # You can also pass a dict-like object to map the renaming 351 s = s.cat.rename_categories({1: "x", 2: "y", 3: "z"}) 352 s 353 354.. note:: 355 356 In contrast to R's ``factor``, categorical data can have categories of other types than string. 357 358.. note:: 359 360 Be aware that assigning new categories is an inplace operation, while most other operations 361 under ``Series.cat`` per default return a new ``Series`` of dtype ``category``. 362 363Categories must be unique or a ``ValueError`` is raised: 364 365.. ipython:: python 366 367 try: 368 s.cat.categories = [1, 1, 1] 369 except ValueError as e: 370 print("ValueError:", str(e)) 371 372Categories must also not be ``NaN`` or a ``ValueError`` is raised: 373 374.. ipython:: python 375 376 try: 377 s.cat.categories = [1, 2, np.nan] 378 except ValueError as e: 379 print("ValueError:", str(e)) 380 381Appending new categories 382~~~~~~~~~~~~~~~~~~~~~~~~ 383 384Appending categories can be done by using the 385:meth:`~pandas.Categorical.add_categories` method: 386 387.. ipython:: python 388 389 s = s.cat.add_categories([4]) 390 s.cat.categories 391 s 392 393Removing categories 394~~~~~~~~~~~~~~~~~~~ 395 396Removing categories can be done by using the 397:meth:`~pandas.Categorical.remove_categories` method. Values which are removed 398are replaced by ``np.nan``.: 399 400.. ipython:: python 401 402 s = s.cat.remove_categories([4]) 403 s 404 405Removing unused categories 406~~~~~~~~~~~~~~~~~~~~~~~~~~ 407 408Removing unused categories can also be done: 409 410.. ipython:: python 411 412 s = pd.Series(pd.Categorical(["a", "b", "a"], categories=["a", "b", "c", "d"])) 413 s 414 s.cat.remove_unused_categories() 415 416Setting categories 417~~~~~~~~~~~~~~~~~~ 418 419If you want to do remove and add new categories in one step (which has some 420speed advantage), or simply set the categories to a predefined scale, 421use :meth:`~pandas.Categorical.set_categories`. 422 423 424.. ipython:: python 425 426 s = pd.Series(["one", "two", "four", "-"], dtype="category") 427 s 428 s = s.cat.set_categories(["one", "two", "three", "four"]) 429 s 430 431.. note:: 432 Be aware that :func:`Categorical.set_categories` cannot know whether some category is omitted 433 intentionally or because it is misspelled or (under Python3) due to a type difference (e.g., 434 NumPy S1 dtype and Python strings). This can result in surprising behaviour! 435 436Sorting and order 437----------------- 438 439.. _categorical.sort: 440 441If categorical data is ordered (``s.cat.ordered == True``), then the order of the categories has a 442meaning and certain operations are possible. If the categorical is unordered, ``.min()/.max()`` will raise a ``TypeError``. 443 444.. ipython:: python 445 446 s = pd.Series(pd.Categorical(["a", "b", "c", "a"], ordered=False)) 447 s.sort_values(inplace=True) 448 s = pd.Series(["a", "b", "c", "a"]).astype(CategoricalDtype(ordered=True)) 449 s.sort_values(inplace=True) 450 s 451 s.min(), s.max() 452 453You can set categorical data to be ordered by using ``as_ordered()`` or unordered by using ``as_unordered()``. These will by 454default return a *new* object. 455 456.. ipython:: python 457 458 s.cat.as_ordered() 459 s.cat.as_unordered() 460 461Sorting will use the order defined by categories, not any lexical order present on the data type. 462This is even true for strings and numeric data: 463 464.. ipython:: python 465 466 s = pd.Series([1, 2, 3, 1], dtype="category") 467 s = s.cat.set_categories([2, 3, 1], ordered=True) 468 s 469 s.sort_values(inplace=True) 470 s 471 s.min(), s.max() 472 473 474Reordering 475~~~~~~~~~~ 476 477Reordering the categories is possible via the :meth:`Categorical.reorder_categories` and 478the :meth:`Categorical.set_categories` methods. For :meth:`Categorical.reorder_categories`, all 479old categories must be included in the new categories and no new categories are allowed. This will 480necessarily make the sort order the same as the categories order. 481 482.. ipython:: python 483 484 s = pd.Series([1, 2, 3, 1], dtype="category") 485 s = s.cat.reorder_categories([2, 3, 1], ordered=True) 486 s 487 s.sort_values(inplace=True) 488 s 489 s.min(), s.max() 490 491.. note:: 492 493 Note the difference between assigning new categories and reordering the categories: the first 494 renames categories and therefore the individual values in the ``Series``, but if the first 495 position was sorted last, the renamed value will still be sorted last. Reordering means that the 496 way values are sorted is different afterwards, but not that individual values in the 497 ``Series`` are changed. 498 499.. note:: 500 501 If the ``Categorical`` is not ordered, :meth:`Series.min` and :meth:`Series.max` will raise 502 ``TypeError``. Numeric operations like ``+``, ``-``, ``*``, ``/`` and operations based on them 503 (e.g. :meth:`Series.median`, which would need to compute the mean between two values if the length 504 of an array is even) do not work and raise a ``TypeError``. 505 506Multi column sorting 507~~~~~~~~~~~~~~~~~~~~ 508 509A categorical dtyped column will participate in a multi-column sort in a similar manner to other columns. 510The ordering of the categorical is determined by the ``categories`` of that column. 511 512.. ipython:: python 513 514 dfs = pd.DataFrame( 515 { 516 "A": pd.Categorical( 517 list("bbeebbaa"), 518 categories=["e", "a", "b"], 519 ordered=True, 520 ), 521 "B": [1, 2, 1, 2, 2, 1, 2, 1], 522 } 523 ) 524 dfs.sort_values(by=["A", "B"]) 525 526Reordering the ``categories`` changes a future sort. 527 528.. ipython:: python 529 530 dfs["A"] = dfs["A"].cat.reorder_categories(["a", "b", "e"]) 531 dfs.sort_values(by=["A", "B"]) 532 533Comparisons 534----------- 535 536Comparing categorical data with other objects is possible in three cases: 537 538* Comparing equality (``==`` and ``!=``) to a list-like object (list, Series, array, 539 ...) of the same length as the categorical data. 540* All comparisons (``==``, ``!=``, ``>``, ``>=``, ``<``, and ``<=``) of categorical data to 541 another categorical Series, when ``ordered==True`` and the ``categories`` are the same. 542* All comparisons of a categorical data to a scalar. 543 544All other comparisons, especially "non-equality" comparisons of two categoricals with different 545categories or a categorical with any list-like object, will raise a ``TypeError``. 546 547.. note:: 548 549 Any "non-equality" comparisons of categorical data with a ``Series``, ``np.array``, ``list`` or 550 categorical data with different categories or ordering will raise a ``TypeError`` because custom 551 categories ordering could be interpreted in two ways: one with taking into account the 552 ordering and one without. 553 554.. ipython:: python 555 556 cat = pd.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True)) 557 cat_base = pd.Series([2, 2, 2]).astype(CategoricalDtype([3, 2, 1], ordered=True)) 558 cat_base2 = pd.Series([2, 2, 2]).astype(CategoricalDtype(ordered=True)) 559 560 cat 561 cat_base 562 cat_base2 563 564Comparing to a categorical with the same categories and ordering or to a scalar works: 565 566.. ipython:: python 567 568 cat > cat_base 569 cat > 2 570 571Equality comparisons work with any list-like object of same length and scalars: 572 573.. ipython:: python 574 575 cat == cat_base 576 cat == np.array([1, 2, 3]) 577 cat == 2 578 579This doesn't work because the categories are not the same: 580 581.. ipython:: python 582 583 try: 584 cat > cat_base2 585 except TypeError as e: 586 print("TypeError:", str(e)) 587 588If you want to do a "non-equality" comparison of a categorical series with a list-like object 589which is not categorical data, you need to be explicit and convert the categorical data back to 590the original values: 591 592.. ipython:: python 593 594 base = np.array([1, 2, 3]) 595 596 try: 597 cat > base 598 except TypeError as e: 599 print("TypeError:", str(e)) 600 601 np.asarray(cat) > base 602 603When you compare two unordered categoricals with the same categories, the order is not considered: 604 605.. ipython:: python 606 607 c1 = pd.Categorical(["a", "b"], categories=["a", "b"], ordered=False) 608 c2 = pd.Categorical(["a", "b"], categories=["b", "a"], ordered=False) 609 c1 == c2 610 611Operations 612---------- 613 614Apart from :meth:`Series.min`, :meth:`Series.max` and :meth:`Series.mode`, the 615following operations are possible with categorical data: 616 617``Series`` methods like :meth:`Series.value_counts` will use all categories, 618even if some categories are not present in the data: 619 620.. ipython:: python 621 622 s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"])) 623 s.value_counts() 624 625``DataFrame`` methods like :meth:`DataFrame.sum` also show "unused" categories. 626 627.. ipython:: python 628 629 columns = pd.Categorical( 630 ["One", "One", "Two"], categories=["One", "Two", "Three"], ordered=True 631 ) 632 df = pd.DataFrame( 633 data=[[1, 2, 3], [4, 5, 6]], 634 columns=pd.MultiIndex.from_arrays([["A", "B", "B"], columns]), 635 ) 636 df.sum(axis=1, level=1) 637 638Groupby will also show "unused" categories: 639 640.. ipython:: python 641 642 cats = pd.Categorical( 643 ["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"] 644 ) 645 df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]}) 646 df.groupby("cats").mean() 647 648 cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"]) 649 df2 = pd.DataFrame( 650 { 651 "cats": cats2, 652 "B": ["c", "d", "c", "d"], 653 "values": [1, 2, 3, 4], 654 } 655 ) 656 df2.groupby(["cats", "B"]).mean() 657 658 659Pivot tables: 660 661.. ipython:: python 662 663 raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"]) 664 df = pd.DataFrame({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]}) 665 pd.pivot_table(df, values="values", index=["A", "B"]) 666 667Data munging 668------------ 669 670The optimized pandas data access methods ``.loc``, ``.iloc``, ``.at``, and ``.iat``, 671work as normal. The only difference is the return type (for getting) and 672that only values already in ``categories`` can be assigned. 673 674Getting 675~~~~~~~ 676 677If the slicing operation returns either a ``DataFrame`` or a column of type 678``Series``, the ``category`` dtype is preserved. 679 680.. ipython:: python 681 682 idx = pd.Index(["h", "i", "j", "k", "l", "m", "n"]) 683 cats = pd.Series(["a", "b", "b", "b", "c", "c", "c"], dtype="category", index=idx) 684 values = [1, 2, 2, 2, 3, 4, 5] 685 df = pd.DataFrame({"cats": cats, "values": values}, index=idx) 686 df.iloc[2:4, :] 687 df.iloc[2:4, :].dtypes 688 df.loc["h":"j", "cats"] 689 df[df["cats"] == "b"] 690 691An example where the category type is not preserved is if you take one single 692row: the resulting ``Series`` is of dtype ``object``: 693 694.. ipython:: python 695 696 # get the complete "h" row as a Series 697 df.loc["h", :] 698 699Returning a single item from categorical data will also return the value, not a categorical 700of length "1". 701 702.. ipython:: python 703 704 df.iat[0, 0] 705 df["cats"].cat.categories = ["x", "y", "z"] 706 df.at["h", "cats"] # returns a string 707 708.. note:: 709 The is in contrast to R's ``factor`` function, where ``factor(c(1,2,3))[1]`` 710 returns a single value ``factor``. 711 712To get a single value ``Series`` of type ``category``, you pass in a list with 713a single value: 714 715.. ipython:: python 716 717 df.loc[["h"], "cats"] 718 719String and datetime accessors 720~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 721 722The accessors ``.dt`` and ``.str`` will work if the ``s.cat.categories`` are of 723an appropriate type: 724 725 726.. ipython:: python 727 728 str_s = pd.Series(list("aabb")) 729 str_cat = str_s.astype("category") 730 str_cat 731 str_cat.str.contains("a") 732 733 date_s = pd.Series(pd.date_range("1/1/2015", periods=5)) 734 date_cat = date_s.astype("category") 735 date_cat 736 date_cat.dt.day 737 738.. note:: 739 740 The returned ``Series`` (or ``DataFrame``) is of the same type as if you used the 741 ``.str.<method>`` / ``.dt.<method>`` on a ``Series`` of that type (and not of 742 type ``category``!). 743 744That means, that the returned values from methods and properties on the accessors of a 745``Series`` and the returned values from methods and properties on the accessors of this 746``Series`` transformed to one of type ``category`` will be equal: 747 748.. ipython:: python 749 750 ret_s = str_s.str.contains("a") 751 ret_cat = str_cat.str.contains("a") 752 ret_s.dtype == ret_cat.dtype 753 ret_s == ret_cat 754 755.. note:: 756 757 The work is done on the ``categories`` and then a new ``Series`` is constructed. This has 758 some performance implication if you have a ``Series`` of type string, where lots of elements 759 are repeated (i.e. the number of unique elements in the ``Series`` is a lot smaller than the 760 length of the ``Series``). In this case it can be faster to convert the original ``Series`` 761 to one of type ``category`` and use ``.str.<method>`` or ``.dt.<property>`` on that. 762 763Setting 764~~~~~~~ 765 766Setting values in a categorical column (or ``Series``) works as long as the 767value is included in the ``categories``: 768 769.. ipython:: python 770 771 idx = pd.Index(["h", "i", "j", "k", "l", "m", "n"]) 772 cats = pd.Categorical(["a", "a", "a", "a", "a", "a", "a"], categories=["a", "b"]) 773 values = [1, 1, 1, 1, 1, 1, 1] 774 df = pd.DataFrame({"cats": cats, "values": values}, index=idx) 775 776 df.iloc[2:4, :] = [["b", 2], ["b", 2]] 777 df 778 try: 779 df.iloc[2:4, :] = [["c", 3], ["c", 3]] 780 except ValueError as e: 781 print("ValueError:", str(e)) 782 783Setting values by assigning categorical data will also check that the ``categories`` match: 784 785.. ipython:: python 786 787 df.loc["j":"k", "cats"] = pd.Categorical(["a", "a"], categories=["a", "b"]) 788 df 789 try: 790 df.loc["j":"k", "cats"] = pd.Categorical(["b", "b"], categories=["a", "b", "c"]) 791 except ValueError as e: 792 print("ValueError:", str(e)) 793 794Assigning a ``Categorical`` to parts of a column of other types will use the values: 795 796.. ipython:: python 797 798 df = pd.DataFrame({"a": [1, 1, 1, 1, 1], "b": ["a", "a", "a", "a", "a"]}) 799 df.loc[1:2, "a"] = pd.Categorical(["b", "b"], categories=["a", "b"]) 800 df.loc[2:3, "b"] = pd.Categorical(["b", "b"], categories=["a", "b"]) 801 df 802 df.dtypes 803 804.. _categorical.merge: 805.. _categorical.concat: 806 807Merging / concatenation 808~~~~~~~~~~~~~~~~~~~~~~~ 809 810By default, combining ``Series`` or ``DataFrames`` which contain the same 811categories results in ``category`` dtype, otherwise results will depend on the 812dtype of the underlying categories. Merges that result in non-categorical 813dtypes will likely have higher memory usage. Use ``.astype`` or 814``union_categoricals`` to ensure ``category`` results. 815 816.. ipython:: python 817 818 from pandas.api.types import union_categoricals 819 820 # same categories 821 s1 = pd.Series(["a", "b"], dtype="category") 822 s2 = pd.Series(["a", "b", "a"], dtype="category") 823 pd.concat([s1, s2]) 824 825 # different categories 826 s3 = pd.Series(["b", "c"], dtype="category") 827 pd.concat([s1, s3]) 828 829 # Output dtype is inferred based on categories values 830 int_cats = pd.Series([1, 2], dtype="category") 831 float_cats = pd.Series([3.0, 4.0], dtype="category") 832 pd.concat([int_cats, float_cats]) 833 834 pd.concat([s1, s3]).astype("category") 835 union_categoricals([s1.array, s3.array]) 836 837The following table summarizes the results of merging ``Categoricals``: 838 839+-------------------+------------------------+----------------------+-----------------------------+ 840| arg1 | arg2 | identical | result | 841+===================+========================+======================+=============================+ 842| category | category | True | category | 843+-------------------+------------------------+----------------------+-----------------------------+ 844| category (object) | category (object) | False | object (dtype is inferred) | 845+-------------------+------------------------+----------------------+-----------------------------+ 846| category (int) | category (float) | False | float (dtype is inferred) | 847+-------------------+------------------------+----------------------+-----------------------------+ 848 849See also the section on :ref:`merge dtypes<merging.dtypes>` for notes about 850preserving merge dtypes and performance. 851 852.. _categorical.union: 853 854Unioning 855~~~~~~~~ 856 857If you want to combine categoricals that do not necessarily have the same 858categories, the :func:`~pandas.api.types.union_categoricals` function will 859combine a list-like of categoricals. The new categories will be the union of 860the categories being combined. 861 862.. ipython:: python 863 864 from pandas.api.types import union_categoricals 865 866 a = pd.Categorical(["b", "c"]) 867 b = pd.Categorical(["a", "b"]) 868 union_categoricals([a, b]) 869 870By default, the resulting categories will be ordered as 871they appear in the data. If you want the categories to 872be lexsorted, use ``sort_categories=True`` argument. 873 874.. ipython:: python 875 876 union_categoricals([a, b], sort_categories=True) 877 878``union_categoricals`` also works with the "easy" case of combining two 879categoricals of the same categories and order information 880(e.g. what you could also ``append`` for). 881 882.. ipython:: python 883 884 a = pd.Categorical(["a", "b"], ordered=True) 885 b = pd.Categorical(["a", "b", "a"], ordered=True) 886 union_categoricals([a, b]) 887 888The below raises ``TypeError`` because the categories are ordered and not identical. 889 890.. code-block:: ipython 891 892 In [1]: a = pd.Categorical(["a", "b"], ordered=True) 893 In [2]: b = pd.Categorical(["a", "b", "c"], ordered=True) 894 In [3]: union_categoricals([a, b]) 895 Out[3]: 896 TypeError: to union ordered Categoricals, all categories must be the same 897 898Ordered categoricals with different categories or orderings can be combined by 899using the ``ignore_ordered=True`` argument. 900 901.. ipython:: python 902 903 a = pd.Categorical(["a", "b", "c"], ordered=True) 904 b = pd.Categorical(["c", "b", "a"], ordered=True) 905 union_categoricals([a, b], ignore_order=True) 906 907:func:`~pandas.api.types.union_categoricals` also works with a 908``CategoricalIndex``, or ``Series`` containing categorical data, but note that 909the resulting array will always be a plain ``Categorical``: 910 911.. ipython:: python 912 913 a = pd.Series(["b", "c"], dtype="category") 914 b = pd.Series(["a", "b"], dtype="category") 915 union_categoricals([a, b]) 916 917.. note:: 918 919 ``union_categoricals`` may recode the integer codes for categories 920 when combining categoricals. This is likely what you want, 921 but if you are relying on the exact numbering of the categories, be 922 aware. 923 924 .. ipython:: python 925 926 c1 = pd.Categorical(["b", "c"]) 927 c2 = pd.Categorical(["a", "b"]) 928 929 c1 930 # "b" is coded to 0 931 c1.codes 932 933 c2 934 # "b" is coded to 1 935 c2.codes 936 937 c = union_categoricals([c1, c2]) 938 c 939 # "b" is coded to 0 throughout, same as c1, different from c2 940 c.codes 941 942 943Getting data in/out 944------------------- 945 946You can write data that contains ``category`` dtypes to a ``HDFStore``. 947See :ref:`here <io.hdf5-categorical>` for an example and caveats. 948 949It is also possible to write data to and reading data from *Stata* format files. 950See :ref:`here <io.stata-categorical>` for an example and caveats. 951 952Writing to a CSV file will convert the data, effectively removing any information about the 953categorical (categories and ordering). So if you read back the CSV file you have to convert the 954relevant columns back to ``category`` and assign the right categories and categories ordering. 955 956.. ipython:: python 957 958 import io 959 960 s = pd.Series(pd.Categorical(["a", "b", "b", "a", "a", "d"])) 961 # rename the categories 962 s.cat.categories = ["very good", "good", "bad"] 963 # reorder the categories and add missing categories 964 s = s.cat.set_categories(["very bad", "bad", "medium", "good", "very good"]) 965 df = pd.DataFrame({"cats": s, "vals": [1, 2, 3, 4, 5, 6]}) 966 csv = io.StringIO() 967 df.to_csv(csv) 968 df2 = pd.read_csv(io.StringIO(csv.getvalue())) 969 df2.dtypes 970 df2["cats"] 971 # Redo the category 972 df2["cats"] = df2["cats"].astype("category") 973 df2["cats"].cat.set_categories( 974 ["very bad", "bad", "medium", "good", "very good"], inplace=True 975 ) 976 df2.dtypes 977 df2["cats"] 978 979The same holds for writing to a SQL database with ``to_sql``. 980 981Missing data 982------------ 983 984pandas primarily uses the value ``np.nan`` to represent missing data. It is by 985default not included in computations. See the :ref:`Missing Data section 986<missing_data>`. 987 988Missing values should **not** be included in the Categorical's ``categories``, 989only in the ``values``. 990Instead, it is understood that NaN is different, and is always a possibility. 991When working with the Categorical's ``codes``, missing values will always have 992a code of ``-1``. 993 994.. ipython:: python 995 996 s = pd.Series(["a", "b", np.nan, "a"], dtype="category") 997 # only two categories 998 s 999 s.cat.codes 1000 1001 1002Methods for working with missing data, e.g. :meth:`~Series.isna`, :meth:`~Series.fillna`, 1003:meth:`~Series.dropna`, all work normally: 1004 1005.. ipython:: python 1006 1007 s = pd.Series(["a", "b", np.nan], dtype="category") 1008 s 1009 pd.isna(s) 1010 s.fillna("a") 1011 1012Differences to R's ``factor`` 1013----------------------------- 1014 1015The following differences to R's factor functions can be observed: 1016 1017* R's ``levels`` are named ``categories``. 1018* R's ``levels`` are always of type string, while ``categories`` in pandas can be of any dtype. 1019* It's not possible to specify labels at creation time. Use ``s.cat.rename_categories(new_labels)`` 1020 afterwards. 1021* In contrast to R's ``factor`` function, using categorical data as the sole input to create a 1022 new categorical series will *not* remove unused categories but create a new categorical series 1023 which is equal to the passed in one! 1024* R allows for missing values to be included in its ``levels`` (pandas' ``categories``). pandas 1025 does not allow ``NaN`` categories, but missing values can still be in the ``values``. 1026 1027 1028Gotchas 1029------- 1030 1031.. _categorical.rfactor: 1032 1033Memory usage 1034~~~~~~~~~~~~ 1035 1036.. _categorical.memory: 1037 1038The memory usage of a ``Categorical`` is proportional to the number of categories plus the length of the data. In contrast, 1039an ``object`` dtype is a constant times the length of the data. 1040 1041.. ipython:: python 1042 1043 s = pd.Series(["foo", "bar"] * 1000) 1044 1045 # object dtype 1046 s.nbytes 1047 1048 # category dtype 1049 s.astype("category").nbytes 1050 1051.. note:: 1052 1053 If the number of categories approaches the length of the data, the ``Categorical`` will use nearly the same or 1054 more memory than an equivalent ``object`` dtype representation. 1055 1056 .. ipython:: python 1057 1058 s = pd.Series(["foo%04d" % i for i in range(2000)]) 1059 1060 # object dtype 1061 s.nbytes 1062 1063 # category dtype 1064 s.astype("category").nbytes 1065 1066 1067``Categorical`` is not a ``numpy`` array 1068~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1069 1070Currently, categorical data and the underlying ``Categorical`` is implemented as a Python 1071object and not as a low-level NumPy array dtype. This leads to some problems. 1072 1073NumPy itself doesn't know about the new ``dtype``: 1074 1075.. ipython:: python 1076 1077 try: 1078 np.dtype("category") 1079 except TypeError as e: 1080 print("TypeError:", str(e)) 1081 1082 dtype = pd.Categorical(["a"]).dtype 1083 try: 1084 np.dtype(dtype) 1085 except TypeError as e: 1086 print("TypeError:", str(e)) 1087 1088Dtype comparisons work: 1089 1090.. ipython:: python 1091 1092 dtype == np.str_ 1093 np.str_ == dtype 1094 1095To check if a Series contains Categorical data, use ``hasattr(s, 'cat')``: 1096 1097.. ipython:: python 1098 1099 hasattr(pd.Series(["a"], dtype="category"), "cat") 1100 hasattr(pd.Series(["a"]), "cat") 1101 1102Using NumPy functions on a ``Series`` of type ``category`` should not work as ``Categoricals`` 1103are not numeric data (even in the case that ``.categories`` is numeric). 1104 1105.. ipython:: python 1106 1107 s = pd.Series(pd.Categorical([1, 2, 3, 4])) 1108 try: 1109 np.sum(s) 1110 # same with np.log(s),... 1111 except TypeError as e: 1112 print("TypeError:", str(e)) 1113 1114.. note:: 1115 If such a function works, please file a bug at https://github.com/pandas-dev/pandas! 1116 1117dtype in apply 1118~~~~~~~~~~~~~~ 1119 1120pandas currently does not preserve the dtype in apply functions: If you apply along rows you get 1121a ``Series`` of ``object`` ``dtype`` (same as getting a row -> getting one element will return a 1122basic type) and applying along columns will also convert to object. ``NaN`` values are unaffected. 1123You can use ``fillna`` to handle missing values before applying a function. 1124 1125.. ipython:: python 1126 1127 df = pd.DataFrame( 1128 { 1129 "a": [1, 2, 3, 4], 1130 "b": ["a", "b", "c", "d"], 1131 "cats": pd.Categorical([1, 2, 3, 2]), 1132 } 1133 ) 1134 df.apply(lambda row: type(row["cats"]), axis=1) 1135 df.apply(lambda col: col.dtype, axis=0) 1136 1137Categorical index 1138~~~~~~~~~~~~~~~~~ 1139 1140``CategoricalIndex`` is a type of index that is useful for supporting 1141indexing with duplicates. This is a container around a ``Categorical`` 1142and allows efficient indexing and storage of an index with a large number of duplicated elements. 1143See the :ref:`advanced indexing docs <indexing.categoricalindex>` for a more detailed 1144explanation. 1145 1146Setting the index will create a ``CategoricalIndex``: 1147 1148.. ipython:: python 1149 1150 cats = pd.Categorical([1, 2, 3, 4], categories=[4, 2, 3, 1]) 1151 strings = ["a", "b", "c", "d"] 1152 values = [4, 2, 3, 1] 1153 df = pd.DataFrame({"strings": strings, "values": values}, index=cats) 1154 df.index 1155 # This now sorts by the categories order 1156 df.sort_index() 1157 1158Side effects 1159~~~~~~~~~~~~ 1160 1161Constructing a ``Series`` from a ``Categorical`` will not copy the input 1162``Categorical``. This means that changes to the ``Series`` will in most cases 1163change the original ``Categorical``: 1164 1165.. ipython:: python 1166 1167 cat = pd.Categorical([1, 2, 3, 10], categories=[1, 2, 3, 4, 10]) 1168 s = pd.Series(cat, name="cat") 1169 cat 1170 s.iloc[0:2] = 10 1171 cat 1172 df = pd.DataFrame(s) 1173 df["cat"].cat.categories = [1, 2, 3, 4, 5] 1174 cat 1175 1176Use ``copy=True`` to prevent such a behaviour or simply don't reuse ``Categoricals``: 1177 1178.. ipython:: python 1179 1180 cat = pd.Categorical([1, 2, 3, 10], categories=[1, 2, 3, 4, 10]) 1181 s = pd.Series(cat, name="cat", copy=True) 1182 cat 1183 s.iloc[0:2] = 10 1184 cat 1185 1186.. note:: 1187 1188 This also happens in some cases when you supply a NumPy array instead of a ``Categorical``: 1189 using an int array (e.g. ``np.array([1,2,3,4])``) will exhibit the same behavior, while using 1190 a string array (e.g. ``np.array(["a","b","c","a"])``) will not. 1191