1.. _categorical:
2
3{{ header }}
4
5****************
6Categorical data
7****************
8
9This is an introduction to pandas categorical data type, including a short comparison
10with R's ``factor``.
11
12``Categoricals`` are a pandas data type corresponding to categorical variables in
13statistics. A categorical variable takes on a limited, and usually fixed,
14number of possible values (``categories``; ``levels`` in R). Examples are gender,
15social class, blood type, country affiliation, observation time or rating via
16Likert scales.
17
18In contrast to statistical categorical variables, categorical data might have an order (e.g.
19'strongly agree' vs 'agree' or 'first observation' vs. 'second observation'), but numerical
20operations (additions, divisions, ...) are not possible.
21
22All values of categorical data are either in ``categories`` or ``np.nan``. Order is defined by
23the order of ``categories``, not lexical order of the values. Internally, the data structure
24consists of a ``categories`` array and an integer array of ``codes`` which point to the real value in
25the ``categories`` array.
26
27The categorical data type is useful in the following cases:
28
29* A string variable consisting of only a few different values. Converting such a string
30  variable to a categorical variable will save some memory, see :ref:`here <categorical.memory>`.
31* The lexical order of a variable is not the same as the logical order ("one", "two", "three").
32  By converting to a categorical and specifying an order on the categories, sorting and
33  min/max will use the logical order instead of the lexical order, see :ref:`here <categorical.sort>`.
34* As a signal to other Python libraries that this column should be treated as a categorical
35  variable (e.g. to use suitable statistical methods or plot types).
36
37See also the :ref:`API docs on categoricals<api.arrays.categorical>`.
38
39.. _categorical.objectcreation:
40
41Object creation
42---------------
43
44Series creation
45~~~~~~~~~~~~~~~
46
47Categorical ``Series`` or columns in a ``DataFrame`` can be created in several ways:
48
49By specifying ``dtype="category"`` when constructing a ``Series``:
50
51.. ipython:: python
52
53    s = pd.Series(["a", "b", "c", "a"], dtype="category")
54    s
55
56By converting an existing ``Series`` or column to a ``category`` dtype:
57
58.. ipython:: python
59
60    df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
61    df["B"] = df["A"].astype("category")
62    df
63
64By using special functions, such as :func:`~pandas.cut`, which groups data into
65discrete bins. See the :ref:`example on tiling <reshaping.tile.cut>` in the docs.
66
67.. ipython:: python
68
69    df = pd.DataFrame({"value": np.random.randint(0, 100, 20)})
70    labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]
71
72    df["group"] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
73    df.head(10)
74
75By passing a :class:`pandas.Categorical` object to a ``Series`` or assigning it to a ``DataFrame``.
76
77.. ipython:: python
78
79    raw_cat = pd.Categorical(
80        ["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=False
81    )
82    s = pd.Series(raw_cat)
83    s
84    df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
85    df["B"] = raw_cat
86    df
87
88Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
89
90.. ipython:: python
91
92    df.dtypes
93
94DataFrame creation
95~~~~~~~~~~~~~~~~~~
96
97Similar to the previous section where a single column was converted to categorical, all columns in a
98``DataFrame`` can be batch converted to categorical either during or after construction.
99
100This can be done during construction by specifying ``dtype="category"`` in the ``DataFrame`` constructor:
101
102.. ipython:: python
103
104    df = pd.DataFrame({"A": list("abca"), "B": list("bccd")}, dtype="category")
105    df.dtypes
106
107Note that the categories present in each column differ; the conversion is done column by column, so
108only labels present in a given column are categories:
109
110.. ipython:: python
111
112    df["A"]
113    df["B"]
114
115
116Analogously, all columns in an existing ``DataFrame`` can be batch converted using :meth:`DataFrame.astype`:
117
118.. ipython:: python
119
120    df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
121    df_cat = df.astype("category")
122    df_cat.dtypes
123
124This conversion is likewise done column by column:
125
126.. ipython:: python
127
128    df_cat["A"]
129    df_cat["B"]
130
131
132Controlling behavior
133~~~~~~~~~~~~~~~~~~~~
134
135In the examples above where we passed ``dtype='category'``, we used the default
136behavior:
137
1381. Categories are inferred from the data.
1392. Categories are unordered.
140
141To control those behaviors, instead of passing ``'category'``, use an instance
142of :class:`~pandas.api.types.CategoricalDtype`.
143
144.. ipython:: python
145
146    from pandas.api.types import CategoricalDtype
147
148    s = pd.Series(["a", "b", "c", "a"])
149    cat_type = CategoricalDtype(categories=["b", "c", "d"], ordered=True)
150    s_cat = s.astype(cat_type)
151    s_cat
152
153Similarly, a ``CategoricalDtype`` can be used with a ``DataFrame`` to ensure that categories
154are consistent among all columns.
155
156.. ipython:: python
157
158    from pandas.api.types import CategoricalDtype
159
160    df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
161    cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)
162    df_cat = df.astype(cat_type)
163    df_cat["A"]
164    df_cat["B"]
165
166.. note::
167
168    To perform table-wise conversion, where all labels in the entire ``DataFrame`` are used as
169    categories for each column, the ``categories`` parameter can be determined programmatically by
170    ``categories = pd.unique(df.to_numpy().ravel())``.
171
172If you already have ``codes`` and ``categories``, you can use the
173:func:`~pandas.Categorical.from_codes` constructor to save the factorize step
174during normal constructor mode:
175
176.. ipython:: python
177
178    splitter = np.random.choice([0, 1], 5, p=[0.5, 0.5])
179    s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
180
181
182Regaining original data
183~~~~~~~~~~~~~~~~~~~~~~~
184
185To get back to the original ``Series`` or NumPy array, use
186``Series.astype(original_dtype)`` or ``np.asarray(categorical)``:
187
188.. ipython:: python
189
190    s = pd.Series(["a", "b", "c", "a"])
191    s
192    s2 = s.astype("category")
193    s2
194    s2.astype(str)
195    np.asarray(s2)
196
197.. note::
198
199    In contrast to R's ``factor`` function, categorical data is not converting input values to
200    strings; categories will end up the same data type as the original values.
201
202.. note::
203
204    In contrast to R's ``factor`` function, there is currently no way to assign/change labels at
205    creation time. Use ``categories`` to change the categories after creation time.
206
207.. _categorical.categoricaldtype:
208
209CategoricalDtype
210----------------
211
212A categorical's type is fully described by
213
2141. ``categories``: a sequence of unique values and no missing values
2152. ``ordered``: a boolean
216
217This information can be stored in a :class:`~pandas.api.types.CategoricalDtype`.
218The ``categories`` argument is optional, which implies that the actual categories
219should be inferred from whatever is present in the data when the
220:class:`pandas.Categorical` is created. The categories are assumed to be unordered
221by default.
222
223.. ipython:: python
224
225   from pandas.api.types import CategoricalDtype
226
227   CategoricalDtype(["a", "b", "c"])
228   CategoricalDtype(["a", "b", "c"], ordered=True)
229   CategoricalDtype()
230
231A :class:`~pandas.api.types.CategoricalDtype` can be used in any place pandas
232expects a ``dtype``. For example :func:`pandas.read_csv`,
233:func:`pandas.DataFrame.astype`, or in the ``Series`` constructor.
234
235.. note::
236
237    As a convenience, you can use the string ``'category'`` in place of a
238    :class:`~pandas.api.types.CategoricalDtype` when you want the default behavior of
239    the categories being unordered, and equal to the set values present in the
240    array. In other words, ``dtype='category'`` is equivalent to
241    ``dtype=CategoricalDtype()``.
242
243Equality semantics
244~~~~~~~~~~~~~~~~~~
245
246Two instances of :class:`~pandas.api.types.CategoricalDtype` compare equal
247whenever they have the same categories and order. When comparing two
248unordered categoricals, the order of the ``categories`` is not considered.
249
250.. ipython:: python
251
252   c1 = CategoricalDtype(["a", "b", "c"], ordered=False)
253
254   # Equal, since order is not considered when ordered=False
255   c1 == CategoricalDtype(["b", "c", "a"], ordered=False)
256
257   # Unequal, since the second CategoricalDtype is ordered
258   c1 == CategoricalDtype(["a", "b", "c"], ordered=True)
259
260All instances of ``CategoricalDtype`` compare equal to the string ``'category'``.
261
262.. ipython:: python
263
264   c1 == "category"
265
266.. warning::
267
268   Since ``dtype='category'`` is essentially ``CategoricalDtype(None, False)``,
269   and since all instances ``CategoricalDtype`` compare equal to ``'category'``,
270   all instances of ``CategoricalDtype`` compare equal to a
271   ``CategoricalDtype(None, False)``, regardless of ``categories`` or
272   ``ordered``.
273
274Description
275-----------
276
277Using :meth:`~DataFrame.describe` on categorical data will produce similar
278output to a ``Series`` or ``DataFrame`` of type ``string``.
279
280.. ipython:: python
281
282    cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
283    df = pd.DataFrame({"cat": cat, "s": ["a", "c", "c", np.nan]})
284    df.describe()
285    df["cat"].describe()
286
287.. _categorical.cat:
288
289Working with categories
290-----------------------
291
292Categorical data has a ``categories`` and a ``ordered`` property, which list their
293possible values and whether the ordering matters or not. These properties are
294exposed as ``s.cat.categories`` and ``s.cat.ordered``. If you don't manually
295specify categories and ordering, they are inferred from the passed arguments.
296
297.. ipython:: python
298
299    s = pd.Series(["a", "b", "c", "a"], dtype="category")
300    s.cat.categories
301    s.cat.ordered
302
303It's also possible to pass in the categories in a specific order:
304
305.. ipython:: python
306
307    s = pd.Series(pd.Categorical(["a", "b", "c", "a"], categories=["c", "b", "a"]))
308    s.cat.categories
309    s.cat.ordered
310
311.. note::
312
313    New categorical data are **not** automatically ordered. You must explicitly
314    pass ``ordered=True`` to indicate an ordered ``Categorical``.
315
316
317.. note::
318
319    The result of :meth:`~Series.unique` is not always the same as ``Series.cat.categories``,
320    because ``Series.unique()`` has a couple of guarantees, namely that it returns categories
321    in the order of appearance, and it only includes values that are actually present.
322
323    .. ipython:: python
324
325         s = pd.Series(list("babc")).astype(CategoricalDtype(list("abcd")))
326         s
327
328         # categories
329         s.cat.categories
330
331         # uniques
332         s.unique()
333
334Renaming categories
335~~~~~~~~~~~~~~~~~~~
336
337Renaming categories is done by assigning new values to the
338``Series.cat.categories`` property or by using the
339:meth:`~pandas.Categorical.rename_categories` method:
340
341
342.. ipython:: python
343
344    s = pd.Series(["a", "b", "c", "a"], dtype="category")
345    s
346    s.cat.categories = ["Group %s" % g for g in s.cat.categories]
347    s
348    s = s.cat.rename_categories([1, 2, 3])
349    s
350    # You can also pass a dict-like object to map the renaming
351    s = s.cat.rename_categories({1: "x", 2: "y", 3: "z"})
352    s
353
354.. note::
355
356    In contrast to R's ``factor``, categorical data can have categories of other types than string.
357
358.. note::
359
360    Be aware that assigning new categories is an inplace operation, while most other operations
361    under ``Series.cat`` per default return a new ``Series`` of dtype ``category``.
362
363Categories must be unique or a ``ValueError`` is raised:
364
365.. ipython:: python
366
367    try:
368        s.cat.categories = [1, 1, 1]
369    except ValueError as e:
370        print("ValueError:", str(e))
371
372Categories must also not be ``NaN`` or a ``ValueError`` is raised:
373
374.. ipython:: python
375
376    try:
377        s.cat.categories = [1, 2, np.nan]
378    except ValueError as e:
379        print("ValueError:", str(e))
380
381Appending new categories
382~~~~~~~~~~~~~~~~~~~~~~~~
383
384Appending categories can be done by using the
385:meth:`~pandas.Categorical.add_categories` method:
386
387.. ipython:: python
388
389    s = s.cat.add_categories([4])
390    s.cat.categories
391    s
392
393Removing categories
394~~~~~~~~~~~~~~~~~~~
395
396Removing categories can be done by using the
397:meth:`~pandas.Categorical.remove_categories` method. Values which are removed
398are replaced by ``np.nan``.:
399
400.. ipython:: python
401
402    s = s.cat.remove_categories([4])
403    s
404
405Removing unused categories
406~~~~~~~~~~~~~~~~~~~~~~~~~~
407
408Removing unused categories can also be done:
409
410.. ipython:: python
411
412    s = pd.Series(pd.Categorical(["a", "b", "a"], categories=["a", "b", "c", "d"]))
413    s
414    s.cat.remove_unused_categories()
415
416Setting categories
417~~~~~~~~~~~~~~~~~~
418
419If you want to do remove and add new categories in one step (which has some
420speed advantage), or simply set the categories to a predefined scale,
421use :meth:`~pandas.Categorical.set_categories`.
422
423
424.. ipython:: python
425
426    s = pd.Series(["one", "two", "four", "-"], dtype="category")
427    s
428    s = s.cat.set_categories(["one", "two", "three", "four"])
429    s
430
431.. note::
432    Be aware that :func:`Categorical.set_categories` cannot know whether some category is omitted
433    intentionally or because it is misspelled or (under Python3) due to a type difference (e.g.,
434    NumPy S1 dtype and Python strings). This can result in surprising behaviour!
435
436Sorting and order
437-----------------
438
439.. _categorical.sort:
440
441If categorical data is ordered (``s.cat.ordered == True``), then the order of the categories has a
442meaning and certain operations are possible. If the categorical is unordered, ``.min()/.max()`` will raise a ``TypeError``.
443
444.. ipython:: python
445
446    s = pd.Series(pd.Categorical(["a", "b", "c", "a"], ordered=False))
447    s.sort_values(inplace=True)
448    s = pd.Series(["a", "b", "c", "a"]).astype(CategoricalDtype(ordered=True))
449    s.sort_values(inplace=True)
450    s
451    s.min(), s.max()
452
453You can set categorical data to be ordered by using ``as_ordered()`` or unordered by using ``as_unordered()``. These will by
454default return a *new* object.
455
456.. ipython:: python
457
458    s.cat.as_ordered()
459    s.cat.as_unordered()
460
461Sorting will use the order defined by categories, not any lexical order present on the data type.
462This is even true for strings and numeric data:
463
464.. ipython:: python
465
466    s = pd.Series([1, 2, 3, 1], dtype="category")
467    s = s.cat.set_categories([2, 3, 1], ordered=True)
468    s
469    s.sort_values(inplace=True)
470    s
471    s.min(), s.max()
472
473
474Reordering
475~~~~~~~~~~
476
477Reordering the categories is possible via the :meth:`Categorical.reorder_categories` and
478the :meth:`Categorical.set_categories` methods. For :meth:`Categorical.reorder_categories`, all
479old categories must be included in the new categories and no new categories are allowed. This will
480necessarily make the sort order the same as the categories order.
481
482.. ipython:: python
483
484    s = pd.Series([1, 2, 3, 1], dtype="category")
485    s = s.cat.reorder_categories([2, 3, 1], ordered=True)
486    s
487    s.sort_values(inplace=True)
488    s
489    s.min(), s.max()
490
491.. note::
492
493    Note the difference between assigning new categories and reordering the categories: the first
494    renames categories and therefore the individual values in the ``Series``, but if the first
495    position was sorted last, the renamed value will still be sorted last. Reordering means that the
496    way values are sorted is different afterwards, but not that individual values in the
497    ``Series`` are changed.
498
499.. note::
500
501    If the ``Categorical`` is not ordered, :meth:`Series.min` and :meth:`Series.max` will raise
502    ``TypeError``. Numeric operations like ``+``, ``-``, ``*``, ``/`` and operations based on them
503    (e.g. :meth:`Series.median`, which would need to compute the mean between two values if the length
504    of an array is even) do not work and raise a ``TypeError``.
505
506Multi column sorting
507~~~~~~~~~~~~~~~~~~~~
508
509A categorical dtyped column will participate in a multi-column sort in a similar manner to other columns.
510The ordering of the categorical is determined by the ``categories`` of that column.
511
512.. ipython:: python
513
514   dfs = pd.DataFrame(
515       {
516           "A": pd.Categorical(
517               list("bbeebbaa"),
518               categories=["e", "a", "b"],
519               ordered=True,
520           ),
521           "B": [1, 2, 1, 2, 2, 1, 2, 1],
522       }
523   )
524   dfs.sort_values(by=["A", "B"])
525
526Reordering the ``categories`` changes a future sort.
527
528.. ipython:: python
529
530   dfs["A"] = dfs["A"].cat.reorder_categories(["a", "b", "e"])
531   dfs.sort_values(by=["A", "B"])
532
533Comparisons
534-----------
535
536Comparing categorical data with other objects is possible in three cases:
537
538* Comparing equality (``==`` and ``!=``) to a list-like object (list, Series, array,
539  ...) of the same length as the categorical data.
540* All comparisons (``==``, ``!=``, ``>``, ``>=``, ``<``, and ``<=``) of categorical data to
541  another categorical Series, when ``ordered==True`` and the ``categories`` are the same.
542* All comparisons of a categorical data to a scalar.
543
544All other comparisons, especially "non-equality" comparisons of two categoricals with different
545categories or a categorical with any list-like object, will raise a ``TypeError``.
546
547.. note::
548
549    Any "non-equality" comparisons of categorical data with a ``Series``, ``np.array``, ``list`` or
550    categorical data with different categories or ordering will raise a ``TypeError`` because custom
551    categories ordering could be interpreted in two ways: one with taking into account the
552    ordering and one without.
553
554.. ipython:: python
555
556    cat = pd.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
557    cat_base = pd.Series([2, 2, 2]).astype(CategoricalDtype([3, 2, 1], ordered=True))
558    cat_base2 = pd.Series([2, 2, 2]).astype(CategoricalDtype(ordered=True))
559
560    cat
561    cat_base
562    cat_base2
563
564Comparing to a categorical with the same categories and ordering or to a scalar works:
565
566.. ipython:: python
567
568    cat > cat_base
569    cat > 2
570
571Equality comparisons work with any list-like object of same length and scalars:
572
573.. ipython:: python
574
575    cat == cat_base
576    cat == np.array([1, 2, 3])
577    cat == 2
578
579This doesn't work because the categories are not the same:
580
581.. ipython:: python
582
583    try:
584        cat > cat_base2
585    except TypeError as e:
586        print("TypeError:", str(e))
587
588If you want to do a "non-equality" comparison of a categorical series with a list-like object
589which is not categorical data, you need to be explicit and convert the categorical data back to
590the original values:
591
592.. ipython:: python
593
594    base = np.array([1, 2, 3])
595
596    try:
597        cat > base
598    except TypeError as e:
599        print("TypeError:", str(e))
600
601    np.asarray(cat) > base
602
603When you compare two unordered categoricals with the same categories, the order is not considered:
604
605.. ipython:: python
606
607   c1 = pd.Categorical(["a", "b"], categories=["a", "b"], ordered=False)
608   c2 = pd.Categorical(["a", "b"], categories=["b", "a"], ordered=False)
609   c1 == c2
610
611Operations
612----------
613
614Apart from :meth:`Series.min`, :meth:`Series.max` and :meth:`Series.mode`, the
615following operations are possible with categorical data:
616
617``Series`` methods like :meth:`Series.value_counts` will use all categories,
618even if some categories are not present in the data:
619
620.. ipython:: python
621
622    s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"]))
623    s.value_counts()
624
625``DataFrame`` methods like :meth:`DataFrame.sum` also show "unused" categories.
626
627.. ipython:: python
628
629    columns = pd.Categorical(
630        ["One", "One", "Two"], categories=["One", "Two", "Three"], ordered=True
631    )
632    df = pd.DataFrame(
633        data=[[1, 2, 3], [4, 5, 6]],
634        columns=pd.MultiIndex.from_arrays([["A", "B", "B"], columns]),
635    )
636    df.sum(axis=1, level=1)
637
638Groupby will also show "unused" categories:
639
640.. ipython:: python
641
642    cats = pd.Categorical(
643        ["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"]
644    )
645    df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})
646    df.groupby("cats").mean()
647
648    cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
649    df2 = pd.DataFrame(
650        {
651            "cats": cats2,
652            "B": ["c", "d", "c", "d"],
653            "values": [1, 2, 3, 4],
654        }
655    )
656    df2.groupby(["cats", "B"]).mean()
657
658
659Pivot tables:
660
661.. ipython:: python
662
663    raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
664    df = pd.DataFrame({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]})
665    pd.pivot_table(df, values="values", index=["A", "B"])
666
667Data munging
668------------
669
670The optimized pandas data access methods  ``.loc``, ``.iloc``, ``.at``, and ``.iat``,
671work as normal. The only difference is the return type (for getting) and
672that only values already in ``categories`` can be assigned.
673
674Getting
675~~~~~~~
676
677If the slicing operation returns either a ``DataFrame`` or a column of type
678``Series``, the ``category`` dtype is preserved.
679
680.. ipython:: python
681
682    idx = pd.Index(["h", "i", "j", "k", "l", "m", "n"])
683    cats = pd.Series(["a", "b", "b", "b", "c", "c", "c"], dtype="category", index=idx)
684    values = [1, 2, 2, 2, 3, 4, 5]
685    df = pd.DataFrame({"cats": cats, "values": values}, index=idx)
686    df.iloc[2:4, :]
687    df.iloc[2:4, :].dtypes
688    df.loc["h":"j", "cats"]
689    df[df["cats"] == "b"]
690
691An example where the category type is not preserved is if you take one single
692row: the resulting ``Series`` is of dtype ``object``:
693
694.. ipython:: python
695
696    # get the complete "h" row as a Series
697    df.loc["h", :]
698
699Returning a single item from categorical data will also return the value, not a categorical
700of length "1".
701
702.. ipython:: python
703
704    df.iat[0, 0]
705    df["cats"].cat.categories = ["x", "y", "z"]
706    df.at["h", "cats"]  # returns a string
707
708.. note::
709    The is in contrast to R's ``factor`` function, where ``factor(c(1,2,3))[1]``
710    returns a single value ``factor``.
711
712To get a single value ``Series`` of type ``category``, you pass in a list with
713a single value:
714
715.. ipython:: python
716
717    df.loc[["h"], "cats"]
718
719String and datetime accessors
720~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
721
722The accessors  ``.dt`` and ``.str`` will work if the ``s.cat.categories`` are of
723an appropriate type:
724
725
726.. ipython:: python
727
728    str_s = pd.Series(list("aabb"))
729    str_cat = str_s.astype("category")
730    str_cat
731    str_cat.str.contains("a")
732
733    date_s = pd.Series(pd.date_range("1/1/2015", periods=5))
734    date_cat = date_s.astype("category")
735    date_cat
736    date_cat.dt.day
737
738.. note::
739
740    The returned ``Series`` (or ``DataFrame``) is of the same type as if you used the
741    ``.str.<method>`` / ``.dt.<method>`` on a ``Series`` of that type (and not of
742    type ``category``!).
743
744That means, that the returned values from methods and properties on the accessors of a
745``Series`` and the returned values from methods and properties on the accessors of this
746``Series`` transformed to one of type ``category`` will be equal:
747
748.. ipython:: python
749
750    ret_s = str_s.str.contains("a")
751    ret_cat = str_cat.str.contains("a")
752    ret_s.dtype == ret_cat.dtype
753    ret_s == ret_cat
754
755.. note::
756
757    The work is done on the ``categories`` and then a new ``Series`` is constructed. This has
758    some performance implication if you have a ``Series`` of type string, where lots of elements
759    are repeated (i.e. the number of unique elements in the ``Series`` is a lot smaller than the
760    length of the ``Series``). In this case it can be faster to convert the original ``Series``
761    to one of type ``category`` and use ``.str.<method>`` or ``.dt.<property>`` on that.
762
763Setting
764~~~~~~~
765
766Setting values in a categorical column (or ``Series``) works as long as the
767value is included in the ``categories``:
768
769.. ipython:: python
770
771    idx = pd.Index(["h", "i", "j", "k", "l", "m", "n"])
772    cats = pd.Categorical(["a", "a", "a", "a", "a", "a", "a"], categories=["a", "b"])
773    values = [1, 1, 1, 1, 1, 1, 1]
774    df = pd.DataFrame({"cats": cats, "values": values}, index=idx)
775
776    df.iloc[2:4, :] = [["b", 2], ["b", 2]]
777    df
778    try:
779        df.iloc[2:4, :] = [["c", 3], ["c", 3]]
780    except ValueError as e:
781        print("ValueError:", str(e))
782
783Setting values by assigning categorical data will also check that the ``categories`` match:
784
785.. ipython:: python
786
787    df.loc["j":"k", "cats"] = pd.Categorical(["a", "a"], categories=["a", "b"])
788    df
789    try:
790        df.loc["j":"k", "cats"] = pd.Categorical(["b", "b"], categories=["a", "b", "c"])
791    except ValueError as e:
792        print("ValueError:", str(e))
793
794Assigning a ``Categorical`` to parts of a column of other types will use the values:
795
796.. ipython:: python
797
798    df = pd.DataFrame({"a": [1, 1, 1, 1, 1], "b": ["a", "a", "a", "a", "a"]})
799    df.loc[1:2, "a"] = pd.Categorical(["b", "b"], categories=["a", "b"])
800    df.loc[2:3, "b"] = pd.Categorical(["b", "b"], categories=["a", "b"])
801    df
802    df.dtypes
803
804.. _categorical.merge:
805.. _categorical.concat:
806
807Merging / concatenation
808~~~~~~~~~~~~~~~~~~~~~~~
809
810By default, combining ``Series`` or ``DataFrames`` which contain the same
811categories results in ``category`` dtype, otherwise results will depend on the
812dtype of the underlying categories. Merges that result in non-categorical
813dtypes will likely have higher memory usage. Use ``.astype`` or
814``union_categoricals`` to ensure ``category`` results.
815
816.. ipython:: python
817
818   from pandas.api.types import union_categoricals
819
820   # same categories
821   s1 = pd.Series(["a", "b"], dtype="category")
822   s2 = pd.Series(["a", "b", "a"], dtype="category")
823   pd.concat([s1, s2])
824
825   # different categories
826   s3 = pd.Series(["b", "c"], dtype="category")
827   pd.concat([s1, s3])
828
829   # Output dtype is inferred based on categories values
830   int_cats = pd.Series([1, 2], dtype="category")
831   float_cats = pd.Series([3.0, 4.0], dtype="category")
832   pd.concat([int_cats, float_cats])
833
834   pd.concat([s1, s3]).astype("category")
835   union_categoricals([s1.array, s3.array])
836
837The following table summarizes the results of merging ``Categoricals``:
838
839+-------------------+------------------------+----------------------+-----------------------------+
840| arg1              | arg2                   |      identical       | result                      |
841+===================+========================+======================+=============================+
842| category          | category               | True                 | category                    |
843+-------------------+------------------------+----------------------+-----------------------------+
844| category (object) | category (object)      | False                | object (dtype is inferred)  |
845+-------------------+------------------------+----------------------+-----------------------------+
846| category (int)    | category (float)       | False                | float (dtype is inferred)   |
847+-------------------+------------------------+----------------------+-----------------------------+
848
849See also the section on :ref:`merge dtypes<merging.dtypes>` for notes about
850preserving merge dtypes and performance.
851
852.. _categorical.union:
853
854Unioning
855~~~~~~~~
856
857If you want to combine categoricals that do not necessarily have the same
858categories, the :func:`~pandas.api.types.union_categoricals` function will
859combine a list-like of categoricals. The new categories will be the union of
860the categories being combined.
861
862.. ipython:: python
863
864    from pandas.api.types import union_categoricals
865
866    a = pd.Categorical(["b", "c"])
867    b = pd.Categorical(["a", "b"])
868    union_categoricals([a, b])
869
870By default, the resulting categories will be ordered as
871they appear in the data. If you want the categories to
872be lexsorted, use ``sort_categories=True`` argument.
873
874.. ipython:: python
875
876    union_categoricals([a, b], sort_categories=True)
877
878``union_categoricals`` also works with the "easy" case of combining two
879categoricals of the same categories and order information
880(e.g. what you could also ``append`` for).
881
882.. ipython:: python
883
884    a = pd.Categorical(["a", "b"], ordered=True)
885    b = pd.Categorical(["a", "b", "a"], ordered=True)
886    union_categoricals([a, b])
887
888The below raises ``TypeError`` because the categories are ordered and not identical.
889
890.. code-block:: ipython
891
892   In [1]: a = pd.Categorical(["a", "b"], ordered=True)
893   In [2]: b = pd.Categorical(["a", "b", "c"], ordered=True)
894   In [3]: union_categoricals([a, b])
895   Out[3]:
896   TypeError: to union ordered Categoricals, all categories must be the same
897
898Ordered categoricals with different categories or orderings can be combined by
899using the ``ignore_ordered=True`` argument.
900
901.. ipython:: python
902
903    a = pd.Categorical(["a", "b", "c"], ordered=True)
904    b = pd.Categorical(["c", "b", "a"], ordered=True)
905    union_categoricals([a, b], ignore_order=True)
906
907:func:`~pandas.api.types.union_categoricals` also works with a
908``CategoricalIndex``, or ``Series`` containing categorical data, but note that
909the resulting array will always be a plain ``Categorical``:
910
911.. ipython:: python
912
913    a = pd.Series(["b", "c"], dtype="category")
914    b = pd.Series(["a", "b"], dtype="category")
915    union_categoricals([a, b])
916
917.. note::
918
919   ``union_categoricals`` may recode the integer codes for categories
920   when combining categoricals.  This is likely what you want,
921   but if you are relying on the exact numbering of the categories, be
922   aware.
923
924   .. ipython:: python
925
926      c1 = pd.Categorical(["b", "c"])
927      c2 = pd.Categorical(["a", "b"])
928
929      c1
930      # "b" is coded to 0
931      c1.codes
932
933      c2
934      # "b" is coded to 1
935      c2.codes
936
937      c = union_categoricals([c1, c2])
938      c
939      # "b" is coded to 0 throughout, same as c1, different from c2
940      c.codes
941
942
943Getting data in/out
944-------------------
945
946You can write data that contains ``category`` dtypes to a ``HDFStore``.
947See :ref:`here <io.hdf5-categorical>` for an example and caveats.
948
949It is also possible to write data to and reading data from *Stata* format files.
950See :ref:`here <io.stata-categorical>` for an example and caveats.
951
952Writing to a CSV file will convert the data, effectively removing any information about the
953categorical (categories and ordering). So if you read back the CSV file you have to convert the
954relevant columns back to ``category`` and assign the right categories and categories ordering.
955
956.. ipython:: python
957
958    import io
959
960    s = pd.Series(pd.Categorical(["a", "b", "b", "a", "a", "d"]))
961    # rename the categories
962    s.cat.categories = ["very good", "good", "bad"]
963    # reorder the categories and add missing categories
964    s = s.cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
965    df = pd.DataFrame({"cats": s, "vals": [1, 2, 3, 4, 5, 6]})
966    csv = io.StringIO()
967    df.to_csv(csv)
968    df2 = pd.read_csv(io.StringIO(csv.getvalue()))
969    df2.dtypes
970    df2["cats"]
971    # Redo the category
972    df2["cats"] = df2["cats"].astype("category")
973    df2["cats"].cat.set_categories(
974        ["very bad", "bad", "medium", "good", "very good"], inplace=True
975    )
976    df2.dtypes
977    df2["cats"]
978
979The same holds for writing to a SQL database with ``to_sql``.
980
981Missing data
982------------
983
984pandas primarily uses the value ``np.nan`` to represent missing data. It is by
985default not included in computations. See the :ref:`Missing Data section
986<missing_data>`.
987
988Missing values should **not** be included in the Categorical's ``categories``,
989only in the ``values``.
990Instead, it is understood that NaN is different, and is always a possibility.
991When working with the Categorical's ``codes``, missing values will always have
992a code of ``-1``.
993
994.. ipython:: python
995
996    s = pd.Series(["a", "b", np.nan, "a"], dtype="category")
997    # only two categories
998    s
999    s.cat.codes
1000
1001
1002Methods for working with missing data, e.g. :meth:`~Series.isna`, :meth:`~Series.fillna`,
1003:meth:`~Series.dropna`, all work normally:
1004
1005.. ipython:: python
1006
1007    s = pd.Series(["a", "b", np.nan], dtype="category")
1008    s
1009    pd.isna(s)
1010    s.fillna("a")
1011
1012Differences to R's ``factor``
1013-----------------------------
1014
1015The following differences to R's factor functions can be observed:
1016
1017* R's ``levels`` are named ``categories``.
1018* R's ``levels`` are always of type string, while ``categories`` in pandas can be of any dtype.
1019* It's not possible to specify labels at creation time. Use ``s.cat.rename_categories(new_labels)``
1020  afterwards.
1021* In contrast to R's ``factor`` function, using categorical data as the sole input to create a
1022  new categorical series will *not* remove unused categories but create a new categorical series
1023  which is equal to the passed in one!
1024* R allows for missing values to be included in its ``levels`` (pandas' ``categories``). pandas
1025  does not allow ``NaN`` categories, but missing values can still be in the ``values``.
1026
1027
1028Gotchas
1029-------
1030
1031.. _categorical.rfactor:
1032
1033Memory usage
1034~~~~~~~~~~~~
1035
1036.. _categorical.memory:
1037
1038The memory usage of a ``Categorical`` is proportional to the number of categories plus the length of the data. In contrast,
1039an ``object`` dtype is a constant times the length of the data.
1040
1041.. ipython:: python
1042
1043   s = pd.Series(["foo", "bar"] * 1000)
1044
1045   # object dtype
1046   s.nbytes
1047
1048   # category dtype
1049   s.astype("category").nbytes
1050
1051.. note::
1052
1053   If the number of categories approaches the length of the data, the ``Categorical`` will use nearly the same or
1054   more memory than an equivalent ``object`` dtype representation.
1055
1056   .. ipython:: python
1057
1058      s = pd.Series(["foo%04d" % i for i in range(2000)])
1059
1060      # object dtype
1061      s.nbytes
1062
1063      # category dtype
1064      s.astype("category").nbytes
1065
1066
1067``Categorical`` is not a ``numpy`` array
1068~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1069
1070Currently, categorical data and the underlying ``Categorical`` is implemented as a Python
1071object and not as a low-level NumPy array dtype. This leads to some problems.
1072
1073NumPy itself doesn't know about the new ``dtype``:
1074
1075.. ipython:: python
1076
1077    try:
1078        np.dtype("category")
1079    except TypeError as e:
1080        print("TypeError:", str(e))
1081
1082    dtype = pd.Categorical(["a"]).dtype
1083    try:
1084        np.dtype(dtype)
1085    except TypeError as e:
1086        print("TypeError:", str(e))
1087
1088Dtype comparisons work:
1089
1090.. ipython:: python
1091
1092    dtype == np.str_
1093    np.str_ == dtype
1094
1095To check if a Series contains Categorical data, use ``hasattr(s, 'cat')``:
1096
1097.. ipython:: python
1098
1099    hasattr(pd.Series(["a"], dtype="category"), "cat")
1100    hasattr(pd.Series(["a"]), "cat")
1101
1102Using NumPy functions on a ``Series`` of type ``category`` should not work as ``Categoricals``
1103are not numeric data (even in the case that ``.categories`` is numeric).
1104
1105.. ipython:: python
1106
1107    s = pd.Series(pd.Categorical([1, 2, 3, 4]))
1108    try:
1109        np.sum(s)
1110        # same with np.log(s),...
1111    except TypeError as e:
1112        print("TypeError:", str(e))
1113
1114.. note::
1115    If such a function works, please file a bug at https://github.com/pandas-dev/pandas!
1116
1117dtype in apply
1118~~~~~~~~~~~~~~
1119
1120pandas currently does not preserve the dtype in apply functions: If you apply along rows you get
1121a ``Series`` of ``object`` ``dtype`` (same as getting a row -> getting one element will return a
1122basic type) and applying along columns will also convert to object. ``NaN`` values are unaffected.
1123You can use ``fillna`` to handle missing values before applying a function.
1124
1125.. ipython:: python
1126
1127    df = pd.DataFrame(
1128        {
1129            "a": [1, 2, 3, 4],
1130            "b": ["a", "b", "c", "d"],
1131            "cats": pd.Categorical([1, 2, 3, 2]),
1132        }
1133    )
1134    df.apply(lambda row: type(row["cats"]), axis=1)
1135    df.apply(lambda col: col.dtype, axis=0)
1136
1137Categorical index
1138~~~~~~~~~~~~~~~~~
1139
1140``CategoricalIndex`` is a type of index that is useful for supporting
1141indexing with duplicates. This is a container around a ``Categorical``
1142and allows efficient indexing and storage of an index with a large number of duplicated elements.
1143See the :ref:`advanced indexing docs <indexing.categoricalindex>` for a more detailed
1144explanation.
1145
1146Setting the index will create a ``CategoricalIndex``:
1147
1148.. ipython:: python
1149
1150    cats = pd.Categorical([1, 2, 3, 4], categories=[4, 2, 3, 1])
1151    strings = ["a", "b", "c", "d"]
1152    values = [4, 2, 3, 1]
1153    df = pd.DataFrame({"strings": strings, "values": values}, index=cats)
1154    df.index
1155    # This now sorts by the categories order
1156    df.sort_index()
1157
1158Side effects
1159~~~~~~~~~~~~
1160
1161Constructing a ``Series`` from a ``Categorical`` will not copy the input
1162``Categorical``. This means that changes to the ``Series`` will in most cases
1163change the original ``Categorical``:
1164
1165.. ipython:: python
1166
1167    cat = pd.Categorical([1, 2, 3, 10], categories=[1, 2, 3, 4, 10])
1168    s = pd.Series(cat, name="cat")
1169    cat
1170    s.iloc[0:2] = 10
1171    cat
1172    df = pd.DataFrame(s)
1173    df["cat"].cat.categories = [1, 2, 3, 4, 5]
1174    cat
1175
1176Use ``copy=True`` to prevent such a behaviour or simply don't reuse ``Categoricals``:
1177
1178.. ipython:: python
1179
1180    cat = pd.Categorical([1, 2, 3, 10], categories=[1, 2, 3, 4, 10])
1181    s = pd.Series(cat, name="cat", copy=True)
1182    cat
1183    s.iloc[0:2] = 10
1184    cat
1185
1186.. note::
1187
1188    This also happens in some cases when you supply a NumPy array instead of a ``Categorical``:
1189    using an int array (e.g. ``np.array([1,2,3,4])``) will exhibit the same behavior, while using
1190    a string array (e.g. ``np.array(["a","b","c","a"])``) will not.
1191