1.. _compare_with_sas:
2
3{{ header }}
4
5Comparison with SAS
6********************
7For potential users coming from `SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__
8this page is meant to demonstrate how different SAS operations would be
9performed in pandas.
10
11If you're new to pandas, you might want to first read through :ref:`10 Minutes to pandas<10min>`
12to familiarize yourself with the library.
13
14As is customary, we import pandas and NumPy as follows:
15
16.. ipython:: python
17
18    import pandas as pd
19    import numpy as np
20
21
22.. note::
23
24   Throughout this tutorial, the pandas ``DataFrame`` will be displayed by calling
25   ``df.head()``, which displays the first N (default 5) rows of the ``DataFrame``.
26   This is often used in interactive work (e.g. `Jupyter notebook
27   <https://jupyter.org/>`_ or terminal) - the equivalent in SAS would be:
28
29   .. code-block:: sas
30
31      proc print data=df(obs=5);
32      run;
33
34Data structures
35---------------
36
37General terminology translation
38~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
39
40.. csv-table::
41    :header: "pandas", "SAS"
42    :widths: 20, 20
43
44    ``DataFrame``, data set
45    column, variable
46    row, observation
47    groupby, BY-group
48    ``NaN``, ``.``
49
50
51``DataFrame`` / ``Series``
52~~~~~~~~~~~~~~~~~~~~~~~~~~
53
54A ``DataFrame`` in pandas is analogous to a SAS data set - a two-dimensional
55data source with labeled columns that can be of different types. As will be
56shown in this document, almost any operation that can be applied to a data set
57using SAS's ``DATA`` step, can also be accomplished in pandas.
58
59A ``Series`` is the data structure that represents one column of a
60``DataFrame``. SAS doesn't have a separate data structure for a single column,
61but in general, working with a ``Series`` is analogous to referencing a column
62in the ``DATA`` step.
63
64``Index``
65~~~~~~~~~
66
67Every ``DataFrame`` and ``Series`` has an ``Index`` - which are labels on the
68*rows* of the data. SAS does not have an exactly analogous concept. A data set's
69rows are essentially unlabeled, other than an implicit integer index that can be
70accessed during the ``DATA`` step (``_N_``).
71
72In pandas, if no index is specified, an integer index is also used by default
73(first row = 0, second row = 1, and so on). While using a labeled ``Index`` or
74``MultiIndex`` can enable sophisticated analyses and is ultimately an important
75part of pandas to understand, for this comparison we will essentially ignore the
76``Index`` and just treat the ``DataFrame`` as a collection of columns. Please
77see the :ref:`indexing documentation<indexing>` for much more on how to use an
78``Index`` effectively.
79
80
81Data input / output
82-------------------
83
84Constructing a DataFrame from values
85~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
86
87A SAS data set can be built from specified values by
88placing the data after a ``datalines`` statement and
89specifying the column names.
90
91.. code-block:: sas
92
93   data df;
94       input x y;
95       datalines;
96       1 2
97       3 4
98       5 6
99       ;
100   run;
101
102A pandas ``DataFrame`` can be constructed in many different ways,
103but for a small number of values, it is often convenient to specify it as
104a Python dictionary, where the keys are the column names
105and the values are the data.
106
107.. ipython:: python
108
109   df = pd.DataFrame({"x": [1, 3, 5], "y": [2, 4, 6]})
110   df
111
112
113Reading external data
114~~~~~~~~~~~~~~~~~~~~~
115
116Like SAS, pandas provides utilities for reading in data from
117many formats.  The ``tips`` dataset, found within the pandas
118tests (`csv <https://raw.github.com/pandas-dev/pandas/master/pandas/tests/io/data/csv/tips.csv>`_)
119will be used in many of the following examples.
120
121SAS provides ``PROC IMPORT`` to read csv data into a data set.
122
123.. code-block:: sas
124
125   proc import datafile='tips.csv' dbms=csv out=tips replace;
126       getnames=yes;
127   run;
128
129The pandas method is :func:`read_csv`, which works similarly.
130
131.. ipython:: python
132
133   url = (
134       "https://raw.github.com/pandas-dev/"
135       "pandas/master/pandas/tests/io/data/csv/tips.csv"
136   )
137   tips = pd.read_csv(url)
138   tips.head()
139
140
141Like ``PROC IMPORT``, ``read_csv`` can take a number of parameters to specify
142how the data should be parsed.  For example, if the data was instead tab delimited,
143and did not have column names, the pandas command would be:
144
145.. code-block:: python
146
147   tips = pd.read_csv("tips.csv", sep="\t", header=None)
148
149   # alternatively, read_table is an alias to read_csv with tab delimiter
150   tips = pd.read_table("tips.csv", header=None)
151
152In addition to text/csv, pandas supports a variety of other data formats
153such as Excel, HDF5, and SQL databases.  These are all read via a ``pd.read_*``
154function.  See the :ref:`IO documentation<io>` for more details.
155
156Exporting data
157~~~~~~~~~~~~~~
158
159The inverse of ``PROC IMPORT`` in SAS is ``PROC EXPORT``
160
161.. code-block:: sas
162
163   proc export data=tips outfile='tips2.csv' dbms=csv;
164   run;
165
166Similarly in pandas, the opposite of ``read_csv`` is :meth:`~DataFrame.to_csv`,
167and other data formats follow a similar api.
168
169.. code-block:: python
170
171   tips.to_csv("tips2.csv")
172
173
174Data operations
175---------------
176
177Operations on columns
178~~~~~~~~~~~~~~~~~~~~~
179
180In the ``DATA`` step, arbitrary math expressions can
181be used on new or existing columns.
182
183.. code-block:: sas
184
185   data tips;
186       set tips;
187       total_bill = total_bill - 2;
188       new_bill = total_bill / 2;
189   run;
190
191pandas provides similar vectorized operations by
192specifying the individual ``Series`` in the ``DataFrame``.
193New columns can be assigned in the same way.
194
195.. ipython:: python
196
197   tips["total_bill"] = tips["total_bill"] - 2
198   tips["new_bill"] = tips["total_bill"] / 2.0
199   tips.head()
200
201.. ipython:: python
202   :suppress:
203
204   tips = tips.drop("new_bill", axis=1)
205
206Filtering
207~~~~~~~~~
208
209Filtering in SAS is done with an ``if`` or ``where`` statement, on one
210or more columns.
211
212.. code-block:: sas
213
214   data tips;
215       set tips;
216       if total_bill > 10;
217   run;
218
219   data tips;
220       set tips;
221       where total_bill > 10;
222       /* equivalent in this case - where happens before the
223          DATA step begins and can also be used in PROC statements */
224   run;
225
226DataFrames can be filtered in multiple ways; the most intuitive of which is using
227:ref:`boolean indexing <indexing.boolean>`
228
229.. ipython:: python
230
231   tips[tips["total_bill"] > 10].head()
232
233If/then logic
234~~~~~~~~~~~~~
235
236In SAS, if/then logic can be used to create new columns.
237
238.. code-block:: sas
239
240   data tips;
241       set tips;
242       format bucket $4.;
243
244       if total_bill < 10 then bucket = 'low';
245       else bucket = 'high';
246   run;
247
248The same operation in pandas can be accomplished using
249the ``where`` method from ``numpy``.
250
251.. ipython:: python
252
253   tips["bucket"] = np.where(tips["total_bill"] < 10, "low", "high")
254   tips.head()
255
256.. ipython:: python
257   :suppress:
258
259   tips = tips.drop("bucket", axis=1)
260
261Date functionality
262~~~~~~~~~~~~~~~~~~
263
264SAS provides a variety of functions to do operations on
265date/datetime columns.
266
267.. code-block:: sas
268
269   data tips;
270       set tips;
271       format date1 date2 date1_plusmonth mmddyy10.;
272       date1 = mdy(1, 15, 2013);
273       date2 = mdy(2, 15, 2015);
274       date1_year = year(date1);
275       date2_month = month(date2);
276       * shift date to beginning of next interval;
277       date1_next = intnx('MONTH', date1, 1);
278       * count intervals between dates;
279       months_between = intck('MONTH', date1, date2);
280   run;
281
282The equivalent pandas operations are shown below.  In addition to these
283functions pandas supports other Time Series features
284not available in Base SAS (such as resampling and custom offsets) -
285see the :ref:`timeseries documentation<timeseries>` for more details.
286
287.. ipython:: python
288
289   tips["date1"] = pd.Timestamp("2013-01-15")
290   tips["date2"] = pd.Timestamp("2015-02-15")
291   tips["date1_year"] = tips["date1"].dt.year
292   tips["date2_month"] = tips["date2"].dt.month
293   tips["date1_next"] = tips["date1"] + pd.offsets.MonthBegin()
294   tips["months_between"] = tips["date2"].dt.to_period("M") - tips[
295       "date1"
296   ].dt.to_period("M")
297
298   tips[
299       ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"]
300   ].head()
301
302.. ipython:: python
303   :suppress:
304
305   tips = tips.drop(
306       ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"],
307       axis=1,
308   )
309
310Selection of columns
311~~~~~~~~~~~~~~~~~~~~
312
313SAS provides keywords in the ``DATA`` step to select,
314drop, and rename columns.
315
316.. code-block:: sas
317
318   data tips;
319       set tips;
320       keep sex total_bill tip;
321   run;
322
323   data tips;
324       set tips;
325       drop sex;
326   run;
327
328   data tips;
329       set tips;
330       rename total_bill=total_bill_2;
331   run;
332
333The same operations are expressed in pandas below.
334
335.. ipython:: python
336
337   # keep
338   tips[["sex", "total_bill", "tip"]].head()
339
340   # drop
341   tips.drop("sex", axis=1).head()
342
343   # rename
344   tips.rename(columns={"total_bill": "total_bill_2"}).head()
345
346
347Sorting by values
348~~~~~~~~~~~~~~~~~
349
350Sorting in SAS is accomplished via ``PROC SORT``
351
352.. code-block:: sas
353
354   proc sort data=tips;
355       by sex total_bill;
356   run;
357
358pandas objects have a :meth:`~DataFrame.sort_values` method, which
359takes a list of columns to sort by.
360
361.. ipython:: python
362
363   tips = tips.sort_values(["sex", "total_bill"])
364   tips.head()
365
366
367String processing
368-----------------
369
370Length
371~~~~~~
372
373SAS determines the length of a character string with the
374`LENGTHN <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002284668.htm>`__
375and `LENGTHC <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002283942.htm>`__
376functions. ``LENGTHN`` excludes trailing blanks and ``LENGTHC`` includes trailing blanks.
377
378.. code-block:: sas
379
380   data _null_;
381   set tips;
382   put(LENGTHN(time));
383   put(LENGTHC(time));
384   run;
385
386Python determines the length of a character string with the ``len`` function.
387``len`` includes trailing blanks.  Use ``len`` and ``rstrip`` to exclude
388trailing blanks.
389
390.. ipython:: python
391
392   tips["time"].str.len().head()
393   tips["time"].str.rstrip().str.len().head()
394
395
396Find
397~~~~
398
399SAS determines the position of a character in a string with the
400`FINDW <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002978282.htm>`__ function.
401``FINDW`` takes the string defined by the first argument and searches for the first position of the substring
402you supply as the second argument.
403
404.. code-block:: sas
405
406   data _null_;
407   set tips;
408   put(FINDW(sex,'ale'));
409   run;
410
411Python determines the position of a character in a string with the
412``find`` function.  ``find`` searches for the first position of the
413substring.  If the substring is found, the function returns its
414position.  Keep in mind that Python indexes are zero-based and
415the function will return -1 if it fails to find the substring.
416
417.. ipython:: python
418
419   tips["sex"].str.find("ale").head()
420
421
422Substring
423~~~~~~~~~
424
425SAS extracts a substring from a string based on its position with the
426`SUBSTR <https://www2.sas.com/proceedings/sugi25/25/cc/25p088.pdf>`__ function.
427
428.. code-block:: sas
429
430   data _null_;
431   set tips;
432   put(substr(sex,1,1));
433   run;
434
435With pandas you can use ``[]`` notation to extract a substring
436from a string by position locations.  Keep in mind that Python
437indexes are zero-based.
438
439.. ipython:: python
440
441   tips["sex"].str[0:1].head()
442
443
444Scan
445~~~~
446
447The SAS `SCAN <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000214639.htm>`__
448function returns the nth word from a string. The first argument is the string you want to parse and the
449second argument specifies which word you want to extract.
450
451.. code-block:: sas
452
453   data firstlast;
454   input String $60.;
455   First_Name = scan(string, 1);
456   Last_Name = scan(string, -1);
457   datalines2;
458   John Smith;
459   Jane Cook;
460   ;;;
461   run;
462
463Python extracts a substring from a string based on its text
464by using regular expressions. There are much more powerful
465approaches, but this just shows a simple approach.
466
467.. ipython:: python
468
469   firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]})
470   firstlast["First_Name"] = firstlast["String"].str.split(" ", expand=True)[0]
471   firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[0]
472   firstlast
473
474
475Upcase, lowcase, and propcase
476~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
477
478The SAS `UPCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245965.htm>`__
479`LOWCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245912.htm>`__ and
480`PROPCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/a002598106.htm>`__
481functions change the case of the argument.
482
483.. code-block:: sas
484
485   data firstlast;
486   input String $60.;
487   string_up = UPCASE(string);
488   string_low = LOWCASE(string);
489   string_prop = PROPCASE(string);
490   datalines2;
491   John Smith;
492   Jane Cook;
493   ;;;
494   run;
495
496The equivalent Python functions are ``upper``, ``lower``, and ``title``.
497
498.. ipython:: python
499
500   firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]})
501   firstlast["string_up"] = firstlast["String"].str.upper()
502   firstlast["string_low"] = firstlast["String"].str.lower()
503   firstlast["string_prop"] = firstlast["String"].str.title()
504   firstlast
505
506Merging
507-------
508
509The following tables will be used in the merge examples
510
511.. ipython:: python
512
513   df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
514   df1
515   df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})
516   df2
517
518In SAS, data must be explicitly sorted before merging.  Different
519types of joins are accomplished using the ``in=`` dummy
520variables to track whether a match was found in one or both
521input frames.
522
523.. code-block:: sas
524
525   proc sort data=df1;
526       by key;
527   run;
528
529   proc sort data=df2;
530       by key;
531   run;
532
533   data left_join inner_join right_join outer_join;
534       merge df1(in=a) df2(in=b);
535
536       if a and b then output inner_join;
537       if a then output left_join;
538       if b then output right_join;
539       if a or b then output outer_join;
540   run;
541
542pandas DataFrames have a :meth:`~DataFrame.merge` method, which provides
543similar functionality.  Note that the data does not have
544to be sorted ahead of time, and different join
545types are accomplished via the ``how`` keyword.
546
547.. ipython:: python
548
549   inner_join = df1.merge(df2, on=["key"], how="inner")
550   inner_join
551
552   left_join = df1.merge(df2, on=["key"], how="left")
553   left_join
554
555   right_join = df1.merge(df2, on=["key"], how="right")
556   right_join
557
558   outer_join = df1.merge(df2, on=["key"], how="outer")
559   outer_join
560
561
562Missing data
563------------
564
565Like SAS, pandas has a representation for missing data - which is the
566special float value ``NaN`` (not a number).  Many of the semantics
567are the same, for example missing data propagates through numeric
568operations, and is ignored by default for aggregations.
569
570.. ipython:: python
571
572   outer_join
573   outer_join["value_x"] + outer_join["value_y"]
574   outer_join["value_x"].sum()
575
576One difference is that missing data cannot be compared to its sentinel value.
577For example, in SAS you could do this to filter missing values.
578
579.. code-block:: sas
580
581   data outer_join_nulls;
582       set outer_join;
583       if value_x = .;
584   run;
585
586   data outer_join_no_nulls;
587       set outer_join;
588       if value_x ^= .;
589   run;
590
591Which doesn't work in pandas.  Instead, the ``pd.isna`` or ``pd.notna`` functions
592should be used for comparisons.
593
594.. ipython:: python
595
596   outer_join[pd.isna(outer_join["value_x"])]
597   outer_join[pd.notna(outer_join["value_x"])]
598
599pandas also provides a variety of methods to work with missing data - some of
600which would be challenging to express in SAS. For example, there are methods to
601drop all rows with any missing values, replacing missing values with a specified
602value, like the mean, or forward filling from previous rows. See the
603:ref:`missing data documentation<missing_data>` for more.
604
605.. ipython:: python
606
607   outer_join.dropna()
608   outer_join.fillna(method="ffill")
609   outer_join["value_x"].fillna(outer_join["value_x"].mean())
610
611
612GroupBy
613-------
614
615Aggregation
616~~~~~~~~~~~
617
618SAS's PROC SUMMARY can be used to group by one or
619more key variables and compute aggregations on
620numeric columns.
621
622.. code-block:: sas
623
624   proc summary data=tips nway;
625       class sex smoker;
626       var total_bill tip;
627       output out=tips_summed sum=;
628   run;
629
630pandas provides a flexible ``groupby`` mechanism that
631allows similar aggregations.  See the :ref:`groupby documentation<groupby>`
632for more details and examples.
633
634.. ipython:: python
635
636   tips_summed = tips.groupby(["sex", "smoker"])[["total_bill", "tip"]].sum()
637   tips_summed.head()
638
639
640Transformation
641~~~~~~~~~~~~~~
642
643In SAS, if the group aggregations need to be used with
644the original frame, it must be merged back together.  For
645example, to subtract the mean for each observation by smoker group.
646
647.. code-block:: sas
648
649   proc summary data=tips missing nway;
650       class smoker;
651       var total_bill;
652       output out=smoker_means mean(total_bill)=group_bill;
653   run;
654
655   proc sort data=tips;
656       by smoker;
657   run;
658
659   data tips;
660       merge tips(in=a) smoker_means(in=b);
661       by smoker;
662       adj_total_bill = total_bill - group_bill;
663       if a and b;
664   run;
665
666
667pandas ``groupby`` provides a ``transform`` mechanism that allows
668these type of operations to be succinctly expressed in one
669operation.
670
671.. ipython:: python
672
673   gb = tips.groupby("smoker")["total_bill"]
674   tips["adj_total_bill"] = tips["total_bill"] - gb.transform("mean")
675   tips.head()
676
677
678By group processing
679~~~~~~~~~~~~~~~~~~~
680
681In addition to aggregation, pandas ``groupby`` can be used to
682replicate most other by group processing from SAS. For example,
683this ``DATA`` step reads the data by sex/smoker group and filters to
684the first entry for each.
685
686.. code-block:: sas
687
688   proc sort data=tips;
689      by sex smoker;
690   run;
691
692   data tips_first;
693       set tips;
694       by sex smoker;
695       if FIRST.sex or FIRST.smoker then output;
696   run;
697
698In pandas this would be written as:
699
700.. ipython:: python
701
702   tips.groupby(["sex", "smoker"]).first()
703
704
705Other considerations
706--------------------
707
708Disk vs memory
709~~~~~~~~~~~~~~
710
711pandas operates exclusively in memory, where a SAS data set exists on disk.
712This means that the size of data able to be loaded in pandas is limited by your
713machine's memory, but also that the operations on that data may be faster.
714
715If out of core processing is needed, one possibility is the
716`dask.dataframe <https://dask.pydata.org/en/latest/dataframe.html>`_
717library (currently in development) which
718provides a subset of pandas functionality for an on-disk ``DataFrame``
719
720Data interop
721~~~~~~~~~~~~
722
723pandas provides a :func:`read_sas` method that can read SAS data saved in
724the XPORT or SAS7BDAT binary format.
725
726.. code-block:: sas
727
728   libname xportout xport 'transport-file.xpt';
729   data xportout.tips;
730       set tips(rename=(total_bill=tbill));
731       * xport variable names limited to 6 characters;
732   run;
733
734.. code-block:: python
735
736   df = pd.read_sas("transport-file.xpt")
737   df = pd.read_sas("binary-file.sas7bdat")
738
739You can also specify the file format directly. By default, pandas will try
740to infer the file format based on its extension.
741
742.. code-block:: python
743
744   df = pd.read_sas("transport-file.xpt", format="xport")
745   df = pd.read_sas("binary-file.sas7bdat", format="sas7bdat")
746
747XPORT is a relatively limited format and the parsing of it is not as
748optimized as some of the other pandas readers. An alternative way
749to interop data between SAS and pandas is to serialize to csv.
750
751.. code-block:: ipython
752
753   # version 0.17, 10M rows
754
755   In [8]: %time df = pd.read_sas('big.xpt')
756   Wall time: 14.6 s
757
758   In [9]: %time df = pd.read_csv('big.csv')
759   Wall time: 4.86 s
760