1.. _compare_with_sas: 2 3{{ header }} 4 5Comparison with SAS 6******************** 7For potential users coming from `SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__ 8this page is meant to demonstrate how different SAS operations would be 9performed in pandas. 10 11If you're new to pandas, you might want to first read through :ref:`10 Minutes to pandas<10min>` 12to familiarize yourself with the library. 13 14As is customary, we import pandas and NumPy as follows: 15 16.. ipython:: python 17 18 import pandas as pd 19 import numpy as np 20 21 22.. note:: 23 24 Throughout this tutorial, the pandas ``DataFrame`` will be displayed by calling 25 ``df.head()``, which displays the first N (default 5) rows of the ``DataFrame``. 26 This is often used in interactive work (e.g. `Jupyter notebook 27 <https://jupyter.org/>`_ or terminal) - the equivalent in SAS would be: 28 29 .. code-block:: sas 30 31 proc print data=df(obs=5); 32 run; 33 34Data structures 35--------------- 36 37General terminology translation 38~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 39 40.. csv-table:: 41 :header: "pandas", "SAS" 42 :widths: 20, 20 43 44 ``DataFrame``, data set 45 column, variable 46 row, observation 47 groupby, BY-group 48 ``NaN``, ``.`` 49 50 51``DataFrame`` / ``Series`` 52~~~~~~~~~~~~~~~~~~~~~~~~~~ 53 54A ``DataFrame`` in pandas is analogous to a SAS data set - a two-dimensional 55data source with labeled columns that can be of different types. As will be 56shown in this document, almost any operation that can be applied to a data set 57using SAS's ``DATA`` step, can also be accomplished in pandas. 58 59A ``Series`` is the data structure that represents one column of a 60``DataFrame``. SAS doesn't have a separate data structure for a single column, 61but in general, working with a ``Series`` is analogous to referencing a column 62in the ``DATA`` step. 63 64``Index`` 65~~~~~~~~~ 66 67Every ``DataFrame`` and ``Series`` has an ``Index`` - which are labels on the 68*rows* of the data. SAS does not have an exactly analogous concept. A data set's 69rows are essentially unlabeled, other than an implicit integer index that can be 70accessed during the ``DATA`` step (``_N_``). 71 72In pandas, if no index is specified, an integer index is also used by default 73(first row = 0, second row = 1, and so on). While using a labeled ``Index`` or 74``MultiIndex`` can enable sophisticated analyses and is ultimately an important 75part of pandas to understand, for this comparison we will essentially ignore the 76``Index`` and just treat the ``DataFrame`` as a collection of columns. Please 77see the :ref:`indexing documentation<indexing>` for much more on how to use an 78``Index`` effectively. 79 80 81Data input / output 82------------------- 83 84Constructing a DataFrame from values 85~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 86 87A SAS data set can be built from specified values by 88placing the data after a ``datalines`` statement and 89specifying the column names. 90 91.. code-block:: sas 92 93 data df; 94 input x y; 95 datalines; 96 1 2 97 3 4 98 5 6 99 ; 100 run; 101 102A pandas ``DataFrame`` can be constructed in many different ways, 103but for a small number of values, it is often convenient to specify it as 104a Python dictionary, where the keys are the column names 105and the values are the data. 106 107.. ipython:: python 108 109 df = pd.DataFrame({"x": [1, 3, 5], "y": [2, 4, 6]}) 110 df 111 112 113Reading external data 114~~~~~~~~~~~~~~~~~~~~~ 115 116Like SAS, pandas provides utilities for reading in data from 117many formats. The ``tips`` dataset, found within the pandas 118tests (`csv <https://raw.github.com/pandas-dev/pandas/master/pandas/tests/io/data/csv/tips.csv>`_) 119will be used in many of the following examples. 120 121SAS provides ``PROC IMPORT`` to read csv data into a data set. 122 123.. code-block:: sas 124 125 proc import datafile='tips.csv' dbms=csv out=tips replace; 126 getnames=yes; 127 run; 128 129The pandas method is :func:`read_csv`, which works similarly. 130 131.. ipython:: python 132 133 url = ( 134 "https://raw.github.com/pandas-dev/" 135 "pandas/master/pandas/tests/io/data/csv/tips.csv" 136 ) 137 tips = pd.read_csv(url) 138 tips.head() 139 140 141Like ``PROC IMPORT``, ``read_csv`` can take a number of parameters to specify 142how the data should be parsed. For example, if the data was instead tab delimited, 143and did not have column names, the pandas command would be: 144 145.. code-block:: python 146 147 tips = pd.read_csv("tips.csv", sep="\t", header=None) 148 149 # alternatively, read_table is an alias to read_csv with tab delimiter 150 tips = pd.read_table("tips.csv", header=None) 151 152In addition to text/csv, pandas supports a variety of other data formats 153such as Excel, HDF5, and SQL databases. These are all read via a ``pd.read_*`` 154function. See the :ref:`IO documentation<io>` for more details. 155 156Exporting data 157~~~~~~~~~~~~~~ 158 159The inverse of ``PROC IMPORT`` in SAS is ``PROC EXPORT`` 160 161.. code-block:: sas 162 163 proc export data=tips outfile='tips2.csv' dbms=csv; 164 run; 165 166Similarly in pandas, the opposite of ``read_csv`` is :meth:`~DataFrame.to_csv`, 167and other data formats follow a similar api. 168 169.. code-block:: python 170 171 tips.to_csv("tips2.csv") 172 173 174Data operations 175--------------- 176 177Operations on columns 178~~~~~~~~~~~~~~~~~~~~~ 179 180In the ``DATA`` step, arbitrary math expressions can 181be used on new or existing columns. 182 183.. code-block:: sas 184 185 data tips; 186 set tips; 187 total_bill = total_bill - 2; 188 new_bill = total_bill / 2; 189 run; 190 191pandas provides similar vectorized operations by 192specifying the individual ``Series`` in the ``DataFrame``. 193New columns can be assigned in the same way. 194 195.. ipython:: python 196 197 tips["total_bill"] = tips["total_bill"] - 2 198 tips["new_bill"] = tips["total_bill"] / 2.0 199 tips.head() 200 201.. ipython:: python 202 :suppress: 203 204 tips = tips.drop("new_bill", axis=1) 205 206Filtering 207~~~~~~~~~ 208 209Filtering in SAS is done with an ``if`` or ``where`` statement, on one 210or more columns. 211 212.. code-block:: sas 213 214 data tips; 215 set tips; 216 if total_bill > 10; 217 run; 218 219 data tips; 220 set tips; 221 where total_bill > 10; 222 /* equivalent in this case - where happens before the 223 DATA step begins and can also be used in PROC statements */ 224 run; 225 226DataFrames can be filtered in multiple ways; the most intuitive of which is using 227:ref:`boolean indexing <indexing.boolean>` 228 229.. ipython:: python 230 231 tips[tips["total_bill"] > 10].head() 232 233If/then logic 234~~~~~~~~~~~~~ 235 236In SAS, if/then logic can be used to create new columns. 237 238.. code-block:: sas 239 240 data tips; 241 set tips; 242 format bucket $4.; 243 244 if total_bill < 10 then bucket = 'low'; 245 else bucket = 'high'; 246 run; 247 248The same operation in pandas can be accomplished using 249the ``where`` method from ``numpy``. 250 251.. ipython:: python 252 253 tips["bucket"] = np.where(tips["total_bill"] < 10, "low", "high") 254 tips.head() 255 256.. ipython:: python 257 :suppress: 258 259 tips = tips.drop("bucket", axis=1) 260 261Date functionality 262~~~~~~~~~~~~~~~~~~ 263 264SAS provides a variety of functions to do operations on 265date/datetime columns. 266 267.. code-block:: sas 268 269 data tips; 270 set tips; 271 format date1 date2 date1_plusmonth mmddyy10.; 272 date1 = mdy(1, 15, 2013); 273 date2 = mdy(2, 15, 2015); 274 date1_year = year(date1); 275 date2_month = month(date2); 276 * shift date to beginning of next interval; 277 date1_next = intnx('MONTH', date1, 1); 278 * count intervals between dates; 279 months_between = intck('MONTH', date1, date2); 280 run; 281 282The equivalent pandas operations are shown below. In addition to these 283functions pandas supports other Time Series features 284not available in Base SAS (such as resampling and custom offsets) - 285see the :ref:`timeseries documentation<timeseries>` for more details. 286 287.. ipython:: python 288 289 tips["date1"] = pd.Timestamp("2013-01-15") 290 tips["date2"] = pd.Timestamp("2015-02-15") 291 tips["date1_year"] = tips["date1"].dt.year 292 tips["date2_month"] = tips["date2"].dt.month 293 tips["date1_next"] = tips["date1"] + pd.offsets.MonthBegin() 294 tips["months_between"] = tips["date2"].dt.to_period("M") - tips[ 295 "date1" 296 ].dt.to_period("M") 297 298 tips[ 299 ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"] 300 ].head() 301 302.. ipython:: python 303 :suppress: 304 305 tips = tips.drop( 306 ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"], 307 axis=1, 308 ) 309 310Selection of columns 311~~~~~~~~~~~~~~~~~~~~ 312 313SAS provides keywords in the ``DATA`` step to select, 314drop, and rename columns. 315 316.. code-block:: sas 317 318 data tips; 319 set tips; 320 keep sex total_bill tip; 321 run; 322 323 data tips; 324 set tips; 325 drop sex; 326 run; 327 328 data tips; 329 set tips; 330 rename total_bill=total_bill_2; 331 run; 332 333The same operations are expressed in pandas below. 334 335.. ipython:: python 336 337 # keep 338 tips[["sex", "total_bill", "tip"]].head() 339 340 # drop 341 tips.drop("sex", axis=1).head() 342 343 # rename 344 tips.rename(columns={"total_bill": "total_bill_2"}).head() 345 346 347Sorting by values 348~~~~~~~~~~~~~~~~~ 349 350Sorting in SAS is accomplished via ``PROC SORT`` 351 352.. code-block:: sas 353 354 proc sort data=tips; 355 by sex total_bill; 356 run; 357 358pandas objects have a :meth:`~DataFrame.sort_values` method, which 359takes a list of columns to sort by. 360 361.. ipython:: python 362 363 tips = tips.sort_values(["sex", "total_bill"]) 364 tips.head() 365 366 367String processing 368----------------- 369 370Length 371~~~~~~ 372 373SAS determines the length of a character string with the 374`LENGTHN <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002284668.htm>`__ 375and `LENGTHC <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002283942.htm>`__ 376functions. ``LENGTHN`` excludes trailing blanks and ``LENGTHC`` includes trailing blanks. 377 378.. code-block:: sas 379 380 data _null_; 381 set tips; 382 put(LENGTHN(time)); 383 put(LENGTHC(time)); 384 run; 385 386Python determines the length of a character string with the ``len`` function. 387``len`` includes trailing blanks. Use ``len`` and ``rstrip`` to exclude 388trailing blanks. 389 390.. ipython:: python 391 392 tips["time"].str.len().head() 393 tips["time"].str.rstrip().str.len().head() 394 395 396Find 397~~~~ 398 399SAS determines the position of a character in a string with the 400`FINDW <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002978282.htm>`__ function. 401``FINDW`` takes the string defined by the first argument and searches for the first position of the substring 402you supply as the second argument. 403 404.. code-block:: sas 405 406 data _null_; 407 set tips; 408 put(FINDW(sex,'ale')); 409 run; 410 411Python determines the position of a character in a string with the 412``find`` function. ``find`` searches for the first position of the 413substring. If the substring is found, the function returns its 414position. Keep in mind that Python indexes are zero-based and 415the function will return -1 if it fails to find the substring. 416 417.. ipython:: python 418 419 tips["sex"].str.find("ale").head() 420 421 422Substring 423~~~~~~~~~ 424 425SAS extracts a substring from a string based on its position with the 426`SUBSTR <https://www2.sas.com/proceedings/sugi25/25/cc/25p088.pdf>`__ function. 427 428.. code-block:: sas 429 430 data _null_; 431 set tips; 432 put(substr(sex,1,1)); 433 run; 434 435With pandas you can use ``[]`` notation to extract a substring 436from a string by position locations. Keep in mind that Python 437indexes are zero-based. 438 439.. ipython:: python 440 441 tips["sex"].str[0:1].head() 442 443 444Scan 445~~~~ 446 447The SAS `SCAN <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000214639.htm>`__ 448function returns the nth word from a string. The first argument is the string you want to parse and the 449second argument specifies which word you want to extract. 450 451.. code-block:: sas 452 453 data firstlast; 454 input String $60.; 455 First_Name = scan(string, 1); 456 Last_Name = scan(string, -1); 457 datalines2; 458 John Smith; 459 Jane Cook; 460 ;;; 461 run; 462 463Python extracts a substring from a string based on its text 464by using regular expressions. There are much more powerful 465approaches, but this just shows a simple approach. 466 467.. ipython:: python 468 469 firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]}) 470 firstlast["First_Name"] = firstlast["String"].str.split(" ", expand=True)[0] 471 firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[0] 472 firstlast 473 474 475Upcase, lowcase, and propcase 476~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 477 478The SAS `UPCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245965.htm>`__ 479`LOWCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000245912.htm>`__ and 480`PROPCASE <https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/a002598106.htm>`__ 481functions change the case of the argument. 482 483.. code-block:: sas 484 485 data firstlast; 486 input String $60.; 487 string_up = UPCASE(string); 488 string_low = LOWCASE(string); 489 string_prop = PROPCASE(string); 490 datalines2; 491 John Smith; 492 Jane Cook; 493 ;;; 494 run; 495 496The equivalent Python functions are ``upper``, ``lower``, and ``title``. 497 498.. ipython:: python 499 500 firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]}) 501 firstlast["string_up"] = firstlast["String"].str.upper() 502 firstlast["string_low"] = firstlast["String"].str.lower() 503 firstlast["string_prop"] = firstlast["String"].str.title() 504 firstlast 505 506Merging 507------- 508 509The following tables will be used in the merge examples 510 511.. ipython:: python 512 513 df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)}) 514 df1 515 df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)}) 516 df2 517 518In SAS, data must be explicitly sorted before merging. Different 519types of joins are accomplished using the ``in=`` dummy 520variables to track whether a match was found in one or both 521input frames. 522 523.. code-block:: sas 524 525 proc sort data=df1; 526 by key; 527 run; 528 529 proc sort data=df2; 530 by key; 531 run; 532 533 data left_join inner_join right_join outer_join; 534 merge df1(in=a) df2(in=b); 535 536 if a and b then output inner_join; 537 if a then output left_join; 538 if b then output right_join; 539 if a or b then output outer_join; 540 run; 541 542pandas DataFrames have a :meth:`~DataFrame.merge` method, which provides 543similar functionality. Note that the data does not have 544to be sorted ahead of time, and different join 545types are accomplished via the ``how`` keyword. 546 547.. ipython:: python 548 549 inner_join = df1.merge(df2, on=["key"], how="inner") 550 inner_join 551 552 left_join = df1.merge(df2, on=["key"], how="left") 553 left_join 554 555 right_join = df1.merge(df2, on=["key"], how="right") 556 right_join 557 558 outer_join = df1.merge(df2, on=["key"], how="outer") 559 outer_join 560 561 562Missing data 563------------ 564 565Like SAS, pandas has a representation for missing data - which is the 566special float value ``NaN`` (not a number). Many of the semantics 567are the same, for example missing data propagates through numeric 568operations, and is ignored by default for aggregations. 569 570.. ipython:: python 571 572 outer_join 573 outer_join["value_x"] + outer_join["value_y"] 574 outer_join["value_x"].sum() 575 576One difference is that missing data cannot be compared to its sentinel value. 577For example, in SAS you could do this to filter missing values. 578 579.. code-block:: sas 580 581 data outer_join_nulls; 582 set outer_join; 583 if value_x = .; 584 run; 585 586 data outer_join_no_nulls; 587 set outer_join; 588 if value_x ^= .; 589 run; 590 591Which doesn't work in pandas. Instead, the ``pd.isna`` or ``pd.notna`` functions 592should be used for comparisons. 593 594.. ipython:: python 595 596 outer_join[pd.isna(outer_join["value_x"])] 597 outer_join[pd.notna(outer_join["value_x"])] 598 599pandas also provides a variety of methods to work with missing data - some of 600which would be challenging to express in SAS. For example, there are methods to 601drop all rows with any missing values, replacing missing values with a specified 602value, like the mean, or forward filling from previous rows. See the 603:ref:`missing data documentation<missing_data>` for more. 604 605.. ipython:: python 606 607 outer_join.dropna() 608 outer_join.fillna(method="ffill") 609 outer_join["value_x"].fillna(outer_join["value_x"].mean()) 610 611 612GroupBy 613------- 614 615Aggregation 616~~~~~~~~~~~ 617 618SAS's PROC SUMMARY can be used to group by one or 619more key variables and compute aggregations on 620numeric columns. 621 622.. code-block:: sas 623 624 proc summary data=tips nway; 625 class sex smoker; 626 var total_bill tip; 627 output out=tips_summed sum=; 628 run; 629 630pandas provides a flexible ``groupby`` mechanism that 631allows similar aggregations. See the :ref:`groupby documentation<groupby>` 632for more details and examples. 633 634.. ipython:: python 635 636 tips_summed = tips.groupby(["sex", "smoker"])[["total_bill", "tip"]].sum() 637 tips_summed.head() 638 639 640Transformation 641~~~~~~~~~~~~~~ 642 643In SAS, if the group aggregations need to be used with 644the original frame, it must be merged back together. For 645example, to subtract the mean for each observation by smoker group. 646 647.. code-block:: sas 648 649 proc summary data=tips missing nway; 650 class smoker; 651 var total_bill; 652 output out=smoker_means mean(total_bill)=group_bill; 653 run; 654 655 proc sort data=tips; 656 by smoker; 657 run; 658 659 data tips; 660 merge tips(in=a) smoker_means(in=b); 661 by smoker; 662 adj_total_bill = total_bill - group_bill; 663 if a and b; 664 run; 665 666 667pandas ``groupby`` provides a ``transform`` mechanism that allows 668these type of operations to be succinctly expressed in one 669operation. 670 671.. ipython:: python 672 673 gb = tips.groupby("smoker")["total_bill"] 674 tips["adj_total_bill"] = tips["total_bill"] - gb.transform("mean") 675 tips.head() 676 677 678By group processing 679~~~~~~~~~~~~~~~~~~~ 680 681In addition to aggregation, pandas ``groupby`` can be used to 682replicate most other by group processing from SAS. For example, 683this ``DATA`` step reads the data by sex/smoker group and filters to 684the first entry for each. 685 686.. code-block:: sas 687 688 proc sort data=tips; 689 by sex smoker; 690 run; 691 692 data tips_first; 693 set tips; 694 by sex smoker; 695 if FIRST.sex or FIRST.smoker then output; 696 run; 697 698In pandas this would be written as: 699 700.. ipython:: python 701 702 tips.groupby(["sex", "smoker"]).first() 703 704 705Other considerations 706-------------------- 707 708Disk vs memory 709~~~~~~~~~~~~~~ 710 711pandas operates exclusively in memory, where a SAS data set exists on disk. 712This means that the size of data able to be loaded in pandas is limited by your 713machine's memory, but also that the operations on that data may be faster. 714 715If out of core processing is needed, one possibility is the 716`dask.dataframe <https://dask.pydata.org/en/latest/dataframe.html>`_ 717library (currently in development) which 718provides a subset of pandas functionality for an on-disk ``DataFrame`` 719 720Data interop 721~~~~~~~~~~~~ 722 723pandas provides a :func:`read_sas` method that can read SAS data saved in 724the XPORT or SAS7BDAT binary format. 725 726.. code-block:: sas 727 728 libname xportout xport 'transport-file.xpt'; 729 data xportout.tips; 730 set tips(rename=(total_bill=tbill)); 731 * xport variable names limited to 6 characters; 732 run; 733 734.. code-block:: python 735 736 df = pd.read_sas("transport-file.xpt") 737 df = pd.read_sas("binary-file.sas7bdat") 738 739You can also specify the file format directly. By default, pandas will try 740to infer the file format based on its extension. 741 742.. code-block:: python 743 744 df = pd.read_sas("transport-file.xpt", format="xport") 745 df = pd.read_sas("binary-file.sas7bdat", format="sas7bdat") 746 747XPORT is a relatively limited format and the parsing of it is not as 748optimized as some of the other pandas readers. An alternative way 749to interop data between SAS and pandas is to serialize to csv. 750 751.. code-block:: ipython 752 753 # version 0.17, 10M rows 754 755 In [8]: %time df = pd.read_sas('big.xpt') 756 Wall time: 14.6 s 757 758 In [9]: %time df = pd.read_csv('big.csv') 759 Wall time: 4.86 s 760