Unlocking the potential of your data begins with understanding how to seamlessly import it into the right tools. Whether you are working with SPSS, Stata or SAS, the process of importing data into these powerful platforms can open the door to advanced analysis, insightful discoveries and informed decision making. But before diving in, it’s crucial to get your workspace and datasets ready to ensure a smooth, efficient workflow. From preparing your files to executing the import process, knowing how to do it correctly can save your time and avoid costly errors. As we walk through the steps, you will gain a clear understanding of how it all works and why these platforms excel in handling complex data operations. And there’s more once you are comfortable with the basics, you will be ready to explore additional features and functionalities that these software solutions offer to supercharge your analysis. For more insights and tips on optimizing your data workflows, be sure to check out the additional resources provided.
Table of Content
- Importing SPSS, Stata and SAS Data
- Getting ready
- How to do it
- How it works
- There’s more
- See also
Importing SPSS, Stata and SAS Data
We will use pyreadstat to read data from three popular statistical packages
into pandas. The key advantage of pyreadstat is that it allows data analysts
to import data from these packages without losing metadata, such as variable
and value labels. The SPSS, Stata and SAS data files we receive often come
to us with the data issues of CSV and Excel files and SQL databases having
been resolved. We do not typically have the invalid column names, changes
in data types and unclear missing values that we can get with CSV or Excel
files, nor do we usually get the detachment of data from business logic, such
as the meaning of data codes, that we often get with SQL data. When
someone or some organization shares a data file from one of these packages
with us, they have often added variable labels and value labels for categorical
data. For example, a hypothetical data column called presentsat has the
variable label overall satisfaction with presentation and value labels
1-5, with 1 being not at all satisfied and 5 being highly satisfied.The
challenge is retaining that metadata when importing data from those systems
into pandas. There is no precise equivalent to variable and value labels in
pandas, and built-in tools for importing SAS, Stata and SAS data lose the
metadata. In this recipe, we will use pyreadstat to load variable and value
label information and use a couple of techniques for representing that
information in pandas.
Getting Ready
This recipe assumes you have installed the pyreadstat package. If it is not
installed, you can install it with pip. From the terminal or powershell in
Windows. You will need the SPSS, Stata and SAS data files for this recipe to run the code.We will work with data
from the United States National Longitudinal Survey of Youth.
pip install pyreadstat
The National Longitudinal Survey of Youth is conducted by the
United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997. Each survey respondent was high
school age when they first completed the survey, having been born
between 1980 and 1985. There were annual follow-up surveys each
year through 2017. For this recipe, I pulled 42 variables on grades,
employment, income and attitudes toward government from the
hundreds of data items on the survey. Separate files for SPSS,
Stata and SAS can be downloaded from the repository. NLS data
can be downloaded from the link below:
https://www.nlsinfo.org/investigator/pages/search.
How To Do It
We will import data from SPSS, Stata and SAS, retaining metadata such as
value labels.
- Import pandas, numpy and pyreadstat. This step assumes that you have installed pyreadstat.
import pandas as pd
import numpy as np
import pyreadstat
- Retrieve the SPSS data.
Pass a path and filename to the read_sav method of pyreadstat . Display
the first few rows and a frequency distribution. Notice that the column names
and value labels are non-descriptive and that read_sav returns both a
pandas DataFrame and a meta object.
nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav')
nls97spss.dtypes
Output:
R0000100 float64
R0536300 float64
R0536401 float64
...
U2962900 float64
U2963000 float64
Z9063900 float64
dtype: object
nls97spss.head()
Output:
R0000100 R0536300 ... U2963000 Z9063900
0 1 2 ... nan 52
1 2 1 ... 6 0
2 3 2 ... 6 0
3 4 2 ... 6 4
4 5 1 ... 5 12
[5 rows x 42 columns]
nls97spss['R0536300'].value_counts(normalize=True)
Output:
1.00 0.51
2.00 0.49
Name: R0536300, dtype: float64
- Grab the metadata to improve column labels and value labels.
The meta spss object created when we called read_sav has the column
labels and the value labels from the SPSS file. Use the
variable_value_labels dictionary to map values to value labels for one
column (R0536300). This does not change the data. It only improves our
display when we run value_counts. Use the set_value_labels method to
actually apply the value labels to the DataFrame.
metaspss.variable_value_labels['R0536300']
Output:
{0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'}
nls97spss['R0536300'].\ map(metaspss.variable_value_labels['R0536300']).\ value_counts(normalize=True)
Output:
Male 0.51
Female 0.49
Name: R0536300, dtype: float64
nls97spss = pyreadstat.set_value_labels(nls97spss, metaspss, formats_as_category=True)
- Use column labels in the metadata to rename the columns.
To use the column labels from metaspss in our DataFrame, we can
simply assign the column labels in metaspss to our DataFrame’s
column names. Clean up the column names a bit by changing them to
lowercase changing spaces to underscores, and removing all remaining non-alphanumeric characters.
nls97spss.columns = metaspss.column_labels
nls97spss['KEY!SEX (SYMBOL) 1997'].value_counts(normalize=True)
Output:
Male 0.51
Female 0.49
Name: KEY!SEX (SYMBOL) 1997, dtype: float64
nls97spss.dtypes
Output:
PUBID - YTH ID CODE 1997 float64
KEY!SEX (SYMBOL) 1997 category
KEY!BDATE M/Y (SYMBOL) 1997 float64
KEY!BDATE M/Y (SYMBOL) 1997 float64
CV_SAMPLE_TYPE 1997 category
KEY!RACE_ETHNICITY (SYMBOL) 1997 category
HRS/WK R WATCHES TELEVISION 2017 category
HRS/NIGHT R SLEEPS 2017 float64
CVC_WKSWK_YR_ALL L99 float64
dtype: object
nls97spss.columns = nls97spss.columns.\ str.lower().\
str.replace(' ','_').\
str.replace('[^a-z0-9_]', '')
nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)
- Simplify the process by applying the value labels from the beginning.
The data values can actually be applied in the initial call to read_sav by
setting apply_value_formats to True. This eliminates the need to call the
set_value_labels function later.
nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav',
apply_value_formats=True, formats_as_category=True)
>nls97spss.columns = metaspss.column_labels
nls97spss.columns = nls97spss.columns.\
str.lower().\
str.replace(' ','_').\
str.replace('[^a-z0-9_]', '')
- Show the columns and a few rows.
nls97spss.dtypes
Output:
pubid__yth_id_code_1997 float64
keysex_symbol_1997 category
keybdate_my_symbol_1997 float64
keybdate_my_symbol_1997 float64
hrsnight_r_sleeps_2017 float64
cvc_wkswk_yr_all_l99 float64
dtype: object
nls97spss.head()
Output:
pubid__yth_id_code_1997 keysex_symbol_1997 ... \
0 1 Female ...
1 2 Male ...
2 3 Female ...
3 4 Female ...
4 5 Male ...
hrsnight_r_sleeps_2017 cvc_wkswk_yr_all_l99
0 nan 52
1 6 0
2 6 0
3 6 4
4 5 12
[5 rows x 42 columns]
- Run frequencies on one of the columns and set the index.
nls97spss.govt_responsibility__provide_jobs_2006.\ value_counts(sort=False)
Output:
Definitely should be 454
Definitely should not be 300
Probably should be 617
Probably should not be 462
Name: govt_responsibility__provide_jobs_2006, dtype: int64
nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)
- Import the Stata data, apply value labels and improve the column
headings.
Use the same methods for the Stata data that we use for the SPSS data.
nls97stata, metastata = pyreadstat.read_dta('data/nls97.dta', apply_value_formats=True, formats_as_category=True)
nls97stata.columns = metastata.column_labels
nls97stata.columns = nls97stata.columns.\
str.lower().\
str.replace(' ','_').\
str.replace('[^a-z0-9_]', '')
nls97stata.dtypes
Output:
pubid__yth_id_code_1997 float64
keysex_symbol_1997 category
keybdate_my_symbol_1997 float64
keybdate_my_symbol_1997 float64
hrsnight_r_sleeps_2017 float64
cvc_wkswk_yr_all_l99 float64
dtype: object
- View a few rows of the data and run a frequency.
nls97stata.head()
Output:
pubid__yth_id_code_1997 keysex_symbol_1997 ... \
0 1 Female ...
1 2 Male ...
2 3 Female ...
3 4 Female ...
4 5 Male ...
hrsnight_r_sleeps_2017 cvc_wkswk_yr_all_l99
0 -5 52
1 6 0
2 6 0
3 6 4
4 5 12
[5 rows x 42 columns]
nls97stata.govt_responsibility__provide_jobs_2006.\ value_counts(sort=False)
Output:
5.0 1425
-4.0 5665
-2.0 56
-1.0 5
Definitely should be 454
Definitely should not be 300
Probably should be 617
Probably should not be 462
Name: govt_responsibility__provide_jobs_2006, dtype: int64
- Fix the logical missing values that show up with the Stata data and set
an index.
nls97stata.min()
Output:
pubid__yth_id_code_1997 1
keysex_symbol_1997 Female
keybdate_my_symbol_1997 1
keybdate_my_symbol_1997 1,980
cv_bio_child_hh_2017 -5
cv_bio_child_nr_2017 -5
hrsnight_r_sleeps_2017 -5
cvc_wkswk_yr_all_l99 -4
dtype: object
nls97stata.replace(list(range(-9,0)), np.nan, inplace=True)
>nls97stata.min()
Output:
pubid__yth_id_code_1997 1
keysex_symbol_1997 Female
keybdate_my_symbol_1997 1
keybdate_my_symbol_1997 1,980
cv_bio_child_hh_2017 0
cv_bio_child_nr_2017 0
hrsnight_r_sleeps_2017 0
cvc_wkswk_yr_all_l99 0
dtype: object
nls97stata.set_index('pubid__yth_id_code_1997', inplace=True)
- Retrieve the SAS data, using the SAS catalog file for value labels.
The data values for SAS are stored in a catalog file. Setting the catalog file
path and filename retrieves the value labels and applies them.
nls97sas, metasas = pyreadstat.read_sas7bdat('data/nls97.sas7bdat', catalog_file='data/nlsformats3.sas7bcat', formats_as_category=True)
nls97sas.columns = metasas.column_labels
nls97sas.columns = nls97sas.columns.\
str.lower().\
str.replace(' ','_').\
str.replace('[^a-z0-9_]', '')
nls97sas.head()
Output:
pubid__yth_id_code_1997 keysex_symbol_1997 ... \
0 1 Female ...
1 2 Male ...
2 3 Female ...
3 4 Female ...
4 5 Male ...
hrsnight_r_sleeps_2017 cvc_wkswk_yr_all_l99
0 nan 52
1 6 0
2 6 0
3 6 4
4 5 12
[5 rows x 42 columns]
nls97sas.keysex_symbol_1997.value_counts()
Output:
Male 4599
Female 4385
Name: keysex_symbol_1997, dtype: int64
nls97sas.set_index('pubid__yth_id_code_1997', inplace=True)
This demonstrates how to import SPSS, SAS and Stata data without losing
important metadata.
How It Works
The read_sav , read_dta and read_sas7bdat methods of Pyreadstat for SPSS, Stata and SAS data files, respectively, work in a similar manner.
Value labels can be applied when reading in the data by setting
apply_value_formats to True for SPSS and Stata files steps 5 and 8 or
by providing a catalog file path and filename for SAS (step 11). We can set
formats_as_category to True to change the data type to category for
those columns where the data values will change. The meta object has the
column names and the column labels from the statistical package so
metadata column labels can be assigned to pandas data frame column names
at any point (nls97spss.columns = metaspss.column_labels). We can
even revert to the original column headings after assigning meta column
labels to them by setting pandas column names to the metadata column
names ( nls97spss.columns = metaspss.column_names). In step 3, we read the SPSS data without applying value labels. We looked at the dictionary for
one variable (metaspss.variable_value_labels[’R0536300’] ), but we
could have viewed it for all variables (metaspss.variable_value_labels).
When we are satisfied that the labels make sense, we can set them by calling
the set_value_labels function. This is a good approach when you do not
know the data well and want to inspect the labels before applying them. The
column labels from the meta object are often a better choice than the original
column headings. Column headings can be quite cryptic, particularly when
the SPSS, Stata or SAS file is based on a large survey, as in this example.
But the labels are not usually ideal for column headings either. They
sometimes have spaces, capitalization that is not helpful and non-alphanumeric characters. We chain some string operations to switch to
lowercase, replace spaces with underscores, and remove non-alphanumeric
characters. Handling missing values is not always straight forward with these
data files, since there are often many reasons why data is missing. If the file
is from a survey, the missing value may be because of a survey skip pattern,
or a respondent failed to respond, or the response was invalid and so on. The
National Longitudinal Survey has 9 possible values for missing, from -1 to
-9. The SPSS import automatically set those values to NaN , while the Stata
import retained the original values. We could have gotten the SPSS import
to retain those values by setting user_missing to True. For the Stata data,
we need to tell it to replace all values from -1 to -9 with NaN . We do this by
using the DataFrame’s replace function and passing it a list of integers
from -9 to -1 ( list(range(-9,0))).
There’s More
You may have noticed similarities between this recipe and the previous one
in terms of how value labels are set. The set_value_labels function is like
the DataFrame replace operation we used to set value labels in that recipe.
We passed a dictionary to replace that mapped columns to value labels.
The set_value_labels function in this recipe essentially does the same
thing using the variable_value_labels property of the meta object as the
dictionary. Data from statistical packages is often not as well structured as
SQL databases tend to be in one significant way. Since they are designed to
facilitate analysis, they often violate database normalization rules. There is
often an implied relational structure that might have to be unflattened at some point. For example, the data combines individual and event level data
person and hospital visits, brown bear and date emerged from hibernation.
Often, this data will need to be reshaped for some aspects of the analysis.
See Also
The pyreadstat package is nicely documented at: https://github.com/Roche/pyreadstat.
The package has many useful options
for selecting columns and handling missing data that space did not permit me
to demonstrate in this recipe. In section 8, Tidying and Reshaping Data we
will examine how to normalize data that may have been flattened for
analytical purposes.
Conclusion
In conclusion, importing data from SPSS, Stata and SAS into your workflow opens up new possibilities for data analysis and decision making. Getting ready for this process involves familiarizing yourself with the right tools, ensuring you have access to clean and structured data and understanding how these formats integrate into your system. By following the steps outlined in the "how to do it" section, you can efficiently bring in data from these powerful statistical packages and start working with them seamlessly. As you dive deeper into "how it works," you will discover the intricacies of each format and how to optimize your workflows. And if you are eager to explore more, there are always additional resources and methods to further enhance your data handling capabilities. Be sure to explore the “there’s more” section for advanced techniques and the “see also” links to expand your knowledge even further giving you a comprehensive edge in mastering the importation of SPSS, Stata and SAS data.