4 — Pandas Data Cleaning: Importing SPSS, Stata and SAS

10 min readSep 9, 2024

Unlocking the potential of your data begins with understanding how to seamlessly import it into the right tools. Whether you are working with SPSS, Stata or SAS, the process of importing data into these powerful platforms can open the door to advanced analysis, insightful discoveries and informed decision making. But before diving in, it’s crucial to get your workspace and datasets ready to ensure a smooth, efficient workflow. From preparing your files to executing the import process, knowing how to do it correctly can save your time and avoid costly errors. As we walk through the steps, you will gain a clear understanding of how it all works and why these platforms excel in handling complex data operations. And there’s more once you are comfortable with the basics, you will be ready to explore additional features and functionalities that these software solutions offer to supercharge your analysis. For more insights and tips on optimizing your data workflows, be sure to check out the additional resources provided.

Table of Content

Importing SPSS, Stata and SAS Data
Getting ready
How to do it
How it works
There’s more
See also

Importing SPSS, Stata and SAS Data

We will use pyreadstat to read data from three popular statistical packages
into pandas. The key advantage of pyreadstat is that it allows data analysts
to import data from these packages without losing metadata, such as variable

and value labels. The SPSS, Stata and SAS data files we receive often come
to us with the data issues of CSV and Excel files and SQL databases having
been resolved. We do not typically have the invalid column names, changes

in data types and unclear missing values that we can get with CSV or Excel
files, nor do we usually get the detachment of data from business logic, such
as the meaning of data codes, that we often get with SQL data. When

someone or some organization shares a data file from one of these packages
with us, they have often added variable labels and value labels for categorical
data. For example, a hypothetical data column called presentsat has the

variable label overall satisfaction with presentation and value labels

1-5, with 1 being not at all satisfied and 5 being highly satisfied.The

challenge is retaining that metadata when importing data from those systems
into pandas. There is no precise equivalent to variable and value labels in

pandas, and built-in tools for importing SAS, Stata and SAS data lose the
metadata. In this recipe, we will use pyreadstat to load variable and value
label information and use a couple of techniques for representing that

information in pandas.

Getting Ready

This recipe assumes you have installed the pyreadstat package. If it is not
installed, you can install it with pip. From the terminal or powershell in
Windows. You will need the SPSS, Stata and SAS data files for this recipe to run the code.We will work with data

from the United States National Longitudinal Survey of Youth.

pip install pyreadstat

The National Longitudinal Survey of Youth is conducted by the

United States Bureau of Labor Statistics. This survey started with a cohort of individuals in 1997. Each survey respondent was high

school age when they first completed the survey, having been born
between 1980 and 1985. There were annual follow-up surveys each
year through 2017. For this recipe, I pulled 42 variables on grades,
employment, income and attitudes toward government from the
hundreds of data items on the survey. Separate files for SPSS,
Stata and SAS can be downloaded from the repository. NLS data
can be downloaded from the link below:

https://www.nlsinfo.org/investigator/pages/search.

How To Do It

We will import data from SPSS, Stata and SAS, retaining metadata such as
value labels.

Import pandas, numpy and pyreadstat. This step assumes that you have installed pyreadstat.

import pandas as pd
import numpy as np
import pyreadstat

Retrieve the SPSS data.

Pass a path and filename to the read_sav method of pyreadstat . Display
the first few rows and a frequency distribution. Notice that the column names
and value labels are non-descriptive and that read_sav returns both a
pandas DataFrame and a meta object.

nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav')
nls97spss.dtypes

Output:

R0000100 float64
R0536300 float64
R0536401 float64
...
U2962900 float64
U2963000 float64
Z9063900 float64
dtype: object

nls97spss.head()

Output:

R0000100 R0536300 ... U2963000 Z9063900
0 1 2 ... nan 52
1 2 1 ... 6 0
2 3 2 ... 6 0
3 4 2 ... 6 4
4 5 1 ... 5 12
[5 rows x 42 columns]

nls97spss['R0536300'].value_counts(normalize=True)

Output:

1.00 0.51
2.00 0.49
Name: R0536300, dtype: float64

Grab the metadata to improve column labels and value labels.

The meta spss object created when we called read_sav has the column
labels and the value labels from the SPSS file. Use the

variable_value_labels dictionary to map values to value labels for one
column (R0536300). This does not change the data. It only improves our
display when we run value_counts. Use the set_value_labels method to

actually apply the value labels to the DataFrame.

metaspss.variable_value_labels['R0536300']

Output:

{0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'}

nls97spss['R0536300'].\ map(metaspss.variable_value_labels['R0536300']).\ value_counts(normalize=True)

Output:

Male 0.51
Female 0.49
Name: R0536300, dtype: float64

nls97spss = pyreadstat.set_value_labels(nls97spss, metaspss, formats_as_category=True)

Use column labels in the metadata to rename the columns.

To use the column labels from metaspss in our DataFrame, we can
simply assign the column labels in metaspss to our DataFrame’s
column names. Clean up the column names a bit by changing them to
lowercase changing spaces to underscores, and removing all remaining non-alphanumeric characters.

nls97spss.columns = metaspss.column_labels
nls97spss['KEY!SEX (SYMBOL) 1997'].value_counts(normalize=True)

Output:

Male 0.51
Female 0.49
Name: KEY!SEX (SYMBOL) 1997, dtype: float64

nls97spss.dtypes

Output:

PUBID - YTH ID CODE 1997 float64
KEY!SEX (SYMBOL) 1997 category
KEY!BDATE M/Y (SYMBOL) 1997 float64
KEY!BDATE M/Y (SYMBOL) 1997 float64
CV_SAMPLE_TYPE 1997 category
KEY!RACE_ETHNICITY (SYMBOL) 1997 category
HRS/WK R WATCHES TELEVISION 2017 category
HRS/NIGHT R SLEEPS 2017 float64
CVC_WKSWK_YR_ALL L99 float64
dtype: object

nls97spss.columns = nls97spss.columns.\ str.lower().\
str.replace(' ','_').\
str.replace('[^a-z0-9_]', '')

nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)

Simplify the process by applying the value labels from the beginning.

The data values can actually be applied in the initial call to read_sav by
setting apply_value_formats to True. This eliminates the need to call the

set_value_labels function later.

nls97spss, metaspss = pyreadstat.read_sav('data/nls97.sav',
apply_value_formats=True, formats_as_category=True)
>nls97spss.columns = metaspss.column_labels
nls97spss.columns = nls97spss.columns.\
str.lower().\
str.replace(' ','_').\
str.replace('[^a-z0-9_]', '')

Show the columns and a few rows.

nls97spss.dtypes

Output:

pubid__yth_id_code_1997 float64
keysex_symbol_1997 category
keybdate_my_symbol_1997 float64
keybdate_my_symbol_1997 float64
hrsnight_r_sleeps_2017 float64
cvc_wkswk_yr_all_l99 float64
dtype: object

nls97spss.head()

Output:

pubid__yth_id_code_1997 keysex_symbol_1997 ... \
0 1 Female ...
1 2 Male ...
2 3 Female ...
3 4 Female ...
4 5 Male ...
hrsnight_r_sleeps_2017 cvc_wkswk_yr_all_l99
0 nan 52
1 6 0
2 6 0
3 6 4
4 5 12
[5 rows x 42 columns]

Run frequencies on one of the columns and set the index.

nls97spss.govt_responsibility__provide_jobs_2006.\ value_counts(sort=False)

Output:

Definitely should be 454
Definitely should not be 300
Probably should be 617
Probably should not be 462
Name: govt_responsibility__provide_jobs_2006, dtype: int64

nls97spss.set_index('pubid__yth_id_code_1997', inplace=True)

Import the Stata data, apply value labels and improve the column
headings.

Use the same methods for the Stata data that we use for the SPSS data.

nls97stata, metastata = pyreadstat.read_dta('data/nls97.dta', apply_value_formats=True, formats_as_category=True)
nls97stata.columns = metastata.column_labels

nls97stata.columns = nls97stata.columns.\
str.lower().\
str.replace(' ','_').\
str.replace('[^a-z0-9_]', '')

nls97stata.dtypes

Output:

pubid__yth_id_code_1997 float64
keysex_symbol_1997 category
keybdate_my_symbol_1997 float64
keybdate_my_symbol_1997 float64
hrsnight_r_sleeps_2017 float64
cvc_wkswk_yr_all_l99 float64
dtype: object

View a few rows of the data and run a frequency.

nls97stata.head()

Output:

pubid__yth_id_code_1997 keysex_symbol_1997 ... \
0 1 Female ...
1 2 Male ...
2 3 Female ...
3 4 Female ...
4 5 Male ...
hrsnight_r_sleeps_2017 cvc_wkswk_yr_all_l99
0 -5 52
1 6 0
2 6 0
3 6 4
4 5 12
[5 rows x 42 columns]

nls97stata.govt_responsibility__provide_jobs_2006.\ value_counts(sort=False)

Output:

5.0 1425
-4.0 5665
-2.0 56
-1.0 5
Definitely should be 454
Definitely should not be 300
Probably should be 617
Probably should not be 462
Name: govt_responsibility__provide_jobs_2006, dtype: int64

Fix the logical missing values that show up with the Stata data and set
an index.

nls97stata.min()

Output:

pubid__yth_id_code_1997 1
keysex_symbol_1997 Female
keybdate_my_symbol_1997 1
keybdate_my_symbol_1997 1,980
cv_bio_child_hh_2017 -5
cv_bio_child_nr_2017 -5
hrsnight_r_sleeps_2017 -5
cvc_wkswk_yr_all_l99 -4
dtype: object

nls97stata.replace(list(range(-9,0)), np.nan, inplace=True)
>nls97stata.min()

Output:

pubid__yth_id_code_1997 1
keysex_symbol_1997 Female
keybdate_my_symbol_1997 1
keybdate_my_symbol_1997 1,980
cv_bio_child_hh_2017 0
cv_bio_child_nr_2017 0
hrsnight_r_sleeps_2017 0
cvc_wkswk_yr_all_l99 0
dtype: object

nls97stata.set_index('pubid__yth_id_code_1997', inplace=True)

Retrieve the SAS data, using the SAS catalog file for value labels.

The data values for SAS are stored in a catalog file. Setting the catalog file
path and filename retrieves the value labels and applies them.

nls97sas, metasas = pyreadstat.read_sas7bdat('data/nls97.sas7bdat', catalog_file='data/nlsformats3.sas7bcat', formats_as_category=True)
nls97sas.columns = metasas.column_labels
nls97sas.columns = nls97sas.columns.\
str.lower().\
str.replace(' ','_').\
str.replace('[^a-z0-9_]', '')

nls97sas.head()

Output:

pubid__yth_id_code_1997 keysex_symbol_1997 ... \
0 1 Female ...
1 2 Male ...
2 3 Female ...
3 4 Female ...
4 5 Male ...
hrsnight_r_sleeps_2017 cvc_wkswk_yr_all_l99
0 nan 52
1 6 0
2 6 0
3 6 4
4 5 12
[5 rows x 42 columns]

nls97sas.keysex_symbol_1997.value_counts()

Output:

Male 4599
Female 4385
Name: keysex_symbol_1997, dtype: int64

nls97sas.set_index('pubid__yth_id_code_1997', inplace=True)

This demonstrates how to import SPSS, SAS and Stata data without losing
important metadata.

How It Works

The read_sav , read_dta and read_sas7bdat methods of Pyreadstat for SPSS, Stata and SAS data files, respectively, work in a similar manner.
Value labels can be applied when reading in the data by setting

apply_value_formats to True for SPSS and Stata files steps 5 and 8 or
by providing a catalog file path and filename for SAS (step 11). We can set
formats_as_category to True to change the data type to category for
those columns where the data values will change. The meta object has the
column names and the column labels from the statistical package so
metadata column labels can be assigned to pandas data frame column names

at any point (nls97spss.columns = metaspss.column_labels). We can

even revert to the original column headings after assigning meta column
labels to them by setting pandas column names to the metadata column
names ( nls97spss.columns = metaspss.column_names). In step 3, we read the SPSS data without applying value labels. We looked at the dictionary for

one variable (metaspss.variable_value_labels[’R0536300’] ), but we
could have viewed it for all variables (metaspss.variable_value_labels).

When we are satisfied that the labels make sense, we can set them by calling
the set_value_labels function. This is a good approach when you do not

know the data well and want to inspect the labels before applying them. The
column labels from the meta object are often a better choice than the original
column headings. Column headings can be quite cryptic, particularly when

the SPSS, Stata or SAS file is based on a large survey, as in this example.
But the labels are not usually ideal for column headings either. They
sometimes have spaces, capitalization that is not helpful and non-alphanumeric characters. We chain some string operations to switch to
lowercase, replace spaces with underscores, and remove non-alphanumeric

characters. Handling missing values is not always straight forward with these
data files, since there are often many reasons why data is missing. If the file
is from a survey, the missing value may be because of a survey skip pattern,

or a respondent failed to respond, or the response was invalid and so on. The
National Longitudinal Survey has 9 possible values for missing, from -1 to
-9. The SPSS import automatically set those values to NaN , while the Stata

import retained the original values. We could have gotten the SPSS import
to retain those values by setting user_missing to True. For the Stata data,
we need to tell it to replace all values from -1 to -9 with NaN . We do this by
using the DataFrame’s replace function and passing it a list of integers

from -9 to -1 ( list(range(-9,0))).

There’s More

You may have noticed similarities between this recipe and the previous one
in terms of how value labels are set. The set_value_labels function is like
the DataFrame replace operation we used to set value labels in that recipe.
We passed a dictionary to replace that mapped columns to value labels.

The set_value_labels function in this recipe essentially does the same
thing using the variable_value_labels property of the meta object as the
dictionary. Data from statistical packages is often not as well structured as

SQL databases tend to be in one significant way. Since they are designed to
facilitate analysis, they often violate database normalization rules. There is

often an implied relational structure that might have to be unflattened at some point. For example, the data combines individual and event level data

person and hospital visits, brown bear and date emerged from hibernation.
Often, this data will need to be reshaped for some aspects of the analysis.

Conclusion

In conclusion, importing data from SPSS, Stata and SAS into your workflow opens up new possibilities for data analysis and decision making. Getting ready for this process involves familiarizing yourself with the right tools, ensuring you have access to clean and structured data and understanding how these formats integrate into your system. By following the steps outlined in the "how to do it" section, you can efficiently bring in data from these powerful statistical packages and start working with them seamlessly. As you dive deeper into "how it works," you will discover the intricacies of each format and how to optimize your workflows. And if you are eager to explore more, there are always additional resources and methods to further enhance your data handling capabilities. Be sure to explore the “there’s more” section for advanced techniques and the “see also” links to expand your knowledge even further giving you a comprehensive edge in mastering the importation of SPSS, Stata and SAS data.

4 — Pandas Data Cleaning: Importing SPSS, Stata and SAS

Table of Content

Importing SPSS, Stata and SAS Data

Getting Ready

How To Do It

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

Output:

How It Works

There’s More

See Also

Conclusion

Written by A.I Hub

No responses yet