5 — Pandas Data Cleaning: Importing R Data

A.I Hub
7 min readSep 9, 2024

--

Image owned by Canva

In the world of data analysis, working with R opens doors to powerful insights, but before diving into sophisticated algorithms, it’s crucial to master the fundamentals. Importing R data is the first step, a gateway to harnessing the full potential of this versatile language. As you prepare to handle data, getting ready with the right tools and techniques ensures a smooth workflow. Knowing how to do it and understanding how it works sets the foundation for accurate analysis. But don’t stop there there’s more. The deeper you explore R’s vast capabilities, the more efficient and effective your data processes will become. And if you are eager for more knowledge, make sure to check out the see also section for additional resources to elevate your R skills to the next level.

Table of Content

  • Importing R Data
  • Getting ready
  • How to do it
  • How it works
  • There’s more
  • See also

Importing R Data

We will use pyreadr to read an R data file into pandas. Since pyreadr
cannot capture the metadata, we will write code to reconstruct value labels

analogous to R factors and column headings. This is similar to what we did
in the Importing data from SQL databases recipe. The R statistical package is in many ways, similar to the combination of Python and pandas, at least in its
scope. Both have strong tools across a range of data preparation and data
analysis tasks. Some data scientists work with both R and Python, perhaps
doing data manipulation in Python and statistical analysis in R or vice versa
depending on their preferred packages. But there is currently a scarcity of
tools for reading data saved in R, as rds or rdata files, into Python. The
analyst often saves the data as a CSV file first and then loads the CSV file

into Python. We will use pyreadr, from the same author as pyreadstat,
because it does not require an installation of R. When we receive an R file or
work with one we have created ourselves, we can count on it being fairly well

structured, at least compared to CSV or Excel files. Each column will have
only one data type, column headings will have appropriate names for Python

variables and all rows will have the same structure. However, we may need
to restore some of the coding logic, as we did when working with SQL data.

Getting Ready

This recipe assumes you have installed the pyreadr package. If it is not
installed, you can install it with pip . From the terminal or powershell in
Windows. We will again work with the
National Longitudinal Survey in this recipe. You will need to download the

rds file used in this recipe from the GitHub repository in order to run the
code.

pip install pyreadr

How To Do It

We will import data from R without losing important metadata.

  • Load pandas, numpy, pprint, and the pyreadr package.
import pandas as pd
import numpy as np
import pyreadr
import pprint
  • Get the R data

Pass the path and filename to the read_r method to retrieve the R data and
load it into memory as a pandas DataFrame. read_r can return one or more
objects. When reading an rds file as opposed to an rdata file, it will
return one object having the key None. We indicate None to get the pandas

DataFrame.

nls97r = pyreadr.read_r('data/nls97.rds')[None]
>nls97r.dtypes

Output:

R0000100 int32
R0536300 int32
...
U2962800 int32
U2962900 int32
U2963000 int32
Z9063900 int32
dtype: object
nls97r.head(10)

Output:

R0000100 R0536300 ... U2963000 Z9063900
0 1 2 ... -5 52
1 2 1 ... 6 0
2 3 2 ... 6 0
3 4 2 ... 6 4
4 5 1 ... 5 12
5 6 2 ... 6 6
6 7 1 ... -5 0
7 8 2 ... -5 39
8 9 1 ... 4 0
9 10 1 ... 6 0
[10 rows x 42 columns]
  • Set up dictionaries for value labels and column headings.

Load a dictionary that maps columns to the value labels and create a list of
preferred column names as follows.

with open('data/nlscodes.txt', 'r') as reader:
.setvalues = eval(reader.read())

pprint.pprint(setvalues)

Output:

{'R0536300': {0.0: 'No Information', 1.0: 'Male', 2.0: 'Female'}
,
'R1235800': {0.0: 'Oversample', 1.0: 'Cross-sectional'},
'S8646900': {1.0: '1. Definitely',
2.0: '2. Probably ',
3.0: '3. Probably not',
4.0: '4. Definitely not'}}
newcols = ['personid','gender','birthmonth',
'birthyear','sampletype','category',
'satverbal','satmath','gpaoverall',
'gpaeng','gpamath','gpascience','govjobs',
'govprices','govhealth','goveld','govind',
'govunemp','govinc','govcollege',
'govhousing','govenvironment','bacredits',
'coltype1','coltype2','coltype3','coltype4',
'coltype5','coltype6','highestgrade',
'maritalstatus','childnumhome','childnumaway',
'degreecol1','degreecol2','degreecol3',
'degreecol4','wageincome','weeklyhrscomputer',
'weeklyhrstv','nightlyhrssleep',
'weeksworkedlastyear']
  • Set value labels and missing values and change selected columns to
    category data type.

Use the setvalues dictionary to replace existing values with value labels.
Replace all values from -9 to -1 with NaN.

nls97r.replace(setvalues, inplace=True)
>>> nls97r.head()

Output:

R0000100 R0536300 ... U2963000 Z9063900
0 1 Female ... -5 52
1 2 Male ... 6 0
2 3 Female ... 6 0
3 4 Female ... 6 4
4 5 Male ... 5 12
[5 rows x 42 columns]
nls97r.replace(list(range(-9,0)), np.nan, inplace=True)
for col in nls97r[[k for k in setvalues]].columns:
nls97r[col] = nls97r[col].astype('category')
nls97r.dtypes

Output:

R0000100 int64
R0536300 category
R0536401 int64
R0536402 int64
R1235800 category
...
U2857300 category
U2962800 category
U2962900 category
U2963000 float64
Z9063900 float64
Length: 42, dtype: object
  • Set meaningful column headings.
nls97r.columns = newcols
nls97r.dtypes

Output:

personid int64
gender category
birthmonth int64
birthyear int64
sampletype category
...
wageincome category
weeklyhrscomputer category
weeklyhrstv category
nightlyhrssleep float64
weeksworkedlastyear float64
Length: 42, dtype: object

This shows how R data files can be imported into pandas and value labels
assigned.

How It Works

Reading R data into pandas with pyreadr is fairly straightforward. Passing a
filename to the read_r function is all that is required. Since read_r can
return multiple objects with one call, we need to specify which object. When

reading an rds file as opposed to an rdata file, only one object is

returned. It has the key None .In step 3, we load a dictionary that maps our
variables to value labels, and a list for our preferred column headings. In step
4 we apply the value labels. We also change the data type to category for
the columns where we applied the values. We do this by generating a list of
the keys of our setvalues dictionary with [k for k in setvalues] and
then iterating over those columns. We change the column headings in step 5

to ones that are more intuitive. Note that the order matters here. We need to
set the value labels before changing the column names, since the setvalues

dictionary is based on the original column headings. The main advantage of
using pyreadr to read R files directly into pandas is that we do not have to
convert the R data into a CSV file first. Once we have written our Python
code to read the file, we can just rerun it whenever the R data changes. This is
particularly helpful when we do not have R on the machine where we are
working.

There’s More

Pyreadr is able to return multiple data frames. This is useful when we save
several data objects in R as an rdata file. We can return all of them with one
call. Pprint is a handy tool for improving the display of Python dictionaries.

See Also

Clear instructions and examples for pyreadr are available at: https://github.com/ofajardo/pyreadr. Feather files, a relatively new format can
be read by both R and Python. I discuss those files in the next recipe. We
could have used rpy2 instead of pyreadr to import R data. rpy2 requires

that R also be installed, but it is more powerful than pyreadr. It will read R
factors and automatically set them to pandas DataFrame values. See the

code below.

import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
readRDS = robjects.r['readRDS']
nls97withvalues = readRDS('data/nls97withvalues.rds')
nls97withvalues

Output:

R0000100 R0536300 ... U2963000 Z9063900
1 1 Female ... -2147483648 52
2 2 Male ... 6 0
3 3 Female ... 6 0
4 4 Female ... 6 4
5 5 Male ... 5 12
... ... ... ... ... ...
8980 9018 Female ... 4 49
8981 9019 Male ... 6 0
8982 9020 Male ... -2147483648 15
8983 9021 Male ... 7 50
8984 9022 Female ... 7 20
[8984 rows x 42 columns]

This generates unusual -2147483648 values. This is what happened when
readRDS interpreted missing data in numeric columns. A global replace of
that number with NaN, after confirming that that is not a valid value would
be a good next step.

Conclusion

In conclusion, importing R data is a crucial step for any data analysis and mastering the process is key to streamlining your workflows. Once you are ready and have a clear understanding of your data sources and formats, executing the import becomes a seamless task that can save you hours of manual input. By following the steps on how to do it and understanding how it works, you will be able to efficiently manage large datasets and prepare them for analysis with ease. But remember, there’s always more to explore. Advanced techniques, additional libraries and further optimization can take your skills to the next level, opening up new possibilities for analysis and modeling. Be sure to explore related resources to deepen your understanding and stay ahead in the ever evolving world of data science.

--

--

A.I Hub
A.I Hub

Written by A.I Hub

We writes about Data Science | Software Development | Machine Learning | Artificial Intelligence | Ethical Hacking and much more. Unleash your potential with us

No responses yet