When it comes to working with data, importing a CSV file is often the first step towards unlocking insights and driving decisions. Getting ready for this task involves setting up the right tools, ensuring your environment is primed for seamless data processing. But how do you actually do it? It’s simpler than you might think with just a few lines of code or clicks, your data can be loaded and ready for analysis. So, how does it all work? Behind the scenes, your file is transformed into a structured format, ready for exploration and manipulation. And guess what? There’s more Beyond the basics, there are countless ways to optimize and expand on your CSV handling techniques. For even deeper knowledge, be sure to check out the additional resources in the 'See also' section to continue your learning journey.
Table of Content
- Importing CSV file
- Getting ready
- How it works
- There’s more
- See also
Importing CSV File
The read_csv method of the pandas library can be used to read a file with
comma separated values and load it into memory as a pandas
DataFrame. In this recipe, we read a CSV file and address some common
issues that creating column names that make sense to us parsing dates and
dropping rows with critical missing data. Raw data is often stored as CSV
files. These files have a carriage return at the end of each line of data to demarcate a row and a comma between each data value to delineate columns.
Something other than a comma can be used as the delimiter, such as a tab.
Quotation marks may be placed around values which can be helpful when
the delimiter occurs naturally within certain values, which sometimes
happens with commas. All data in a CSV file are characters, regardless of the
logical data type. This is why it is easy to view a CSV file, presuming it is not
too large in a text editor. The pandas read_csv method will make an
educated guess about the data type of each column, but you will need to help
it along to ensure that these guesses are on the mark.
Getting Ready
Create a folder for this section and create a new Python script or Jupyter
Notebook file in that folder. Create a data subfolder and place the
landtempssample.csv file in that subfolder. Alternatively, you could
retrieve all of the files from the GitHub repository. Here is a screenshot of the
beginning of the CSV file.
This dataset taken from the Global Historical Climatology
Network integrated database is made available for public use by
the United States National Oceanic and Atmospheric
Administration. This is just a 100,000 row
sample of the full dataset which is also available in the repository.
How To Do It
We will import a CSV file into pandas and take advantage of some very useful
read_csv options.
- Import the pandas library and set up the environment to make viewing
the output is easier.
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.width', 85)
pd.set_option('display.max_columns', 8)
- Read the data file, set new names for the headings and parse the date
column.
Pass an argument of 1 to the skiprows parameter to skip the first row, pass
a list of columns to parse_dates to create a pandas datetime column from
those columns and set low_memory to False to reduce the usage of memory
during the import process.
landtemps = pd.read_csv('data/landtempssample.csv', names=['stationid','year','month','avgtemp','latitude', 'longitude','elevation','station','countryid','country'], skiprows=1, parse_dates=[['month','year']], low_memory=False)
type(landtemps)
Output:
<class 'pandas.core.frame.DataFrame'>
We have to use skiprows because we are passing a list of column
names to read_csv . If we use the column names in the CSV file
we do not need to specify values for either names or skiprows .
- Get a quick glimpse of the data.
View the first few rows. Show the data type for all columns as well as
the number of rows and columns.
landtemps.head(7)
Output:
month_year stationid ... countryid \
0 2000-04-01 USS0010K01S ... US
1 1940-05-01 CI000085406 ... CI
2 2013-12-01 USC00036376 ... US
3 1963-02-01 ASN00024002 ... AS
4 2001-11-01 ASN00028007 ... AS
5 1991-04-01 USW00024151 ... US
6 1993-12-01 RSM00022641 ... RS
country
0 United States
1 Chile
2 United States
3 Australia
4 Australia
5 United States
6 Russia
[7 rows x 9 columns]
landtemps.dtypes
Output:
month_year stationid ... countryid \
0 2000-04-01 USS0010K01S ... US
1 1940-05-01 CI000085406 ... CI
2 2013-12-01 USC00036376 ... US
3 1963-02-01 ASN00024002 ... AS
4 2001-11-01 ASN00028007 ... AS
5 1991-04-01 USW00024151 ... US
6 1993-12-01 RSM00022641 ... RS
country
0 United States
1 Chile
2 United States
3 Australia
4 Australia
5 United States
6 Russia
[7 rows x 9 columns]
landtemps.dtypes
Output:
month_year datetime64[ns]
stationid object
avgtemp float64
latitude float64
longitude float64
elevation float64
station object
countryid object
country object
dtype: object
landtemps.shape
Output:
(100000, 9)
- Give the date column a better name and view the summary statistics for
average monthly temperature.
landtemps.rename(columns={'month_year':'measuredate'}, inplace=True)
landtemps.dtypes
Output:
measuredate datetime64[ns]
stationid object
avgtemp float64
latitude float64
longitude float64
elevation float64
station object
countryid object
country object
dtype: object
landtemps.avgtemp.describe()
Output:
count 85,554.00
mean 10.92
std 11.52
min -70.70
25% 3.46
50% 12.22
75% 19.57
max 39.95
Name: avgtemp, dtype: float64
Look for missing values for each column.
Use isnull which returns True for each value that is missing for each
column and False when not missing. Chain this with sum to count the
missing’s for each column. When working with Boolean values, sum treats
True as 1 and False as 0 . I will discuss method chaining in the There’s
more section of this recipe.
landtemps.isnull().sum()
Output:
measuredate 0
stationid 0
avgtemp 14446
latitude 0
longitude 0
elevation 0
station 0
countryid 0
country 5
dtype: int64
Remove rows with missing data for avgtemp .
Use the subset parameter to tell dropna to drop rows when avgtemp is
missing. Set inplace to True . Leaving inplace at its default value of
False would display the DataFrame, but the changes we have made would
not be retained. Use the shape attribute of the DataFrame to get the number
of rows and columns.
landtemps.dropna(subset=['avgtemp'], inplace=True)
landtemps.shape
Output:
(85554, 9)
That’s it! Importing CSV files into pandas is as simple as that.
How It Works
Almost all of the recipes in this section use the pandas library. We refer to it
as pd to make it easier to reference later. This is customary. We also use
float_format to display float values in a readable way and set_option to
make the terminal output wide enough to accommodate the number of
variables. Much of the work is done by the first line in step 2. We use
read_csv to load a pandas DataFrame in memory and call it landtemps . In
addition to passing a filename, we set the names parameter to a list of our
preferred column headings. We also tell read_csv to skip the first row by
setting skiprows to 1, since the original column headings are in the first row
of the CSV file. If we do not tell it to skip the first row, read_csv will treat
the header row in the file as actual data. read_csv also solves a date
conversion issue for us. We use the parse_dates parameter to ask it to
convert the month and year columns to a date value. Step 3 runs through a
few standard data checks. We use head(7) to print out all columns for the
first 7 rows. We use the dtypes attribute of the data frame to show the data
type of all columns. Each column has the expected data type. In pandas
character data has the object data type, a data type that allows for mixed
values. Shape returns a tuple, whose first element is the number of rows in
the data frame (100,000 in this case) and whose second element is the
number of columns (9). When we used read_csv to parse the month and
year columns, it gave the resulting column the name month_year . We use
the rename method in step 4 to give that column a better name. We need to
specify inplace=True to replace the old column name with the new column
name in memory. The describe method provides summary statistics on the
avgtemp column. Notice that the count for avgtemp indicates that there are
85,554 rows that have valid values for avgtemp . This is out of 100,000 rows
for the whole DataFrame as provided by the shape attribute. The listing of
missing values for each column in step 5 ( landtemps.isnull().sum() )
confirms this, 100,000 – 85,554 = 14,446.Step 6 drops all rows where
avgtemp is NaN . The NaN value, not a number is the pandas representation
of missing values. subset is used to indicate which column to check for
missings. The shape attribute for landtemps now indicates that there are
85,554 rows, which is what we would expect given the previous count from
describe .
There’s More
If the file you are reading uses a delimiter other than a comma, such as a tab,
this can be specified in the sep parameter of read_csv. When creating the
pandas DataFrame, an index was also created. The numbers to the far left of
the output when the head and sample were run are index values. Any number
of rows can be specified for head or sample . The default value is 5. Setting
low_memory to False causes read_csv to parse data in chunks. This is
easier on the system’s with lower memory when working with larger files.
However the full DataFrame will still be loaded into memory once
read_csv completes successfully. The landtemps.isnull().sum()
statement is an example of chaining methods. First, isnull returns a
DataFrame of True and False values, resulting from testing whether each
column value is null. sum takes that DataFrame and sums the True values
for each column, interpreting the True values as 1 and the False values as
0. We would have obtained the same result if we had used the following two
steps.
checknull = landtemps.isnull()
checknull.sum()
There is no hard and fast rule for when to chain methods and when not to do
so. I find it helpful to chain when I really think of something I am doing as
being a single step but only two or more steps, mechanically speaking.
Chaining also has the side benefit of not creating extra objects that I might
not need. The dataset used in this recipe is just a sample from the full land
temperatures database with almost 17 million records. You can run the larger
file if your machine can handle it, with this code.
landtemps = pd.read_csv('data/landtemps.zip', compression='zip', names=['stationid','year', 'month','avgtemp','latitude','longitude', 'elevation','station','countryid','country'], skiprows=1, parse_dates=[['month','year']], low_memory=False)
read_csv can read a compressed ZIP file. We get it to do this by passing the
name of the ZIP file and the type of compression.
See Also
Subsequent recipes in this section and in other sections, set indexes to
improve navigation over rows and merging. A significant amount of
reshaping of the Global Historical Climatology Network raw data was done
before using it in this recipe. We demonstrate this in part 8, Tidying and
Reshaping Data.
Conclusion
In conclusion, working with CSV files is an essential skill for handling structured data across various domains. As you begin by getting everything in place, such as preparing your tools and setting up your environment, the process becomes straightforward. Understanding how to import a CSV file and knowing the steps required to efficiently manipulate the data is a crucial foundation. Once you grasp how it works, you can explore more advanced techniques to optimize your workflow. But remember, this is just the tip of the iceberg there are countless methods and functions out there that can help you unlock even more powerful insights from your data.