Mesin Belajar: An Introduction to Scientific Python: Pandas

http://www.datadependence.com/2016/05/scientific-python-pandas

Pandas has got to be one of my most favourite libraries… Ever.
Pandas allows us to deal with data in a way that us humans can understand it; with labelled columns and indexes. It allows us to effortlessly import data from files such as csvs, allows us to quickly apply complex transformations and filters to our data and much more. It’s absolutely brilliant.
Along with Numpy and Matplotlib I feel it helps create a really strong base for data exploration and analysis in Python. Scipy (which will be covered in the next post), is of course a major component and another absolutely fantastic library, but I feel these three are the real pillars of scientific Python.
So without any ado, let’s get on with the third post in this series on scientific Python and take a look at Pandas. Don’t forget to check out the other posts if you haven’t yet!

IMPORTING PANDAS

First thing to do its to import the star of the show, Pandas.

import pandas as pd # This is the standard

1	import pandas as pd # This is the standard

This is the standard way to import Pandas. We don’t want to be writing ‘pandas’ all the time but it’s important to keep code concise and avoid naming clashes so we compromise with ‘pd’. If you look at other people’s code that uses Pandas you will see this import.

THE PANDAS DATA TYPES

Pandas is based around two data types, the series and the dataframe.
A series is a one-dimensional data type where each element is labelled. If you have read the post in this series on NumPy, you can think of it as a numpy array with labelled elements. Labels can be numeric or strings.
A dataframe is a two-dimensional, tabular data structure. The Pandas dataframe can store many different data types and each axis is labelled. You can think of it as sort of like a dictionary of series.

GETTING DATA INTO PANDAS

Before we can start wrangling, exploring and analysing, we first need data to wrangle, explore and analyse. Thanks to Pandas this is very easy, more so than NumPy.
Here I encourage you to find your own dataset, one that interests you and play around with that. Some good resources to find datasets are your country’s (or another’s) website. If you search for example UK government data or US government data, it will be one of the first results. Kaggle is another great source.
I will be using data on the UK’s rainfall that I found on the UK government’s website which can easily be downloaded and towards the end, some data that I got off a website about Japan’s rainfall.

# Reading a csv into Pandas.
df = pd.read_csv('notebook_playground/data/uk_rain_2014.csv', header=0)

1 2	# Reading a csv into Pandas. df = pd.read_csv('notebook_playground/data/uk_rain_2014.csv', header=0)

Here we get data from a csv file and store it in a dataframe. It’s as simple as calling read_csv and putting the path to your csv file as an argument. The header keyword argument tells Pandas if and where the column names of your data are. If there are no column names you can set it to None. Pandas is pretty clever so this can often be omitted.

GETTING YOUR DATA READY TO EXPLORE AND ANALYSE

Now we have our data in Pandas, we probably want to take a quick look at it and know some basic information about it to give us some direction before we really probe into it.
To take a quick look at the first x rows of the data.

# Getting first x rows.
df.head(5)

1 2	# Getting first x rows. df.head(5)

All we do is use the head() function and pass it the number of rows we want to retrieve.
You’ll end up with table looking like this:

Another thing you might want to do is get the last x rows.

# Getting last x rows.
df.tail(5)

1 2	# Getting last x rows. df.tail(5)

As with head all we do is call tail and pass it the number of rows we want to retrieve. Notice that it doesn’t start at the end of the dataframe and goes backwards. It gives you the rows in the order they are in in the dataframe.
You will end up with something that looks like this:

When referring to columns in Pandas you often refer to their names. This great and very easy to work with, but sometimes data has horribly long column names, such as whole questions from questionnaires. Life is much easier when you change them to make them shorter.

# Changing column labels.
df.columns = ['water_year','rain_octsep', 'outflow_octsep',
              'rain_decfeb', 'outflow_decfeb', 'rain_junaug', 'outflow_junaug']

df.head(5)

# Changing column labels.

df.columns = ['water_year','rain_octsep', 'outflow_octsep',

'rain_decfeb', 'outflow_decfeb', 'rain_junaug', 'outflow_junaug']

df.head(5)

One thing to note here is that I have purposely made all the column labels with no spaces and no dashes. If we name our variables like this, it saves us some typing as you will see later.
You will end up with the same data as before, but with different column names:

Another important thing about your data that you will usually want to know is how many entries you have. In Pandas one entry equates to one row, so we can take the len of the dataset, which returns the amount of rows there are.

# Finding out how many rows dataset has.
len(df)

1 2	# Finding out how many rows dataset has. len(df)

This will give you an integer telling you the number of rows, in my dataset I have 33.
One more thing that you might need to know is some basic statistics on your data, Pandas makes this delightfully simple.

# Finding out basic statistical information on your dataset.
pd.options.display.float_format = '{:,.3f}'.format # Limit output to 3 decimal places.
df.describe()

# Finding out basic statistical information on your dataset.

pd.options.display.float_format = '{:,.3f}'.format # Limit output to 3 decimal places.

df.describe()

This will return a table of various statistics such as count, mean, standard deviation and more that will look at bit like this:

FILTERING

When poking around in your dataset you will often want to take specific a sample of your data, for example if you had a questionnaire on job satisfaction, you might want to take all the people in a specific industry or age range.
Pandas gives us many ways to filter our data to extract the information we want.
Sometimes you’ll want to extract a whole column. Using column labels this is extremely easy.

# Getting a column by label
df['rain_octsep']

1 2	# Getting a column by label df['rain_octsep']

Note that when we extract a column, we are given back a series, not a dataframe. If you remember, you can think of a dataframe as a dictionary of series, so if we pull out a column, of course we get a series back.
Remember how I pointed out my choice in naming the column labels? Not using spaces or dashes etc. allows us to access columns the same way we can access object properties; using a dot.

# Getting a column by label using .
df.rain_octsep

1 2	# Getting a column by label using . df.rain_octsep

This will return exactly the same as the previous example. A series of the data in our chosen column.
If you have read the Numpy post in the series, you may remember a technique called ‘boolean masking’ and how we can get an array of boolean values running a conditional on an array. Well we can do this in Pandas too.

# Creating a series of booleans based on a conditional
df.rain_octsep < 1000 # Or df['rain_octsep] < 1000

1 2	# Creating a series of booleans based on a conditional df.rain_octsep < 1000 # Or df['rain_octsep] < 1000

The above code will return a dataframe of boolean values; ‘True’ if the rain in October-September what less than 1000mm and ‘False’ if not.
We can then use these conditional expressions to filter an existing dataframe.

# Using a series of booleans to filter
df[df.rain_octsep < 1000]

1 2	# Using a series of booleans to filter df[df.rain_octsep < 1000]

This will return a dataframe of only entries that had less than 1000mm of rain from October-September.

You can also filter by multiple conditional expressions.

# Filtering by multiple conditionals
df[(df.rain_octsep < 1000) & (df.outflow_octsep < 4000)] # Can't use the keyword 'and'

1 2	# Filtering by multiple conditionals df[(df.rain_octsep < 1000) & (df.outflow_octsep < 4000)] # Can't use the keyword 'and'

This will return only the entries that have a value of less than 1000 for rain_octsep and less than 4000 for outflow_octsep.
An important thing to note here is you cannot use the keyword ‘and’ here due to problems with the order of operations. You must use ‘&’ and brackets.

If you have strings in your data, then good news for you, you can also use string methods to filter with.

# Filtering by string methods
df[df.water_year.str.startswith('199')]

1 2	# Filtering by string methods df[df.water_year.str.startswith('199')]

Note that you have to use .str.[string method], you can’t just call a string method on it right away. This returns all entries in the 1990’s.

INDEXING

The previous section showed us how to get data based on operations done to the columns, but Pandas has labelled rows too. These row labels can be numerical or label based, and the method of retrieving a row differs depending on this label type.
If your rows have numerical indices, you can reference them using iloc.

# Getting a row via a numerical index
df.iloc[30]

1 2	# Getting a row via a numerical index df.iloc[30]

iloc will only work on numerical indices. It will return a series of that row. Each column of that row will be an element in the returned series.
Maybe in your dataset you have a column of years, or ages. Maybe you want to be able to reference rows using these years or ages. In this case we can set a new index (or multiple).

# Setting a new index from an existing column
df = df.set_index(['water_year'])
df.head(5)

# Setting a new index from an existing column

df = df.set_index(['water_year'])

df.head(5)

This will make the column ‘water_year’ an index. Notice that the column name is actually in a list, although the one above only has one element. If you wanted to have more than one index, this can be easily done by adding another column name to the list.

In the above example we set our index to be a column that is full of strings. This means that we now can’t reference then with iloc, so what do we use? We use loc.

# Getting a row via a label-based index
df.loc['2000/01']

1 2	# Getting a row via a label-based index df.loc['2000/01']

This, like iloc will return a series of the row you reference. The only difference is this time you are using label based referencing not numerical based.
There is another commonly used way to reference a row; ix. So if loc is label based and iloc is numerical based… What is ix? Well, ix is label based with a numerical index fallback.

# Getting a row via a label-based or numerical index
df.ix['1999/00'] # Label based with numerical index fallback *Not recommended

1 2	# Getting a row via a label-based or numerical index df.ix['1999/00'] # Label based with numerical index fallback *Not recommended

Just like loc and iloc this will return a series of the row you referenced.
So if ix does the job of both loc and iloc, why would you use anything else? The big reason is that it is slightly unpredictable. Remember how I said that it is label based with a numerical index fallback? Well this makes it do weird things sometimes like interpreting a number as a location. Using loc and iloc gives you safety, predictability, peace of mind. I should point out however, that ix is faster than both loc and iloc
It’s often useful to have indexes in order, we can do this in pandas via calling sort_index on our dataframe.

df.sort_index(ascending=False).head(5) #inplace=True to apple the sorting in place

1	df.sort_index(ascending=False).head(5) #inplace=True to apple the sorting in place

My index is already in order, so I set the keyword argument ‘ascending’ to False for demonstration purposes. This makes my data sort in descending order.

When you set a column of data to an index, it is no longer data per se. If you want to return the index to it’s original data form you just do the opposite of set_index… reset_index.

# Returning an index to data
df = df.reset_index('water_year')
df.head(5)

# Returning an index to data

df = df.reset_index('water_year')

df.head(5)

This will return your index to it’s original column form.

APPLYING FUNCTIONS TO DATASETS

Sometimes you will want to change or operate on the data in your dataset in some way. For example maybe you have a list of years and want to create a new column that gives the year’s decade. Pandas has two very useful functions for this, apply and applymap.

# Applying a function to a column
def base_year(year):
    base_year = year[:4]
    base_year= pd.to_datetime(base_year).year
    return base_year

df['year'] = df.water_year.apply(base_year)
df.head(5)

# Applying a function to a column

def base_year(year):

base_year = year[:4]

base_year= pd.to_datetime(base_year).year

return base_year

df['year'] = df.water_year.apply(base_year)

df.head(5)

This creates a new column called ‘year’ that is derived from the ‘water_year’ column and extracts just the main year. This is how to use apply, which is how you apply a function to a column. If you wanted to apply some function to the whole dataset you can use dataset.applymap().

MANIPULATING A DATASET’S STRUCTURE

Another common thing to do with dataframes is to restructure them in order to put them in a more convenient and/or useful form.
The easiest way to get to grips with these transformations is to see them happening, more than anything else in this post, the next few operations require some playing with to get your head around them.
First up, groupby…

#Manipulating structure (groupby, unstack, pivot)
# Grouby
df.groupby(df.year // 10 * 10).max()

#Manipulating structure (groupby, unstack, pivot)

# Grouby

df.groupby(df.year // 10 * 10).max()

What grouby does is form groups around the column you choose. Above groups by decade. This however doesn’t produce anything useful to us, we must then call something on it, such as max, min, mean, etc. Which will give us for example, the mean x in the 90’s.

You can also form groups with multiple columns.

# Grouping by multiple columns
decade_rain = df.groupby([df.year // 10 * 10, df.rain_octsep // 1000 * 1000])[['outflow_octsep',
                                                                              'outflow_decfeb',
                                                                              'outflow_junaug']].mean()
decade_rain

# Grouping by multiple columns

decade_rain = df.groupby([df.year // 10 * 10, df.rain_octsep // 1000 * 1000])[['outflow_octsep',

'outflow_decfeb',

'outflow_junaug']].mean()

decade_rain

Next up unstacking which can be a little confusing at first. What it does is push a column up to become column labels. It’s best to just see it in action…

# Unstacking
decade_rain.unstack(0)

1 2	# Unstacking decade_rain.unstack(0)

This transforms the dataframe that we produced in the section above into the dataframe below. It pushes up the 0th column, which is actually the index ‘year’, and turns it into column labels.

Let’s do one more for good measure. This time on the 1st column, which is the index ‘rain_octsep’.

# More unstacking
decade_rain.unstack(1)

1 2	# More unstacking decade_rain.unstack(1)

Now before our next operation, we will first create a new dataframe to play with.

# Create a new dataframe containing entries which 
# has rain_octsep values of greater than 1250
high_rain = df[df.rain_octsep > 1250]
high_rain

# Create a new dataframe containing entries which

# has rain_octsep values of greater than 1250

high_rain = df[df.rain_octsep > 1250]

high_rain

The above code gives us the below dataframe which we will perform pivoting on.

Pivoting is actually a combination of operations that we have already looked at in this post. First it sets a new index (set_index()), then it sorts that index (sort_index()) and finally it does an unstack on it. Together this is a pivot. See if you can visualise what is happening.

#Pivoting
#does set_index, sort_index and unstack in a row
high_rain.pivot('year', 'rain_octsep')[['outflow_octsep', 'outflow_decfeb', 'outflow_junaug']].fillna('')

#Pivoting

#does set_index, sort_index and unstack in a row

high_rain.pivot('year', 'rain_octsep')[['outflow_octsep', 'outflow_decfeb', 'outflow_junaug']].fillna('')

Notice at the end there is a .fillna(”). This pivot creates a lot of empty entries, NaN value entries. I personally find my dataframe being littered with NaNs distracting so I use fillna(”). You can enter whatever you like, for example a zero. We can also use the function dropna(how=’any’) to delete all rows with NaNs in them. In this case it would delete everything though, so we won’t do that.

The above dataframe shows us the outflow for all the years with rainfall over 1250. Admittedly this wasn’t the best example of a pivot in terms of practical use, but hopefully you get the idea. See what you can come up with in your dataset.

COMBINING DATASETS

Sometimes you will have two separate datasets that relate to each other that you want to compare them together or combine them. Well, no problem; Pandas makes this easy.

# Merging two datasets together
rain_jpn = pd.read_csv('notebook_playground/data/jpn_rain.csv')
rain_jpn.columns = ['year', 'jpn_rainfall']

uk_jpn_rain = df.merge(rain_jpn, on='year')
uk_jpn_rain.head(5)

# Merging two datasets together

rain_jpn = pd.read_csv('notebook_playground/data/jpn_rain.csv')

rain_jpn.columns = ['year', 'jpn_rainfall']

uk_jpn_rain = df.merge(rain_jpn, on='year')

uk_jpn_rain.head(5)

First off you need to have matching columns to merge on which you can then select via the ‘on’ keyword argument. You can often omit it and Pandas will work out one which columns to merge too.
As you can see below, the two datasets have been combined on the year category. The rain_jpn dataset only has the year and amount of rainfall and as we merged on the year column, only one column ‘jpn_rainfall’ has been merged with the columns from our UK rain dataset.

USING PANDAS TO PLOT GRAPHS QUICKLY

Matplotlib is great, but it takes a fair bit of code to create a half-way decent graph and sometimes you just want to quickly whip up a plot of your data for just your eyes to help you explore and make sense of it. Pandas answers this problem with plot.

# Using pandas to quickly plot graphs
uk_jpn_rain.plot(x='year', y=['rain_octsep', 'jpn_rainfall'])

1 2	# Using pandas to quickly plot graphs uk_jpn_rain.plot(x='year', y=['rain_octsep', 'jpn_rainfall'])

This creates a plot of your data using matplotlib, quickly and hassle free. From this you can then analyse your data visually to give yourself more direction when exploring. For example if you look at the plot of my data, you can see that maybe there was a drought in UK in 1995.

You can also see that UK’s rainfall is significantly less than Japan’s, and people say UK rains a lot!

SAVING YOUR DATASETS

After cleaning, reshaping and exploring your dataset, you often end up with something very different and much more useful than what you started with. You should also ways keep your original data, but also saving your newly polished dataset is a good idea too.

# Saving your data to a csv
df.to_csv('uk_rain.csv')

1 2	# Saving your data to a csv df.to_csv('uk_rain.csv')

The above code will save your data to a csv ready for next time.
So there we have it, and introduction to Pandas. As I said before, Pandas is really great and we have only scratched the surface here, but you should now know enough to get going and start cleaning and exploring data.
As usual I really urge you to go and play with this. Find a dataset or two that really interests you, sit down with a beer maybe and start probing it. It’s really the only way to get comfortable with Pandas and the other libraries introduced in this series. Plus you never know, you might find something interesting.

Mesin Belajar

Monday, July 4, 2016

An Introduction to Scientific Python: Pandas

IMPORTING PANDAS

THE PANDAS DATA TYPES

GETTING DATA INTO PANDAS

GETTING YOUR DATA READY TO EXPLORE AND ANALYSE

FILTERING

INDEXING

APPLYING FUNCTIONS TO DATASETS

MANIPULATING A DATASET’S STRUCTURE

COMBINING DATASETS

USING PANDAS TO PLOT GRAPHS QUICKLY

SAVING YOUR DATASETS

No comments:

Post a Comment