http://www.datadependence.com/2016/05/scientific-python-pandas
Pandas has got to be one of my most favourite libraries… Ever.
Pandas allows us to deal with data in a way that us humans can understand it; with labelled columns and indexes. It allows us to effortlessly import data from files such as csvs, allows us to quickly apply complex transformations and filters to our data and much more. It’s absolutely brilliant.
Along with Numpy and Matplotlib I feel it helps create a really strong base for data exploration and analysis in Python. Scipy (which will be covered in the next post), is of course a major component and another absolutely fantastic library, but I feel these three are the real pillars of scientific Python.
So without any ado, let’s get on with the third post in this series on scientific Python and take a look at Pandas. Don’t forget to check out the other posts if you haven’t yet!
IMPORTING PANDAS
First thing to do its to import the star of the show, Pandas.THE PANDAS DATA TYPES
Pandas is based around two data types, the series and the dataframe.A series is a one-dimensional data type where each element is labelled. If you have read the post in this series on NumPy, you can think of it as a numpy array with labelled elements. Labels can be numeric or strings.
A dataframe is a two-dimensional, tabular data structure. The Pandas dataframe can store many different data types and each axis is labelled. You can think of it as sort of like a dictionary of series.
GETTING DATA INTO PANDAS
Before we can start wrangling, exploring and analysing, we first need data to wrangle, explore and analyse. Thanks to Pandas this is very easy, more so than NumPy.Here I encourage you to find your own dataset, one that interests you and play around with that. Some good resources to find datasets are your country’s (or another’s) website. If you search for example UK government data or US government data, it will be one of the first results. Kaggle is another great source.
I will be using data on the UK’s rainfall that I found on the UK government’s website which can easily be downloaded and towards the end, some data that I got off a website about Japan’s rainfall.
GETTING YOUR DATA READY TO EXPLORE AND ANALYSE
Now we have our data in Pandas, we probably want to take a quick look at it and know some basic information about it to give us some direction before we really probe into it.To take a quick look at the first x rows of the data.
You’ll end up with table looking like this:
Another thing you might want to do is get the last x rows.
You will end up with something that looks like this:
When referring to columns in Pandas you often refer to their names. This great and very easy to work with, but sometimes data has horribly long column names, such as whole questions from questionnaires. Life is much easier when you change them to make them shorter.
You will end up with the same data as before, but with different column names:
Another important thing about your data that you will usually want to know is how many entries you have. In Pandas one entry equates to one row, so we can take the len of the dataset, which returns the amount of rows there are.
One more thing that you might need to know is some basic statistics on your data, Pandas makes this delightfully simple.
FILTERING
When poking around in your dataset you will often want to take specific a sample of your data, for example if you had a questionnaire on job satisfaction, you might want to take all the people in a specific industry or age range.Pandas gives us many ways to filter our data to extract the information we want.
Sometimes you’ll want to extract a whole column. Using column labels this is extremely easy.
Remember how I pointed out my choice in naming the column labels? Not using spaces or dashes etc. allows us to access columns the same way we can access object properties; using a dot.
If you have read the Numpy post in the series, you may remember a technique called ‘boolean masking’ and how we can get an array of boolean values running a conditional on an array. Well we can do this in Pandas too.
We can then use these conditional expressions to filter an existing dataframe.
You can also filter by multiple conditional expressions.
An important thing to note here is you cannot use the keyword ‘and’ here due to problems with the order of operations. You must use ‘&’ and brackets.
If you have strings in your data, then good news for you, you can also use string methods to filter with.
INDEXING
The previous section showed us how to get data based on operations done to the columns, but Pandas has labelled rows too. These row labels can be numerical or label based, and the method of retrieving a row differs depending on this label type.If your rows have numerical indices, you can reference them using iloc.
Maybe in your dataset you have a column of years, or ages. Maybe you want to be able to reference rows using these years or ages. In this case we can set a new index (or multiple).
In the above example we set our index to be a column that is full of strings. This means that we now can’t reference then with iloc, so what do we use? We use loc.
There is another commonly used way to reference a row; ix. So if loc is label based and iloc is numerical based… What is ix? Well, ix is label based with a numerical index fallback.
So if ix does the job of both loc and iloc, why would you use anything else? The big reason is that it is slightly unpredictable. Remember how I said that it is label based with a numerical index fallback? Well this makes it do weird things sometimes like interpreting a number as a location. Using loc and iloc gives you safety, predictability, peace of mind. I should point out however, that ix is faster than both loc and iloc
It’s often useful to have indexes in order, we can do this in pandas via calling sort_index on our dataframe.
When you set a column of data to an index, it is no longer data per se. If you want to return the index to it’s original data form you just do the opposite of set_index… reset_index.
APPLYING FUNCTIONS TO DATASETS
Sometimes you will want to change or operate on the data in your dataset in some way. For example maybe you have a list of years and want to create a new column that gives the year’s decade. Pandas has two very useful functions for this, apply and applymap.MANIPULATING A DATASET’S STRUCTURE
Another common thing to do with dataframes is to restructure them in order to put them in a more convenient and/or useful form.The easiest way to get to grips with these transformations is to see them happening, more than anything else in this post, the next few operations require some playing with to get your head around them.
First up, groupby…
You can also form groups with multiple columns.
Let’s do one more for good measure. This time on the 1st column, which is the index ‘rain_octsep’.
Pivoting is actually a combination of operations that we have already looked at in this post. First it sets a new index (set_index()), then it sorts that index (sort_index()) and finally it does an unstack on it. Together this is a pivot. See if you can visualise what is happening.
The above dataframe shows us the outflow for all the years with rainfall over 1250. Admittedly this wasn’t the best example of a pivot in terms of practical use, but hopefully you get the idea. See what you can come up with in your dataset.
COMBINING DATASETS
Sometimes you will have two separate datasets that relate to each other that you want to compare them together or combine them. Well, no problem; Pandas makes this easy.As you can see below, the two datasets have been combined on the year category. The rain_jpn dataset only has the year and amount of rainfall and as we merged on the year column, only one column ‘jpn_rainfall’ has been merged with the columns from our UK rain dataset.
USING PANDAS TO PLOT GRAPHS QUICKLY
Matplotlib is great, but it takes a fair bit of code to create a half-way decent graph and sometimes you just want to quickly whip up a plot of your data for just your eyes to help you explore and make sense of it. Pandas answers this problem with plot.You can also see that UK’s rainfall is significantly less than Japan’s, and people say UK rains a lot!
SAVING YOUR DATASETS
After cleaning, reshaping and exploring your dataset, you often end up with something very different and much more useful than what you started with. You should also ways keep your original data, but also saving your newly polished dataset is a good idea too.So there we have it, and introduction to Pandas. As I said before, Pandas is really great and we have only scratched the surface here, but you should now know enough to get going and start cleaning and exploring data.
As usual I really urge you to go and play with this. Find a dataset or two that really interests you, sit down with a beer maybe and start probing it. It’s really the only way to get comfortable with Pandas and the other libraries introduced in this series. Plus you never know, you might find something interesting.
No comments:
Post a Comment