Tuesday, July 26, 2016

Chasing Cats

http://myplace.frontier.com/~r.bond/cats/cats.htm

Thursday, July 21, 2016

Beginner's GUide to Understanding Convolutional Neural Network

https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/

A Beginner's Guide To Understanding Convolutional Neural Networks

Introduction

Convolutional neural networks. Sounds like a weird combination of biology and math with a little CS sprinkled in, but these networks have been some of the most influential innovations in the field of computer vision. 2012 was the first year that neural nets grew to prominence as Alex Krizhevsky used them to win that year’s ImageNet competition (basically, the annual Olympics of computer vision), dropping the classification error record from 26% to 15%, an astounding improvement at the time.Ever since then, a host of companies have been using deep learning at the core of their services. Facebook uses neural nets for their automatic tagging algorithms, Google for their photo search, Amazon for their product recommendations, Pinterest for their home feed personalization, and Instagram for their search infrastructure.

However, the classic, and arguably most popular, use case of these networks is for image processing. Within image processing, let’s take a look at how to use these CNNs for image classification.

The Problem Space

Image classification is the task of taking an input image and outputting a class (a cat, dog, etc) or a probability of classes that best describes the image. For humans, this task of recognition is one of the first skills we learn from the moment we are born and is one that comes naturally and effortlessly as adults. Without even thinking twice, we’re able to quickly and seamlessly identify the environment we are in as well as the objects that surround us. When we see an image or just when we look at the world around us, most of the time we are able to immediately characterize the scene and give each object a label, all without even consciously noticing. These skills of being able to quickly recognize patterns, generalize from prior knowledge, and adapt to different image environments are ones that we do not share with our fellow machines.

Inputs and Outputs

When a computer sees an image (takes an image as input), it will see an array of pixel values. Depending on the resolution and size of the image, it will see a 32 x 32 x 3 array of numbers (The 3 refers to RGB values). Each of these numbers is given a value from 0 to 255 which describes the pixel intensity at that point. These numbers, while meaningless to us when we perform image classification, are the only inputs available to the computer. The idea is that you give the computer this array of numbers and it will output numbers that describe the probability of the image being a certain class (.80 for cat, .15 for dog, .05 for bird, etc).

What We Want the Computer to Do

Now that we know the problem as well as the inputs and outputs, let’s think about how to approach this. What we want the computer to do is to be able to differentiate between all the images it’s given and figure out the unique features that make a dog a dog or that make a cat a cat. This is the process that goes on in our minds subconsciously as well. When we look at a picture of a dog, we can classify it as such if the picture has identifiable features such as paws or 4 legs. In a similar way, the computer is able perform image classification by looking for low level features such as edges and curves, and then building up to more abstract concepts through a series of convolutional layers. This is a general overview of what a CNN does. Let’s get into the specifics.

Biological Connection

But first, a little background. When you first heard of the term convolutional neural networks, you may have thought of something related to neuroscience or biology, and you would be right. Sort of. CNNs do take a biological inspiration from the visual cortex. The visual cortex has small regions of cells that are sensitive to specific regions of the visual field. This idea was expanded upon by a fascinating experiment by Hubel and Wiesel in 1962 (Video) where they showed that some individual neuronal cells in the brain responded (or fired) only in the presence of edges of a certain orientation. For example, some neurons fired when exposed to vertical edges and some when shown horizontal or diagonal edges. Hubel and Wiesel found out that all of these neurons were organized in a columnar architecture and that together, they were able to produce visual perception. This idea of specialized components inside of a system having specific tasks (the neuronal cells in the visual cortex looking for specific characteristics) is one that machines use as well, and is the basis behind CNNs.

Structure

Back to the specifics. A more detailed overview of what CNNs do would be that you take the image, pass it through a series of convolutional, nonlinear, pooling (downsampling), and fully connected layers, and get an output. As we said earlier, the output can be a single class or a probability of classes that best describes the image. Now, the hard part is understanding what each of these layers do. So let’s get into the most important one.

First Layer – Math Part

The first layer in a CNN is always a Convolutional Layer. First thing to make sure you remember is what the input to this conv (I’ll be using that abbreviation a lot) layer is. Like we mentioned before, the input is a 32 x 32 x 3 array of pixel values. Now, the best way to explain a conv layer is to imagine a flashlight that is shining over the top left of the image. Let’s say that the light this flashlight shines covers a 5 x 5 area. And now, let’s imagine this flashlight sliding across all the areas of the input image. In machine learning terms, this flashlight is called a filter(or sometimes referred to as a neuron or a kernel) and the region that it is shining over is called the receptive field. Now this filter is also an array of numbers (the numbers are called weights or parameters). A very important note is that the depth of this filter has to be the same as the depth of the input (this makes sure that the math works out), so the dimensions of this filter is 5 x 5 x 3. Now, let’s take the first position the filter is in for example. It would be the top left corner. As the filter is sliding, or convolving, around the input image, it is multiplying the values in the filter with the original pixel values of the image (aka computing dot products). These multiplications are all summed up (mathematically speaking, this would be 75 multiplications in total). So now you have a single number. Remember, this number is just representative of when the filter is at the top left of the image. Now, we repeat this process for every location on the input volume. (Next step would be moving the filter to the right by 1 unit, then right again by 1, and so on). Every unique location on the input volume produces a number. After sliding the filter over all the locations, you will find out that what you’re left with is a 28 x 28 x 1 array of numbers, which we call an activation map or feature map. The reason you get a 28 x 28 array is that there are 784 different locations that a 5 x 5 filter can fit on a 32 x 32 input image. These 784 numbers are mapped to a 28 x 28 array.

Let’s say now we use two 5 x 5 x 3 filters instead of one. Then our output volume would be 28 x 28 x 2. By using more filters, we are able to preserve the spatial dimensions better. Mathematically, this is what’s going on in a convolutional layer.

First Layer – High Level Perspective

However, let’s talk about what this convolution is actually doing from a high level. Each of these filters can be thought of as feature identifiers. When I say features, I’m talking about things like straight edges, simple colors, and curves. Think about the simplest characteristics that all images have in common with each other. Let’s say our first filter is 7 x 7 x 3 and is going to be a curve detector. (In this section, let’s ignore the fact that the filter is 3 units deep and only consider the top depth slice of the filter and the image, for simplicity.)As a curve detector, the filter will have a pixel structure in which there will be higher numerical values along the area that is a shape of a curve (Remember, these filters that we’re talking about as just numbers!).

Now, let’s go back to visualizing this mathematically. When we have this filter at the top left corner of the input volume, it is computing dot products between the filter and pixel values at that region. Now let’s take an example of an image that we want to classify, and let’s put our filter at the top left corner.

Remember, what we have to do is multiply the values in the filter with the original pixel values of the image.

Basically, in the input image, if there is a shape that generally resembles the curve that this filter is representing, then all of the dot products summed together will result in a large value! Now let’s see what happens when we move our filter.

The value is much lower! This is because there wasn’t anything in the image section that responded to the curve detector filter. Remember, the output of this conv layer is an activation map. So, in the simple case of a one filter convolution (and if that filter is a curve detector), the activation map will show the areas in which there at mostly likely to be curves in the picture. In this example, the top left value of our 28 x 28 x 1 activation map will be 6600. This high value means that it is likely that there is some sort of curve in the input volume that caused the filter to activate. The top right value in our activation map will be 0 because there wasn’t anything in the input volume that caused the filter to activate (or more simply said, there wasn’t a curve in that region of the original image). Remember, this is just for one filter. This is just a filter that is going to detect lines that curve outward and to the right. We can have other filters for lines that curve to the left or for straight edges. The more filters, the greater the depth of the activation map, and the more information we have about the input volume.

Disclaimer: The filter I described in this section was simplistic for the main purpose of describing the math that goes on during a convolution. In the picture below, you’ll see some examples of actual visualizations of the filters of the first conv layer of a trained network. Nonetheless, the main argument remains the same. The filters on the first layer convolve around the input image and “activate” (or compute high values) when the specific feature it is looking for is in the input volume.

Going Deeper Through the Network

Now in a traditional convolutional neural network architecture, there are other layers that are interspersed between these conv layers. I’d strongly encourage those interested to read up on them and understand their function and effects, but in a general sense, they provide nonlinearities and preservation of dimension that help to improve the robustness of the network and control overfitting. A classic CNN architecture would look like this.

The last layer, however, is an important one and one that we will go into later on. Let’s just take a step back and review what we’ve learned so far. We talked about what the filters in the first conv layer are designed to detect. They detect low level features such as edges and curves. As one would imagine, in order to predict whether an image is a type of object, we need the network to be able to recognize higher level features such as hands or paws or ears. So let’s think about what the output of the network is after the first conv layer. It would be a 28 x 28 x 3 volume (assuming we use three 5 x 5 x 3 filters). When we go through another conv layer, the output of the first conv layer becomes the input of the 2^nd conv layer. Now, this is a little bit harder to visualize. When we were talking about the first layer, the input was just the original image. However, when we’re talking about the 2^nd conv layer, the input is the activation map(s) that result from the first layer. So each layer of the input is basically describing the locations in the original image for where certain low level features appear. Now when you apply a set of filters on top of that (pass it through the 2^nd conv layer), the output will be activations that represent higher level features. Types of these features could be semicircles (combination of a curve and straight edge) or squares (combination of several straight edges). As you go through the network and go through more conv layers, you get activation maps that represent more and more complex features. By the end of the network, you may have some filters that activate when there is handwriting in the image, filters that activate when they see pink objects, etc. If you want more information about visualizing filters in ConvNets, Matt Zeiler and Rob Fergus had an excellent research paper discussing the topic. Jason Yosinski also has a video on YouTube that provides a great visual representation. Another interesting thing to note is that as you go deeper into the network, the filters begin to have a larger and larger receptive field, which means that they are able to consider information from a larger area of the original input volume (another way of putting it is that they are more responsive to a larger region of pixel space).

Fully Connected Layer

Now that we can detect these high level features, the icing on the cake is attaching a fully connected layer to the end of the network. This layer basically takes an input volume (whatever the output is of the conv or ReLU or pool layer preceding it) and outputs an N dimensional vector where N is the number of classes that the program has to choose from. For example, if you wanted a digit classification program, N would be 10 since there are 10 digits. Each number in this N dimensional vector represents the probability of a certain class. For example, if the resulting vector for a digit classification program is [0 .1 .1 .75 0 0 0 0 0 .05], then this represents a 10% probability that the image is a 1, a 10% probability that the image is a 2, a 75% probability that the image is a 3, and a 5% probability that the image is a 9 (Side note: There are other ways that you can represent the output, but I am just showing the softmax approach). The way this fully connected layer works is that it looks at the output of the previous layer (which as we remember should represent the activation maps of high level features) and determines which features most correlate to a particular class. For example, if the program is predicting that some image is a dog, it will have high values in the activation maps that represent high level features like a paw or 4 legs, etc. Similarly, if the program is predicting that some image is a bird, it will have high values in the activation maps that represent high level features like wings or a beak, etc. Basically, a FC layer looks at what high level features most strongly correlate to a particular class and has particular weights so that when you compute the dot product between the weights and the previous layer, you get the correct probabilities for the different classes.

Training (AKA:What Makes this Stuff Work)

Now, this is the one aspect of neural networks that I purposely haven’t mentioned yet and it is probably the most important part. There may be a lot of questions you had while reading. How do the filters in the first conv layer know to look for edges and curves? How does the fully connected layer know what activation maps to look at? How do the filters in each layer know what values to have? The way the computer is able to adjust its filter values (or weights) is through a training process called backpropagation.

Before we get into backpropagation, we must first take a step back and talk about what a neural network needs in order to work. At the moment we all were born, our minds were fresh. We didn’t know what a cat or dog or bird was. In a similar sort of way, before the CNN starts, the weights or filter values are randomized. The filters don’t know to look for edges and curves. The filters in the higher layers don’t know to look for paws and beaks. As we grew older however, our parents and teachers showed us different pictures and images and gave us a corresponding label. This idea of being given an image and a label is the training process that CNNs go through. Before getting too into it, let’s just say that we have a training set that has thousands of images of dogs, cats, and birds and each of the images has a label of what animal that picture is. Back to backprop.

So backpropagation can be separated into 4 distinct sections, the forward pass, the loss function, the backward pass, and the weight update. During the forward pass, you take a training image which as we remember is a 32 x 32 x 3 array of numbers and pass it through the whole network. On our first training example, since all of the weights or filter values were randomly initialized, the output will probably be something like [.1 .1 .1 .1 .1 .1 .1 .1 .1 .1], basically an output that doesn’t give preference to any number in particular. The network, with its current weights, isn’t able to look for those low level features or thus isn’t able to make any reasonable conclusion about what the classification might be. This goes to the loss function part of backpropagation. Remember that what we are using right now is training data. This data has both an image and a label. Let’s say for example that the first training image inputted was a 3. The label for the image would be [0 0 0 1 0 0 0 0 0 0]. A loss function can be defined in many different ways but a common one is MSE (mean squared error), which is ½ times (actual - predicted) squared.

Let’s say the variable L is equal to that value. As you can imagine, the loss will be extremely high for the first couple of training images. Now, let’s just think about this intuitively. We want to get to a point where the predicted label (output of the ConvNet) is the same as the training label (This means that our network got its prediction right).In order to get there, we want to minimize the amount of loss we have. Visualizing this as just an optimization problem in calculus, we want to find out which inputs (weights in our case) most directly contributed to the loss (or error) of the network.

This is the mathematical equivalent of a dL/dW where W are the weights at a particular layer. Now, what we want to do is perform a backward pass through the network, which is determining which weights contributed most to the loss and finding ways to adjust them so that the loss decreases. Once we compute this derivative, we then go to the last step which is the weight update. This is where we take all the weights of the filters and update them so that they change in the direction of the gradient.

The learning rate is a parameter that is chosen by the programmer. A high learning rate means that bigger steps are taken in the weight updates and thus, it may take less time for the model to converge on an optimal set of weights. However, a learning rate that is too high could result in jumps that are too large and not precise enough to reach the optimal point.

The process of forward pass, loss function, backward pass, and parameter update is generally called one epoch. The program will repeat this process for a fixed number of epochs for each training image. Once you finish the parameter update on the last training example, hopefully the network should be trained well enough so that the weights of the layers are tuned correctly.

Testing

Finally, to see whether or not our CNN works, we have a different set of images and labels (can’t double dip between training and test!) and pass the images through the CNN. We compare the outputs to the ground truth and see if our network works!

How Companies Use CNNs

Data, data, data. The companies that have lots of this magic 4 letter word are the ones that have an inherent advantage over the rest of the competition. The more training data that you can give to a network, the more training iterations you can make, the more weight updates you can make, and the better tuned to the network is when it goes to production. Facebook (and Instagram) can use all the photos of the billion users it currently has, Pinterest can use information of the 50 billion pins that are on its site, Google can use search data, and Amazon can use data from the millions of products that are bought every day. And now you know the magic behind how they use it.

Disclaimer

While this post should be a good start to understanding CNNs, it is by no means a comprehensive overview. Things not discussed in this post include the nonlinear and pooling layers as well as hyperparameters of the network such as filter sizes, stride, and padding. Topics like network architecture, batch normalization, vanishing gradients, dropout, initialization techniques, non-convex optimization,biases, choices of loss functions, data augmentation,regularization methods, computational considerations, modifications of backpropagation, and more were also not discussed (yet ).

Sources

Data Science is Software Talk

http://isaacslavitt.com/2016/07/20/data-science-is-software-talk/

Last week, Peter and I gave a tutorial at SciPy 2016. In keeping with our recent thinking about reproducibility and organization, the idea was to look at how some of the hard-won lessons and best practices from modern software engineering can be applied to data science work.
We ran a little short on time near the end, partly because a few people ran into pretty wild open bugs and configuration issues during the virtual environment lab, and partly because we tried to fit a bunch of varied subjects into one tutorial. But overall, I think it ended up being a pretty good session.
The repo with slides, notebooks, and other materials is here. Here's the talk:

Monday, July 11, 2016

Caffee

http://dl.caffe.berkeleyvision.org/

Wednesday, July 6, 2016

Applying Machine Learning to InfoSec

http://conf.startup.ml/blog/infosec

Arshak Navruzyan

July 4, 2016

There seems to be very little overlap currently between the worlds of infosec and machine learning. If a data scientist attended Black Hat and a network security expert went to NIPS, they would be equally at a loss.
This is unfortunate because infosec can definitely benefit from a probabilistic approach but a significant amount of domain expertise is required in order to apply ML methods.
Machine learning practitioners face a few challenges for doing work in this domain including understanding the datasets, how to do feature engineering (in a generalizable way) and creation of labels.

Available DatasetS

A variety of datasets can be collected as a precursor to creating a training set for a machine learning model:

Log files from systems, firewalls, proxies, routers, switches that capture in semi-structured formats network activity and user behavior
Application level logging and diagnostics that record user/system access information and application usage
Monitoring tools, IDS systems and SIEMs
Network Packet Capture (PCAP) is a rather compute/storage intensive process of recording the raw ethernet frames

Some of these sources (like log formats) are readily available and fairly standardized while others will require extensive tooling and software modifications (e.g. application logging), and yet others will require a significant hardware footprint and a monitoring network that could rival the size of the real network.

Feature Engineering

Bearing in mind that the whole point of machine learning is generalization beyond the training set, thoughtful feature engineering is required to go from the identity information of IP addresses, hostnames and URLs to something that can turn into a useful representation within the machine learning model.
For example the following might be a useful feature space created from proxy logs (Franc)

length
digit ratio
lower case ratio
upper case ratio
vowel changes ratio
has repetition of '&' and '='
start with number
number of non-base64 characters
has a special character
max length of consonant stream
max length of vowel stream
max length of lower case stream
max length of upper case stream
max length of digit stream
ratio of a character with max occurrence

(session) duration
HTTP request status
is URL encrypted
is protocol HTTPS
number of bytes up
number of bytes down
is URL in ASCII
client port number
server port number
user agent length
MIME-Type length
number of '/' in path
number of '/' in query
number of '/' in referrer
is the second-level domain raw IP

getting labels

Once the input feature space has been established, getting the label is the next challenge. For each observation of the training set, we need to determine if it is a malicious or a benign pattern.
Having a network security expert create labels (on potentially millions of observations) will be expensive and time consuming. Instead we might rely on a semi-supervised approach by leveraging publicly available threat intelligence feeds. MLSec provides a set of tools and resources for gathering and processing various threat intelligence feeds.*
Thus labels are created by doing a join between public blacklists and the collected dataset. Matching can be done on fields like IP addresses, domains, agents, etc. Keep in mind that these identity elements are only used to produce the label and will not be part of the model input (Franc).

Modeling considerations

Finally once we have the input feature space and the labels, we are ready to train a model. We can anticipate certain characteristics of input space such as class imbalance, non-linearity and noisiness. A modern ensemble method like random forest or gradient boosted trees should overcome these issues with proper parameter tuning (Seni).
Bigger issue is that this is an adversarial use case and model decay will be a significant factor. Since there is an active adversary trying to avoid detection, attack patterns will constantly evolve which will cause data drift for the model.
Some possible solutions to the adversarial issue could be the use of a nonparametric method, using online/active learning (i.e. letting the model evolve on every new observation) or having rigorous tests to determine when the model should be retrained.

LEARNING MORE

To address some of the issues unique to adversarial machine learning, Startup.ML is organizing a one-day special conference on September 10th in San Francisco. Leading practitioners from Google, Coinbase, Ripple, Stripe, Square, etc. will cover their approaches to solving these problems in hands-on workshops and talks.
The conference will also include a hands-on, 90 minute tutorial on TensorFlow by Illia Polosukhin one of the most active contributors to Google's new deep learning library.

Reference

Franc, Vojtech, Michal Sofka, and Karel Bartos. "Learning detector of malicious network traffic from weak labels." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer International Publishing, 2015.
Seni, Giovanni, and John F. Elder. "Ensemble methods in data mining: improving accuracy through combining predictions." Synthesis Lectures on Data Mining and Knowledge Discovery 2.1 (2010): 1-126.

Notes

* This approach is limited to global IP addresses and domains and cannot be used for internal threats.

Monday, July 4, 2016

An Introduction to Scientific Python: Pandas

http://www.datadependence.com/2016/05/scientific-python-pandas

Pandas has got to be one of my most favourite libraries… Ever.
Pandas allows us to deal with data in a way that us humans can understand it; with labelled columns and indexes. It allows us to effortlessly import data from files such as csvs, allows us to quickly apply complex transformations and filters to our data and much more. It’s absolutely brilliant.
Along with Numpy and Matplotlib I feel it helps create a really strong base for data exploration and analysis in Python. Scipy (which will be covered in the next post), is of course a major component and another absolutely fantastic library, but I feel these three are the real pillars of scientific Python.
So without any ado, let’s get on with the third post in this series on scientific Python and take a look at Pandas. Don’t forget to check out the other posts if you haven’t yet!

IMPORTING PANDAS

First thing to do its to import the star of the show, Pandas.

import pandas as pd # This is the standard

1	import pandas as pd # This is the standard

This is the standard way to import Pandas. We don’t want to be writing ‘pandas’ all the time but it’s important to keep code concise and avoid naming clashes so we compromise with ‘pd’. If you look at other people’s code that uses Pandas you will see this import.

THE PANDAS DATA TYPES

Pandas is based around two data types, the series and the dataframe.
A series is a one-dimensional data type where each element is labelled. If you have read the post in this series on NumPy, you can think of it as a numpy array with labelled elements. Labels can be numeric or strings.
A dataframe is a two-dimensional, tabular data structure. The Pandas dataframe can store many different data types and each axis is labelled. You can think of it as sort of like a dictionary of series.

GETTING DATA INTO PANDAS

Before we can start wrangling, exploring and analysing, we first need data to wrangle, explore and analyse. Thanks to Pandas this is very easy, more so than NumPy.
Here I encourage you to find your own dataset, one that interests you and play around with that. Some good resources to find datasets are your country’s (or another’s) website. If you search for example UK government data or US government data, it will be one of the first results. Kaggle is another great source.
I will be using data on the UK’s rainfall that I found on the UK government’s website which can easily be downloaded and towards the end, some data that I got off a website about Japan’s rainfall.

# Reading a csv into Pandas.
df = pd.read_csv('notebook_playground/data/uk_rain_2014.csv', header=0)

1 2	# Reading a csv into Pandas. df = pd.read_csv('notebook_playground/data/uk_rain_2014.csv', header=0)

Here we get data from a csv file and store it in a dataframe. It’s as simple as calling read_csv and putting the path to your csv file as an argument. The header keyword argument tells Pandas if and where the column names of your data are. If there are no column names you can set it to None. Pandas is pretty clever so this can often be omitted.

GETTING YOUR DATA READY TO EXPLORE AND ANALYSE

Now we have our data in Pandas, we probably want to take a quick look at it and know some basic information about it to give us some direction before we really probe into it.
To take a quick look at the first x rows of the data.

# Getting first x rows.
df.head(5)

1 2	# Getting first x rows. df.head(5)

All we do is use the head() function and pass it the number of rows we want to retrieve.
You’ll end up with table looking like this:

Another thing you might want to do is get the last x rows.

# Getting last x rows.
df.tail(5)

1 2	# Getting last x rows. df.tail(5)

As with head all we do is call tail and pass it the number of rows we want to retrieve. Notice that it doesn’t start at the end of the dataframe and goes backwards. It gives you the rows in the order they are in in the dataframe.
You will end up with something that looks like this:

When referring to columns in Pandas you often refer to their names. This great and very easy to work with, but sometimes data has horribly long column names, such as whole questions from questionnaires. Life is much easier when you change them to make them shorter.

# Changing column labels.
df.columns = ['water_year','rain_octsep', 'outflow_octsep',
              'rain_decfeb', 'outflow_decfeb', 'rain_junaug', 'outflow_junaug']

df.head(5)

# Changing column labels.

df.columns = ['water_year','rain_octsep', 'outflow_octsep',

'rain_decfeb', 'outflow_decfeb', 'rain_junaug', 'outflow_junaug']

df.head(5)

One thing to note here is that I have purposely made all the column labels with no spaces and no dashes. If we name our variables like this, it saves us some typing as you will see later.
You will end up with the same data as before, but with different column names:

Another important thing about your data that you will usually want to know is how many entries you have. In Pandas one entry equates to one row, so we can take the len of the dataset, which returns the amount of rows there are.

# Finding out how many rows dataset has.
len(df)

1 2	# Finding out how many rows dataset has. len(df)

This will give you an integer telling you the number of rows, in my dataset I have 33.
One more thing that you might need to know is some basic statistics on your data, Pandas makes this delightfully simple.

# Finding out basic statistical information on your dataset.
pd.options.display.float_format = '{:,.3f}'.format # Limit output to 3 decimal places.
df.describe()

# Finding out basic statistical information on your dataset.

pd.options.display.float_format = '{:,.3f}'.format # Limit output to 3 decimal places.

df.describe()

This will return a table of various statistics such as count, mean, standard deviation and more that will look at bit like this:

FILTERING

When poking around in your dataset you will often want to take specific a sample of your data, for example if you had a questionnaire on job satisfaction, you might want to take all the people in a specific industry or age range.
Pandas gives us many ways to filter our data to extract the information we want.
Sometimes you’ll want to extract a whole column. Using column labels this is extremely easy.

# Getting a column by label
df['rain_octsep']

1 2	# Getting a column by label df['rain_octsep']

Note that when we extract a column, we are given back a series, not a dataframe. If you remember, you can think of a dataframe as a dictionary of series, so if we pull out a column, of course we get a series back.
Remember how I pointed out my choice in naming the column labels? Not using spaces or dashes etc. allows us to access columns the same way we can access object properties; using a dot.

# Getting a column by label using .
df.rain_octsep

1 2	# Getting a column by label using . df.rain_octsep

This will return exactly the same as the previous example. A series of the data in our chosen column.
If you have read the Numpy post in the series, you may remember a technique called ‘boolean masking’ and how we can get an array of boolean values running a conditional on an array. Well we can do this in Pandas too.

# Creating a series of booleans based on a conditional
df.rain_octsep < 1000 # Or df['rain_octsep] < 1000

1 2	# Creating a series of booleans based on a conditional df.rain_octsep < 1000 # Or df['rain_octsep] < 1000

The above code will return a dataframe of boolean values; ‘True’ if the rain in October-September what less than 1000mm and ‘False’ if not.
We can then use these conditional expressions to filter an existing dataframe.

# Using a series of booleans to filter
df[df.rain_octsep < 1000]

1 2	# Using a series of booleans to filter df[df.rain_octsep < 1000]

This will return a dataframe of only entries that had less than 1000mm of rain from October-September.

You can also filter by multiple conditional expressions.

# Filtering by multiple conditionals
df[(df.rain_octsep < 1000) & (df.outflow_octsep < 4000)] # Can't use the keyword 'and'

1 2	# Filtering by multiple conditionals df[(df.rain_octsep < 1000) & (df.outflow_octsep < 4000)] # Can't use the keyword 'and'

This will return only the entries that have a value of less than 1000 for rain_octsep and less than 4000 for outflow_octsep.
An important thing to note here is you cannot use the keyword ‘and’ here due to problems with the order of operations. You must use ‘&’ and brackets.

If you have strings in your data, then good news for you, you can also use string methods to filter with.

# Filtering by string methods
df[df.water_year.str.startswith('199')]

1 2	# Filtering by string methods df[df.water_year.str.startswith('199')]

Note that you have to use .str.[string method], you can’t just call a string method on it right away. This returns all entries in the 1990’s.

INDEXING

The previous section showed us how to get data based on operations done to the columns, but Pandas has labelled rows too. These row labels can be numerical or label based, and the method of retrieving a row differs depending on this label type.
If your rows have numerical indices, you can reference them using iloc.

# Getting a row via a numerical index
df.iloc[30]

1 2	# Getting a row via a numerical index df.iloc[30]

iloc will only work on numerical indices. It will return a series of that row. Each column of that row will be an element in the returned series.
Maybe in your dataset you have a column of years, or ages. Maybe you want to be able to reference rows using these years or ages. In this case we can set a new index (or multiple).

# Setting a new index from an existing column
df = df.set_index(['water_year'])
df.head(5)

# Setting a new index from an existing column

df = df.set_index(['water_year'])

df.head(5)

This will make the column ‘water_year’ an index. Notice that the column name is actually in a list, although the one above only has one element. If you wanted to have more than one index, this can be easily done by adding another column name to the list.

In the above example we set our index to be a column that is full of strings. This means that we now can’t reference then with iloc, so what do we use? We use loc.

# Getting a row via a label-based index
df.loc['2000/01']

1 2	# Getting a row via a label-based index df.loc['2000/01']

This, like iloc will return a series of the row you reference. The only difference is this time you are using label based referencing not numerical based.
There is another commonly used way to reference a row; ix. So if loc is label based and iloc is numerical based… What is ix? Well, ix is label based with a numerical index fallback.

# Getting a row via a label-based or numerical index
df.ix['1999/00'] # Label based with numerical index fallback *Not recommended

1 2	# Getting a row via a label-based or numerical index df.ix['1999/00'] # Label based with numerical index fallback *Not recommended

Just like loc and iloc this will return a series of the row you referenced.
So if ix does the job of both loc and iloc, why would you use anything else? The big reason is that it is slightly unpredictable. Remember how I said that it is label based with a numerical index fallback? Well this makes it do weird things sometimes like interpreting a number as a location. Using loc and iloc gives you safety, predictability, peace of mind. I should point out however, that ix is faster than both loc and iloc
It’s often useful to have indexes in order, we can do this in pandas via calling sort_index on our dataframe.

df.sort_index(ascending=False).head(5) #inplace=True to apple the sorting in place

1	df.sort_index(ascending=False).head(5) #inplace=True to apple the sorting in place

My index is already in order, so I set the keyword argument ‘ascending’ to False for demonstration purposes. This makes my data sort in descending order.

When you set a column of data to an index, it is no longer data per se. If you want to return the index to it’s original data form you just do the opposite of set_index… reset_index.

# Returning an index to data
df = df.reset_index('water_year')
df.head(5)

# Returning an index to data

df = df.reset_index('water_year')

df.head(5)

This will return your index to it’s original column form.

APPLYING FUNCTIONS TO DATASETS

Sometimes you will want to change or operate on the data in your dataset in some way. For example maybe you have a list of years and want to create a new column that gives the year’s decade. Pandas has two very useful functions for this, apply and applymap.

# Applying a function to a column
def base_year(year):
    base_year = year[:4]
    base_year= pd.to_datetime(base_year).year
    return base_year

df['year'] = df.water_year.apply(base_year)
df.head(5)

# Applying a function to a column

def base_year(year):

base_year = year[:4]

base_year= pd.to_datetime(base_year).year

return base_year

df['year'] = df.water_year.apply(base_year)

df.head(5)

This creates a new column called ‘year’ that is derived from the ‘water_year’ column and extracts just the main year. This is how to use apply, which is how you apply a function to a column. If you wanted to apply some function to the whole dataset you can use dataset.applymap().

MANIPULATING A DATASET’S STRUCTURE

Another common thing to do with dataframes is to restructure them in order to put them in a more convenient and/or useful form.
The easiest way to get to grips with these transformations is to see them happening, more than anything else in this post, the next few operations require some playing with to get your head around them.
First up, groupby…

#Manipulating structure (groupby, unstack, pivot)
# Grouby
df.groupby(df.year // 10 * 10).max()

#Manipulating structure (groupby, unstack, pivot)

# Grouby

df.groupby(df.year // 10 * 10).max()

What grouby does is form groups around the column you choose. Above groups by decade. This however doesn’t produce anything useful to us, we must then call something on it, such as max, min, mean, etc. Which will give us for example, the mean x in the 90’s.

You can also form groups with multiple columns.

# Grouping by multiple columns
decade_rain = df.groupby([df.year // 10 * 10, df.rain_octsep // 1000 * 1000])[['outflow_octsep',
                                                                              'outflow_decfeb',
                                                                              'outflow_junaug']].mean()
decade_rain

# Grouping by multiple columns

decade_rain = df.groupby([df.year // 10 * 10, df.rain_octsep // 1000 * 1000])[['outflow_octsep',

'outflow_decfeb',

'outflow_junaug']].mean()

decade_rain

Next up unstacking which can be a little confusing at first. What it does is push a column up to become column labels. It’s best to just see it in action…

# Unstacking
decade_rain.unstack(0)

1 2	# Unstacking decade_rain.unstack(0)

This transforms the dataframe that we produced in the section above into the dataframe below. It pushes up the 0th column, which is actually the index ‘year’, and turns it into column labels.

Let’s do one more for good measure. This time on the 1st column, which is the index ‘rain_octsep’.

# More unstacking
decade_rain.unstack(1)

1 2	# More unstacking decade_rain.unstack(1)

Now before our next operation, we will first create a new dataframe to play with.

# Create a new dataframe containing entries which 
# has rain_octsep values of greater than 1250
high_rain = df[df.rain_octsep > 1250]
high_rain

# Create a new dataframe containing entries which

# has rain_octsep values of greater than 1250

high_rain = df[df.rain_octsep > 1250]

high_rain

The above code gives us the below dataframe which we will perform pivoting on.

Pivoting is actually a combination of operations that we have already looked at in this post. First it sets a new index (set_index()), then it sorts that index (sort_index()) and finally it does an unstack on it. Together this is a pivot. See if you can visualise what is happening.

#Pivoting
#does set_index, sort_index and unstack in a row
high_rain.pivot('year', 'rain_octsep')[['outflow_octsep', 'outflow_decfeb', 'outflow_junaug']].fillna('')

#Pivoting

#does set_index, sort_index and unstack in a row

high_rain.pivot('year', 'rain_octsep')[['outflow_octsep', 'outflow_decfeb', 'outflow_junaug']].fillna('')

Notice at the end there is a .fillna(”). This pivot creates a lot of empty entries, NaN value entries. I personally find my dataframe being littered with NaNs distracting so I use fillna(”). You can enter whatever you like, for example a zero. We can also use the function dropna(how=’any’) to delete all rows with NaNs in them. In this case it would delete everything though, so we won’t do that.

The above dataframe shows us the outflow for all the years with rainfall over 1250. Admittedly this wasn’t the best example of a pivot in terms of practical use, but hopefully you get the idea. See what you can come up with in your dataset.

COMBINING DATASETS

Sometimes you will have two separate datasets that relate to each other that you want to compare them together or combine them. Well, no problem; Pandas makes this easy.

# Merging two datasets together
rain_jpn = pd.read_csv('notebook_playground/data/jpn_rain.csv')
rain_jpn.columns = ['year', 'jpn_rainfall']

uk_jpn_rain = df.merge(rain_jpn, on='year')
uk_jpn_rain.head(5)

# Merging two datasets together

rain_jpn = pd.read_csv('notebook_playground/data/jpn_rain.csv')

rain_jpn.columns = ['year', 'jpn_rainfall']

uk_jpn_rain = df.merge(rain_jpn, on='year')

uk_jpn_rain.head(5)

First off you need to have matching columns to merge on which you can then select via the ‘on’ keyword argument. You can often omit it and Pandas will work out one which columns to merge too.
As you can see below, the two datasets have been combined on the year category. The rain_jpn dataset only has the year and amount of rainfall and as we merged on the year column, only one column ‘jpn_rainfall’ has been merged with the columns from our UK rain dataset.

USING PANDAS TO PLOT GRAPHS QUICKLY

Matplotlib is great, but it takes a fair bit of code to create a half-way decent graph and sometimes you just want to quickly whip up a plot of your data for just your eyes to help you explore and make sense of it. Pandas answers this problem with plot.

# Using pandas to quickly plot graphs
uk_jpn_rain.plot(x='year', y=['rain_octsep', 'jpn_rainfall'])

1 2	# Using pandas to quickly plot graphs uk_jpn_rain.plot(x='year', y=['rain_octsep', 'jpn_rainfall'])

This creates a plot of your data using matplotlib, quickly and hassle free. From this you can then analyse your data visually to give yourself more direction when exploring. For example if you look at the plot of my data, you can see that maybe there was a drought in UK in 1995.

You can also see that UK’s rainfall is significantly less than Japan’s, and people say UK rains a lot!

SAVING YOUR DATASETS

After cleaning, reshaping and exploring your dataset, you often end up with something very different and much more useful than what you started with. You should also ways keep your original data, but also saving your newly polished dataset is a good idea too.

# Saving your data to a csv
df.to_csv('uk_rain.csv')

1 2	# Saving your data to a csv df.to_csv('uk_rain.csv')

The above code will save your data to a csv ready for next time.
So there we have it, and introduction to Pandas. As I said before, Pandas is really great and we have only scratched the surface here, but you should now know enough to get going and start cleaning and exploring data.
As usual I really urge you to go and play with this. Find a dataset or two that really interests you, sit down with a beer maybe and start probing it. It’s really the only way to get comfortable with Pandas and the other libraries introduced in this series. Plus you never know, you might find something interesting.