Practical Data Science in Python
This notebook accompanies my talk on "Data Science with Python".
The goal of this talk is to demonstrate some high level, introductory concepts behind (text) machine learning. The concepts are accompanied by concrete code examples in this notebook, which you can run yourself (after installing IPython, see below), on your own computer.
The talk audience is expected to have some basic programming knowledge (though not necessarily Python) and some basic introductory data mining background. This is not an "advanced talk" for machine learning experts.
The code examples build a working, executable prototype: an app to classify phone SMS messages in English (well, the "SMS kind" of English...) as either "spam" or "ham" (=not spam).
The goal of this talk is to demonstrate some high level, introductory concepts behind (text) machine learning. The concepts are accompanied by concrete code examples in this notebook, which you can run yourself (after installing IPython, see below), on your own computer.
The talk audience is expected to have some basic programming knowledge (though not necessarily Python) and some basic introductory data mining background. This is not an "advanced talk" for machine learning experts.
The code examples build a working, executable prototype: an app to classify phone SMS messages in English (well, the "SMS kind" of English...) as either "spam" or "ham" (=not spam).
The language used throughout will be Python,
a general purpose language helpful in all parts of the pipeline: I/O,
data wrangling and preprocessing, model training and evaluation. While
Python is by no means the only choice, it offers a unique combination of
flexibility, ease of development and performance, thanks to its mature
scientific computing ecosystem. Its vast, open source ecosystem also
avoids the lock-in (and associated bitrot) of any single specific
framework or library.
Python (and of most its libraries) is also platform independent, so you can run this notebook on Windows, Linux or OS X without a change.
One of the Python tools, the IPython notebook = interactive Python rendered as HTML, you're watching right now. We'll go over other practical tools, widely used in the data science industry, below.
Python (and of most its libraries) is also platform independent, so you can run this notebook on Windows, Linux or OS X without a change.
One of the Python tools, the IPython notebook = interactive Python rendered as HTML, you're watching right now. We'll go over other practical tools, widely used in the data science industry, below.
Want to run the examples below interactively? (optional)
- Install the (free) Anaconda Python distribution, including Python itself.
- Install the "natural language processing" TextBlob library: instructions here.
-
Download the source for this notebook to your computer: http://radimrehurek.com/data_science_python/data_science_python.ipynb and run it with:
$ ipython notebook data_science_python.ipynb
- Watch the IPython tutorial video for notebook navigation basics.
- Run the first code cell below; if it executes without errors, you're good to go!
End-to-end example: automated spam filtering
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import csv
from textblob import TextBlob
import pandas
import sklearn
import cPickle
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold, cross_val_score, train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.learning_curve import learning_curve
Step 1: Load data, look around
Skipping the real first step (fleshing out specs, finding
out what is it we want to be doing -- often highly non-trivial in
practice!), let's download the dataset we'll be using in this demo. Go
to https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection and
download the zip file. Unzip it under
data
subdirectory. You should see a file called SMSSpamCollection
, about 0.5MB in size:$ ls -l data
total 1352
-rw-r--r--@ 1 kofola staff 477907 Mar 15 2011 SMSSpamCollection
-rw-r--r--@ 1 kofola staff 5868 Apr 18 2011 readme
-rw-r-----@ 1 kofola staff 203415 Dec 1 15:30 smsspamcollection.zip
This file contains a collection of more than 5 thousand SMS phone messages (see the
readme
file for more info):
In [2]:
messages = [line.rstrip() for line in open('./data/SMSSpamCollection')]
print len(messages)
A collection of texts is also sometimes called "corpus". Let's print the first ten messages in this SMS corpus:
In [3]:
for message_no, message in enumerate(messages[:10]):
print message_no, message
We see that this is a TSV
("tab separated values") file, where the first column is a label saying
whether the given message is a normal message ("ham") or "spam". The
second column is the message itself.
This corpus will be our labeled training set. Using these ham/spam examples, we'll train a machine learning model to learn to discriminate between ham/spam automatically. Then, with a trained model, we'll be able to classify arbitrary unlabeled messages as ham or spam.
This corpus will be our labeled training set. Using these ham/spam examples, we'll train a machine learning model to learn to discriminate between ham/spam automatically. Then, with a trained model, we'll be able to classify arbitrary unlabeled messages as ham or spam.
Instead of parsing TSV (or CSV, or Excel...) files by hand, we can use Python's
pandas
library to do the work for us:
In [4]:
messages = pandas.read_csv('./data/SMSSpamCollection', sep='\t', quoting=csv.QUOTE_NONE,
names=["label", "message"])
print messages
With
pandas
, we can also view aggregate statistics easily:
In [5]:
messages.groupby('label').describe()
Out[5]:
How long are the messages?
In [6]:
messages['length'] = messages['message'].map(lambda text: len(text))
print messages.head()
In [7]:
messages.length.plot(bins=20, kind='hist')
Out[7]:
In [8]:
messages.length.describe()
Out[8]:
What is that super long message?
In [9]:
print list(messages.message[messages.length > 900])
Is there any difference in message length between spam and ham?
In [10]:
messages.hist(column='length', by='label', bins=50)
Out[10]:
Good fun, but how do we make computer understand the plain text
messages themselves? Or can it under such malformed gibberish at all?
Step 2: Data preprocessing
In this section we'll massage the raw messages (sequence of characters) into vectors (sequences of numbers).
The mapping is not 1-to-1; we'll use the bag-of-words approach, where each unique word in a text will be represented by one number.
As a first step, let's write a function that will split a message into its individual words:
The mapping is not 1-to-1; we'll use the bag-of-words approach, where each unique word in a text will be represented by one number.
As a first step, let's write a function that will split a message into its individual words:
In [11]:
def split_into_tokens(message):
message = unicode(message, 'utf8') # convert bytes into proper unicode
return TextBlob(message).words
Here are some of the original texts again:
In [12]:
messages.message.head()
Out[12]:
...and here are the same messages, tokenized:
In [13]:
messages.message.head().apply(split_into_tokens)
Out[13]:
NLP questions:
With textblob, we'd detect part-of-speech (POS) tags with:
- Do capital letters carry information?
- Does distinguishing inflected form ("goes" vs. "go") carry information?
- Do interjections, determiners carry information?
With textblob, we'd detect part-of-speech (POS) tags with:
In [14]:
TextBlob("Hello world, how is it going?").tags # list of (word, POS) pairs
Out[14]:
and normalize words into their base form (lemmas) with:
In [15]:
def split_into_lemmas(message):
message = unicode(message, 'utf8').lower()
words = TextBlob(message).words
# for each word, take its "base form" = lemma
return [word.lemma for word in words]
messages.message.head().apply(split_into_lemmas)
Out[15]:
Better. You can probably think of many more ways to improve the preprocessing: decoding HTML entities (those
&
and <
we saw above); filtering out stop words (pronouns etc); adding more features, such as an word-in-all-caps indicator and so on.Step 3: Data to vectors
Now we'll convert each message, represented as a list of tokens
(lemmas) above, into a vector that machine learning models can
understand.
Doing that requires essentially three steps, in the bag-of-words model:
Doing that requires essentially three steps, in the bag-of-words model:
- counting how many times does a word occur in each message (term frequency)
- weighting the counts, so that frequent tokens get lower weight (inverse document frequency)
- normalizing the vectors to unit length, to abstract from the original text length (L2 norm)
Each vector has as many dimensions as there are unique words in the SMS corpus:
In [16]:
bow_transformer = CountVectorizer(analyzer=split_into_lemmas).fit(messages['message'])
print len(bow_transformer.vocabulary_)
Here we used
Let's take one text message and get its bag-of-words counts as a vector, putting to use our new
scikit-learn
(sklearn
), a powerful Python library for teaching machine learning. It contains a multitude of various methods and options.Let's take one text message and get its bag-of-words counts as a vector, putting to use our new
bow_transformer
:
In [17]:
message4 = messages['message'][3]
print message4
In [18]:
bow4 = bow_transformer.transform([message4])
print bow4
print bow4.shape
So, nine unique words in message nr. 4, two of them appear twice, the
rest only once. Sanity check: what are these words the appear twice?
In [19]:
print bow_transformer.get_feature_names()[6736]
print bow_transformer.get_feature_names()[8013]
The bag-of-words counts for the entire SMS corpus are a large, sparse matrix:
In [20]:
messages_bow = bow_transformer.transform(messages['message'])
print 'sparse matrix shape:', messages_bow.shape
print 'number of non-zeros:', messages_bow.nnz
print 'sparsity: %.2f%%' % (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1]))
And finally, after the counting, the term weighting and normalization can be done with TF-IDF, using scikit-learn's
TfidfTransformer
:
In [21]:
tfidf_transformer = TfidfTransformer().fit(messages_bow)
tfidf4 = tfidf_transformer.transform(bow4)
print tfidf4
What is the IDF (inverse document frequency) of the word
"u"
? Of word "university"
?
In [22]:
print tfidf_transformer.idf_[bow_transformer.vocabulary_['u']]
print tfidf_transformer.idf_[bow_transformer.vocabulary_['university']]
To transform the entire bag-of-words corpus into TF-IDF corpus at once:
In [23]:
messages_tfidf = tfidf_transformer.transform(messages_bow)
print messages_tfidf.shape
There are a multitude of ways in which data can be proprocessed and
vectorized. These two steps, also called "feature engineering", are
typically the most time consuming and "unsexy" parts of building a
predictive pipeline, but they are very important and require some
experience. The trick is to evaluate constantly: analyze model for the
errors it makes, improve data cleaning & preprocessing, brainstorm
for new features, evaluate...
Step 4: Training a model, detecting spam
With messages represented as vectors, we can finally train our
spam/ham classifier. This part is pretty straightforward, and there are
many libraries that realize the training algorithms.
We'll be using scikit-learn here, choosing the Naive Bayes classifier to start with:
In [24]:
%time spam_detector = MultinomialNB().fit(messages_tfidf, messages['label'])
Let's try classifying our single random message:
In [25]:
print 'predicted:', spam_detector.predict(tfidf4)[0]
print 'expected:', messages.label[3]
Hooray! You can try it with your own texts, too.
A natural question is to ask, how many messages do we classify correctly overall?
A natural question is to ask, how many messages do we classify correctly overall?
In [26]:
all_predictions = spam_detector.predict(messages_tfidf)
print all_predictions
In [27]:
print 'accuracy', accuracy_score(messages['label'], all_predictions)
print 'confusion matrix\n', confusion_matrix(messages['label'], all_predictions)
print '(row=expected, col=predicted)'
In [28]:
plt.matshow(confusion_matrix(messages['label'], all_predictions), cmap=plt.cm.binary, interpolation='nearest')
plt.title('confusion matrix')
plt.colorbar()
plt.ylabel('expected label')
plt.xlabel('predicted label')
Out[28]:
From this confusion matrix, we can compute precision and recall, or their combination (harmonic mean) F1:
In [29]:
print classification_report(messages['label'], all_predictions)
There are quite a few possible metrics for evaluating model
performance. Which one is the most suitable depends on the task. For
example, the cost of mispredicting "spam" as "ham" is probably much
lower than mispredicting "ham" as "spam".
Step 5: How to run experiments?
In the above "evaluation", we committed a cardinal sin. For
simplicity of demonstration, we evaluated accuracy on the same data we
used for training. Never evaluate on the same dataset you train on! Bad! Incest!
Such evaluation tells us nothing about the true predictive power of our model. If we simply remembered each example during training, the accuracy on training data would trivially be 100%, even though we wouldn't be able to classify any new messages.
A proper way is to split the data into a training/test set, where the model only ever sees the training data during its model fitting and parameter tuning. The test data is never used in any way -- thanks to this process, we make sure we are not "cheating", and that our final evaluation on test data is representative of true predictive performance.
Such evaluation tells us nothing about the true predictive power of our model. If we simply remembered each example during training, the accuracy on training data would trivially be 100%, even though we wouldn't be able to classify any new messages.
A proper way is to split the data into a training/test set, where the model only ever sees the training data during its model fitting and parameter tuning. The test data is never used in any way -- thanks to this process, we make sure we are not "cheating", and that our final evaluation on test data is representative of true predictive performance.
In [30]:
msg_train, msg_test, label_train, label_test = \
train_test_split(messages['message'], messages['label'], test_size=0.2)
print len(msg_train), len(msg_test), len(msg_train) + len(msg_test)
So, as requested, the test size is 20% of the entire dataset (1115
messages out of total 5574), and the training is the rest (4459 out of
5574).
Let's recap the entire pipeline up to this point, putting the steps explicitly into scikit-learn's
Pipeline
:
In [31]:
def split_into_lemmas(message):
message = unicode(message, 'utf8').lower()
words = TextBlob(message).words
# for each word, take its "base form" = lemma
return [word.lemma for word in words]
pipeline = Pipeline([
('bow', CountVectorizer(analyzer=split_into_lemmas)), # strings to token integer counts
('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores
('classifier', MultinomialNB()), # train on TF-IDF vectors w/ Naive Bayes classifier
])
A common practice is to partition the training set again, into
smaller subsets; for example, 5 equally sized subsets. Then we train the
model on four parts, and compute accuracy on the last part (called
"validation set"). Repeated five times (taking different part for
evaluation each time), we get a sense of model "stability". If the model
gives wildly different scores for different subsets, it's a sign
something is wrong (bad data, or bad model variance). Go back, analyze
errors, re-check input data for garbage, re-check data cleaning.
In our case, everything goes smoothly though:
In our case, everything goes smoothly though:
In [32]:
scores = cross_val_score(pipeline, # steps to convert raw messages into models
msg_train, # training data
label_train, # training labels
cv=10, # split data randomly into 10 parts: 9 for training, 1 for scoring
scoring='accuracy', # which scoring metric?
n_jobs=-1, # -1 = use all cores = faster
)
print scores
The scores are indeed a little bit worse than when we trained on the
entire dataset (5574 training examples, accuracy 0.97). They are fairly
stable though:
In [33]:
print scores.mean(), scores.std()
A natural question is, how can we improve this model? The scores are
already high here, but how would we go about improving a model in
general?
Naive Bayes is an example of a high bias - low variance classifier (aka simple and stable, not prone to overfitting). An example from the opposite side of the spectrum would be Nearest Neighbour (kNN) classifiers, or Decision Trees, with their low bias but high variance (easy to overfit). Bagging (Random Forests) as a way to lower variance, by training many (high-variance) models and averaging.
Naive Bayes is an example of a high bias - low variance classifier (aka simple and stable, not prone to overfitting). An example from the opposite side of the spectrum would be Nearest Neighbour (kNN) classifiers, or Decision Trees, with their low bias but high variance (easy to overfit). Bagging (Random Forests) as a way to lower variance, by training many (high-variance) models and averaging.
In other words:
- high bias = classifer is opinionated. Not as much room to change its mind with data, it has its own ideas. On the other hand, not as much room it can fool itself into overfitting either (picture on the left).
- low bias = classifier more obedient, but also more neurotic. Will do exactly what you ask it to do, which, as everybody knows, can be a real nuisance (picture on the right).
In [34]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and traning learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
cv : integer, cross-validation generator, optional
If an integer is passed, it is the number of folds (defaults to 3).
Specific cross-validation objects can be passed, see
sklearn.cross_validation module for the list of possible objects
n_jobs : integer, optional
Number of jobs to run in parallel (default 1).
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
In [35]:
%time plot_learning_curve(pipeline, "accuracy vs. training set size", msg_train, label_train, cv=5)
Out[35]:
(We're effectively training on 64% of all available data: we reserved
20% for the test set above, and the 5-fold cross validation reserves
another 20% for validation sets =>
0.8*0.8*5574=3567
training examples left.)
Since performance keeps growing, both for training and cross
validation scores, we see our model is not complex/flexible enough to
capture all nuance, given little data. In this particular case, it's not
very pronounced, since the accuracies are high anyway.
At this point, we have two options:
Knowing how to evaluate models properly, we can now explore how different parameters affect the performace.
At this point, we have two options:
- use more training data, to overcome low model complexity
- use a more complex (lower bias) model to start with, to get more out of the existing data
Knowing how to evaluate models properly, we can now explore how different parameters affect the performace.
Step 6: How to tune parameters?
What we've seen so far is only a tip of the iceberg: there are many
other parameters to tune. One example is what algorithm to use for
training.
We've used Naive Bayes above, but scikit-learn supports many classifiers out of the box: Support Vector Machines, Nearest Neighbours, Decision Trees, Ensamble methods...
We've used Naive Bayes above, but scikit-learn supports many classifiers out of the box: Support Vector Machines, Nearest Neighbours, Decision Trees, Ensamble methods...
We can ask: What is the effect of IDF weighting on accuracy? Does the
extra processing cost of lemmatization (vs. just plain words) really
help?
Let's find out:
Let's find out:
In [37]:
params = {
'tfidf__use_idf': (True, False),
'bow__analyzer': (split_into_lemmas, split_into_tokens),
}
grid = GridSearchCV(
pipeline, # pipeline from above
params, # parameters to tune via cross validation
refit=True, # fit using all available data at the end, on the best found param combination
n_jobs=-1, # number of cores to use for parallelization; -1 for "all cores"
scoring='accuracy', # what score are we optimizing?
cv=StratifiedKFold(label_train, n_folds=5), # what type of cross validation to use
)
In [38]:
%time nb_detector = grid.fit(msg_train, label_train)
print nb_detector.grid_scores_
(best parameter combinations are displayed first: in this case,
A quick sanity check:
use_idf=True
and analyzer=split_into_lemmas
take the prize).A quick sanity check:
In [39]:
print nb_detector.predict_proba(["Hi mom, how are you?"])[0]
print nb_detector.predict_proba(["WINNER! Credit for free!"])[0]
The
predict_proba
returns the predicted probability for
each class (ham, spam). In the first case, the message is predicted to
be ham with > 99% probability, and spam with < 1%. So if forced to
choose, the model will say "ham":
In [40]:
print nb_detector.predict(["Hi mom, how are you?"])[0]
print nb_detector.predict(["WINNER! Credit for free!"])[0]
And overall scores on the test set, the one we haven't used at all during training:
In [41]:
predictions = nb_detector.predict(msg_test)
print confusion_matrix(label_test, predictions)
print classification_report(label_test, predictions)
This is then the realistic predictive performance we can expect from
our spam detection pipeline, when using lowercase with lemmatization,
TF-IDF and Naive Bayes for classifier.
Let's try with another classifier: Support Vector Machines (SVM).
SVMs are a great starting point when classifying text data, getting
state of the art results very quickly and with pleasantly little tuning
(although a bit more than Naive Bayes):
In [42]:
pipeline_svm = Pipeline([
('bow', CountVectorizer(analyzer=split_into_lemmas)),
('tfidf', TfidfTransformer()),
('classifier', SVC()), # <== change here
])
# pipeline parameters to automatically explore and tune
param_svm = [
{'classifier__C': [1, 10, 100, 1000], 'classifier__kernel': ['linear']},
{'classifier__C': [1, 10, 100, 1000], 'classifier__gamma': [0.001, 0.0001], 'classifier__kernel': ['rbf']},
]
grid_svm = GridSearchCV(
pipeline_svm, # pipeline from above
param_grid=param_svm, # parameters to tune via cross validation
refit=True, # fit using all data, on the best detected classifier
n_jobs=-1, # number of cores to use for parallelization; -1 for "all cores"
scoring='accuracy', # what score are we optimizing?
cv=StratifiedKFold(label_train, n_folds=5), # what type of cross validation to use
)
In [43]:
%time svm_detector = grid_svm.fit(msg_train, label_train) # find the best combination from param_svm
print svm_detector.grid_scores_
So apparently, linear kernel with
Sanity check again:
C=1
is the best parameter combination.Sanity check again:
In [44]:
print svm_detector.predict(["Hi mom, how are you?"])[0]
print svm_detector.predict(["WINNER! Credit for free!"])[0]
In [45]:
print confusion_matrix(label_test, svm_detector.predict(msg_test))
print classification_report(label_test, svm_detector.predict(msg_test))
This is then the realistic predictive performance we can expect from our spam detection pipeline, when using SVMs.
Step 7: Productionalizing a predictor
With basic analysis and tuning done, the real work (engineering) begins.
The final step for a production predictor would be training it on the entire dataset again, to make full use of all the data available. We'd use the best parameters found via cross validation above, of course. This is very similar to what we did in the beginning, but this time having insight into its behaviour and stability. Evaluation was done honestly, on distinct train/test subset splits.
The final predictor can be serialized to disk, so that the next time we want to use it, we can skip all training and use the trained model directly:
The final step for a production predictor would be training it on the entire dataset again, to make full use of all the data available. We'd use the best parameters found via cross validation above, of course. This is very similar to what we did in the beginning, but this time having insight into its behaviour and stability. Evaluation was done honestly, on distinct train/test subset splits.
The final predictor can be serialized to disk, so that the next time we want to use it, we can skip all training and use the trained model directly:
In [46]:
# store the spam detector to disk after training
with open('sms_spam_detector.pkl', 'wb') as fout:
cPickle.dump(svm_detector, fout)
# ...and load it back, whenever needed, possibly on a different machine
svm_detector_reloaded = cPickle.load(open('sms_spam_detector.pkl'))
The loaded result is an object that behaves identically to the original:
In [47]:
print 'before:', svm_detector.predict([message4])[0]
print 'after:', svm_detector_reloaded.predict([message4])[0]
Another important part of a production implementation is performance.
After a rapid, iterative model tuning and parameter search as shown
here, a well performing model can be translated into a different
language and optimized. Would trading a few accuracy points give us a
smaller, faster model? Is it worth optimizing memory usage, perhaps
using
Note that optimization is not always necessary; always start with actual profiling.
Other things to consider here, for a production pipeline, are robustness (service failover, redundancy, load balancing), monitoring (incl. auto-alerts on anomalies) and HR fungibility (avoiding "knowledge silos" of how things are done, arcane/lock-in technologies, black art of tuning results). These days, even the open source world can offer viable solutions in all of these areas. All the tool shown today are free for commercial use, under OSI-approved open source licenses.
mmap
to share memory across processes?Note that optimization is not always necessary; always start with actual profiling.
Other things to consider here, for a production pipeline, are robustness (service failover, redundancy, load balancing), monitoring (incl. auto-alerts on anomalies) and HR fungibility (avoiding "knowledge silos" of how things are done, arcane/lock-in technologies, black art of tuning results). These days, even the open source world can offer viable solutions in all of these areas. All the tool shown today are free for commercial use, under OSI-approved open source licenses.
Other practical concepts
data sparsity
online learning, data streams
scalability, distributed (cluster) processing
online learning, data streams
mmap
for memory sharing, system "cold-start" load timesscalability, distributed (cluster) processing
Unsupervised learning
Most data not structured. Gaining insight, no intrinsic evaluation possible (or else becomes supervised learning!).
How can we train anything without labels? What kind of sorcery is this?
Distributional hypothesis: "Words that occur in similar contexts tend to have similar meanings". Context = sentence, document, sliding window...
Check out this live demo of Google's word2vec for unsupervised learning. Simple model, large data (Google News, 100 billion words, no labels).
How can we train anything without labels? What kind of sorcery is this?
Distributional hypothesis: "Words that occur in similar contexts tend to have similar meanings". Context = sentence, document, sliding window...
Check out this live demo of Google's word2vec for unsupervised learning. Simple model, large data (Google News, 100 billion words, no labels).
Where next?
A static (non-interactive version) of this notebook rendered into HTML at http://radimrehurek.com/data_science_python (you're probably watching it right now, but just in case).
Interactive notebook source at http://radimrehurek.com/data_science_python/data_science_python.ipynb (see above for installation instructions).
My company lives in the exciting intersection of pragmatic, commercial system building and research. Interested in interning / collaboration? Get in touch.
Interactive notebook source at http://radimrehurek.com/data_science_python/data_science_python.ipynb (see above for installation instructions).
My company lives in the exciting intersection of pragmatic, commercial system building and research. Interested in interning / collaboration? Get in touch.
In []:
No comments:
Post a Comment