https://monkeylearn.com/text-analysis/
Text analysis is the automated process of obtaining information from text.
In today's information-saturated world, it's a challenge for
businesses to keep on top of all the tweets, emails, product feedback
and support tickets that pour in every day. Take Google, for example. On
average, the tech company processes over 40,000 search queries every
second, which is equal to over
3.5 billion searches per day!
So, how can text analysis help businesses deal with
information overload?
Below, we'll go into more detail about what text analysis is, how it
works, use cases and applications, as well as some resources and useful
tutorials to get your feet wet. Maybe you're new to artificial
intelligence and work in customer support, sales or product teams. You
might even be a data-savvy analyst or software developer. Either way,
this guide offers a comprehensive introduction to text analysis with
machine learning.
Read this guide in your spare time, bookmark it for later, or jump to the sections that pique your interest:
-
Introduction to Text Analysis
-
How does Text Analysis work?
-
Use Cases and Applications
-
Resources
Let's get down to the nitty-gritty!
Introduction to Text Analysis
What is Text Analysis?
In short, text analysis (a.k.a. text mining) is the automated process
that allows machines to extract and classify information from text,
such as tweets, emails, support tickets, product reviews, survey
responses, etc.
Businesses might want to extract specific information, like keywords,
names, or company information. They may even want to categorize text
with tags according to topic or viewpoint, or classify it as positive or
negative.
Either way, sorting through data is a repetitive, time-consuming and
expensive process if done by humans – just imagine if Walmart's
employees had to manually process the
one-million customer transactions
they receive every day. It would take forever. Instead, if done by
machines, high volumes of text can automatically be analyzed, saving
time and money, providing more insights from business data and
automating processes.
To really understand what automated text analysis is, we need to
touch upon machine learning. Let's start with this definition from
Machine Learning by Tom Mitchell:
"A computer program is said to learn to perform a task T from experience E".
In other words, if we want text analysis software to perform desired
tasks, we need to teach machine learning algorithms how to analyze,
understand and derive meaning from text. But how? The simple answer is
by tagging examples of text. Once a machine has enough examples of
tagged text to work with, algorithms are able to start differentiating
and making associations between pieces of text, and can even begin to
make predictions.
It's very similar to the way humans learn how to differentiate
between topics, objects, and emotions. Let's say we have urgent and low
priority issues to deal with. We don't instinctively know the difference
between them – we learn gradually by associating urgency with certain
expressions. For example, when we want to identify urgent issues, we'd
look out for expressions like
'please help me ASAP!' or
'urgent: can't enter the platform, the system is DOWN!!'. On the other hand, when we want to identify low priority issues, we'd look out for more positive expressions like
'thanks for the help! Really appreciate it' or
'the new feature works like a dream'.
The number of texts needed for training a machine learning model will
depend on various factors such as the complexity of what you want to
achieve, and the number of tags involved.
For example:
- Topic detection needs around 250 examples per tag (topic) for good accuracy levels.
- Sentiment analysis needs at least 500 examples per tag (sentiment) to produce good results.
These are just two exciting capabilities of text analysis, which
we'll come back to later on. First, let's explore some definitions that
are closely linked to text analysis.
Text Analysis vs. Text Mining vs. Text Analytics
Firstly, let's dispel the myth that
text mining and
text analysis
are two different processes. The terms are often used interchangeably
to explain the same process of obtaining data through statistical
pattern learning. To avoid any confusion here, let's stick to text
analysis.
So,
text analytics vs.
text analysis: what's the difference?
One is qualitative and the other quantitative, the latter being text
analytics. If a machine performs text analysis, it identifies important
information within the text itself, but if it performs text analytics it
reveals patterns across thousands of text data, resulting in graphs,
reports, tables etc.
Here's an example:
Let's say a customer support manager wants to know the outcomes of
each support ticket handled by individual team members – was the result
positive or negative? By analyzing the text within each ticket, and
subsequent exchanges, customer support managers can see team members'
individual ticket resolution rates. However, it's likely that the
manager also wants to create a graph that visualizes how many tickets
were tagged as solved. They'd use text analytics.
Basically, the challenge in text analysis is decoding the ambiguity
of human language, while in text analytics it's detecting patterns and
trends from the results. As in the example above, text analytics and
text analysis are often used hand in hand.
Methods and Techniques
There are basic methods for text analysis and more advanced ones. First, let's start with the simpler techniques.
Basic Methods
Word Frequency
Word frequency can be used to list the most frequently occurring
words or concepts in a given text. This can be useful for a number of
use cases, for example, to analyze the words or expressions customers
use most frequently in support conversations, e.g. if the word
'delivery' appears most often, this might suggest there are issues with a
companies delivery service.
Collocation
Collocation helps identify words that commonly co-occur. For example,
in customer reviews on a hotel booking website, the words 'air' and
'conditioning' are more likely to co-occur rather than appear
individually. Bigrams (two adjacent words e.g. 'air conditioning' or
'customer support') and trigrams (three adjacent words e.g. 'out of
office' or 'to be continued') are the most common types of collocation
you'll need to look out for.
Collocation can be helpful to identify hidden semantic structures and
improve the granularity of the insights by counting bigrams and
trigrams as one word.
Concordance
Concordance helps identify the context and instances of words or a
set of words. For example, the following is the concordance of the word
simple in a set of app reviews:
In this case, the concordance of the word
simple can give us
a quick grasp of how reviewers are using this word. It can also be used
to decode the ambiguity of the human language to a certain extent, by
looking at how words are used in different contexts, as well as being
able to analyze more complex phrases.
Advanced Methods
Now that we've touched upon the basic techniques of text analysis,
we'll introduce you to the more advanced methods: text classification
and text extraction.
Text Classification
Text classification is the process of assigning predefined tags or categories to unstructured text. It's considered one of the most useful
Natural Language Processing
(NLP) techniques because it's so versatile and can organize, structure
and categorize pretty much anything to deliver meaningful data and solve
problems.
Below, we're going to focus on some of the most common text
classification tasks, which include sentiment analysis, topic modeling,
language detection, and intent detection.
Sentiment Analysis
Emotions are essential to effective communication between humans, so
if we want machines to handle texts in the same way, we need teach them
how to detect emotions and classify text as positive, negative or
neutral. That's where
sentiment analysis comes into play. It's the automated process of understanding an opinion about a given subject from written or spoken language.
For example, by using sentiment analysis companies are able to flag
complaints or urgent requests, so they can be dealt with immediately –
and perhaps
avert a PR crisis on social media.
Other uses of sentiment classifiers include assessing brand reputation,
carrying out market research, and improving products with customer
feedback.
Why not play around with our
pre-trained classifier using MonkeyLearn? Write something positive or negative, and see how this classifier makes a prediction:
For more accuracy, you can
make your own custom classifier for your specific use case and criteria. Check out these use
cases & applications to see how companies and organizations are already using sentiment analysis.
Topic Analysis
Another common example of text classification is topic analysis or,
more simply put, understanding what a given text is talking about. It's
often used for structuring and organizing data. For example:
“The app is really simple and easy to use”
This product feedback can be classified under
Ease of use.
Try out
this pre-trained classifier for categorizing
NPS responses for SaaS products. This model can categorize feedback into tags such as
Customer Support,
Ease of Use,
Features, and
Pricing:
Language Detection
Language detection allows text to be classified according to its
language and is often used for routing purposes e.g. support tickets can
be automatically allocated to the right team depending on the language
detected. Saving time and avoiding confusion among staff.
We've created a classifier that
detects 49 different languages in a text. Discover the language of the text below:
Intent Detection
Text classifiers can also be used to automatically detect the intent
within texts, for example, companies are able to better understand
customer feedback about a product if they know more about the purpose or
intentions behind the text. Anything from the intention to
complain about a product to the intention to
buy a product.
Check out the following classifier, trained for detecting the
intent from replies in outbound sales emails. We've used the tags
interested,
not interested,
unsubscribe,
wrong person,
email bounce, and
-autoresponder to train this classifier:
Text extraction
is another widely used text analysis technique for getting insights
from data. It involves extracting pieces of data that already exist
within any given text, so if you wanted to extract important data such
as keywords, prices, company names, and product specifications, you'd
train an extraction model to automatically detect this information.
Then, you can organize the extracted data into spreadsheets, translate
into graphs and use it to resolve particular problems. And yes, all
without having to tediously sort through data, and input information
manually!
Text extraction is often used alongside text classification so that
businesses can categorize their data and extract information at the same
time.
There are different extraction models for different types of purposes,
which we'll go into more detail about below.
Keywords are the most relevant terms within a text, terms that
summarize the contents of text in list form. Keyword extraction can be
used to index data to be searched and to generate tag clouds (a visual
representation of text data).
Use the following
keyword extractor to see how it works:
Remember, if you'd like more accurate results you'e better off training your own
custom extractor and tailoring it to your particular needs and criteria.
Entity Recognition
A named entity recognition (NER) extractor finds entities, which can
be people, companies or locations and exist within text data. Results
are shown labeled with the corresponding entity label, like in this
pre-trained person extractor:
Summary extraction allows long texts to be summarized without losing
their meaning. Imagine if every customer support ticket could be read in
seconds instead of minutes – staff would save a lot of time, and
response times would be quicker leading to better customer experiences.
See for yourself how this
summary extractor handles a piece of text:
Above, we've mentioned the most common models of text analysis, but
there are many other useful methods. Below, we've listed two more
advanced techniques that we think deserve an honorable mention!
Word Sense Disambiguation
It's very common for a word to have more than one meaning, which is
why Word Sense Disambiguation is a major challenge of Natural Language
Processing. Its function is to identify the sense of a word within a
sentence when the word has more than one meaning. Easy for humans to
figure out, but difficult for machines.
Take the word
'light' for example. Is the text referring to
weight, color or an electrical appliance? Smart text analysis with Word
Sense Disambiguation can differentiate words that have more than one
meaning, but only if we teach models to do so.
Clustering
Text clusters are able to understand and group vast quantities of
unstructured data. Although less accurate than classification
algorithms, clustering algorithms are faster to implement because you
don't need to tag examples to train models. That means these smart
algorithms mine information and make predictions without the use of
training data, otherwise known as unsupervised machine learning.
Google is a great example of how clustering works. When you search
for a term on Google, have you ever wondered how it takes just seconds
to pull up relevant results?
Google's algorithm breaks down unstructured data
from web pages and groups pages into clusters around a set of similar
words or n-grams (all possible combinations of adjacent words or letters
in a text). So, the pages from the cluster that contains a higher count
of words or n-grams relevant to the search query will appear first
within the results.
Text Analysis Scope
Text analysis can stretch it's AI wings across a range of texts depending on the results you desire. It can be applied to:
- Whole documents: obtains information from a complete document or paragraph e.g. the overall sentiment of a customer review.
- Single sentences: obtains information from specific sentences e.g. more detailed sentiments of every sentence of a customer review.
- Sub-sentences: obtains information from sub-expressions within a sentence e.g. the underlying sentiments of every opinion unit of a customer review.
Why is Text Analysis Important?
Every minute of the day,
156 million emails and 456,000 tweets
are sent! That's a colossal amount of data to process, and impossible
for humans to do it alone. If machines are made solely responsible for
sorting through data using text analysis models, the benefits for
businesses will be huge.
Let's take a look at some of the advantages for businesses below:
Scalability
Text Analysis allows businesses to structure vast quantities of
information, like emails, chats, social media, support tickets,
documents and so on, in seconds rather than in days, and redirect extra
resources to more important business tasks.
Real-time Analysis
Businesses are inundated with information, making it harder to
resolve urgent queries and deal with negative reviews as and when they
arise! Text analysis is a game-changer when it comes to detecting urgent
matters, and the big advantage is that it can work in real-time 24/7.
By training these models to detect expressions and sentiments that imply
negativity or urgency, businesses can automatically flag tweets,
reviews, videos, tickets, and the like, and take action sooner rather
than later.
Consistent Criteria
Humans make errors. Fact. And the more tedious and time-consuming a
task is, the more errors that are made. By using automated text analysis
models that have been trained, algorithms are able to analyze,
understand, and sort through data more accurately than humans. We are
influenced by personal experiences, thoughts,and beliefs when reading
texts, whereas algorithms are influenced by the information they've
received. By applying the same criteria to analyze all data, algorithms
are able to deliver more consistent and reliable data.
Wrap-up
Throughout this section, we've covered what is text analysis,
different techniques, and scopes. We went through the benefits of
automated text analysis and learned about text classification and text
extraction – that companies can use and customize depending on their
business needs.
Organizations that adopt text analysis will have clear advantages
over those that don't, including automating business processes, getting
actionable insights for better decision making, and processing data at
scale and in a cost-effective way.
Now that you know the ins and outs of text analysis, check out our next chapter
How Text Analysis Works
to learn more about training text analysis models with machine
learning. Before you know it, you'll be creating your own text
classifiers and extractors!
How does Text Analysis work?
Machine learning for text analysis makes it possible to process huge amounts of
unstructured text data in a fast and simple way.
By creating customized models that learn from examples and improve
over time, businesses can automate daily tasks and save their teams
precious time, as well as gain relevant insights that enhance the
decision-making process.
But how easy is to get started? Automated text analysis may sound far
too complex for someone with no programming skills, but that's not
always true. By using an AI platform like
MonkeyLearn, everyone is able to create
customized text analysis models or even use pre-trained models for specific purposes, without writing a single line of code!
However, so that businesses can take full advantage of automated text analysis, it's important to understand how it works.
In this section, we'll dive deeper into how text analysis works.
First, we'll go through the internal and external sources you can use to
obtain data.
We'll also refer to the different natural language processing
techniques ― like parsing, tokenization, lemmatization, and stop-word
removal ― to
prepare the data for text analysis.
Then, we'll move onto the
data analysis process: how to analyze text data automatically. We'll focus on the two most common and useful text analysis models:
text classification and
text extraction.
Finally, we'll provide some insights on
data visualization and highlight the different tools you can use to show the results of text analysis.
Data Gathering
Before getting started with automated text analysis, we need to
gather text data. That could be any data we're interested in analyzing.
For example, let's imagine that we work for Slack and we want to
analyze online reviews to better understand what our customers like and
dislike about our platform. First, we'd need to automatically gather
reviews from sites like
Capterra and
G2Crowd.
Gathering data is important because it's used as training samples to
build either text classification or text extraction models with machine
learning. After training models, we'll be able to automatically analyze
text and make predictions on our data.
Sources for data gathering can either be internal or external:
Internal Data
It's the data that you already have in your company database, or that
you can obtain from the tools you use every day. Every business
generates a myriad of unstructured data, from emails and chats, to
reviews, customer queries, and support tickets.
In order to use this internal data, you will need to export it from your software or platform as a
CSV or Excel file, or retrieve the data with an
API.
Here are some examples of where you might get internal data:
Customer Service Software: the software you use to communicate with
customers, manage user queries coming from different channels and deal
with customer support issues. Zendesk, Freshdesk, and Help Scout are a
few examples of this type of software.
-
CRM: it's the software that allows your company to
keep track of all the interactions with clients or potential clients. It
can involve different areas, from customer support to sales and
marketing. Hubspot, Salesforce, and Pipedrive are some examples of CRMs.
-
Chat: the apps that you use to communicate with the
members of your team or your customers. Slack, Hipchat, Intercom, and
Drift are four good examples.
-
Email: the king of business communication, emails
are still the most popular tool to manage conversations with customers
and team members. Gmail is the obvious reference when it comes to email
service providers, but there are other email management platforms like
Front, which is focused on a shared email inbox for teams.
-
Spreadsheets: they are widely used by companies to
compile and analyze data, create reports and budgets, and organize
tasks. Whether it's Google Sheets, Excel files or CSV documents,
spreadsheets are a valuable source of information.
-
Surveys: a great way to obtain data, generally used
to gather customer service feedback, product feedback or to conduct
market research. Some popular survey tools include Typeform, Google
Forms, and SurveyMonkey.
-
NPS (Net Promoter Score): one of the most popular
metrics for customer experience in the world. Many companies use NPS
tracking software to collect and analyze feedback from their customers. A
few examples are Delighted, Promoter.io and Satismeter.
-
Databases: a database is a collection of
information. By using a database management system, a company can store,
manage and analyze all sorts of data. Examples of databases include
Postgres, MongoDB, and MySQL.
-
Product Analytics: it's the product feedback and
information about the interactions of a customer with your product or
service. It's useful to understand the customer's journey and make
data-driven decisions. ProductBoard and UserVoice are two tools you can
use to process product analytics.
External Data
This refers to data that you can find anywhere on the Web. You can
use web scraping tools, APIs and open datasets to collect external data
from different websites and analyze it with a machine learning model.
Let's go into more detail below
-
Visual Web Scraping Tools: these allow you to easily
build your own web scraper without needing any coding skills. Dexi.io,
Portia and ParseHub are some tools that you can use to crawl a website
and extract relevant data without using any code.
-
Web Scraping Frameworks: if you are a seasoned
coder, you can benefit from these tools to create your own scraper and
efficiently obtain data from a website. Scrapy, for example, is an open
source tool you can use with Python. Wombat is also a powerful scraping
tool written in Ruby.
APIs
If you are a developer, you can use APIs to connect with different
websites and social media platforms and obtain useful data. Facebook,
Twitter, and Instagram, for example, have their own APIs and allow you
to extract data from their platforms. Major media outlets like the New
York Times or The Guardian also have their own APIs and you can use them
to search their archive or gather users' comments, among other things.
Open Data
This is free data from the Web, which can be used for any purpose.
You can search for data by topic on websites like Kaggle and Quandl.
Data Preparation
In order to automatically analyze text with machine learning, there
is some preparation data needs to go through. Most of this is done
automatically and you won't even notice it's being done. However, it's
important to understand that automatic text analysis makes use of a
number of
Natural Language Processing techniques
(NLP) and processes that you might be unaware of. The output of these
processes will help build the input of the machine learning models you
create to analyze your data.
Tokenization, Part-of-speech Tagging, and Parsing
The first thing a text analysis system needs to take care of is recognizing what units it will analyze. This is known as
tokenization. In other words,
tokenization
refers to the process of breaking up a string of characters into
semantically meaningful parts that can be analyzed (e.g., words) while
discarding meaningless chunks (e.g. whitespaces).
The examples below show two different ways in which one could tokenize the string
'Analyzing text is not that hard'.
(Incorrect): Analyzing text is not that hard. = [“Analyz”, “ing text”, “is n”, “ot that”, “hard.”]
(Correct): Analyzing text is not that hard. = [“Analyzing”, “text”, “is”, “not”, “that”, “hard”, “.”]
Once the tokens have been recognized, it's time to
categorize them.
Part-of-speech tagging refers to the process of assigning a grammatical category, such as noun, verb, etc. to the tokens that have been detected.
Here are the PoS tags of the tokens from the sentence above:
“Analyzing”: VERB, “text”: NOUN, “is”: VERB, “not”: ADV, “that”: ADV, “hard”: ADJ, “.”: PUNCT
With all the categorized tokens and a language model (i.e. a
grammar), the system can now create more complex representations of the
texts it will analyze. This process is known as
parsing. In other words,
parsing
refers to the process of determining the syntactic structure of a text.
To do this, the parsing algorithm makes use of a grammar of the
language the text has been written in. Different representations will
result from the parsing of the same text with different grammars.
The examples below show the dependency and constituency representations of the sentence
'Analyzing text is not that hard'.
Dependency Parsing
Dependency grammars can be defined as grammars that establish directed relations between the words of sentences.
Dependency parsing is the process of using a dependency grammar to determine the syntactic structure of a sentence:
Constituency Parsing
Constituency Phrase Structure Grammars
model syntactic structures by making use of abstract nodes associated
to words and other abstract categories (depending on the type of
grammar) and undirected relations between them.
Constituency parsing refers to the process of using a constituency grammar to determine the syntactic structure of a sentence:
As you can see in the images above, the output of the parsing
algorithms contains a great deal of information which can help you
understand the syntactic (and some of the semantic) complexity of the
text you intend to analyze.
Depending on the problem at hand, you might want to try different
parsing strategies and techniques. However, at present, dependency
parsing seems to outperform other approaches.
Lemmatization and Stemming
Stemming and
Lemmatization
both refer to the process of removing all of the affixes (i.e.
suffixes, prefixes, etc.) attached to a word in order to keep its
lexical base, also known as
root or
stem or its dictionary form or le
mma. The main difference between these two processes is that
stemming is usually based on rules that trim word beginnings and endings (and sometimes lead to somewhat weird results), whereas
lemmatization makes use of dictionaries and a much more complex morphological analysis.
The table below shows the output of
NLTK's Snowball Stemmer and
Spacy's lemmatizer for the tokens in the sentence
'Analyzing text is not that hard'. The differences in the output have been boldfaced:
Stopword Removal
To provide a more accurate automated analysis of the text, it is
important that we remove from play all the words that are very frequent
but provide very little semantic information or no meaning at all. These
words are also known as
stopwords.
There are many different lists of stopwords for every language.
However, it is important to understand that you might need to add words
to or remove words from those lists depending on the texts you would
like to analyze and the analyses you would like to perform.
You might want to do some kind of lexical analysis of the domain your
texts come from in order to determine the words that should be added to
the stopwords list. Particularly focus on those that might be leading
to:
- overfitting of your text classifiers or text extractors to a particular testing set and,
- a lack of generalization of the analyses your classifiers and extractors perform.
Depending on the problem at hand, sequences of numbers, URLs, and
some names, for example, might not be relevant for the detection of a
topic. If this happens to be the case, you should add those to the
stopword list. Look at the example below (from an IT support system):
'Please talk to Professor John Doe (555) 5555 5555 to discuss his
computer configuration. He would like to upgrade his HDD and RAM.
Thanks.'
Here, the phone number, Professor John Doe, and some words like
Please are likely to bear little or no relevance in the detection of the
topic of the text:
computer configuration. If you add those words to the stopword list, the classifier will not make predictions based on their appearances in texts.
Data Analysis
Now that you know the basics of data preparation, let's delve deeper into the fun part: data analysis!
Unstructured text is everywhere and mining data has been brought down
to scraping websites, downloading emails, exporting tickets, or the
like.
Now, how do you analyze all of this text?
Well, the analysis of unstructured text is not
straightforward.
There are countless ways of analyzing text to choose from and a great
many more you can create yourself. In this section, we will introduce
two ways in which you will be able to get insights from your texts,
namely,
text classification and
text extraction.
Text Classification
Text classification (also known as
text categorization or
text tagging) refers to the process of assigning tags to texts based on its content.
In the past, text classification was done manually, which was
time-consuming, inefficient, and inaccurate. At present, we can perform
automated text analysis of our data in very little time and get really
good results.
Typical text classification tasks include
sentiment analysis
(i.e. detecting when a text says something positive or negative about a
given topic), topic detection (i.e. determining what topics a text
talks about), and intent detection (i.e. detecting the purpose or
underlying intent of the text), among others, but there are a great many
more
applications you might be interested in.
Rule-based Systems
In text classification, a
rule is essentially a human-made
association between a linguistic pattern that can be found in a text and
a tag. What rule-based systems do is detecting these handcrafted
linguistic patterns in texts and assigning the corresponding tags based
on the results of the detections. Usually, rules consist of references
to morphological, lexical, or syntactic patterns, but they can also
contain references to other components of language, such as semantics or
phonology.
Here's an example of a very simple rule for classifying product
descriptions according to the type of product described in the text:
(HDD|RAM|SSD|Memory) → Hardware
In this case, the system will assign the
Hardware tag to those texts that contain the words
HDD,
RAM,
SSD, or
Memory.
The most obvious advantage of rule-based systems is that they are
easily understandable by humans. However, creating complex rule-based
systems takes a lot of time and a good deal of knowledge of both
linguistics and the topics being dealt with in the texts the system is
supposed to analyze.
On top of that, rule-based systems are difficult to scale and
maintain because adding new rules or modifying the existing ones
requires a lot of analysis and testing of the impact of these changes on
the results of the predictions.
Machine Learning-based Systems
Machine Learning based systems can make predictions based on what
they learn from past observations. In other words, these systems need
that you feed them with many examples of texts and the expected
predictions (tags) for each of them.
The more consistent and accurate the samples you have fed the
classifier with are, the better its predictions will be. That set of
tagged texts you feed your system with so that it learns from your texts
is called
training data.
When you train a machine learning-based classifier, training data has
to be transformed into something a machine can understand, that is,
vectors
(i.e. lists of numbers which encode some information). By using
vectors, the system can extract relevant features (pieces of
information) which will help it learn from the existing data and make
predictions about the texts to come.
There are a number of ways of doing this, but one of the most frequently used is known as the
bag of words vectorization. You can learn more about vectorization
here.
Once the texts have been transformed into vectors, they are fed into a
machine learning algorithm together with their expected output to
create a classification model that can choose what features best
represent the texts and make predictions about unseen texts:
The trained model will transform unseen text into a vector, extract its relevant features, and make a prediction:
Machine Learning Algorithms
There are many machine learning algorithms used in text classification. The most frequently used are the
Naive Bayes family of algorithms (NB),
Support Vector Machines (SVM), and
deep learning algorithms.
The
Naive Bayes
family of algorithms is based on Bayes's Theorem and the conditional
probabilities of occurrence of the words of a sample text within the
words of a set of texts that belong to a given tag. Vectors that
represent texts encode information about how likely it is for the words
in the text to occur in the texts of a given tag. With this information,
the probability of a text's belonging to any given tag in the model can
be computed. Once all of the probabilities have been computed for an
input text, the classification model will return the tag with the
highest probability as the output for that input.
One of the main advantages of this algorithm is results are pretty good when training data is not much.
Support Vector Machines
(SVM) is an algorithm that can divide a vector space of tagged texts
into two subspaces: one space that contains most of the vectors that
belong to a given tag and another subspace that contains most of the
vectors that do not belong to that one tag.
Classification models that use SVM at their core will transform texts
into vectors and will determine what side of the boundary that divides
the vector space for a given tag those vectors belong to. Based on where
they land, the model will know if they belong to a given tag or not.
The most important advantage of using SVM is that results are usually
better than those obtained if Naive Bayes is used. However, more
computational resources are needed in order to use SVM.
Deep Learning
is a set of algorithms and techniques inspired by how the human brain
works. These algorithms use huge amounts of training data (millions of
examples) to generate semantically rich representations of texts which
can then be fed into machine learning-based models of different kinds
that will make much more accurate predictions than traditional machine
learning models.
Hybrid Systems
Hybrid systems usually contain machine learning-based systems at
their cores and rule-based systems that are used to further improve the
predictions.
Evaluation
Classifier performance is usually evaluated through standard metrics used in the machine learning field. These metrics are
accuracy,
precision,
recall, and
F1 score. Understanding what they mean will give you a clearer idea of how good your classifiers are at analyzing your texts.
It is also important to understand that evaluation can be performed over a
fixed testing set
(i.e. a set of texts for which we know the expected output tags) or by
using cross-validation (i.e. a method that splits your training data
into different folds so that you can use some subsets of your data for
training purposes and some for testing purposes,
see below).
In this section, we will introduce you to the standard performance
metrics and the cross-validation method so that you can better
understand how the performance of your classifiers is evaluated.
Accuracy, Precision, Recall, and F1 score
Accuracy is the number of correct predictions the classifier
has made for a given tag divided by the total number of predictions. In
general, accuracy alone is not a good indicator of performance. For
example, when categories are imbalanced, that is, when there is one
category that contains many more examples than all of the others,
predicting all texts as belonging to that category will return high
accuracy levels. This is known as the
accuracy paradox. To get a better idea of the performance of a classifier, you might want to consider precision and recall instead.
Precision states how many texts were predicted correctly out
of the ones that were predicted as belonging to a given tag. In other
words, precision takes the number of texts that were correctly predicted
as positive for a given tag and divides it by the number of texts that
were predicted (correctly and incorrectly) as belonging to the tag.
We have to bear in mind that precision only gives information about
the cases where the classifier predicts that the text belongs to a given
tag. This might be particularly important, for example, if you would
like to generate automated responses for user messages. In this case,
before you send an automated response you want to know for sure you will
be sending the right response, right? In other words, if your
classifier says the user message belongs to a certain type of messages,
you would like the classifier to make the right guess. This means you
would like a high precision for that type of messages.
Recall states how many texts were predicted correctly out of
the ones that should have been predicted as belonging to a given tag.
In other words, recall takes the number of texts that were correctly
predicted as positive for a given tag and divides it by the number of
texts that were either predicted correctly as belonging to the tag or
that were incorrectly predicted as not belonging to the tag.
Recall might prove useful when routing support tickets to the
appropriate team, for example. It might be desired for an automated
system to detect as many tickets as possible for a critical tag (for
example tickets about
'Outrages / Downtime') at the expense of
making some incorrect predictions along the way. In this case, making a
prediction will help perform the initial routing and solve most of these
critical issues ASAP. If the prediction is incorrect, the ticket will
get rerouted by a member of the team. When processing thousands of
tickets per week, high recall (with good levels of precision as well, of
course) can save support teams a good deal of time and enable them to
solve critical issues faster.
The
F1 score is the harmonic means of precision and recall.
It tells you how well your classifier performs if equal importance is
given to precision and recall. In general, F1 score is a much better
indicator of classifier performance than accuracy is.
Cross-validation
Cross-validation is quite frequently used to evaluate the performance
of text classifiers. The method is simple. First of all, the training
dataset is randomly split into a number of equal-length subsets (e.g. 4
subsets with 25% of the original data each). Then, all the subsets
except for one are used to train a classifier (in this case, 3 subsets
with 75% of the original data) and this classifier is used to predict
the texts in the remaining subset. Next, all the performance metrics are
computed (i.e. accuracy, precision, recall, f1, etc.). Finally, the
process is repeated with a new testing fold until all the folds have
been used for testing purposes.
Once all folds have been used, the average performance metrics are computed and the evaluation process is finished.
Text Extraction refers to the process of recognizing structured pieces of information from unstructured text.
For example, it might be useful to automatically detect the most
relevant keywords from a piece of text, identify names of companies in a
news article, detect lessors and lessees in a financial contract, or
identify prices on product descriptions. Just like text classification,
text extraction can be performed automatically or manually, with the
latter being tremendously more time-consuming and inefficient.
There are different ways in which automated text extraction can be
implemented. In this section, we introduce some approaches which are
widely accepted and return very good results.
Regular Expressions
Regular Expressions
(a.k.a. regexes) work as the equivalent of the rules defined in
classification tasks. In this case, a regular expression defines a
pattern of characters that will be associated with a tag.
For example, the pattern below will detect most email addresses in a text if they preceded and followed by spaces:
(?i)\b(?:[a-zA-Z0-9_-.]+)@(?:(?:[[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.)|(?:(?:[a-zA-Z0-9-]+.)+))(?:[a-zA-Z]{2,4}|[0-9]{1,3})(?:]?)\b
By detecting this match in texts and assigning it the
email tag, we can create a rudimentary -but functional- email address extractor.
There are obvious pros and cons of this approach. On the plus side,
you can create text extractors quickly and the results obtained can be
good, provided you can find the right patterns for the type of
information you would like to detect. On the minus side, regular
expressions can get extremely complex and might be really difficult to
maintain and scale, particularly when many expressions are needed in
order to extract the desired patterns.
Conditional Random Fields
Conditional Random Fields
(CRF) is a statistical approach often used in machine-learning-based
text extraction. This approach learns the patterns to be extracted by
weighing a set of features of the sequences of words that appear in a
text. Through the use of CRFs, we can add multiple variables which
depend on each other to the patterns we use to detect information in
texts, such as syntactic or semantic information.
This usually generates much richer and complex patterns than using
regular expressions and can potentially encode much more information.
However, more computational resources are needed in order to implement
it since all the features have to be calculated for all the sequences to
be considered and all of the weights assigned to those features have to
be learned before determining whether a sequence should belong to a tag
or not.
One of the main advantages of the CRF approach is its generalization
capacity. Once an extractor has been trained using the CRF approach over
texts of a specific domain, it will have the ability to generalize what
it has learned to other domains reasonably well.
Extractors are sometimes evaluated by calculating the same standard
performance metrics we have explained above for text classification,
namely,
accuracy,
precision,
recall, and
f1 score.
However, these metrics do not account for partial matches of patterns.
In order for an extracted segment to be a true positive for a tag, it
has to be a perfect match with the segment that was supposed to be
extracted.
Consider the following example:
'Your flight will depart on June 13, 2019 at 03:30 PM from SFO.'
If we created a date extractor, we would expect it to return June 13,
2019 as a date from the text above, right? So, if the output of the
extractor were June 13, 2019, we would count it as a true positive for
the tag
DATE.
But, what if the output of the extractor were June 13? Would you say
the extraction was bad? Would you say it was a false positive for the
tag
DATE? To capture partial matches like this one, some other
performance metrics can be used to evaluate the performance of
extractors. One example of this is the
ROUGE family of metrics.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is
a family of metrics used in the fields of machine translation and
automatic summarization that can also be used to assess the performance
of text extractors. These metrics basically compute the lengths and
number of sequences that overlap between the source text (in this case,
our original text) and the translated or summarized text (in this case,
our extraction).
Depending on the length of the units whose overlap you would like to
compare, you can define ROUGE-n metrics (for units of length
n) or you can define the ROUGE-LCS or ROUGE-L metric if you intend to compare the longest common sequence (LCS).
Data Visualization
So, you're getting up to speed with automatic text analysis and now
you're able to process complex data in a fast and effective way using
AI. What's next? The answer is simple: data visualization.
When you use text analysis with machine learning to classify or extract specific text data, the
outcome is either:
- An API response in JSON format (if you are a coder).
- A CSV or an Excel file (if you can't code).
But, what can you do with these results?
Let's imagine you've automatically analyzed 10,000 responses from
open-ended questions in a customer feedback survey, and now you're going
to present the results within your company. What if you could transform
the raw results into well-designed data graphics and display the
content in a visual, impactful and engaging way?
Or, say you have to analyze thousands of product reviews. Wouldn't be
it easier to just look at charts and graphs to detect trends and find
valuable insights for your reports, instead of staring at a boring Excel
spreadsheet?
Data visualization * boosts the value* of the results obtained with
text mining. By using business intelligence tools, you can transform
complex concepts into compelling and simple information. Data graphics
also make it easier to establish relationships and observe patterns.
It's all about high-quality insights that lead to smart data-oriented
business decisions!
There are several data visualization tools that you can use to
process text analysis results. Let's take a closer look at the three
most popular:
This is Google's
free and easy-to-use visualization tool that allows you to
create interactive reports
using a wide variety of data. To get started, you'll need to connect
your data to the platform (you can connect to more than 100 different
sources!).
Once you've imported your data you can use different tools to design
your report and turn your data into an attractive story. Finally, you
can share the results with individuals or teams, publish them on the Web
or embed them on your website.
Looker is a
business data analytics platform designed to direct meaningful data to
everyone in a company. The idea is to allow teams to have a 'bigger
picture' about what's happening in their company.
This platform connects to different databases and automatically
creates a data model, which can be fully customized to meet specific
needs. You can use data to build your own friendly dashboards and
reports, interact with data in real-time and share it with your team
members.
Here are
some instructions to get started.
Tableau is a
business intelligence and data visualization tool, which makes it easy
to work with data and create intuitive visual analytics. With a very
user-friendly approach (there are no technical skills required), Tableau
allows organizations to work with almost any existing data source and
provides powerful visualization options.
The Tableau suite offers different products; some of them are
developer tools, while others are sharing tools for non-coders. There's a
trial version available for anyone wanting to give it a go. Check out
this video if you want to learn how to get started with Tableau.
Other data visualization tools you might consider to create customized dashboards and reports from text mining results are
Klipfolio and
Mode Analytics.
Wrap-Up
Businesses generate lots of information every day and most of it is
unstructured text data. Whether you work in sales, customer support,
marketing or product, analyzing raw text data is extremely important.
When done manually, the tagging process is time-consuming, repetitive
and inaccurate. But above all, it leaves employees with less time to
invest in other tasks, the ones where you can really make a difference
and set the bar high for a successful customer experience.
Artificial intelligence, text analysis, machine learning: all these
terms seem too complex, broad and difficult to grasp. But what if you
could benefit from these technologies without needing to actually invest
a lot of time in learning or building tools from scratch? What if you
didn't need to be a developer or a machine learning expert to get
started with text analysis?
Now that you've read this section, you probably have a few ideas of
data sources you can use for automated text analysis. Also, you already
know what a text classifier and a text extractor are, and how they can
be trained with machine learning. Finally, you've learned how data
visualization tools can help you to make your results more appealing.
So what's next? The following section will go into depth about the
different applications of text analysis and the impact this can have on
your business. We'll provide case studies and examples to explain how
you can use text analysis in a wide number of areas within your company,
like social media monitoring, brand monitoring, customer support, voice
of customer (VoC), business intelligence, sales marketing, product
analytics, and knowledge management.
Use Cases and Applications
Did you know that
80 percent of business data is text?
Text is present in every major business process, from support tickets
to product feedback and customer interactions. But analyzing heaps of
text can be quite daunting. That's why text analysis is growing in
popularity, helping more companies to automate different tasks and
processes with machine learning.
Text analysis has a broad range of business applications and use
cases. Some businesses use this technology to maximize efficiency and
reduce the time employees spend doing repetitive tasks that can
potentially have a high turnover impact. Others are hoping to better
understand customer insights without having to sort through millions of
social media posts, online reviews, and survey responses.
If you work in customer experience, product, marketing or sales,
there are quite a few text analysis applications that can help you
automate manual processes and get better insights that won't require you
to be tech savvy. In this section, we'll cover the following use cases
and applications:
- Social Media Monitoring
- Brand Monitoring
- Customer Service
- Voice of Customer & Customer Feedback
- Business Intelligence
- Sales and Marketing
- Product Analytics
- Knowledge Management
Let's dive in!
Social Media Monitoring
Let's say you work for Uber and you want to know what users are
saying about the brand. You've read some positive and negative feedback
on Twitter and Facebook. But
500 million tweets are sent each day,
and Uber has thousands of mentions on social media every month. Can you
imagine analyzing all of them manually? Where do you start? All this
raw text data needs to be converted into numbers. This is where text
analysis with machine learning plays a crucial role.
A useful option is
sentiment analysis, which analyzes the opinion about a given subject within a text. By analyzing your social media mentions with a
sentiment analysis model, you can automatically categorize them into
Positive,
Neutral or
Negative. If you also analyze these mentions with a topic classifier, you can also understand what they are talking about. By running
aspect-based sentiment analysis, you can automatically pinpoint the reasons behind positive or negative mentions and get insights such as:
- The top complaint about Uber on Social Media?
- The success rate of Uber's customer service - are people happy or are annoyed with it?
- What Uber users like about the service when they mention Uber in a positive way?
Now, let's say you've just added a new service to Uber. For example,
Uber Eats. It's a crucial moment, and your company wants to know what
people are saying about Uber Eats so that you can fix any glitches as
soon as possible, and polish the best features. You can also use
aspect-based sentiment analysis on your Facebook, Instagram and Twitter
profiles for any Uber Eats mentions and discover things such as:
- Are people happy with Uber Eats so far?
- What is the most urgent issue to fix?
- How can we incorporate positive stories into our marketing and PR communication?
Not only can you use text analysis to keep tabs on your brand's
social media mentions, but you can also use it to monitor your
competitors' mentions as well. Is a client complaining about a
competitor's service? That gives you a chance to attract potential
customers and show them how much better your brand is.
Brand Monitoring
Have you ever had to deal with a barrage of negative comments on the
internet? It's quite a stressful task, to say the least, so spotting
them on social media, forums, blogs and review sites as soon as possible
could reduce the negative impact on your brand.
The power of negative reviews is quite strong:
40% of consumers are put off from buying if a business has negative reviews.
An angry customer complaining about poor customer service can spread
like wildfire within minutes: a friend shares it, then another, then
another… And before you know it, the negative comments have gone viral.
So, where do you start? First, you can do some
web scraping to automatically collect reviews or news articles. Here are different tools to get this done:
-
Visual web scrapers that are super easy for everyone to use and don't require a single line of code, such as Dexi.io (integrated to MonkeyLearn), Parsehub, and Portia.
-
Web scraping frameworks for programmers (for example, Scrapy for Python, or Upton for Ruby).
The next step will be running text analysis models on the scraped
data to get useful insights about online conversations that mention your
brand. This can help you do all of this:
- Understand how your brand reputation evolves over time.
- Compare your brand reputation to your competitor's.
- Identify which aspects are damaging your reputation.
- Pinpoint which elements are boosting your brand reputation on online media.
- Identify potential PR crises so you can deal with them ASAP.
- Tune into data from a specific moment, like the day of a new product
launch or IPO filing. Just run a sentiment analysis on social media and
press mentions on that day, to find out what people said about your
brand.
Customer Service
Despite many people's fears and expectations, text analysis doesn't
mean that customer service will be entirely machine-powered. It just
means that businesses will have more seamless processes so that teams
can spend more time-solving problems that require human interaction.
That way businesses will be able to increase retention, given that
89 percent of customers change brands because of poor customer service. But, how can text analysis, assist your company's customer service?
Ticket Tagging: Make the Most out of Ticket Tagging
When a customer complaint, comment, or request arises, customer
service representatives have to take the time to categorize each ticket
before addressing it. It's a fairly simple task, but it's a boring and
time-consuming process. Each tag takes at least two clicks to add. And
even if a support agent takes the time to read each ticket and choose
the most appropriate tags, they're not always consistent.
That's why we need to start delegating the task to
machines.
Text analysis automatically identifies topics, and tags each ticket. With a bit of previous training, yes, but without customer service agents becoming frustrated. Here's how it works :
- The model analyzes the language and expressions a customer uses, for example, 'I didn't get the right order'.
- Then, it compares it to other similar conversations.
- Finally, it finds a match and tags the ticket automatically. In this case, it could be under a Shipping Problems tag.
This process repeats itself every time a new ticket comes in, freeing customer agents to focus on more important tasks.
Ticket Routing & Triage: Find the Right Person for the Job
With text analysis, team members can save time they'd have otherwise
spent reading tickets and manually assigning them to the qualified rep.
This can now be done automatically and in real-time with a text analysis
model that can pinpoint what each ticket is about, and then
route it accordingly to the most appropriate person, whether it's because of their skills, appointed tasks, or native language.
For example, for a SaaS company that receives a customer ticket
asking for a refund, the text mining system will identify which team
usually handles billing issues and send the ticket to them. If a ticket
says something like
'How can I integrate your API with python?', it would go straight to the team in charge of helping with
Integrations.
Ticket Analytics: Learn More from your Customers
What is commonly assessed to determine the performance of a customer service team? Common KPIs are
first response time, average
time to resolution (i.e. how long it takes your team to resolve issues), and
customer satisfaction (CSAT). And, let's face it, overall client satisfaction has a lot to do with the first two metrics.
But how do we get actual CSAT insights from customer conversations?
How can we identify if a customer is happy with the way an issue was
solved? Or if they have expressed frustration with the handling of the
issue?
In this situation,
aspect-based sentiment analysis
could be used. This type of text analysis delves into the feelings and
topics behind the words on different support channels, such as support
tickets, chat conversations, emails and CSAT surveys. A text analysis
model can understand words or expressions to define the support
interaction as
Positive,
Negative, or
Neutral, understand what was mentioned (e.g.
Service or
UI/UX) and even determine the sentiments behind the words (e.g.
Sadness,
Anger, etc.).
Urgency Detection: Prioritize Urgent Tickets
“Where do I start?” is a question most customer service
representatives often ask themselves. Urgency is definitely a good
starting point, but how do we define the level of urgency without
wasting valuable time deliberating?
A text mining model can define the urgency level of a customer ticket
and tag it accordingly. Support tickets with words and expressions that
denote urgency, such as
'as soon as possible' or
'right away', are duly tagged as
Priority.
To see how text analysis works to detect urgency, check out this MonkeyLearn
urgency detection demo model.
Voice of Customer & Customer Feedback
Once you get a customer, retention is key, since
acquiring new clients is five to 25 times more expensive than retaining the ones you already have.
That's why paying close attention to the voice of the customer can give
your company a clear picture of the level of client satisfaction and,
consequently, of client retention. Also, it can give you actionable
insights to prioritize the product roadmap from a customer's
perspective.
The thing is that most of your VoC is probably written text, so how can you gain knowledge from thousands of feedback entries?
Analyzing NPS Responses
Maybe your brand already has a customer satisfaction survey in place, the most common one being the
Net Promoter Score (NPS). This survey asks the question
'How likely is it that you would recommend [brand] to a friend or colleague?'. The answer is a score from 0-10 and the result is divided into three groups: the
promoters, the
passives, and the
detractors.
But here comes the tricky part: there's an open-ended follow-up question at the end
'Why did you choose X score?'
The answer can provide your company with invaluable insights. Without
the text, you're left guessing what went wrong. The problem is going
through these open-ended responses is
hard:
- Customer reps have to sort through and categorize every feedback
entry manually. This means that hours are lost solely on reading.
They have to tag it in a consistent way, which is hard for one employee, let alone a whole team.
- To solve these problems and get the evidence to surface, text analysis can be applied in different ways.
You can do what
Promoter.io did:
extract the main keywords of your customers' feedback to understand what's being praised or criticized about your product. Is the keyword
'Product'
mentioned mostly by promoters or detractors? With this info, you'll be
able to use your time to get the most out of NPS responses and start
taking action.
Another option is
following Retently's footsteps using text analysis to classify your feedback into different topics, such as
'Customer Support',
'Product Design', and
'Product Features', among others. After, they analyzed each tag with
sentiment analysis
to see how positively or negatively clients feel about each topic. Now
they know they're on the right track with product design, but still have
to work on product features.
Analyzing Customer Surveys
Does your company have another customer survey system? If it's a
scoring system or closed-ended questions, it'll be a piece of cake to
analyze the responses: just crunch the numbers.
However, if you have an open-text survey, whether it's provided via
email or it's an online form, you can stop manually tagging every single
response by letting text analysis do the job for you. Besides saving
time, you can also have consistent tagging criteria without errors,
24/7.
Business Intelligence
Data analysis is at the core of every business intelligence
operation. Now, what can a company do to understand, for instance, sales
trends and performance over time? With numeric data, a BI team can
identify what's happening (such as sales of X are decreasing) – but not
why.
Numbers are easy to analyze, but they are also somewhat limited. Text
data, on the other hand, is the most widespread format of business
information and can provide your organization with valuable insight into
your operations. The thing is… analyzing text manually is frustrating
😫!
Let's say you work for a SaaS startup. You're part of a small but
mighty team working on a great new product that solves a real problem.
After months of hard work, you successfully launch your product, the
first sales are made, and the business grows. Your customers are leaving
positive feedback, the team starts expanding, and your startup
experiences the so-called hockey-stick growth. Then, after a couple of
high-growth years, somehow, business starts slowing down. The flow of
sales prospects is still good, but you start to have high customer
churn. It's harder and harder for your sales reps to close new deals and
the future growth prospects begin to stall. What changed?
Your team can lean on text analysis to find out what is going on with
your company, offering concrete insights for key decision-making.
For example, you can run keyword extraction and sentiment analysis on
your social media mentions to understand what people are complaining
about regarding your brand.
You can also run
aspect-based sentiment analysis on customer reviews that mention poor customer experiences. After all,
67% of consumers list bad customer experience as one of the primary reasons for churning.
Maybe it's bad support, a faulty feature, unexpected downtime, or a
sudden price change. Analyzing customer feedback can shed a light on the
details, and the team can take action accordingly.
But let's not limit this to internal data… What about the
competitors? What are their reviews saying? Run them through your text
analysis model and see what they're doing right and wrong and improve
your decision-making.
Sales and Marketing
Prospecting is the most difficult part of the sales process. And it's
getting harder and harder. The sales team always want to close deals,
which requires making the sales process more efficient. But
27% of sales agents are spending over an hour a day on data entry work instead of selling, meaning critical time is lost to administrative work and not closing deals.
Text analysis takes the heavy lifting out of manual sales tasks, including:
- Updating the deal status as 'Not interested' in your CRM.
- Qualifying your leads based upon company descriptions.
- Identifying leads on social media that express buying intent.
GlassDollar, a
company that links founders to potential investors, is using text
analysis to find the best quality matches. How? They use text analysis
to classify companies using their company descriptions. The results?
They saved themselves days of manual work, and predictions were 90%
accurate after training a text classification model. You can learn more
about their experience with MonkeyLearn
here.
Not only can text analysis automate manual and tedious tasks, but it
can also improve your analytics to make the sales and marketing funnels
more efficient. For example, you can automatically analyze the responses
from your sales emails and conversations to understand, let's say, a
drop in sales:
- What are the blocks to completing a deal?
- What sparks a customer's interest?
- What are customer concerns?
Now, Imagine that your sales team's goal is to target a new segment
for your SaaS: people over 40. The first impression is that they don't
like the product, but
why? Just filter through that age group's
sales conversations and run them on your text analysis model. Sales
teams could make better decisions using in-depth text analysis on
customer conversations.
Finally, you can use machine learning and text analysis to provide a
better experience overall within your sales process. For example,
Drift, a marketing conversational platform, integrated MonkeyLearn API to
automatically allow recipients to opt out of sales emails based on how they reply.
It's time to boost sales and stop wasting valuable time with leads that don't go anywhere. Xeneta, a sea freight company,
developed a machine learning algorithm and trained it to identify which companies were potential customers, based on the company descriptions gathered through
FullContact (a SaaS company that has descriptions of millions of companies).
You can do the same or target users that visit your website to:
- Get information about where potential customers work using a service like Clearbit and classify the company according to its type of business to see if it's a possible lead.
- Extract information to easily learn the user's job position, the
company they work for, its type of business and other relevant
information.
- Hone in on the most qualified leads and save time actually looking
for them: sales reps will receive the information automatically and
start targeting the potential customers right away.
Product Analytics
We've covered the importance of text analysis applications for
customer feedback, but we haven't delved into the impact this machine
learning tool can have on a brand's product analytics. For argument's
sake, let's imagine your startup has an app on the Google Play store.
You're receiving some unusually negative comments. What's going on?
It can be daunting to locate and tackle a problem if you first have
to read through a thousand reviews. Alternatively, you can discover
what's going on within minutes by using a text analysis model that
groups reviews into different tags like
'Ease of Use' and
'Integrations'.
Then, you can run them on the sentiment analysis model to find out
whether customers are talking about products positively or negatively.
Finally, graphs and reports can be created to visualize and prioritize
product problems. We did this with reviews for
Slack from the product review site
Capterra and got some
pretty interesting insights. Here's how:
-
We analyzed reviews with aspect-based sentiment analysis and categorized them into main topics and sentiment.
-
We extracted keywords with the keyword extractor to get some insights into why reviews that are tagged under 'Performance-Quality-Reliability' tend to be negative.
A quick example: your social media app targets young people
perfectly, reviews are mostly positive, but you want to expand and also
target an older demographic. With aspect-based sentiment analysis, your
product team may learn that many reviews under the
'Ease of Use'
tag shed light on this demographic: some features might need to be
tailored to non-digital natives. But which features? By running the
negative reviews on a keyword extractor you'll track them down within
seconds.
Knowledge Management
CEOs value the practice of sharing knowledge but often find that it's
a struggle to encourage teams to upload information to the knowledge
management system in place (if there is one)... or just simply to the
Google Drive. After all, communication is key to any relationship.
Knowledge management requires teams to manually upload text-form data to
a platform or database, and with text analysis, this task can be made
automatically.
For example, HR is tasked with reading through and classifying CVs.
If it's a big company, the HR department probably receives hundreds of
CVs per day. Whether the company is looking to fill a new job position
or not, wouldn't it be helpful to have resumés classified by background,
area or qualifications? Plus, reading through CVs takes up valuable
time, so a promising candidate may get lost amid all the mediocre
resumés. To make things easier, HR could:
- Apply a text classifier that automatically tags CVs to find candidates more easily.
- Use a text extractor to identify and extract main aspects of the CV, including 'Full Name', 'Degree', 'Job Experience', and 'Skills'.
Of course, text analysis tools can be used for other purposes, like
extracting keywords or creating a
summary of long documents, such as contracts, so that they can be read in a few minutes to get a general idea of the topic.
We've all heard the saying that ‘knowledge is power', and we couldn't
agree more. Unstructured text data can be transformed into useful data
that gives a business knowledge about how to boost its sales, reduce
turnover or even understand its clients. And all this is made possible
thanks to text analysis.
Resources
Text analysis is the process of obtaining structured knowledge from
natural language text. It's the foundation of Natural Language
Processing (NLP) and as such, if you're building an NLP solution, you'll
inevitably have to learn text analysis.
Luckily, it's easy to get started with text analysis: there are a lot
of useful resources available that can help you get your feet wet. In
this section, we'll cover the following resources:
-
APIs
-
Training Datasets
-
Tutorials
Let's get started!
Text Analysis APIs
In this section, we'll go over both the
open-source libraries and
SaaS APIs you can use to build a text analysis solution that fits your needs.
Open Source Libraries
If a commercial off-the-shelf solution for your text analysis needs
is not available, you can build your own using widely-available open
source solutions.
Python
Python is the most widely-used language in scientific computing, period. Tools like
NumPy and
SciPy have established it as a fast, dynamic language that calls C and Fortran libraries where performance is needed.
These things, combined with a thriving community and a diverse set of
libraries to implement NLP models has made Python one of the most
preferred programming languages for doing text analysis.
NLTK
NLTK, the Natural
Language Toolkit, is a best-of-class library for text analysis tasks.
NLTK is used in many university courses, so there's plenty of code
written with it and no shortage of users familiar with both the library
and the theory of NLP who can help answer your questions.
SpaCy
SpaCy is an industrial-strength statistical NLP library. Aside from the usual features, it adds
deep learning integration and
convolutional neural network models for multiple languages.
Unlike NLTK, which is a research library, SpaCy aims to be a battle-tested, production-grade library for text analysis.
Scikit-learn
Scikit-learn
is a complete and mature machine learning toolkit for Python built on
top of NumPy, SciPy, and matplotlib, which gives it stellar performance
and flexibility for building text analysis models.
TensorFlow
Developed by Google,
TensorFlow
by far is the most widely used library for distributed deep learning.
Looking at this graph we can see that TensorFlow is ahead of the
competition:
This has effects further down the line: most libraries, ready-made
models, and notebooks are built with Tensorflow in mind. So, knowing how
to use TensorFlow is important if you're interested in getting into
machine learning and text analysis.
PyTorch
PyTorch is a deep
learning platform built by Facebook and aimed specifically at deep
learning. PyTorch is a Python-centric library, which allows you to
define much of your neural network architecture in terms of Python code,
and only internally deals with lower-level high-performance code.
Keras
Keras is a widely-used deep learning library written in Python.
It's designed to enable rapid iteration and experimentation with deep
neural networks, and as a Python library, it's uniquely user-friendly.
An important feature of Keras is that it provides what is essentially
an abstract interface to deep neural networks. The actual networks can
run on top of Tensorflow, Theano, or other backends. This backend
independence makes Keras an attractive option in terms of its long-term
viability.
The permissive MIT license makes it attractive to businesses looking to develop proprietary models.
R
R is the pre-eminent language for any statistical task. Its collection of libraries (13,711 at the time of writing on
CRAN
far surpasses any other programming language capabilities for
statistical computing and is larger than many other ecosystems. In
short, if you choose to use R for anything statistics-related, you won't
find yourself in a situation where you have to reinvent the wheel, let
alone the whole stack.
Caret
caret
is an R package designed to build complete machine learning pipelines,
with tools for everything from data ingestion and preprocessing, feature
selection, and tuning your model automatically.
mlr
The
Machine Learning in R
project (mlr for short) provides a complete machine learning toolkit
for the R programming language that's frequently used for text analysis.
Java
Java needs no introduction. The language boasts an impressive
ecosystem that stretches beyond Java itself and includes the libraries
of other The
JVM languages such as The
Scala and
Clojure.
Beyond that, the JVM is battle-tested and has had thousands of
person-years of development and performance tuning, so Java is likely to
give you best-of-class performance for all your text analysis NLP work.
CoreNLP
Stanford's
CoreNLP
project provides a battle-tested, actively maintained NLP toolkit.
While it's written in Java, it has APIs for all major languages,
including Python, R, and Go.
OpenNLP
The
Apache OpenNLP project is another machine learning toolkit for NLP. It can be used from any language on the JVM platform.
Weka
Weka
is a GPL-licensed Java library for machine learning, developed at the
University of Waikato in New Zealand. In addition to a comprehensive
collection of machine learning APIs, Weka has a graphical user interface
called the
Explorer, which allows users to interactively develop and study their models.
Weka supports extracting data from SQL databases directly, as well as deep learning through the
deeplearning4j framework.
SaaS APIs
Using a SaaS API for text analysis has a lot of advantages:
A tool implies the rest of its ecosystem. Getting started with a
library for Python, for example, means setting up a Python installation
and getting the Python package manager running.
As Neal Stephenson remarked, "every broken python installation is
broken in its own way". Getting all the pieces working can be a painful
process, and doubly so if you're working on Windows.
And that's just to get the programming language and its tooling
working. Libraries might involve specific set up steps, like installing
system libraries for linear algebra, getting GPU drivers, etc.
SaaS APIs provide ready to use solutions. You give them data and they
return the analysis after some time period. Every other concern --
performance, scalability, logging, architecture, tools, etc. -- is
offloaded to the party responsible for maintaining the API.
The only development effort requires is when it comes to integrating
with your codebase: you just need to write code to call the API and get
the results back.
SaaS APIs for text analysis usually provide ready-made client
libraries for a number of programming languages, which simplifies
development effort further: instead of writing a client for a REST API,
all you have to do is write a client for another library.
SaaS APIs usually provide ready-made integrations for tools like
Zapier or
Google Sheets.
This will allow you to build a truly no-code solution: you can arrange
integrations to feed data from a cloud source directly into the SaaS API
and store the results back in your cloud. All of this with zero lines
of code and without the need to have a programming background.
Some of the most well-known SaaS solutions and APIs for text analysis include:
- MonkeyLearn
- Google Cloud NLP
- IBM Watson
- Lexalytics
- MeaningCloud
- Amazon Comprehend
- Aylien
Training Datasets
If you talk to any data science professional, they'll tell you that
the true bottleneck to building better models is not new and better
algorithms, but more data.
Indeed, in machine learning, data is king: a simple model, given tons
of data, is likely to outperform one that uses every trick in the book
to turn every bit of training data into a meaningful response.
So, here are some high-quality datasets you can use to get started:
Topic Classification
-
Reuters news dataset:
one the most popular datasets for text classification; it has thousands
of articles from Reuters tagged with 135 categories according to their
topics, such as Politics, Economics, Sports, and Business.
-
20 Newsgroups: a very well-known dataset that has more than 20k documents across 20 different topics.
Sentiment Analysis
-
Product reviews:
a pretty big dataset with millions of customer reviews from products on
Amazon. Besides the reviews, it provides useful metadata such as star
ratings, super useful for training a sentiment analysis model.
-
Twitter airline sentiment on Kaggle:
another widely used dataset for getting started with sentiment
analysis. It contains more than 15k tweets about airlines (tagged as
positive, neutral, or negative).
-
First GOP Debate Twitter Sentiment:
another useful dataset with more than 14,000 labeled tweets (positive,
neutral, and negative) from the first GOP debate in 2016.
Other Popular Datasets
-
Spambase: this dataset contains 4,601 emails tagged as spam and not spam.
-
SMS Spam Collection: another dataset for spam detection. It has more than 5k SMS messages tagged as spam and not spam.
-
Hate speech and offensive language: a dataset with more than 24k tagged tweets grouped into three tags: clean, hate speech, and offensive language.
Finding high-volume and high-quality training datasets are the most
important part of text analysis, more important than the choice of the
programming language or tools for creating the models. Remember, the
best-architected machine-learning pipeline is worthless if its models
are backed by unsound data.
Text Analysis Tutorials
The best way to learn is by doing. Theory comes later.
First, we'll go through programming-language-specific tutorials using
open-source tools for text analysis. These will help you deepen your
understanding of the available tools for your platform of choice.
Then, we'll take a step-by-step tutorial of MonkeyLearn so you can get started with text analysis right away.
Tutorials Using Open Source Libraries
In this section, we'll look at various tutorials for text analysis in
the main programming languages for machine learning that we listed
above.
Python
NLTK
The official
NLTK book is a complete resource that teaches you NLTK from beginning to end. In addition, the
reference documentation is a useful resource to consult during development.
Other useful tutorials include:
SpaCy
spaCy 101: Everything you need to know: part of the official documentation, this tutorial shows you everything you need to know to get started using SpaCy.
This tutorial shows you how to build a WordNet pipeline with SpaCy.
Furthermore, there's the
official API documentation, which explains the architecture and API of SpaCy.
If you prefer long-form text, there are a number of books about or featuring SpaCy:
Scikit-learn
The official
scikit-learn documentation contains a number of tutorials on the basic usage of scikit-learn, building pipelines, and evaluating estimators.
Scikit-learn Tutorial: Machine Learning in Python shows you how to use scikit-learn and Pandas to explore a dataset, visualize it, and train a model.
For readers who prefer books, there are a couple of choices:
Keras
The official
Keras website has extensive API as well as tutorial documentation. For readers who prefer long-form text, the
Deep Learning with Keras book is the go-to resource. The book uses real-world examples to give you a strong grasp of Keras.
Other tutorials:
-
Practical Text Classification With Python and Keras:
this tutorial implements a sentiment analysis model using Keras, and
teaches you how to train, evaluate, and improve that model.
-
Text Classification in Keras:
this article builds a simple text classifier on the Reuters news
dataset. It classifies the text of an article into a number of
categories such as sports, entertainment, and technology.
TensorFlow
TensorFlow Tutorial For Beginners
introduces the mathematics behind TensorFlow and includes code examples
that run in the browser, ideal for exploration and learning. The goal
of the tutorial is to classify street signs.
The book
Hands-On Machine Learning with Scikit-Learn and TensorFlow helps you build an intuitive understanding of machine learning using TensorFlow and scikit-learn.
Finally, there's the official
Get Started with TensorFlow guide.
PyTorch
The official
Get Started Guide from PyTorch shows you the basics of PyTorch. If interested in something more practical, check out this
chatbot tutorial; it shows you how to build a chatbot using PyTorch.
The
Deep Learning for NLP with PyTorch tutorial is a gentle introduction to the ideas behind deep learning and how they are applied in PyTorch.
Finally, the official
API reference explains the functioning of each individual component.
R
Caret
A Short Introduction to the Caret Package shows you how to train and visualize a simple model.
A Practical Guide to Machine Learning in R shows you how to prepare data, build and train a model, and evaluate its results. Finally, you have the
official documentation is super useful to get started with Caret.
mlr
For those who prefer long-form text, on arXiv we can find an extensive mlr tutorial
paper. This is closer to a book than a paper and has extensive and thorough code samples for using mlr. There's also the official
mlr cheatsheet, a handy resource to have when debugging.
Java
CoreNLP
If interested in learning about CoreNLP, you should check out
Linguisticsweb.org's tutorial which explains how to quickly get started and perform a number of simple NLP tasks from the command line. Moreover, this
CloudAcademy tutorial shows you how to use CoreNLP and visualize its results. You can also check out this tutorial specifically about
sentiment analysis with CoreNLP. Finally, there's this tutorial on using
CoreNLP with Python that is useful to get started with this framework.
OpenNLP
First things first: the official
Apache OpenNLP Manual should be the
starting point. The book
Taming Text was written by an OpenNLP developer and uses the framework to show the reader how to implement text analysis. Moreover,
this tutorial takes you on a complete tour of OpenNLP, including tokenization, part of speech tagging, parsing sentences, and chunking.
Weka
The Weka library has an official book
Data Mining: Practical Machine Learning Tools and Techniques that comes handy for getting your feet wet with Weka.
If you prefer videos to text, there are also a number of MOOCs using Weka:
Practical Tutorial with MonkeyLearn
Let's imagine that you're concerned about the productivity of your
customer support team. They tell you that they lose hours every day
sorting through incoming tickets. Manually classifying topics and
sending each ticket to the appropriate department is making their job
very difficult.
You need a solution that will help your customer support team sort
through support tickets, and you'd like to know how machine learning can
do the heavy lifting. Your dev team just doesn't have the time or
experience in machine learning, and you don't want to hire new staff.
The good news is that
MonkeyLearn
offers a simple and easy-to-use platform to build your own text
analysis models, without needing to code or having prior machine
learning knowledge or experience.
But how does MonkeyLearn work? Here's a short step-by-step tutorial
on how to build and train two different text analysis models: a text
classifier and a text extractor.
Let's start with the classifier!
Create Your Own Text Classifier
In this six-step tutorial, we'll be able to cover the basics on how
to create a text classifier model and put it to the test. Let's get
going.
1. Create a new model
First, go to your
dashboard and click on
'Create a Model', then select 'Classifier':
2. Now choose 'Topic Classification'
3. Upload your data for training the classifier
The first thing your new model needs is data. You can import data
from CSV or Excel files, and from Twitter, Gmail, Zendesk, Freshdesk
and other third-party integrations offered by MonkeyLearn:
Once you've uploaded your data, you'll start training your model.
4. Define tags
The next step is to determine the tags you want your text classifier to use when sorting data:
5. Start training your model by tagging examples
For your text classifier to understand your criteria, and to
automatically tag the data you feed it, you'll need to manually input
some examples. Around four samples per tag should be enough for your
model to have a very basic understanding of the information it needs to
classify. Nevertheless, the more time you spend training it, the more
accurate your model will be.
Training your model is super easy. Just assign the appropriate tag to each piece of text, like in the example below:
If you start noticing that some examples are already tagged, that's
machine learning getting the job done! Remember that these are the
model's first steps with your data, so it may make some mistakes along
the way. Just retag if something's off and you'll see more accurate
results in no time.
6. Test your classifier
The last step is to test your text classifier. Once you have finished
tagging the initial set of examples, head over to the 'Run' tab and add
some more samples into the text box for your model to analyze and make
predictions:
If you still see mistakes, you can always return to the 'Build' tab for further training.
Put Your New Text Classifier To Work
Now that your new topic classifier is up and running, all you need to
do is upload new data and let the model do its thing. With MonkeyLearn
you can simply upload more
CSV and Excel files to analyze new data or use one of our
integrations:
If you know how to code, you can also use
MonkeyLearn's API to run your model using Python, Ruby, PHP, Javascript, or Java:
So now onto text extraction. Let's say that the results from your
latest customer satisfaction survey have just arrived and you need to
find a way of sifting through open answers to replace time-wasting
manual procedures. These answers contain important information about how
customers view your service. Your aim is to extract the most relevant
keywords from these answers and gain some useful insights.
Via MonkeyLearn you can leave all this work to machine learning models, allowing you to quickly and simply
extract keywords
and expressions from the text provided by your customers. The result? A
quick and effective way of finding out which are the most common topics
your customers talk about when asked about your services.
Mix this with a
sentiment analysis check and you'll also know how they feel towards those topics, for example:
positively,
negatively or
neutrally.
So, let's dive into a step-by-step tutorial that'll teach you how to
build your own text extractor with MonkeyLearn. You'll find a quick
rundown on how to easily set up your first text extraction model, train
it and test it in just a couple of minutes.
1. Create a new extractor
Sign in to MonkeyLearn and go to your
dashboard, click on
'Create a model' and then 'Extractor':
2. Select how you want to feed the model data
Now you can upload data from platforms like Twitter or Gmail, or
customer support services like Front, Zendesk and Freshdesk, as well as
RSS and either CSV or Excel files. Once you've done this, the system
will automatically import your data and allow you to begin training your
new text extraction model:
3. Create tags for your model to extract
You just need a minimum of two tags to start training your text
extraction model, but you can always add more along the way if you feel
that there's important data missing:
In this example above, we are choosing tags that will give us
insights into how customers view three important aspects of our service.
We've extracted data from our latest satisfaction survey using the tags
Support,
UX/UI, and
Pricing.
4. Train your new text extractor
Once you've told your extractor what tags to look out for, it's time
for you to manually tag some examples by highlighting words and
expressions related to them. After just a little training, you'll start
to notice that the extractor begins predicting tags based on your
criteria:
It's always a good idea to go back to the 'Build' tab and continue
training your model so that it's able to detect language variations you
want it to recognize.
What's next? You're ready to use your new text extraction model. You can feed it new data by uploading new
CSV and Excel files, or using one of MonkeyLearn's
integrations:
As we mentioned earlier, MonkeyLearn also allows you to integrate your extractor with its
API, using Python, Ruby, PHP, Javascript, or Java:
Takeaway
Text analysis is no longer an exclusive, technobabble topic for
software engineers with machine learning experience. It has become a
powerful tool that helps businesses across every industry gain useful,
actionable insights from their text data. Saving time, automating tasks
and increasing productivity has never been easier, allowing businesses
to offload cumbersome tasks and help their teams provide a better
service for their customers.
If you would like to give text analysis a go, sign up to
MonkeyLearn
for free and begin training your very own text classifiers and
extractors – no coding needed thanks to our user-friendly interface and
integrations.
Reach out to our team if you have any doubts or questions about text analysis and machine learning, and we'll help you get started!