We recently had a client that wanted to incorporate a Machine Learning Algorithm into their business infrastructure, to help deliver Artificial Intelligence (AI) capabilities to provide a better service to their customers.

We are well aware here at Sheda that AI will be a key advantage for many businesses, so I set out to investigate.

The problem was to automatically classify information received into different categories, a problem that was very similar to a spam filtering service, so I was tasked with putting together a proof-of-concept for this backend system.

One of the challenges that I found while trying to build this system was that while there were some great tutorials to explain how to set up certain parts of the pipeline needed to deliver the service, actually integrating and plugging all the different parts together was largely undocumented.

What I ended up doing was putting together two Pipelines, one to devise building the Machine Learning Algorithm that would handle creating our predictions on the email.

The second was to be able to connect up this pipeline to an actual server so it could be utilised within a production environment.

Let's focus first on creating the Machine Learning Pipeline.

The entire integration for this system has been split up into two blog posts:

Part 1 - Creating a Machine Learning Algorithm using Databricks.

Part 2 - Connecting a Node.js API Service to retrieve AI Predictions using the Databricks API (to come).

The first part of this blog series is very much a beginners guide to using Databricks, so if you’re already familiar with this you might want to read the section on Data Extraction and Cleanup that outlines how to prepare your data for analysis.

Then, skip the Getting Started with Databricks to the Creating a Table in Databricks, which is the beginning of the integration in Databricks.

Sheda uses design and emerging technologies like A.I and blockchain to solve complex problems that make and accelerate impact and bring about postive social change.

Machine Learning Pipeline

The problem of spam detection is a classic example of a Binary Classification problem, these problems are defined as one with two mutually exclusive outcomes.

Some concrete examples include predicting if a tumour is benign or malignant, or if a team will win or lose it’s next match.

Check out a very cool visual resource explaining the basics of Machine Learning put together by the crew at r2d3 for further reading.

Here is the architecture overview for the system we will be building in Part 1:

Architecture overview. Image: Timon Sotiropoulos, Sheda.

The implementation we will be using to solve our problems in this tutorial is the Naive Bayes Classifier

This algorithm will take in pre-labeled data and build a huge network of probability values for each word that it encounters.

The more times a single word is found in spam labeled emails, the more weight that word will have towards the algorithm predicting that future emails containing that word will also be a spam email.

So by feeding our algorithm thousands of emails, it starts to get a clearer picture of which words are common in spam and which aren’t.

I have broken down the steps to this section of them problem in the following ways:

  • Data extraction and cleanup;
  • Text normalisation;
  • Indexing and feature extraction; and
  • Training, predicting and analysing.

Remember that this process can be completed for just about any binary classification problem based around emails.

If you wanted your algorithm to predict whether or not an email was social, you would simply use training data labeled that or not, with all the following steps roughly the same.

Data extraction and cleanup

The first step was to gather a large amount of emails to use as our training data for our algorithm.

I found the easiest way to do this was to export a large number of emails from Gmail, using their archiving service.

In Gmail you can export specific labels from your email account, so I created two labels, one named Spam and one named NotSpam and used Gmail’s filtering service to assign emails to these two labels.

When it comes to training a production ready algorithm, I used about 5000 emails, with about 500 of these being Spam emails.

Remember, when picking training data, the more broad the range of emails, the ‘smarter’ your algorithm will become.

Once you have requested your archive from Gmail, they will send you through an mBox file that contains all our email data.

Note: I do advise while following this tutorial to use a smaller subset of your data, just to make sure the processes are working 100 per cent as on large datasets different processes can take quite a lot of time and computing power.

The next step was to write a little python script to get the data ready for loading into Databricks.

The plan here it to create some initially clean and consistent data from our emails, where each email is reduced down to their basic email text.

To run the script you will need to install Python 2.7 the Python Package Manager. The script also uses the Beautiful Soup package which we use to clear out any HTML tags present in our emails.

Your script will need to do the following things:

  • Read and locate the body data from all emails in your provided mBoxes;
  • Strip out all HTML, CSS and Javascript tags and code (I used Beautiful Soup to do this); and
  • Write the file out to a CSV file.

My complete python script can be found under the Email-Extractor folder inside the complete github repo for this tutorial, usage instructions are also available there too.

Getting started with Databricks

The next part of this tutorial will be where we actually write our code to create and train our Machine Learning Algorithm.

To do this we are going to use a service called Databricks.

This is a tool that will create a distributed computing network on a cloud infrastructure to run the Machine Learning library Apache Spark.

It automatically connects up different virtual machines and allows you to run large computing processes across this network, meaning much faster computational speeds.

Overall, it does all the nitty-gritty setup so we can just get into writing our classifier.

While Databricks will automatically create your clusters for running your Machine Learning operations, you will need to have an Amazon Web Services (AWS) account to actually host your virtual machines.

This is connected up when you create your Databricks account, but be aware the smallest nodes you can run when creating your clusters are c3.2xlarge AWS instances.

For the full list of instance sizes supported by Databricks check this out.

There is a Community Edition of Databricks that allows you to test and run your notebooks on their platform, however, you are limited to one driver and one worker node that are both quite small and depending on the amount of data you are processing, you may need to use the full version.

Starting our first cluster and running our first task

The first step is creating a Databricks cluster and running the task to make sure that we have managed to get everything configured correctly.

So after creating a new account, you will want to navigate to the cluster page and create your first cluster.

Clusters are made up of AWS Instances and the smallest one you can create through Databricks is still quite large, so I would suggest starting with only one driver and one worker node and watch your usage on AWS.

If you start to get above the 50 per cent usage range it might be worth starting up another worker node.

For this tutorial we are going to be using Apache Spark 2.0.0 and Scala 2.10. You can also write your workbooks for Databricks in Python if you prefer.

If you are using the full version of Databricks then you will have a few more options than above, including the ability to run multiple worker nodes if your processes require it.

Once your cluster is running, return to the homepage and create a new notebook.

This will prompt you to give it a name and set the language for the notebook.

Give it an appropriate name and then set the language to Scala.

This will take you to your new notebook.

The next step is to attach your notebook to a cluster.

If you try to run a command now then you should be prompted to attach your notebook to a cluster.

Alternatively, you can click the 'Detached' button and select one of your running clusters to get your notebook ready to run.

Next, add the following line of code into the first section of your notebook and hit Shift + Enter to run the cell.

All going well you should see something like the following:

A note about Notebooks

Notebooks is the way that Databricks executes your code on its clusters.

It runs each command in turn and variables that you set in one cell will be available for use in the corresponding cells. 

The benefit of this process is that we can cache our work and long running processes as we move through our pipeline, so if we want to tweak certain parts of our algorithm for example, we don’t have to rerun the entire pipeline.

It's also worth noting that while each Databricks Notebook has a main programming language, each of these cells can be easily changed to a different language Databricks supports by adding a % sign and the name of a language.

So if we wanted to run something in SQL, we could write it as shown here:

Let’s talk about Dataframes

One of the core aspects to understand with Databricks is their DataFrames API.

While I strongly suggest you read this blog post from Databricks, which gives an in-depth explanation, DataFrames are similar to an relational database table optimised to work in a distributed computing environment.

Alright, now that we are all setup on Databricks, let us get to building our Machine Learning Algorithm!


Creating a table in Databricks

The first thing we need to do is create a table from our CSV data on Databricks.

Click on the 'Create Table' navigation button on the Databricks sidebar.

Set the 'Data Source' as a file and then upload the CVS file you created with your Python script from earlier.

When it finishes uploading your screen should look something like this:

What this has done is uploaded a copy of the file to your Databricks File System, making it available for use within your work books.

Copy down the file location under the Uploaded to DBFS line, as we will need this in our workbook.

Now we need to load in the sqlContext and the Apache Spark contexts.

These are created to connect the execution environment with the Spark environment, allowing you to get access to all the Spark APIs when you need them.

We also take this opportunity to set our FileStore to a variable for later use.

Next we need to define the schema that our DataFrame is going to contain.

Our schema is just the two columns of our CSV, so we will create a struct in Scala to contain these.

We then set the options of our sqlContext to read in the data from our CSV. 

Passing a format allows it to load the file faster, we then tell it it is a comma separated file, without the header present and to read the data in PERMISSIVE mode, which simply means insert nulls for missing values and try to read all lines.

Finally, before we finish, we will cache the DataFrame so that on subsequent runs it will not take as long to process the data, then we create a temporary table so we can run an SQL query on the data table.

It's a good habit to start checking your DataFrame as you progress through this tutorial.

This can be quickly done by calling the show function or alternatively if you register a temporary view as we have above, we can query this table using regular old SQL.

Remember, the name of the temp table will be the string you pass the createOrReplaceTempView function.

To do so we set the cell of the notebook to SQL, then use a select query to get the information stored in the DataFrame.

Refer back to this cell and test your DataFrames often to test the changes you are making to all of your cells.

Normalising our email text

The next step in creating our Machine Learning Pipeline is to clean up and normalise our text.

We want to get it ready so that the text can be used as features for our Machine Learning Algorithm, so this means grouping common attributes together.

This can include normalising data such as currency symbols, email addresses, numbers and more.

To do this normalisation, we will create special User Defined Functions that allow us to run each of the rows of our DataFrame through the regex for that particular normalisation function.

Remember, the aim here is to try to try to clean out the noise in your dataset as much as possible.

We want to be able to refine what we are looking at so that the Machine Learning Algorithm can accurately calculate important to non-important keywords when predicting if we have spam or not.

Here is an example of the normalizeURL function. It will regex match any URL and then replace that with the text string normalizedurl.

This groups all this information into something that will be more useful come analysis time.

We have a number of other normalisation steps that we take as well, such as setting all text to lower case and removing any punctuation.

The order of these steps is also important, for example, if we remove all punctuation, our regex to find URLs won’t find any matches!

All normalisation steps can be found in the full notebook for this tutorial

The next step is to run each row of our emailText column through all of these normalisation functions.

The way that the DataFrame works is we take our current table and then return a new one with an extra column containing all of our changes.

Remember, the code works in reverse order, passing each return string up the ladder, so the functions you want to run first are actually the last ones in the row.

Here is the code snippet:

This is a good opportunity to show your results by printing out the SQL Table.

If you have made any changes to your normalisation functions, it’s a good idea to test them on a small subset of your data and make sure they don’t have any unexpected side effects.

Indexing our labels and extracting our features

Now that we have normalised our data, we have to convert those words into language that the Machine Learning Algorithm can understand.

The first step of this is to index out labels, so that instead of the algorithm seeing the words 'spam' and 'non_spam', it instead sees values of 0 or 1.

Thankfully, the Apache Spark machine learning library already has a StringIndexer that will handle this operation. There is also the inverse operation indexToString for when you want to do the reverse operation.

To extract our features we are going to split our emails into individual words.

This can be done using the Tokenizer from the Apache Spark library.

Next we will create a Term Frequency Matrix, which is a fancy way of saying a table that contains the number of times a word appears across a range of documents (or in our case emails).

If that seems a little confusing, another similar process is a Bag of Words Model, where one finds all the words that appear in all the emails and add up the number of times they occur.

To create a simple Machine Learning Algorithm based on the bag of words model, you could do this for only the spam emails, find the most common words and then check if new emails contained these words to get your prediction.

The term frequency matrix works very well with the Naive Bayes Model we are going to use as our classifier, which if you can remember will rank all the words that appear in all our emails and give them a probability on if they are more likely to indicate if our email is spam or not.

Again, the Apache Spark machine learning library has a handy tool for creating these term frequency matrices called HashingTF.

This object allows us to set the number of features we would like our algorithm to have, an important variable that will assist with tweaking our algorithm to fine tune its accuracy on our dataset.

Building the Machine Learning Pipeline, creating our Machine Learning Model

Apache Spark has a concept called Pipelines which allow us to chain together the common operations that we would complete each time we run our Machine Learning Algorithm. 

Our Pipeline is going to consist of our StringIndexer, Tokenizer, Term Frequency Matrix and Naive Bayes Classifier.

We will combine this into a Pipeline that will produce us a Model, the model will be responsible for making predictions on unlabelled data that we present it, based on the training data we used to create it.

The steps involve:

  • Splitting our data into a random assortment of trainingData and testData;
  • Defining the input and output columns for the sections of our Pipeline;
  • Setting the Machine Learning Variables to tweak our Pipeline; and
  • Creating the Model then checking its accuracy on the testData.

Here is the code example from our notebook:

Evaluating and testing our Machine Learning Algorithm

The final step of the process is to test our model and create some analytics to see how accurate our model is at predicting.

The simplest way to evaluate the success of your Machine Learning Algorithm is to create a Confusion Matrix that maps the input label of the test email and then categorises it against the predicted label of the algorithm.

You then end up with columns denoting:

  • True Positives: Spam emails the algorithm predicted as spam;
  • False Positives: Non-Spam emails the algorithm predicted as spam;
  • True Negatives: Non-Spam emails the algorithm predicted as non-spam; and
  • False Negatives: Spam emails the algorithm predicted as non-spam.

This table creates the basis for a large number of other metrics that can be calculated, if you are familiar with these you can implement them using the Binary Classification Metrics class if you need more accurate results based on that data.

So our first step in this process is to use our testData and run it through our model to create our predictions.

We can then use a few SQL queries on the resulting DataFrame to create our confusion matrix.

Now we have a way of evaluating our Model and seeing how well it is doing, we can start making tweaks to the number of features and then smoothing of the algorithm to see if we can start to get the prediction levels a bit higher.

Remember, if your training data isn’t sufficient or broad enough, too much tweaking can cause Overfitting to occur, so try to avoid this!

Now, the last thing we need to do is save out our model to the Databricks Filesystem so that we can load it back up when we want to use it in our Production environment.

Congratulations, you have your own custom build Machine Learning Algorithm!

Further algorithm reading

Lastly, before we move onto Part 2 of this series, on using the Databricks API to return predictions on new emails, I just want to cover a few terms and additional things that might be useful in trying to improve your algorithm.

Overfitting: This is when the algorithm performs really well on your initial training set and test data, but really poorly in the real world application. This can often be resolved by reducing the number of features you are passing to your algorithm or broadening your training data.

Cross Validation: This is another technique for helping to avoid Overfittin, mentioned above. It works by performing the training of your algorithm multiple times on different subsets of your data and then aggregating the result.

Full Tutorial Github Repository.

Complete Notebook Link on Databricks.

Spam Classification Using Sparks Dataframes ML Zeppelin.

Want some help automating your business processes and using AI to create products/services that your customers will love? Contact SEED to make an inquiry.


Oyem Ebinum

Founder & Director
@ Sheda
As Sheda's Founder, Oyem (Mike) ensures that we are applying the best technology for our customer's needs on both a business and technical level. A guest lecturer and tutor at the University of Melbourne and RMIT University, his interests include applying design & technology to problems in healthcare.

Call us

Call +61 3 9028 6936

Drop in and say hi

3/16 Honeysuckle Dr,
Newcastle NSW 2300
get directions

Follow us


Thank you! Your submission has been received!

Hmm... Something's not right. Try submitting again.