DISCLAIMER: This is a post for beginners to Machine Learning and how to get started without any code. If you already know how to code and use Machine Learning libraries, you might not get much out of it.
Machine Learning at its core is function approximation. Given a set of input variables, Machine Learning algorithms try to come up with a function that can accurately predict the output.
There are various algorithms to come up with this function approximation, also called a model. The one that gets the most press is Neural Networks, also known as Deep Learning, which is what we will cover in this post.
There are three main steps to a Machine Learning process:
- Data Source & Prep
- Model Training
- Prediction Output
We will fully cover the first two steps in this tutorial while only briefly touching on the last step.
Data Source & Prep
The Data sourcing, cleaning and transformation is the most time consuming and also the most impactful and important part of any Machine Learning project. Garbage in and garbage out holds true in most programming but it holds doubly true in Machine Learning.
When you read news stories about image recognition gone astray or health care AI not working correctly, it is most likely a data problem, whether not gathering enough data or gathering data that does not represent the true population.
In a real world Machine Learning project, finding the data you need, cleaning it and transforming it will take a lot of time and headaches. However, for learning purposes we can leverage datasets that have been curated and organized for Machine Learning: the UCI repository, Kaggle datasets and ML Data.
For this post we will use a dataset that contains the votes of congressional members on 16 different issues in 1985. Using these votes we will try to predict whether the member is a Democrat or a Republican.
The value for the votes could be y, n and ? so we do not need to clean up anything, however let’s say the values were Yes, y, Y, no, No, ?? and blank. Then we would need to clean up the inconsistent values and make sure all the yes, no and missing votes correlate to only 1 value each.
In order to train the model, we need to understand the concept of training and test data.
Machine Learning is an inductive way to estimate a function, compared to a deductive or rules based approach. Instead of using logic and theory to approximate the function, you feed data to an algorithm and using the data come up with the function.
If you were to train your model and test its accuracy using all of the data, then the model will be really accurate as it hasn’t seen anything new. It would be as if you practiced a set of 100 math problems and your math exam had those exact 100 problems. Your score on that exam would probably be really high.
In order to avoid this problem, we break up our dataset into a test set and a training set. We will use 70% of the data points to train the model and the other 30% to test the model’s accuracy. This way we will have a better idea on how accurate our model is.
We can do the training and test split at the onset but thankfully the software we will use to train the model will do that for us.
We will be using a free software called Weka. You need Java installed before using Weka. Weka can be downloaded here: https://sourceforge.net/projects/weka/.
NOTE: I am going to be using Weka 3–8 so there might be some changes if you use another version.
After you have downloaded the dataset and installed Weka, fire it up. On the main screen click Open File, go to the folder you have the dataset downloaded, under Files of Type, select CSV files and open up the file.
After you have opened it up, you will see all the different columns, also known as attributes and their values, both visualized and in table form. The class dropdown should be selected to the output attribute or what we are trying to predict. In this case it is the political_party attribute divided amongst 267 democrat and 168 republican.
Now that our data is imported, it is time to train our model. Woot!
Click on the Classify tab at the top toolbar. Then click on the big Choose button right underneath. What pops up is the list of all algorithms that Weka has for you to train your model.
We are going to choose the Neural Network algorithm which is under the Functions -> Multilayer Perceptron.
Now we are going to do two crucial things:
- Specify how much of our data we want to use for the training and test set
- Specify values for our hyper-parameters that we want the algorithm to use
Training and Test Set
Under Test Options, select Percentage Split and specify 70%. That means we will be using 70% of the data to train and the remaining 30% to test. There is no exact way to know what the best split should be and this is what makes Machine Learning more of an experiment compared to other programming. The general standard is to use 65–80% of the dataset to train.
This is also another area of experimentation in Machine Learning. The hyper parameters, the different variables for an algorithm, differ from algorithm to algorithm and changing them greatly affects the accuracy of the model.
What the hyper-parameters mean for different algorithms and how to go about finding their ideal values is way out of scope of this tutorial. You will likely need to take a Machine Learning course or watch several tutorial on those topics specifically to understand them.
There is also a new technology called AutoML that has started to gain adoption, where a software program cycles through different values for the hyper-parameters programmatically and determines what the ideal values should be.
For this tutorial we will go through the trial-and-error approach and see which hyper-parameters values give us the best result.
To select the values click directly on the Multilayer Perceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a line right next to the Choose button. This should pop open a window as seen below.
The two hyper-parameters we will be tinkering with today are hiddenLayers and learningRate. These refer to the number of layers that the Neural Network should have and the factor by which weights in the functions should be updated.
Let’s try to run with the provided learning rate and hidden layers and see our results. Ensure that the class is political_party and click on Start.
After running the model, you will get a plethora of results. The key result for this dataset and experiment is the number of correctly classified instances, and voila! Our model produces a 96% accuracy. Just a heads up that depending on the dataset or the model, other error rates might be more applicable.
Since the accuracy is so good with our default parameters, I won’t be experimenting with the hyper-parameters but you should definitely give it a whirl. Experimenting with different hyper-parameter values, datasets and test-train splits will give you a better intuition on what works and make you a better Machine Learning practitioner.
But that’s it! Congrats! You just did deep learning with a file and without writing code!
Now that we have our model, the fun part is to predict whether a congressmen were a Democrat or a Republican given only how they voted on these issues.
First we need to save our model. Right click on the model in the bottom left pane and click Save Model and save your model.
To go through with the rest of the prediction you can read more here. Or be posted for another tutorial from us.
Happy Machine Learning :)