Pre-GSoC Experience @ mlpack

Learn more about my project from my previous blog
Prologue
On the very first day I started to draft my proposal for GSoC, I started to experiment with my ideas for Pre-GSoC contribution, because I wanted to get myself familiar with the code base, documentation, API’s and best practices for contributing to mlpack.
So I started with a small task of predicting salary from experience using LinearRegression API provided by mlpack.
I started this task in early April. During this period I had a tough time setting up my local environment for interactive C++ using xeus-cling. I’ll write a blog on how to setup a local environment for experimenting with mlpack soon.

Objective
Predict the salary of an employee given how many years of experience they have.We will train a Linear Regression model to learn the correlation between the number of years of experience of each employee and their respective salary.
I started to work on python notebook and made a PR as soon as I completed it. I received various helpful suggestions from my mentors and reviewers, most of the time I messed up the style guideline which I got a hold of during this period
Later I started to work on the C++ end, initially it started out as an standalone C++ program and later grew into an interactive C++ notebook, one capable of explaining the readers a story.
I started with Exploratory Data Analysis on the data, here is a scatter plot from the C++ notebook using matplotlibcpp, a header only C++ wrapper for Python’s matplotlib plotting library.

Linear Regression
Regression analysis is the most widely used method of prediction. Linear regression is used when the datasets has a linear correlation and as the name suggests, simple linear regression has one independent variable (predictor) and one dependent variable(response).
The simple linear regression equation is represented as y = a+bx where x is the explanatory variable, y is the dependent variable, b is coefficient and a is the intercept
To perform linear regression I used LinearRegression()
API from mlpack.
Here’s the plot of best fit line predicted by the trained model.

Finally I used various Evaluation metrics such as MAE, RMSE & MSE to quantify how well the trained model was able to perform on unseen data.
Epilogue
I thought for writing an verbose explanation about the approach I followed in both the notebooks in this blog. Later I realized on not let the notebook narrate the story and approach by themselves instead of me sprinkling some code here and there. Make sure to take a look at the salary prediction notebooks using mlpack by visiting our repo or at binder.
That’s all for today. I will write another one this weekend for this week’s progress. Stay tuned!