GSoC 2021 @ mlpack Week 1 Progress

David Port Louis
3 min readJun 26, 2021

Fast forward 20 days from the student project announcement date. Here I’m in week 1 of Coding Period. My Project for GSoC is “Example Zoo”, which consists of 7 sub-modules.

Learn more about my project and Pre-GSoC experience from my previous blogs

For the first few days, I was concentrating on getting my PR from Pre-GSoC period merged. I was fixing styling issues, adding suggestions and change requests from my mentors and reviewers into the PR. Concurrently I also started to work on my first sub project “Predicting Avocado Price” using Linear Regression.

Hopefully I had setup my local environment beforehand, so I was able to quickly start with the implementation.

As usual I started with the Python notebook implementation and completed it swiftly before moving onto the C++ implementation. During this Period I had my 1st meeting with my mentors “Marcus Edel” and “Kartik Dutt”, we discussed about Week 1’s target and blockers. I showcased my Python notebook and said that replicating the python notebook in C++ was my target for the week.

I made some visualizations in Python and faced an issue that C++ lacks some good visualization. In the mean time I meet Roshan, fellow GSoCer who was facing the same issue with visualization library, after some productive discussion. I approached my mentor for a solution, he suggested using C Python API to generate plots in Python by invoking the function in C++ notebook and import the saved figures back into C++ notebook.

Some Cool Visualizations from the notebooks

Scatter plot of Avg. Price of Conventional Avocados over time

This dataset had many categorical features which where important, I have easily one hot encoded them in python. On contrary I encountered some Issues with one hot encoding in C++, after discussing with my mentors I came to know that the problem was caused due to the presence of “Date” column.

Later I was successful in encoding the categorical features and proceeded with implementing Multivariate Linear Regression.

Here’s the plot of best fit line predicted by the trained model.

All the above plots were generated using seaborns in Python and embedded in C++ notebook using xwidget and custom C Python API functions for generating the plots.

Finally I used various Evaluation metrics such as MAE, RMSE & MSE to quantify how well the trained model was able to perform on unseen data.

Make sure to check out the PR to gain insights on implementation of the above notebooks. Notebooks always narrate the story better than I do!

That’s all for last week’s blog. I will write another one next week for this week’s progress.

--

--

David Port Louis

Junior Majoring in CS | Deep Learning and Machine Learning Enthusiast | Loves to explore new technologies