This Is What A Machine Learning Model Looks Like

Visualising How Machine Learning Makes Decisions and Predictions.

Brunna Torino
Towards Data Science

--

How can machine-learning algorithms understand complex relationships between variables that not even field experts sometimes fully understand? In my article, Machine Learning and Real State, I collected data from real state listings in Amsterdam to understand how rent prices are determined in this (very overpriced) city.

After all the transformations, my dataset ended up with a staggering 327 columns and almost 4,000 rows. It would be near impossible for a human to look at all this data and try to understand what is happening in the real state market. But for the machine-learning model, it only took three minutes to train itself, test its assumptions and tell me how well it did 10 times over.

It can be highly technical for a beginner to understand the details of how a Random Forest algorithm makes decisions. How does it know that adding a sq. meter to an apartment in a specific block, with a certain number of bedrooms, can add x amount of euros to your monthly rent? and even more impressively, how can it estimate with 98% accuracy the rent price for every apartment, with distinct features, all over the city? And all that, within three minutes??

To make it easier to understand how the model works, we can visualize the model’s decision tree: the collection of the decisions it made about the data, based on how they tested inside the model. Here’s the figure I included in my original article:

This only shows that our model was probably working very hard. But it’s not that useful beyond that. Let’s look at this decision tree more closely:

If we go to the absolute top of the tree, we can see where everything started. The model took a random point in the data, in this case where the surface of the apartment is less than or equal to 156 square meters, and that was its first split. Zooming out we can see that this was the “mother” split of the tree:

Let’s take a closer look at the following section (branch) of the tree:

Here, we start off splitting the apartments if they were or were not listed by the real state agency Vesteda Noord West (since it’s a binary variable, if it’s less than 0.5, it means it’s 0, more than 0.5, it means it’s 1). On the right split, we have the apartments that were not listed by that agency and on the left split, we have those that were listed. If you’re confused, don’t worry: think about it as the model asking a question, and if the answer is yes the data points will go to the right, and if the answer is no the data points will go to the left. Here we are asking a negative question: was this apartment not listed by this agency?

Let’s take the apartments that were listed by that agency (on the left split). We further split the apartments if their latitude is smaller or larger than 52.4, if they are not smaller (left), we split by their longitude being smaller than 4.9, and if they are not smaller than that (left again), we end up with only two samples, with zero estimation errors, and a rental estimate of 2300 a month.

For every other split, we can do the same exercise. The terminal points (where the model doesn’t further split itself) contain the model estimates for the price of the apartments, which are also called the leaves of the model.

Hope this helped you to understand the fascinating power of machine-learning. Thank you for reading!

--

--