Machine learning#

This week we will be learning how to develop and use machine learning models. We only have a week so the aim is to briefly review the basics of machine learning and provide a hands-on demo with a geospatial dataset. Students who are interested are encouraged to pursue some of great online resources that are now available to learn more about this extensive (and rapidly growing) field.

What is machine learning?#

  • The goal of machine learning is use input data to make useful predictions on never-before-seen data

  • Machine learning is part of artificial intelligence, but not the only part

../../_images/ml_schematic.jpg

Input data#

  • Machine learning starts with a labelled dataset

  • A label (or target variable) is the thing we’re predicting (e.g. y variable in linear regression)

  • For example, house price, river discharge, land cover etc.

  • It’s tough to collect a good collection of data (time-consuming, expensive)

  • These datasets are therefore extremely valuable

../../_images/captcha.jpeg

Features#

  • An input variable (e.g. the x variable in linear regression)

  • A simple dataset might use a one or two features while a more complex dataset could have thousands of features

  • In our river discharge example - features could include precipitation, snow depth, soil moisture

Algorithms#

  • There are many (e.g. naive bayes, decision trees, neural network etc.)

  • Performance of algorithm dependent on type of problem

  • Just remember: garage in, garbage out

../../_images/ml_types.jpg

Supervised learning#

  • Training data is already labeled and we teach the machine to learn from these examples

  • Supervised learning can used to predict a category (classification) or predict a number (regression)

Classification#

  • “Split things into groups based on their features

  • Examples include:

    • Land cover

    • Flood risk

    • Sentiment analysis

  • Popular algorithms include:

    • Naive Bayes

    • Decision Trees

    • K-Nearest Neighbours

    • Support Vector Machine

../../_images/classification.jpg

Regression#

  • “Draw a line through these dots”

  • Used for predicting continuous variables:

    • River discharge

    • House prices

    • Weather forecasting

  • Popular algorithms include linear regression, polynomial regression, + other algorithms

../../_images/regression.jpeg

Unsupervised learning#

  • Labeled data is a luxury, sometimes we don’t have it

  • Sometimes we have no idea what the labels could be

  • Much less used in geospatial data science but sometimes useful for exploratory analysis

Clustering#

  • “Divide data into groups but machine chooses the best way”

  • Common usages include:

    • image compression

    • labeling training data (i.e. for supervised learning)

    • detecting abnormal behavior

  • Popular algorithms:

    • K-means clustering

    • Mean-Shift

    • DBSCAN

../../_images/clustering.jpeg

Ensemble methods#

  • “Multiple learning algorithms learning to correct errors of each other”

  • Often used improve the accuracy over what could be acheived with a single classical machine learning model.

  • Popular algorithms:

    • Random Forest

    • XGBoost

../../_images/ensemble.jpeg

Predicting house prices using scikit-learn#

  • In our demo we will build a machine learning model to predict house prices in California.

  • The California house price dataset is a classic example dataset derived from the 1990 U.S. Census. A description of the dataset can be found here and the files can be found here.

  • Labels = house prices at the block group level (a block group typically has a population of 600 to 3,000 people)

  • Features = longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Read dataset
df = pd.read_csv('data/california_house_prices.csv')
df.head()
housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 51 1905 291 707 284 6.2561 431000
1 51 1616 374 608 302 3.1932 400000
2 51 2413 431 1095 437 4.0089 357000
3 51 1502 243 586 231 4.3750 332400
4 51 2399 516 1160 514 3.8456 318900

Check data#

# Summary of columns, values, and data types
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15267 entries, 0 to 15266
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   housing_median_age  15267 non-null  int64  
 1   total_rooms         15267 non-null  int64  
 2   total_bedrooms      15267 non-null  int64  
 3   population          15267 non-null  int64  
 4   households          15267 non-null  int64  
 5   median_income       15267 non-null  float64
 6   median_house_value  15267 non-null  int64  
dtypes: float64(1), int64(6)
memory usage: 835.0 KB
../../_images/columns.png
# Summary statistics
df.describe()
housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 15267.000000 15267.000000 15267.000000 15267.000000 15267.000000 15267.00000 15267.000000
mean 26.924805 2678.617214 549.977075 1476.071199 510.814371 3.70112 189416.617345
std 11.426288 2225.565198 430.964703 1180.890618 393.232807 1.57686 95681.349164
min 1.000000 2.000000 2.000000 3.000000 2.000000 0.49990 14999.000000
25% 17.000000 1469.000000 302.000000 814.000000 286.000000 2.53665 115400.000000
50% 27.000000 2142.000000 441.000000 1203.000000 415.000000 3.47840 171200.000000
75% 36.000000 3187.000000 661.000000 1780.000000 614.500000 4.62635 243050.000000
max 51.000000 37937.000000 6445.000000 35682.000000 6082.000000 15.00010 499100.000000

Visualize data#

# Plot histogram
_ = df.hist(bins=50 , figsize=(20, 10))
../../_images/a3b7ff277c1222d9f07d9a6b1bdcc84c58d2b2166e29fd1aaf9d61c4b52dec44.png

Correlation analysis#

It is always useful to compute correlation coeffcients (e.g.Pearson’s r) between the labels (i.e. median_house_value) and features.

# Compute correlation matrix
corr_matrix = df.corr()

# Display just house value correlations
corr_matrix["median_house_value"].sort_values(ascending= False)
median_house_value    1.000000
median_income         0.668566
total_rooms           0.152923
households            0.098525
total_bedrooms        0.079023
population            0.020930
housing_median_age    0.014355
Name: median_house_value, dtype: float64

Warning about correlation coefficients#

  • Just remember that correlation coefficients only measure linear correlations (“if x goes up, then y generally goes up/down”).

  • They may completely miss nonlinear relationships (e.g., “if x is close to zero then y generally goes up”).

../../_images/correlations.png

Feature scaling#

  • Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales.

  • We often scale (or normalize) our features before training the model (e.g. min-max scaling or standardization).

  • Min-max method scales values so that they end up ranging from 0 to 1

  • Standardization scales values so that the they have mean of 0 and unit variance.

../../_images/scaling.png
# Import library
from sklearn.preprocessing import StandardScaler

# Define feature list
feature_list =  ['housing_median_age', 'total_rooms', 'total_bedrooms', 
                 'population', 'households', 'median_income']

# Define features and labels 
X = df[feature_list]
y = df['median_house_value']
# Standarize data
scaler = StandardScaler()  
X_scaled = scaler.fit_transform(X)
df_scaled = pd.DataFrame(X_scaled, columns=feature_list)
df_scaled
housing_median_age total_rooms total_bedrooms population households median_income
0 2.107070 -0.347616 -0.600944 -0.651285 -0.576813 1.620348
1 2.107070 -0.477475 -0.408346 -0.735123 -0.531037 -0.322119
2 2.107070 -0.119352 -0.276081 -0.322709 -0.187718 0.195192
3 2.107070 -0.528700 -0.712325 -0.753753 -0.711598 0.427369
4 2.107070 -0.125643 -0.078842 -0.267664 0.008101 0.091628
... ... ... ... ... ... ...
15262 -2.181428 -0.563748 -0.693762 -0.673303 -0.810779 0.242375
15263 -2.181428 0.618462 0.429337 0.215039 0.364136 0.324757
15264 -2.181428 -1.160470 -1.234427 -1.224600 -1.258365 1.037716
15265 -2.268948 -0.277070 -0.480280 -0.511555 -0.617503 0.990913
15266 -2.268948 -0.190797 -0.515087 -0.909573 -1.014227 0.348095

15267 rows × 6 columns

Split data in training and testing subsets#

If we want to properly evaluate our machine learning model, we should not use all the data for training. Instead, we train the model on a subset of the data, retaining another subset (that the model has not “seen”) to evaluate the model.

from sklearn.model_selection import train_test_split

# Split data 
X_train, X_test, y_train, y_test = train_test_split(df_scaled, y, test_size=0.2, random_state=42)

Multiple linear regression#

We will first experiment with a very simple supervised algorithm that fits a linear model to our data using a least squares approach.

from sklearn.linear_model import LinearRegression

# Define model
lin_reg = LinearRegression()

# Fit model to data
lin_reg.fit(X_train, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
from sklearn.metrics import mean_squared_error

# Predict test labels
predictions = lin_reg.predict(X_test)

# Compute mean-squared-error
lin_mse = mean_squared_error(y_test, predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
np.float64(64374.61617631896)
# Plot
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(y_test, predictions, alpha=0.1, s=50, zorder=2)
ax.plot([0,500000], [0, 500000], color='k', lw=1, zorder=3)
ax.set_ylabel('Predicted house price ($)', fontsize=14)
ax.set_xlabel('Observed house price ($)', fontsize=14)
ax.tick_params(axis='both', which='major', labelsize=13)
ax.grid(ls='dashed', lw=1, zorder=1)
ax.set_ylim(0,500000)
ax.set_xlim(0,500000)
(0.0, 500000.0)
../../_images/b9cc55a814b66b96789ad4eb9cb96cc58029264c54b7ef9d0ab2e3881c79aaa5.png

Decision Tree#

Another machine learning algorithm that predicts a target variable using multiple regression trees

from sklearn.tree import DecisionTreeRegressor

# Define model
tree_reg = DecisionTreeRegressor()

# Fit model
tree_reg.fit(X_train, y_train)
DecisionTreeRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Predict test labels
predictions = tree_reg.predict(X_test)

# Compute mean-squared-error
tree_mse = mean_squared_error(y_test, predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
np.float64(82999.62109649116)
# Plot
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(y_test, predictions, alpha=0.1, s=50, zorder=2)
ax.plot([0,500000], [0, 500000], color='k', lw=1, zorder=3)
ax.set_ylabel('Predicted house price ($)', fontsize=14)
ax.set_xlabel('Observed house price ($)', fontsize=14)
ax.tick_params(axis='both', which='major', labelsize=13)
ax.grid(ls='dashed', lw=1, zorder=1)
ax.set_ylim(0,500000)
ax.set_xlim(0,500000)
(0.0, 500000.0)
../../_images/2e59ed14d686c08b4535001a9511ad12289bfea90a16679a384a2fc3f7ba2ac4.png

RandomForests#

A popular ensemble algorithm that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

from sklearn.ensemble import RandomForestRegressor

# Define model
forest_reg = RandomForestRegressor(n_estimators = 30)

# Fit model
forest_reg.fit(X_train, y_train)
RandomForestRegressor(n_estimators=30)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Predict test labels predictions
predictions = forest_reg.predict(X_test)

# Compute mean-squared-error
final_mse = mean_squared_error(y_test , predictions)
final_rmse = np.sqrt(final_mse)
final_rmse
np.float64(60039.69161883922)
# Plot
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(y_test, predictions, alpha=0.1, s=50, zorder=2)
ax.plot([0,500000], [0, 500000], color='k', lw=1, zorder=3)
ax.set_ylabel('Predicted house price ($)', fontsize=14)
ax.set_xlabel('Observed house price ($)', fontsize=14)
ax.tick_params(axis='both', which='major', labelsize=13)
ax.grid(ls='dashed', lw=1, zorder=1)
ax.set_ylim(0,500000)
ax.set_xlim(0,500000)
(0.0, 500000.0)
../../_images/2f1029aafe3986635d550cadfbe4633ace92f75f16757c253949690e26a743a7.png

Feature engineering#

Generating new features that have predictive power is one of the most important aspects of machine learning. Until now, we just used the default variables for predicting house prices but there are likely other factors that may be useful.

For example, we have geolocation data, which could be very useful. In the next part of this demo, we will engineer some new features to improve the accuracy of our house price prediction model.

As a recap, these were the mean-sqaured-errors from the three models:

  • Multiple linear regression: $64,375

  • Decision Tree: $82,989

  • RandomForests: $60,158

# Import package
import geopandas as gpd

# Read datasets
df = pd.read_csv('data/california_house_prices_w_geometry.csv')
coast = gpd.read_file('data/california_coastline.shp')
# Convert DataFrame to GeoDataFrame
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df['longitude'], df['latitude']))
gdf = gdf.set_crs(4326, allow_override=True)

# Reproject everything to UTM 10N (EPSG:32610)
gdf_utm = gdf.to_crs('EPSG:32610')
coast_utm = coast.to_crs('EPSG:32610')
# Compute distance to coast
distance_to_coast = []
for i in range(gdf_utm.shape[0]):
    distance_to_coast.append(coast_utm.distance(gdf_utm['geometry'].iloc[i]).min())
    
# Add to DataFrame
gdf_utm['distance_to_coast'] = distance_to_coast
# Quickly check that it worked!
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(gdf_utm['geometry'].x, gdf_utm['geometry'].y, c=gdf_utm['distance_to_coast'])
<matplotlib.collections.PathCollection at 0x156d91f10>
../../_images/919e2d31bcd05ccd25833d466101aea1e4b5b84063f4dac29dac28630903fd1a.png

Correlation matrix#

We will perform another correlation matrix to see if distance_to_coast is useful predictor of median_house_value.

# Compute correlation matrix
corr_matrix = gdf_utm.corr()

# Display just house value correlations
corr_matrix["median_house_value"].sort_values(ascending= False)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[23], line 2
      1 # Compute correlation matrix
----> 2 corr_matrix = gdf_utm.corr()
      4 # Display just house value correlations
      5 corr_matrix["median_house_value"].sort_values(ascending= False)

File /opt/miniconda3/envs/book/lib/python3.12/site-packages/pandas/core/frame.py:11049, in DataFrame.corr(self, method, min_periods, numeric_only)
  11047 cols = data.columns
  11048 idx = cols.copy()
> 11049 mat = data.to_numpy(dtype=float, na_value=np.nan, copy=False)
  11051 if method == "pearson":
  11052     correl = libalgos.nancorr(mat, minp=min_periods)

File /opt/miniconda3/envs/book/lib/python3.12/site-packages/pandas/core/frame.py:1993, in DataFrame.to_numpy(self, dtype, copy, na_value)
   1991 if dtype is not None:
   1992     dtype = np.dtype(dtype)
-> 1993 result = self._mgr.as_array(dtype=dtype, copy=copy, na_value=na_value)
   1994 if result.dtype is not dtype:
   1995     result = np.asarray(result, dtype=dtype)

File /opt/miniconda3/envs/book/lib/python3.12/site-packages/pandas/core/internals/managers.py:1694, in BlockManager.as_array(self, dtype, copy, na_value)
   1692         arr.flags.writeable = False
   1693 else:
-> 1694     arr = self._interleave(dtype=dtype, na_value=na_value)
   1695     # The underlying data was copied within _interleave, so no need
   1696     # to further copy if copy=True or setting na_value
   1698 if na_value is lib.no_default:

File /opt/miniconda3/envs/book/lib/python3.12/site-packages/pandas/core/internals/managers.py:1747, in BlockManager._interleave(self, dtype, na_value)
   1741 rl = blk.mgr_locs
   1742 if blk.is_extension:
   1743     # Avoid implicit conversion of extension blocks to object
   1744 
   1745     # error: Item "ndarray" of "Union[ndarray, ExtensionArray]" has no
   1746     # attribute "to_numpy"
-> 1747     arr = blk.values.to_numpy(  # type: ignore[union-attr]
   1748         dtype=dtype,
   1749         na_value=na_value,
   1750     )
   1751 else:
   1752     arr = blk.get_values(dtype)

File /opt/miniconda3/envs/book/lib/python3.12/site-packages/pandas/core/arrays/base.py:568, in ExtensionArray.to_numpy(self, dtype, copy, na_value)
    539 def to_numpy(
    540     self,
    541     dtype: npt.DTypeLike | None = None,
    542     copy: bool = False,
    543     na_value: object = lib.no_default,
    544 ) -> np.ndarray:
    545     """
    546     Convert to a NumPy ndarray.
    547 
   (...)
    566     numpy.ndarray
    567     """
--> 568     result = np.asarray(self, dtype=dtype)
    569     if copy or na_value is not lib.no_default:
    570         result = result.copy()

TypeError: float() argument must be a string or a real number, not 'Point'

It is the second most correlated variable with median_house_value, excellent!

There are still some features that could be improved. For example, total_rooms and total_bedrooms does not mean much because it just depends on the number of house in the block group. A more useful metric would be rooms per house or bedrooms per house.

We can add those columns pretty simply

# Rooms per house
gdf_utm['rooms_per_house'] = gdf_utm['total_rooms'] / gdf_utm['households']

# Bedrooms per house
gdf_utm['bedrooms_per_room'] = gdf_utm['total_bedrooms'] / gdf_utm['total_rooms']
# Compute correlation matrix
corr_matrix = gdf_utm.corr()

# Display just house value correlations
corr_matrix["median_house_value"].sort_values(ascending= False)
median_house_value    1.000000
median_income         0.668566
total_rooms           0.152923
rooms_per_house       0.113277
households            0.098525
total_bedrooms        0.079023
population            0.020930
housing_median_age    0.014355
longitude            -0.020092
latitude             -0.173908
bedrooms_per_room    -0.233964
distance_to_coast    -0.505078
Name: median_house_value, dtype: float64

The new bedrooms_per_room attribute is more correlated with median_house_value than the total_rooms or total_bedrooms. Apparently houses with a lower bedroom/room ratio tend to be more expensive.

The rooms_per_house is surprisingly not that well correlated with median_house_value. Location is clearly the most important thing in the California housing market!

Fit a model to data#

# Define feature list
feature_list =  ['median_income', 'distance_to_coast', 'bedrooms_per_room', 
                 'total_rooms', 'rooms_per_house', 'total_bedrooms', 'households']

# Define features and labels 
X = gdf_utm[feature_list]
y = gdf_utm['median_house_value']

# Standarize data
scaler = StandardScaler()  
X_scaled = scaler.fit_transform(X)
# Split data 
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Define model
forest_reg = RandomForestRegressor(n_estimators = 30)

# Fit model
forest_reg.fit(X_train, y_train)
RandomForestRegressor(n_estimators=30)

Evaluate model#

# Predict test labels predictions
predictions = forest_reg.predict(X_test)

# Compute mean-squared-error
final_mse = mean_squared_error(y_test , predictions)
final_rmse = np.sqrt(final_mse)
final_rmse
55770.43288157548

So we improved our model by $4,388 ($60,158 to $55,770). This might not sound like much but could make a big difference to someone and might the difference between winning a Kaggle competition or not.