Assignment 5

Assignment 5#

It’s competition time!#

In this week’s lab we will play a Kaggle-like competition. In these competitions, a host usually prepares a dataset and people from around the world compete against each other to build the best machine learning model. Submitted models are scored based on their predictive accuracy relative to a hidden solution file.

We will be attempting to produce an accurate model for for predicting house prices. The data contains all houses sold in King County, WA between May 2014 and May 2015 and can be downloaded from the here.

Question 1 (10 points):#

Start by reading seattle-house-prices.csv and answer the following questions.

How many houses are in this dataset?
How many features are there for predicting house price?
Are there any null values in this dataset?
Which three variables are best correlated with house price (include correlation coefficients)?
Which three variables are least correlated with house price (include correlation coefficients)?

Question 2 (30 points):#

Produce a model to predict house prices. You are welcome to generate new features, scale the data, and split the data into training/testing (i.e. train_test_split) in any way you like. You are also welcome to use the datasets contained in the data folder or other datasets that you find on the internet.
Evaluate your model’s accuracy by predicting a test dataset, for example:

predictions = forest_reg.predict(X_test)
final_mse = mean_squared_error(y_test, predictions)
final_rmse = np.sqrt(final_mse)

On Monday the instructor and TA will provide an unseen set of houses which students will use to repeat their accuracy evaluation. The best models (i.e. lowest RMSE) will win prizes.
We will evaluate the models using a simple mean-squared-error as follows:

mse = mean_squared_error(y_test , predictions)
rmse = np.sqrt(final_mse)

Important

Save your notebooks locally as both .ipynb and .pdf formats but only submit the pdf to Canvas.