Wine activity#
Background#
In this activity, we will practice some machine learning using the wine dataset. Download the data from here.
The data provides 13 compostional features for 158 wines grown in the same region in Italy by three different cultivators. The goal is to use these features to determine which of three cultivators the wine belongs to (class 0, 1, or 2).
Activate the
.gdsPython environment by opening an Anaconda Prompt (miniconda3) (Windows) or Terminal (macOS). Then, on Windows:
.gds\Scripts\activate
Or, on macOS:
source .gds/bin/activate
Note
Make sure you run this command from the same directory as the .gds environment folder.
Open a Jupyter Notebook by running:
jupyter notebook
Tip
If you run this command from your course folder, your .ipynb assignment will automatically be saved there.
Task 1 (5 points)#
Import the wine features and target labels make a multi-panel figure showing a histogram for each feature (separated by the three wine classes)
Which features are most useful for distinguishing between:
Class 0 and 1?
Class 1 and 2?
Class 0 and 2?
Standardize the features in the dataset and split into training (80%) and testing (20%) datasets.
What is the mean and standard deviation of features in the standardized dataset?
How many wines are in the training and testing datasets?
Task 2 (5 points)#
Train a Decision Tree on the training dataset with a max_depth=1. Now use this model to classify wines in the testing dataset.
What is the mean testing accuracy of this classifier?
Plot a confusion matrix showing the results of the classification for the test dataset.
Which class has the highest and lowest accuracy?
For the class with the lowest accuracy, explain why.
Task 3 (5 points)#
Plot the Decision Tree using the plot_tree function which can be imported using:
from sklearn.tree import plot_tree
Which feature did the Decision Tree use the split the features?
What is the Gini index of the two groups that are split? Are high values good or bad?
Make a plot showing how testing and training error changes as max_depth is increased from 1 to 5.
At what
max_depthdoes training error become 0?At what
max_depthdoes testing error start to plateau?Which do you think is the best model and explain why.
Important
Save your notebook locally in both .ipynb and .pdf formats but only submit the pdf to Canvas.