Demo 2#

In the previous week, we computed some band indices and learnt that positive NDWI values were likely to be associated with water (e.g lakes, rivers, ocean). But what if someone wanted to know how much water was in the image? To do this, we would first have to classify pixels that are likely to be water. In this week, we will learn how to do this, and other quantitative analysis, by combining raster and vector data.

# Import packages
import glob
import numpy as np
import rasterio
import matplotlib.pyplot as plt

# Define list of Landsat bands
files = sorted(glob.glob('data/landsat/*.tif'))
files
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 4
      2 import glob
      3 import numpy as np
----> 4 import rasterio
      5 import matplotlib.pyplot as plt
      7 # Define list of Landsat bands

ModuleNotFoundError: No module named 'rasterio'

Preparing data#

We first need to produce a true color image so that we can identify our classes. Below is an example of exporting a true color image as a GeoTIFF that can be imported into QGIS. Although note that any combination of bands could be exported (e.g. color infrared) depending on what features we would like to identify.

# Define paths to data
src_2 = rasterio.open(files[1])
band_2 = src_2.read(1)
src_3 = rasterio.open(files[2])
band_3 = src_3.read(1)
src_4 = rasterio.open(files[3])
band_4 = src_4.read(1)

# Write true color image
meta = {'driver': "GTiff",
        'height': band_2.shape[0],
        'width': band_2.shape[1],
        'count': 3,
        'dtype': band_2.dtype,
        'crs': src_2.crs,
        'transform':src_2.transform}
        
with rasterio.open("data/true-color.tif", 'w', **meta) as dst:
    dst.write(band_4, 1)
    dst.write(band_3, 2)
    dst.write(band_2, 3)

Produce training dataset#

Next, we will import the GeoTIFF into QGIS and generate a training dataset that we can use for our supervised classification. The easiest way to do this is to create polygons that contain our features of interest and save it as a shapefile layer.

create shapefile layer

In this example, we will make a new shapefile called lake.shp, making sure that Geometry type is set to Polygon and the CRS is set to the same as our Lansdat true color image. In our case, this is WGS 84 / UTM Zone 10N.

create shapefile layer

Once we have created a new shapefile, we can edit it by selecting it in the Layers panel (i.e. by left-clicking it once) and clicking Toggle Editing (the little pencil button).

toggle editing

Now we can click the Add Polygon Feature button to the right of the Toggle Editing button to make a new polyon. Digitize a small area within the feature of interest (in our case, some pixels in the lake), set the id to 1 (although it doesn’t matter which number we choose), and click OK. Repeat this step, to sample several lakes in our image.

digitize

When we feel good about our training dataset, we can click the Toggle Editing button again to save our shapefile. We can then repeat this procedure for every feature that we want to classify. In our example, we have digitized seven features that include: lake, forest, cleared, river, sand, ocean, and grass.

digitize

Masking raster data using shapefile#

We now should have several shapefiles containing polygons that outline our features of interest. We can read them using Python using the GeoPandas package.

import geopandas as gpd
lake = gpd.read_file('data/lake.shp')
lake
id geometry
0 1 POLYGON ((411681.33 4858920.1, 412387.6 485888...
1 2 POLYGON ((409741.747 4847556.896, 409822.866 4...
2 3 POLYGON ((409904.889 4851822.25, 410222.19 485...

Our GeoDataFrame contains three rows, each of which is a different polygon. The geometry column contains the coordinates that represent the polygons.

Rasterio provides a function called mask.mask for masking raster data using these polygons. This function has a few arguments. The first is the raster dataset (i.e. src in our case). The second argument is the shape to mask. In the docs, it notes that this shape shold be a format that implements the Python geo interface protocol (such as a Shapely Polygon).

So we will have to specify only the geometry of our GeoPandas DataFrame, not the DataFrame object itself. The Invert argument determines whether pixels inside or outside shapes will be masked. The nodata argument allows us to set the no data value (i.e. values of pixels outside the mask).

import rasterio.mask
lake_mask, transform = rasterio.mask.mask(src_2, lake.geometry, invert=False, nodata=0)
lake_mask
array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]], shape=(1, 1422, 1208), dtype=uint16)

This function produced an arry called lake_mask with the same dimensions as the original Landsat band except that all values outside of the lake polygons were classified as zeros. Therefore any value higher than zero is a pixel from the Landsat band that is within the lake polygons. We can subset these values by masking the array like so:

lake_mask_values = lake_mask[lake_mask != 0]

Format training dataset#

To classify our image using scikit-learn, we need to format these values as a labels and features. It’s easiest to do this by constructing a pandas DataFrame.

import pandas as pd
df = pd.DataFrame(list(zip(np.full(lake_mask_values.shape, 1), lake_mask_values)), columns=['label', 'band2'])
df.head()
label band2
0 1 7789
1 1 7789
2 1 7831
3 1 7872
4 1 7896

There are a couple of things going on here. We first defined a list of ones using np.full which we zipped with the Band 2 values that represent lake pixels. Then we converted this to a list and then to a pandas DataFrame with the column names label and band2.

Since our classification will depend on several bands, we can make a DataFrame that represent the pixel values of every band using a for loop.

values = []
for file in files:
    # Open file
    src = rasterio.open(file)
    
    # Mask
    lake_mask, transform = rasterio.mask.mask(src, lake.geometry, invert=False, nodata=0)
    
    # Append to empty list
    values.append(lake_mask[lake_mask != 0])

# Make a DataFrame
df = pd.DataFrame(values).T

# Name columns
df.columns = ['band1', 'band2', 'band3', 'band4', 'band5', 'band6', 'band7']

# Define labels column
df['labels'] = np.full(df.shape[0], 1)
df.head()
band1 band2 band3 band4 band5 band6 band7 labels
0 7285 7789 9017 8582 7863 8011 7970 1
1 7323 7789 9020 8584 7851 7982 7942 1
2 7363 7831 9026 8595 7849 7968 7928 1
3 7414 7872 9051 8615 7860 7959 7916 1
4 7412 7896 9020 8591 7860 7942 7894 1

We actually have seven features so we need another for loop that iterates over every feature that we digitized.

# Define list of Landsat bands
shapefiles = sorted(glob.glob('data/*.shp'))

final_values = []
for i in range(len(shapefiles)):
    # Read shapefile
    feature = gpd.read_file(shapefiles[i])
    
    values = []
    for file in files:
        # Open file
        src = rasterio.open(file)
        
        # Mask
        feature_mask, transform = rasterio.mask.mask(src, feature.geometry, invert=False, nodata=0)
        
        # Append to band list
        values.append(feature_mask[feature_mask != 0])

    # Append to feature list
    final_values.append(values)
df1 = pd.DataFrame(final_values[0]).T
df1.columns = ['band1', 'band2', 'band3', 'band4', 'band5', 'band6', 'band7']
df1['labels'] = np.full(df1.shape[0], 1)
df2 = pd.DataFrame(final_values[1]).T
df2.columns = ['band1', 'band2', 'band3', 'band4', 'band5', 'band6', 'band7']
df2['labels'] = np.full(df2.shape[0], 2)
df3 = pd.DataFrame(final_values[2]).T
df3.columns = ['band1', 'band2', 'band3', 'band4', 'band5', 'band6', 'band7']
df3['labels'] = np.full(df3.shape[0], 3)
df4 = pd.DataFrame(final_values[3]).T
df4.columns = ['band1', 'band2', 'band3', 'band4', 'band5', 'band6', 'band7']
df4['labels'] = np.full(df4.shape[0], 4)
df5 = pd.DataFrame(final_values[4]).T
df5.columns = ['band1', 'band2', 'band3', 'band4', 'band5', 'band6', 'band7']
df5['labels'] = np.full(df5.shape[0], 5)
df6 = pd.DataFrame(final_values[5]).T
df6.columns = ['band1', 'band2', 'band3', 'band4', 'band5', 'band6', 'band7']
df6['labels'] = np.full(df6.shape[0], 6)
df7 = pd.DataFrame(final_values[6]).T
df7.columns = ['band1', 'band2', 'band3', 'band4', 'band5', 'band6', 'band7']
df7['labels'] = np.full(df7.shape[0], 7)

training_data = pd.concat([df1, df2, df3, df4, df5, df6, df7])
training_data.shape
(6862, 8)

Sorry, I was in a rush (extra credit if you can make that more concise!).

What we have now is a training dataset with several thousand rows and eight columns. Each row is a pixel that we know is contained in the associated label class. Our label column numerically represents the different features (i.e. 1 corresponds to cleared, 2 corresponds to forest).

shapefiles
['data/cleared.shp',
 'data/forest.shp',
 'data/grass.shp',
 'data/lake.shp',
 'data/ocean.shp',
 'data/river.shp',
 'data/sand.shp']

Summary statistics for the training dataset#

training_data.groupby('labels').mean()
band1 band2 band3 band4 band5 band6 band7
labels
1 8161.185504 8448.916462 9352.192875 9788.692875 15444.443489 16229.616708 12594.883292
2 7416.639942 7487.944606 8048.635569 7681.294461 15653.613703 10159.087464 8349.674927
3 7820.300981 8053.422028 9506.115594 8962.172301 23864.091603 15648.476554 11192.743730
4 7384.654624 7691.852601 8486.945087 8166.203757 7728.687861 7743.832370 7699.452312
5 7313.515112 7540.395532 7787.961235 7547.877464 7602.442838 7799.425427 7775.527595
6 7331.883721 7511.483721 7940.525581 7603.255814 7509.227907 7500.590698 7466.423256
7 12785.408602 13732.512545 16671.784946 18257.075269 19734.172043 22963.752688 22489.347670
training_data.groupby('labels').count()
band1 band2 band3 band4 band5 band6 band7
labels
1 814 814 814 814 814 814 814
2 686 686 686 686 686 686 686
3 917 917 917 917 917 917 917
4 692 692 692 692 692 692 692
5 3044 3044 3044 3044 3044 3044 3044
6 430 430 430 430 430 430 430
7 279 279 279 279 279 279 279

Write training dataset to file#

We can save the the training dataset as a csv file using to_csv function. Now we don’t have to run the previous lines of code every time we need to generate the training dataset - we can just import the csv file using read_csv.

# Export csv file
training_data.to_csv('data/training_dataset.csv', index=False)

# Import csv file
training_data = pd.read_csv('data/training_dataset.csv')

Separability testing#

Before we begin our classification, we should check the separability of our training data. We can do this using a bi-spectral plot.

# Define the bands to plot (you can choose any two)
x_band = 'band6'
y_band = 'band5'

# Define colors for each label using a colormap
cmap = plt.get_cmap('tab10', 7)  # 7 classes in total

# Create the plot
plt.figure(figsize=(6, 4))

# Plot data for each class label
for label in sorted(training_data['labels'].unique()):
    class_data = training_data[training_data['labels'] == label]
    plt.scatter(
        class_data[x_band], 
        class_data[y_band], 
        label=f'Class {label}', 
        color=cmap(label - 1),
        s=40)

# Customize the plot
plt.xlabel(f'{x_band.capitalize()}', fontsize=11)
plt.ylabel(f'{y_band.capitalize()}', fontsize=11)
plt.legend(loc=0)
plt.tight_layout()

# Show the plot
plt.show()
../_images/fae7a9b8248e60beeebaadd5ee15006aef331f7102747626f8a8500357632539.png

This is really useful. It looks like we should have no problem distinguishing between Class 1 (cleared), Class 2 (forest), Class 3 (grass), and Class 7 (sand). However, it might be difficult distinguishing between our different water classes 4, 5, and 6 (i.e. lake, ocean, and river).

M-statistic#

The M-statistic is a measure of class separability, often used in remote sensing and classification analysis. It evaluates how distinct two classes are by comparing the distance between their mean values relative to their standard deviations.

Below is a function to compute the M-statistic for two classes at a time.

def m_statistic(c1, c2):
    # Extract data for Class 1 and Class 2
    class_1 = training_data[training_data['labels'] == c1][['band1', 'band2', 'band3', 'band4', 'band5', 'band6', 'band7']]
    class_2 = training_data[training_data['labels'] == c2][['band1', 'band2', 'band3', 'band4', 'band5', 'band6', 'band7']]
    
    # Compute means and standard deviations for each band
    mean_1 = class_1.mean()
    mean_2 = class_2.mean()
    
    std_1 = class_1.std()
    std_2 = class_2.std()
    
    # Compute M-statistic for each band
    M_statistic = (abs(mean_1 - mean_2)) / np.sqrt(std_1**2 + std_2**2)
    
    # Display results
    return(M_statistic)
m_statistic(1, 2)
band1    2.906937
band2    3.384646
band3    2.807394
band4    3.653126
band5    0.064677
band6    3.030287
band7    3.367483
dtype: float64

The M-Statistic confirms that there should be good separability between Classes 1 (cleared) and 2 (forest) for all the bands except Band 5.

m_statistic(4, 5)
band1    0.471940
band2    0.810753
band3    1.106840
band4    1.339791
band5    0.709128
band6    0.201126
band7    0.282104
dtype: float64

In contrast, the M-Statistic shows that Classes 4 (lake) and 5 (ocean) are going to be difficult to separate. If we are, we might have best success using Band 3 and 4.