rm(list=ls())
#major required packages:
require(devtools)
::install_github("HannaMeyer/CAST")
devtools
library(terra)
library(caret)
library(randomForest)
library(mapview)
library(sf)
library(CAST)
library(tmap)
library(rprojroot)
# create a string containing the current working directory
= paste0(find_rstudio_root_file(),"/ml_session/data/") wd
Machine learning for remote sensing applications
Mapping the MarburgOpenForest
Introduction
In this tutorial we will go through the basic workflow of training machine learning models for spatial mapping based on remote sensing. To do this we will look at two case studies located in the MarburgOpenForest in Germany: one has the aim to produce a land cover map including different tree species; the other aims at producing a map of Leaf Area Index.
Based on “default” models, we will further discuss the relevance of different validation strategies and the area of applicability.
How to start
For this tutorial we need the terra package for processing of the satellite data as well as the caret package as a wrapper for machine learning (here: randomForest) algorithms. Sf is used for handling of the training data available as vector data (polygons). Mapview is used for spatial visualization of the data. CAST will be used to account for spatial dependencies during model validation as well as for the estimation of the AOA.
Case study 1: land cover classification
Data preparation
To start with, let’s load and explore the remote sensing raster data as well as the vector data that include the training sites.
Raster data (predictor variables)
<- rast(paste0(wd,"sentinel_uniwald.grd"))
mof_sen print(mof_sen)
class : SpatRaster
dimensions : 522, 588, 10 (nrow, ncol, nlyr)
resolution : 10, 10 (x, y)
extent : 474200, 480080, 5629540, 5634760 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=32 +datum=WGS84 +units=m +no_defs
source : sentinel_uniwald.grd
names : T32UM~1_B02, T32UM~1_B03, T32UM~1_B04, T32UM~1_B05, T32UM~1_B06, T32UM~1_B07, ...
min values : 723, 514, 294, 341.8125, 396.9375, 440.8125, ...
max values : 8325, 9087, 13810, 7368.7500, 8683.8125, 9602.3125, ...
The raster data contain a subset of the optical data from Sentinel-2 (see band information here: https://en.wikipedia.org/wiki/Sentinel-2) given in scaled reflectances (B02-B11). In addition,the NDVI was calculated. Let’s plot the data to get an idea how the variables look like.
plot(mof_sen)
plotRGB(mof_sen,r=3,g=2,b=1,stretch="lin")
Vector data (Response variable)
The vector file is read as sf object. It contains the training sites that will be regarded here as a ground truth for the land cover classification.
<- read_sf(paste0(wd,"trainingsites_LUC.gpkg")) trainSites
Using mapview we can visualize the aerial image channels in the geographical context and overlay it with the polygons. Click on the polygons to see which land cover class is assigned to a respective polygon.
mapview(mof_sen[[1]], map.types = "Esri.WorldImagery") +
mapview(trainSites)
Draw training samples and extract raster information
In order to train a machine learning model between the spectral properties and the land cover class, we first need to create a data frame that contains the predictor variables at the location of the training sites as well as the corresponding class information. However, using each pixel overlapped by a polygon would lead to a overly huge dataset, therefore, we first draw training samples from the polygon. Let’s use 1000 randomly sampled (within the polygons) pixels as training data set.
<- st_sample(trainSites,1000)
trainlocations <- st_join(st_sf(trainlocations), trainSites)
trainlocations mapview(trainlocations)
Next, we can extract the raster values for these locations. The resulting data frame contains the predictor variables for each training location that we can merged with the information on the land cover class from the sf object.
<- extract(mof_sen, trainlocations, df=TRUE)
trainDat <- data.frame(trainDat, trainlocations)
trainDat head(trainDat)
ID T32UMB_20170510T103031_B02 T32UMB_20170510T103031_B03
1 1 839 769
2 2 920 801
3 3 865 764
4 4 943 923
5 5 853 832
6 6 850 921
T32UMB_20170510T103031_B04 T32UMB_20170510T103031_B05
1 508 915.3125
2 573 935.5000
3 502 848.5000
4 695 1204.1875
5 490 987.3125
6 524 1220.8125
T32UMB_20170510T103031_B06 T32UMB_20170510T103031_B07
1 1615.625 1864.812
2 2702.750 3693.938
3 2459.188 3228.000
4 2472.188 2900.500
5 1765.438 2037.375
6 2217.875 2491.062
T32UMB_20170510T103031_B08 T32UMB_20170510T103031_B11
1 1957 1425.438
2 3426 1308.750
3 3114 1142.625
4 2787 2058.312
5 2323 1530.188
6 2842 1953.438
T32UMB_20170510T103031_B12 NDVI id LN Type trainlocations
1 759.5625 0.5878296 NA 1 Eiche POINT (476573.9 5632268)
2 674.9375 0.7134284 22 403 Strasse POINT (478543.3 5632763)
3 551.6875 0.7223451 NA 102 Felder POINT (478118 5631136)
4 1091.0000 0.6008042 12 303 Wiese POINT (478396.1 5632345)
5 821.3750 0.6516175 NA 2 Buche POINT (477062.7 5632419)
6 1014.6250 0.6886512 NA 2 Buche POINT (477609 5631575)
Model training
Predictors and response
For model training we need to define the predictor and response variables. As predictors we can use basically all information from the raster stack as we might assume they could all be meaningful for the differentiation between the land cover classes. As response variable we use the “Label” column of the data frame.
<- names(mof_sen)
predictors <- "Type" response
A first “default” model
We then train a Random Forest model to lean how the classes can be distinguished based on the predictors (note: other algorithms would work as well. See https://topepo.github.io/caret/available-models.html for a list of algorithms available in caret). Caret’s train function is doing this job.
So let’s see how we can then train a “default” random forest model. We specify “rf” as method, indicating that a Random Forest is applied. We reduce the number of trees (ntree) to 75 to speed things up. Note that usually a larger number (>250) is appropriate.
<- train(trainDat[,predictors],
model
trainDat[,response],method="rf",
ntree=75)
model
Random Forest
1000 samples
10 predictor
10 classes: 'Buche', 'Duglasie', 'Eiche', 'Felder', 'Fichte', 'Laerche', 'Siedlung', 'Strasse', 'Wasser', 'Wiese'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8327506 0.7927715
6 0.8317421 0.7918213
10 0.8261596 0.7851166
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
To perform the classification we can then use the trained model and apply it to each pixel of the raster stack using the predict function.
<- predict(mof_sen,model) prediction
Then we can then create a map with meaningful colors of the predicted land cover using the tmap package.
<- rev(c("palegreen", "blue", "grey", "red", "lightgreen", "forestgreen", "beige","brown","darkgreen","yellowgreen"))
cols
tm_shape(prediction) +
tm_raster(palette = cols,title = "LUC")+
tm_scale_bar(bg.color="white",bg.alpha=0.75)+
tm_layout(legend.bg.color = "white",
legend.bg.alpha = 0.75)
Based on this we can now discuss more advanced aspects of cross-validation for performance assessment as well as spatial variable selection strategies.
Model training with spatial CV and variable selection
Before starting model training we can specify some control settings using trainControl. For hyperparameter tuning (mtry) as well as for error assessment we use a spatial cross-validation. Here, the training data are split into 5 folds by trying to resemble the geographic distance distribution required when predicting the entire area from the trainign data,
## define prediction area:
<- as.polygons(mof_sen, values = FALSE, na.all = TRUE) |>
studyArea st_as_sf() |>
st_transform(st_crs(trainlocations))|>
st_union()
mapview(studyArea)
<- knndm(trainlocations,studyArea,k=5)
indices <- geodist(trainlocations,studyArea,cvfolds = indices$indx_train )
gd plot(gd)+ scale_x_log10(labels=round)
<- trainControl(method="cv",
ctrl index = indices$indx_train,
indexOut = indices$indx_test,
savePredictions = TRUE)
Model training is then again performed using caret’s train function. However we use a wrapper around it that is selecting the predictor variables which are relevant for making predictions to new spatial locations (forward feature selection, fss). We use the Kappa index as metric to select the best model.
# train the model
set.seed(100)
<- ffs(trainDat[,predictors],
model
trainDat[,response],method="rf",
metric="Kappa",
trControl=ctrl,
importance=TRUE,
ntree=100,
verbose=FALSE)
print(model)
Selected Variables:
T32UMB_20170510T103031_B05 T32UMB_20170510T103031_B06 T32UMB_20170510T103031_B02 T32UMB_20170510T103031_B12
---
Random Forest
1000 samples
4 predictor
10 classes: 'Buche', 'Duglasie', 'Eiche', 'Felder', 'Fichte', 'Laerche', 'Siedlung', 'Strasse', 'Wasser', 'Wiese'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 816, 803, 830, 661, 890
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.7812635 0.5674632
3 0.7448999 0.5188131
4 0.7317119 0.5178565
Kappa was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
plot(varImp(model))
Model validation
When we print the model (see above) we get a summary of the prediction performance as the average Kappa and Accuracy of the three spatial folds. Looking at all cross-validated predictions together we can get the “global” model performance.
# get all cross-validated predictions:
<- model$pred[model$pred$mtry==model$bestTune$mtry,]
cvPredictions # calculate cross table:
table(cvPredictions$pred,cvPredictions$obs)
Buche Duglasie Eiche Felder Fichte Laerche Siedlung Strasse Wasser
Buche 4 0 6 0 7 0 0 0 0
Duglasie 0 2 0 0 1 0 0 0 0
Eiche 3 0 0 1 2 0 0 0 0
Felder 3 0 0 260 0 0 0 11 0
Fichte 0 0 0 6 1 0 0 0 0
Laerche 0 0 0 0 0 0 0 0 0
Siedlung 0 0 0 0 0 0 0 0 0
Strasse 0 0 0 14 0 0 0 35 0
Wasser 0 0 0 1 0 0 0 1 0
Wiese 0 0 0 21 0 0 0 5 0
Wiese
Buche 0
Duglasie 0
Eiche 1
Felder 15
Fichte 0
Laerche 0
Siedlung 0
Strasse 7
Wasser 1
Wiese 69
Visualize the final model predictions
<- predict(mof_sen,model)
prediction <- rev(c("palegreen", "blue", "grey", "red", "lightgreen", "forestgreen", "beige","brown","darkgreen","yellowgreen"))
cols
tm_shape(prediction) +
tm_raster(palette = cols,title = "LUC")+
tm_scale_bar(bg.color="white",bg.alpha=0.75)+
tm_layout(legend.bg.color = "white",
legend.bg.alpha = 0.75)
Area of Applicability
We have seen that technically, the trained model can be applied to the entire area of interest (and beyond…as long as the sentinel predictors are available which they are, even globally). But we should assess if we SHOULD apply our model to the entire area. The model should only be applied to locations that feature predictor properties that are comparable to those of the training data. If dissimilarity to the training data is larger than the dissimmilarity within the training data, the model should not be applied to this location.
<- aoa(mof_sen,model,LPD=TRUE, verbose=FALSE)
AOA plot(AOA$AOA)
The result of the aoa function has two layers: the dissimilarity index (DI) and the area of applicability (AOA). The DI can take values from 0 to Inf, where 0 means that a location has predictor properties that are identical to properties observed in the training data. With increasing values the dissimilarity increases. The AOA has only two values: 0 and 1. 0 means that a location is outside the area of applicability, 1 means that the model is inside the area of applicability. As an option, we cal also calculate the Local Point Density (LPD), which tells us, for a prediction location, how MANY similar training data points were used during modle training.
Error profiles
Let’s assume there is a relationship between the density of training data points in the predictor space (LPD) and the model performance. Let’s analyze that and use that to predict the prediction performance.
plot(AOA$LPD)
<- errorProfiles(model,AOA,variable="LPD")
ep plot(ep)
plot(predict(AOA$LPD,ep))
Case Study 2: Modelling the Leaf Area Index
In the second example we will look at a regression task: We have point measurements of Leaf area index (LAI), and, based in this, we would like to make predictions for the entire forest. Again, we will use the Sentinel data as potnetial predictors.
Prepare data
<- rast(paste0(wd,"sentinel_uniwald.grd"))
mof_sen <- st_read(paste0(wd,"trainingsites_LAI.gpkg")) LAIdat
Reading layer `trainingsites_LAI' from data source
`/home/hanna/Documents/Github/HannaMeyer/EON2024/ml_session/data/trainingsites_LAI.gpkg'
using driver `GPKG'
Simple feature collection with 67 features and 10 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 476350 ymin: 5631537 xmax: 478075 ymax: 5632765
Projected CRS: WGS 84 / UTM zone 32N
<- extract(mof_sen,LAIdat,na.rm=TRUE)
trainDat $LAI <- LAIdat$LAI
trainDat
<- mof_sen[[1]]
meanmodel values(meanmodel) <- mean(trainDat$LAI)
plot(meanmodel)
<- mof_sen[[1]]
randommodel values(randommodel)<- runif(ncell(randommodel),min = 0,4)
plot(randommodel)
A simple linear model
As a simple first approach we might develop a linear model. Let’s assume a linear relationship between the NDVI and the LAI
plot(trainDat$NDVI,trainDat$LAI)
<- lm(LAI~NDVI,data=trainDat)
model_lm summary(model_lm)
Call:
lm(formula = LAI ~ NDVI, data = trainDat)
Residuals:
Min 1Q Median 3Q Max
-1.87314 -0.52143 -0.03363 0.63668 2.25252
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.8518 1.4732 -0.578 0.56515
NDVI 6.8433 2.3160 2.955 0.00435 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8887 on 65 degrees of freedom
Multiple R-squared: 0.1184, Adjusted R-squared: 0.1049
F-statistic: 8.731 on 1 and 65 DF, p-value: 0.004354
abline(model_lm,col="red")
<- predict(mof_sen,model_lm,na.rm=T)
prediction_LAI plot(prediction_LAI)
<- -0.8518+mof_sen$NDVI*6.8433
limodelpred mapview(limodelpred)
The machine learning way
Define CV folds
Let’s use the NNDM cross-validation approach.
<- as.polygons(mof_sen, values = FALSE, na.all = TRUE) |>
studyArea st_as_sf() |>
st_transform(st_crs(LAIdat))|>
st_union()
<- knndm(LAIdat,studyArea,k=3) nndm_folds
Let’s explore the geodistance
<- geodist(LAIdat,studyArea,cvfolds = nndm_folds$indx_test)
gd plot(gd)
Model training
<- trainControl(method="cv",
ctrl index=nndm_folds$indx_train,
indexOut = nndm_folds$indx_test,
savePredictions = "all")
<- ffs(trainDat[,predictors],
model $LAI,
trainDatmethod="rf",
trControl = ctrl,
importance=TRUE,
verbose=FALSE)
model
Selected Variables:
T32UMB_20170510T103031_B07 T32UMB_20170510T103031_B08 NDVI
---
Random Forest
67 samples
3 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 41, 44, 49
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 0.8488074 0.2173559 0.7078165
3 0.8547139 0.2092237 0.7142045
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.
LAI prediction
Let’s then use the trained model for prediction.
<- predict(mof_sen,model)
LAIprediction plot(LAIprediction)
Question?! Why does it look so different than the linear model?
AOA estimation
<- aoa(mof_sen,model,LPD = TRUE, verbose=FALSE)
AOA plot(AOA$AOA)
plot(AOA$LPD)
Error profiles
Let’s assume there is a relationship between the density of training data points in the predictor space (LPD) and the model performance. Let’s analyze that and use that to predict the prediction performance.
<- errorProfiles(model,AOA,variable="DI")
ep plot(ep)
plot(predict(AOA$DI,ep))