Chris Reudenbach, Philipps University Marburg (PUM)
Published
August 30, 2025
The PipeModel is a deliberately idealized yet physically plausible valley scenario. It distills terrain to the essentials (parabolic cross-valley profile) and optional features (left-side hill, right-side pond or hollow), so that dominant microclimate drivers become visible and testable:
Radiation via terrain exposure cos(i) from slope & aspect
Cold-air pooling along the valley axis (Gaussian trough)
Surface type / land-cover (grass / forest / water / bare soil / maize) alters heating, shading, roughness and nocturnal behaviour
You can sample synthetic stations, train interpolators (IDW, Kriging variants, RF, GAM), and assess them with spatial LBO-CV.
What the pipemodel is
A didactic, reproducible pipeline for micro-scale spatial prediction with process-aware features and scale tuning. It simulates or ingests a domain (“scenario”), learns from station points, validates with leave-block-out CV, infers a characteristic scale (R*) from data, re-trains at that scale, and produces tuned maps, diagnostics, and optional exports.
Architecture
Main (orchestrator):main_ultra.R is intentionally thin: it wires pieces together, runs the stages in order, shows live previews, and—optionally—saves outputs at the end. No heavy lifting.
Code
```{r}#| eval: false# =====================================================================# main_ultra.R — minimal: run + (optional) save-at-end## Purpose:# 1) Source packages + your function library + scenario registry# 2) Build scenario + station sets from a chosen scenario "make()" factory# 3) Live preview: domain/land-cover/terrain + 2x2 overview + scenario preview# 4) Baseline: leave-block-out CV (LBO-CV) and prediction maps (T14, T05)# 5) Scale inference: empirical variogram -> (L50,L95) -> U-curve -> R*# 6) Tuned CV & maps at R*# 7) OPTIONAL: save all plots/tables/rasters at the end## Design notes:# - This script stays "thin": all heavy lifting lives in fun_pipemodel.R# and the scenario files. This keeps the main pipe reproducible and# testable.# - Keep side effects (saving files) to the very end; set `export <- FALSE`# if you just want to run and eyeball plots interactively.# =====================================================================message("=== pipe main (ultra) ===")# Toggle: set to FALSE to only run/plot without saving anything export <-TRUE# ---------------------------------------------------------------------# 0) Setup & packages (centralized in your packages.R)# - Loads CRAN pkgs (terra, sf, ggplot2, mgcv, gstat, suncalc, ...)# - Sets knitr options (if used in a notebook context)# - We also disable spherical geometry in sf to keep planar ops robust# for projected domains (UTM in our scenarios).# ---------------------------------------------------------------------source(here::here("block4_5/src/packages.R")) sf::sf_use_s2(FALSE)set.seed(42) # one seed here (scenarios may set more where needed)# ---------------------------------------------------------------------# 1) Functions + Scenario registry# - fun_pipemodel.R: your full function library (NO side effects)# - registry.R: maps scenario names to files; exposes source_scenario()# which returns a `make()` function to build the object.# ---------------------------------------------------------------------source(here::here("block4_5/src/fun_pipemodel.R"))source(here::here("block4_5/src/fun_learn_predict_core.R"))source(here::here("block4_5/scenarios/registry.R"))# ---------------------------------------------------------------------# 2) Pick a scenario# - Choose via env var SCEN (e.g., export SCEN=scen_scaled_demo)# - Defaults to "lake_bump_dense" which is a realistic didactic scene# (valley, lake, bump hill, dense micro-relief, LC = forest/water/bare/baseline meadow).# ---------------------------------------------------------------------#scen_name <- Sys.getenv("SCEN", "lake_bump_dense") scen_name <-Sys.getenv("SCEN", "scen_scaled_demo") make_fun <-source_scenario(scen_name) # returns a function make(overrides=list(), do_cv=FALSE)# ---------------------------------------------------------------------# 3) Build the object (domain + scenario + stations + params)# - `obj` is a list with a stable contract used downstream:# scen: list of rasters (E, R14, R05, I14, I05, lc, sun, ...)# stn_sf_14 / stn_sf_05: station sf at 14/05 UTC (features + temp)# block_size: integer (meters) used by LBO-CV# params$models: character vector of model names to run# --------------------------------------------------------------------- obj <-make_fun() scen <- obj$scen st14 <- obj$stn_sf_14 st05 <- obj$stn_sf_05 bs <- obj$block_size mods <- obj$params$models# --- Safety checks that catch common wiring issues early ----------------------stopifnot(inherits(st14, "sf"), inherits(st05, "sf"))stopifnot(all(c("E","R14","R05") %in%names(scen)))# Sun geometry must be present for the R*-tuning (cosine-of-incidence features)if (is.null(scen$sun) ||is.null(scen$sun$T14) ||is.null(scen$sun$T05)) {stop("Scenario did not populate scen$sun$T14 / scen$sun$T05 (alt/az). ","Fix in the scenario builder (build_scenario) before tuning.") }if (any(is.null(c(scen$sun$T14$alt, scen$sun$T14$az, scen$sun$T05$alt, scen$sun$T05$az)))) {stop("Sun angles (alt/az) are NULL. Scenario must supply numeric alt/az for T14/T05.") }# ---------------------------------------------------------------------# 4) Live preview: plots during the run (no side effects)# Why plot first?# - Instant sanity checks: station placement, LC, illumination maps# - Early visual cues if something is off (e.g., CRS mismatch)# ---------------------------------------------------------------------print(plot_landcover_terrain(scen, stations = st14, layout ="vertical"))print(plot_block_overview_2x2_en(scen, pts_sf = st14))# preview_scenario() may show truth fields, histograms, thumbnails, etc.print(preview_scenario(obj)) # accepts obj or obj$scen in the robust version# ---------------------------------------------------------------------# 5) Baseline LBO-CV & prediction maps# - For each time slice (T14 day / T05 pre-dawn):# 1) Run leave-block-out CV on station data# 2) Produce truth vs predicted raster maps (model ensemble)# - Diagnostics printed/printed:# * Metrics table (RMSE/MAE/R2 per model)# * Blocks plot (spatial CV blocks)# * Pred-vs-obs scatter plot# * Truth and prediction maps# --------------------------------------------------------------------- run14 <-run_for_time(st14, scen$R14, "T14", scen_local = scen, block_m = bs, models = mods) run05 <-run_for_time(st05, scen$R05, "T05", scen_local = scen, block_m = bs, models = mods)cat("\n== Metrics T14 ==\n"); print(run14$res$metrics)cat("\n== Metrics T05 ==\n"); print(run05$res$metrics)print(run14$res$blocks_plot); print(run14$res$diag_plot)print(run05$res$blocks_plot); print(run05$res$diag_plot)print(run14$maps$p_truth); print(run14$maps$p_pred)print(run05$maps$p_truth); print(run05$maps$p_pred)# =====================================================================# 6) SCALE → R* tuning → tuned CV + maps## Pipeline rationale:# (a) Variogram reveals correlation ranges in the point field.# (b) U-curve scans DEM-smoothing radii ~ [L50, L95] to find data-driven R*.# Each radius implies a different "macro-signal" (E*, slope*, cosi*).# We refit CV at each R and pick the RMSE-minimizer (R*).# (c) With R* in hand, we derive feature rasters at that scale: E*, slp*, cosi*.# (d) Re-extract station features at R* to ensure training/prediction consistency.# (e) Run LBO-CV again (tuned) using E* as the reference raster for blocks/domain.# (f) Predict tuned maps by injecting the feature rasters.# (g) Build compact multi-model panels with residual diagnostics.## Performance tips if tuning feels slow:# - Reduce n_grid in the U-curve (e.g., 5 instead of 9).# n_grid sets how many candidate smoothing radii are tested in the U-curve search for R*# - Trim `mods` to a smaller set while teaching the concept.# - Increase block size slightly (fewer blocks → fewer CV folds).# =====================================================================# --- (a) Variogram → L50/L95 ------------------------------------------------- Ls14 <-compute_Ls_from_points(st14, value_col ="temp") Ls05 <-compute_Ls_from_points(st05, value_col ="temp") p_vg14 <-plot_variogram_with_scales(Ls14$vg, Ls14$L50, Ls14$L95, Ls14$sill,"T14 — empirical variogram") p_vg05 <-plot_variogram_with_scales(Ls05$vg, Ls05$L50, Ls05$L95, Ls05$sill,"T05 — empirical variogram")print(p_vg14); print(p_vg05)# --- (b) U-curve → R* --------------------------------------------------------# We pass *explicit* sun angles so tune_Rstar_ucurve() can build cosi@R# consistently with the scenario's solar geometry. tune14 <-tune_Rstar_ucurve(stn_sf = st14,E = scen$E,alt = scen$sun$T14$alt,az = scen$sun$T14$az,L50 = Ls14$L50,L95 = Ls14$L95,block_fallback = bs,n_grid =6 ) tune05 <-tune_Rstar_ucurve(stn_sf = st05,E = scen$E,alt = scen$sun$T05$alt,az = scen$sun$T05$az,L50 = Ls05$L50,L95 = Ls05$L95,block_fallback = bs,n_grid =6 )# Plot the U-curves and report chosen R* (rounded for readability). p_uc14 <-plot_ucurve(tune14$grid, tune14$R_star, "T14 — U-curve") p_uc05 <-plot_ucurve(tune05$grid, tune05$R_star, "T05 — U-curve")print(p_uc14); print(p_uc05)# IMPORTANT: use %.0f (not %d) because R* is numeric (may be non-integer).message(sprintf("Chosen R* — T14: %.0f m | blocks ≈ %d m", tune14$R_star, tune14$block_m))message(sprintf("Chosen R* — T05: %.0f m | blocks ≈ %d m", tune05$R_star, tune05$block_m))# --- (c) Feature rasters @R* -------------------------------------------------# Smooth DEM at R* and derive slope/incident-cosine given the scenario sun angles. fr14 <-smooth_dem_and_derive( scen$E, scen$sun$T14$alt, scen$sun$T14$az, radius_m = tune14$R_star ) fr05 <-smooth_dem_and_derive( scen$E, scen$sun$T05$alt, scen$sun$T05$az, radius_m = tune05$R_star )# --- (d) Station features @R* ------------------------------------------------# Re-extract E*, slope*, cosi* (plus consistent LC factors) at station points.# This keeps training features aligned with the tuned raster features. st14_R <-add_drifts_at_R( st14, scen$E, scen$sun$T14$alt, scen$sun$T14$az, tune14$R_star,lc = scen$lc, lc_levels = scen$lc_levels,na_action ="fill"# or "drop" if you prefer to omit affected stations ) st05_R <-add_drifts_at_R( st05, scen$E, scen$sun$T05$alt, scen$sun$T05$az, tune05$R_star,lc = scen$lc, lc_levels = scen$lc_levels,na_action ="fill"# or "drop" if you prefer to omit affected stations )# --- (e) LBO-CV @R* ----------------------------------------------------------# Use the tuned smoothed DEM (E*) as the reference for CV blocks and domain# geometry so the CV respects the working resolution/scale of the model. bench14 <-run_lbo_cv(st14_R, E = scen$E, block_size = bs, models = mods) bench05 <-run_lbo_cv(st05_R, E = scen$E, block_size = bs, models = mods)print(bench14$metrics); print(bench05$metrics)# --- (f) Tuned maps ----------------------------------------------------------# Inject the tuned feature rasters so model predictions operate at R* scale. maps14_tuned <-predict_maps(stn_sf = st14_R, truth_raster = scen$R14, which_time ="T14",scen = scen, models = mods, lc_levels = scen$lc_levels,feature_rasters =list(E = fr14$Es, slp = fr14$slp, cosi = fr14$cosi) ) maps05_tuned <-predict_maps(stn_sf = st05_R, truth_raster = scen$R05, which_time ="T05",scen = scen, models = mods, lc_levels = scen$lc_levels,feature_rasters =list(E = fr05$Es, slp = fr05$slp, cosi = fr05$cosi) )# --- (g) Panels: truth | predictions | residual diagnostics ------------------ panel_T14 <-build_panels_truth_preds_errors_paged(maps = maps14_tuned, truth_raster = scen$R14, cv_tbl = bench14$cv,which_time ="T14", models_per_page =7, scatter_next_to_truth =TRUE ) panel_T05 <-build_panels_truth_preds_errors_paged(maps = maps05_tuned, truth_raster = scen$R05, cv_tbl = bench05$cv,which_time ="T05", models_per_page =7, scatter_next_to_truth =TRUE )print(panel_T14[[1]]); print(panel_T05[[1]])# Sensor noise (°C) – from specs or co-location sigma_inst <-0.5# α from residual variogram (microscale share) – T14 and T05 alpha14 <-nugget_fraction_from_cv(bench14$cv, model ="RF", crs_ref = st14) alpha05 <-nugget_fraction_from_cv(bench05$cv, model ="RF", crs_ref = st05)# Fallbacks if the fit failsif (!is.finite(alpha14)) alpha14 <-0.6if (!is.finite(alpha05)) alpha05 <-0.6# Fehlerbudgets berechnen (Base = runXX$res, Tuned = benchXX) eb14_base <-simple_error_budget(run14$res, sigma_inst, alpha14) |> dplyr::mutate(Time ="T14", Mode ="Base") eb05_base <-simple_error_budget(run05$res, sigma_inst, alpha05) |> dplyr::mutate(Time ="T05", Mode ="Base") eb14_tuned <-simple_error_budget(bench14, sigma_inst, alpha14) |> dplyr::mutate(Time ="T14", Mode ="Tuned") eb05_tuned <-simple_error_budget(bench05, sigma_inst, alpha05) |> dplyr::mutate(Time ="T05", Mode ="Tuned") eb_all <- dplyr::bind_rows(eb14_base, eb05_base, eb14_tuned, eb05_tuned) |> dplyr::relocate(Time, Mode)print(eb_all)# einfache Stacked-Bar-Plot-Funktion plot_error_budget <-function(df) { d <- df |> dplyr::filter(Component %in%c("Instrument var","Microscale var","Mesoscale var")) ggplot2::ggplot(d, ggplot2::aes(x =interaction(Time, Mode, sep =" "), y = Value, fill = Component) ) + ggplot2::geom_col(position ="stack") + ggplot2::theme_minimal() + ggplot2::labs(x =NULL, y ="Variance (°C²)", title ="Error budget by time & mode") } p_eb <-plot_error_budget(eb_all)print(p_eb)# =====================================================================# 7) Optional: save everything at the end (plots + tables + rasters)# - Change `export <- FALSE` at the top to only run/plot interactively# - We wrap saves in try() so a single failed save does not abort the run.# =====================================================================if (export) {# ---------- Ausgabe-Verzeichnis: results_<scen-name> ---------- out_root <- here::here("block4_5") out_dir <-file.path(out_root, sprintf("results_%s", scen_name)) fig_dir <-file.path(out_dir, "fig") tab_dir <-file.path(out_dir, "tab") ras_dir <-file.path(out_dir, "ras")# ohne Rückfrage, rekursiv, ohne Warnungendir.create(fig_dir, recursive =TRUE, showWarnings =FALSE)dir.create(tab_dir, recursive =TRUE, showWarnings =FALSE)dir.create(ras_dir, recursive =TRUE, showWarnings =FALSE)# ---------- Baseline: Previews & CV-Plots ----------------------save_plot_min(plot_landcover_terrain(scen, stations = st14, layout ="vertical"),fn_fig("landcover_terrain"))save_plot_min(plot_block_overview_2x2_en(scen, pts_sf = st14), fn_fig("overview_2x2"))save_plot_min(run14$res$blocks_plot, fn_fig("T14_blocks"))save_plot_min(run14$res$diag_plot, fn_fig("T14_diag"))save_plot_min(run05$res$blocks_plot, fn_fig("T05_blocks"))save_plot_min(run05$res$diag_plot, fn_fig("T05_diag"))save_plot_min(run14$maps$p_truth, fn_fig("T14_truth"))save_plot_min(run14$maps$p_pred, fn_fig("T14_pred"))save_plot_min(run05$maps$p_truth, fn_fig("T05_truth"))save_plot_min(run05$maps$p_pred, fn_fig("T05_pred"))# --- Tuned Panels ---save_plot_min(panel_T14[[1]], fn_fig("T14_panel_tuned"))save_plot_min(panel_T05[[1]], fn_fig("T05_panel_tuned"))# --- Raster ---save_raster_min(scen$E, fn_ras("E_dem"))save_raster_min(scen$R14, fn_ras("R14_truth"))save_raster_min(scen$R05, fn_ras("R05_truth"))if ("lc"%in%names(scen)) save_raster_min(scen$lc, fn_ras("landcover"))# ---------- Scale inference + tuned panels ----------------------safe_save_plot(p_vg14, fn_fig("T14_variogram"))safe_save_plot(p_vg05, fn_fig("T05_variogram"))safe_save_plot(p_uc14, fn_fig("T14_ucurve"))safe_save_plot(p_uc05, fn_fig("T05_ucurve"))safe_save_plot(panel_T14[[1]], fn_fig("T14_panel_tuned"))safe_save_plot(panel_T05[[1]], fn_fig("T05_panel_tuned"))save_table_readable(bench14$metrics, "metrics_T14_tuned", "Tuned metrics — T14")save_table_readable(bench05$metrics, "metrics_T05_tuned", "Tuned metrics — T05")save_table_readable(tune14$grid, "Ucurve_T14", "U-curve grid — T14")save_table_readable(tune05$grid, "Ucurve_T05", "U-curve grid — T05")save_table_readable(data.frame(L50 = Ls14$L50, L95 = Ls14$L95, R_star = tune14$R_star),"scales_T14", "Scales — T14 (L50/L95/R*)")save_table_readable(data.frame(L50 = Ls05$L50, L95 = Ls05$L95, R_star = tune05$R_star),"scales_T05", "Scales — T05 (L50/L95/R*)")save_table_readable(run14$res$metrics, file.path(tab_dir, sprintf("metrics_T14_%s", scen_name)))save_table_readable(run05$res$metrics, file.path(tab_dir, sprintf("metrics_T05_%s", scen_name)))save_table_readable(bench14$metrics, file.path(tab_dir, sprintf("metrics_T14_tuned_%s", scen_name)))save_table_readable(bench05$metrics, file.path(tab_dir, sprintf("metrics_T05_tuned_%s", scen_name)))save_table_readable(eb_all, file.path(tab_dir, sprintf("error_budget_%s", scen_name)))## ---------- Raster mit Szenario-Präfix --------------------------try(terra::writeRaster(scen$E, fn_ras("E_dem")), silent =TRUE)try(terra::writeRaster(scen$R14, fn_ras("R14_truth")), silent =TRUE)try(terra::writeRaster(scen$R05, fn_ras("R05_truth")), silent =TRUE)if ("lc"%in%names(scen))try(terra::writeRaster(scen$lc, fn_ras("landcover")), silent =TRUE)# ---------- Sessioninfo -----------------------------------------try(saveRDS(sessionInfo(), file.path(out_dir, sprintf("%s_sessionInfo.rds", scen_name))),silent =TRUE)message("✔ Exports written to: ", normalizePath(out_dir, winslash ="/")) }```
Helpers / Core library:
packages.R: centralized package loading and global run options (e.g., sf_use_s2(FALSE), seeds).
```{r}#| eval: false#' Geostatistical Learners & Map Predictor (Core Only)#'#' @title Learners and Raster Predictor (no helpers inside)#' @description#' A compact set of **model-specific predictors** used in your teaching/#' pipeline code, plus a high-level `predict_maps()` convenience that#' evaluates multiple learners on a full grid. #'#' This file intentionally contains **no helpers**. It assumes that common#' utilities and constants are sourced from your *helpers* module, including:#' - `%||%` — null-coalescing helper#' - `.default_vgm()` — conservative variogram fallback#' - `.align_factor_to_model()` — align factor levels at predict time#' - `safe_gam_formula()` — guarded GAM formula constructor#' - `lc_levels_default` — global land-cover levels#' - `temp_palette`, `stretch_q` — visualization defaults#'#' @details#' **Contract expected by all learners**:#' - `train_sf`, `test_sf` are `sf` objects with at least:#' - `temp` (numeric): the response variable to be learned#' - `x`, `y` (numeric): planar coordinates (will be derived from geometry#' if absent)#' - Drift/covariate columns depending on the learner (see each function)#' - Each learner returns a numeric vector of predictions aligned with#' `nrow(test_sf)`.#'#' **Coordinate Reference System**: all learners assume that `x` and `y`#' are in a **projected CRS** with meter-like units (e.g., UTM).#'#' **Error handling**:#' - Learners are defensive; if inputs are insufficient (e.g., too few rows,#' missing drift columns), they return `NA_real_` predictions of the correct#' length instead of failing hard (except where a *hard requirement* is unmet#' such as missing KED drifts in training).#'#' @section Dependencies:#' - **Packages**: `sf`, `gstat`, `mgcv`, `randomForest`, `terra`, `ggplot2`,#' `tibble`, `dplyr`, `stats`, `scales`#' - **Helpers (sourced elsewhere)**: `%||%`, `.default_vgm`, `.align_factor_to_model`,#' `safe_gam_formula`, `lc_levels_default`, `temp_palette`, `stretch_q`#'#' @seealso#' - Your helpers/utilities module for the functions noted above.#' - `gstat::krige`, `gstat::idw`, `gstat::variogram`, `gstat::fit.variogram`#' - `mgcv::gam`, `randomForest::randomForest`#'#' @keywords geostatistics interpolation kriging regression GAM randomForest#' @family learners#' Voronoi / Nearest-Station Predictor#'#' @description#' Assigns each prediction point the observed value from the **nearest**#' training station (a fast proxy for Voronoi interpolation).#'#' @param train_sf `sf` with at least `temp` and geometry.#' @param test_sf `sf` with geometry to predict for.#'#' @return Numeric vector `length(nrow(test_sf))` with nearest-neighbor temps.#' @examples#' # y_hat <- pred_Voronoi(train_sf, grid_sf)pred_Voronoi <-function(train_sf, test_sf) { idx <- sf::st_nearest_feature(test_sf, train_sf)as.numeric(train_sf$temp)[idx]}#' Inverse Distance Weighting (IDW)#'#' @description#' Classic **IDW** using `gstat::idw`, predicting from training points to#' the test geometry.#'#' @param train_sf `sf` with `temp` and geometry.#' @param test_sf `sf` with geometry.#' @param idp Inverse distance power (default `2`).#'#' @return Numeric vector of predictions for `test_sf`.#' @examples#' # y_hat <- pred_IDW(train_sf, grid_sf, idp = 2)pred_IDW <-function(train_sf, test_sf, idp =2) { pr <-suppressWarnings(gstat::idw(temp ~1, locations = train_sf, newdata = test_sf, idp = idp))as.numeric(pr$var1.pred)}#' Ordinary Kriging (OK)#'#' @description#' Univariate **OK** with an automatically fitted **exponential** variogram.#' Falls back to `.default_vgm()` if fitting fails (e.g., too few points).#'#' @param train_sf `sf` with `temp` and geometry.#' @param test_sf `sf` with geometry.#'#' @return Numeric vector of kriged predictions.#' @examples#' # y_hat <- pred_OK(train_sf, grid_sf)pred_OK <-function(train_sf, test_sf) { vg <-suppressWarnings(gstat::variogram(temp ~1, data = train_sf)) vgm_fit <-try(suppressWarnings(gstat::fit.variogram(vg, gstat::vgm("Exp"))), silent =TRUE)if (inherits(vgm_fit, "try-error")) vgm_fit <-.default_vgm(train_sf$temp) pr <-suppressWarnings(gstat::krige(temp ~1, locations = train_sf, newdata = test_sf, model = vgm_fit))as.numeric(pr$var1.pred)}#' Kriging with External Drift (KED)#'#' @description#' **KED** with additive drift terms. Requires drifts in *training*, fills#' non-finite values in *test* by median of training. If `lc` is present in#' both sets, it is included as a categorical drift with aligned levels.#'#' @details#' **Required drift columns** in `train_sf`: `z_surf`, `slp`, `cosi`. #' If any are missing in training, this function errors (by design).#'#' @param train_sf `sf`; must contain `temp`, `z_surf`, `slp`, `cosi`, geometry,#' and optionally `lc`.#' @param test_sf `sf` with geometry and preferably the same drift columns#' (non-finite values are median-filled).#' @param ... Unused (placeholder for compatibility).#'#' @return Numeric vector of KED predictions, `length(nrow(test_sf))`.#' @examples#' # y_hat <- pred_KED(train_sf, grid_sf)pred_KED <-function(train_sf, test_sf, ...) { need <-c("z_surf","slp","cosi") miss <-setdiff(need, names(train_sf))if (length(miss)) stop("pred_KED(): missing drifts in training: ", paste(miss, collapse =", ")) use_lc <-"lc"%in%names(train_sf) &&"lc"%in%names(test_sf) tr <- train_sf; te <- test_sfif (use_lc) { tr$lc <-droplevels(factor(tr$lc)) te$lc <-factor(as.character(te$lc), levels =levels(tr$lc)) te$lc[is.na(te$lc)] <-levels(tr$lc)[1] }for (nm in need) { m <- stats::median(tr[[nm]][is.finite(tr[[nm]])], na.rm =TRUE) te[[nm]][!is.finite(te[[nm]])] <- m } keep_tr <-c("temp", need, if (use_lc) "lc") dtr <- sf::st_drop_geometry(tr)[, keep_tr, drop =FALSE] ok <- stats::complete.cases(dtr); tr <- tr[ok, ]if (nrow(tr) <5) return(rep(NA_real_, nrow(te))) form <- stats::as.formula(paste("temp ~", paste(c(need, if (use_lc) "lc"), collapse =" + "))) vg <-suppressWarnings(gstat::variogram(form, data = tr)) vgm_fit <-try(suppressWarnings(gstat::fit.variogram(vg, gstat::vgm("Exp"))), silent =TRUE)if (inherits(vgm_fit, "try-error")) { ps <- stats::var(sf::st_drop_geometry(tr)$temp, na.rm =TRUE) vgm_fit <- gstat::vgm(psill = ps, model ="Exp", range =max(vg$dist, na.rm =TRUE)/3, nugget =0.1*ps) } pr <-suppressWarnings(gstat::krige(form, locations = tr, newdata = te, model = vgm_fit))as.numeric(pr$var1.pred)}#' Random Forest Regressor (RF)#'#' @description#' A **RandomForest** on spatial and drift features. If `lc` is absent, a#' harmless single-level factor is injected (levels provided by#' `lc_levels_default`). At prediction, factor levels are aligned using#' `.align_factor_to_model()`.#'#' @param train_sf `sf` with `temp`, `x`, `y`, `z_surf`, `slp`, `cosi`,#' optionally `lc` (factor), and geometry.#' @param test_sf `sf` with the same predictors (geometry required).#'#' @return Numeric vector of RF predictions.#' @examples#' # y_hat <- pred_RF(train_sf, grid_sf)pred_RF <-function(train_sf, test_sf) { dtr <- sf::st_drop_geometry(train_sf)if (!("lc"%in%names(dtr))) dtr$lc <-factor(lc_levels_default[1], levels = lc_levels_default) dtr$lc <-droplevels(factor(as.character(dtr$lc), levels = lc_levels_default)) dtr <- stats::na.omit(dtr)if (nrow(dtr) <5) return(rep(NA_real_, nrow(test_sf))) rf <- randomForest::randomForest(temp ~ x + y + z_surf + slp + cosi + lc, data = dtr, na.action = na.omit) dte <- sf::st_drop_geometry(test_sf)if (!("lc"%in%names(dte))) dte$lc <-factor(lc_levels_default[1], levels = lc_levels_default) lev <-levels(dtr$lc) dte$lc <-.align_factor_to_model(dte$lc, lev) good <- stats::complete.cases(dte[, c("x","y","z_surf","slp","cosi","lc")]) out <-rep(NA_real_, nrow(dte)); if (any(good)) out[good] <- stats::predict(rf, dte[good, ]) out}#' Generalized Additive Model (GAM)#'#' @description#' A **GAM** (thin-plate splines) built with a protective formula from#' `safe_gam_formula()` that caps basis sizes and includes `lc` only if#' useful. Requires a minimal number of complete rows.#'#' @param train_sf `sf` with `temp`, `x`, `y`, `z_surf`, `slp`, `cosi`,#' optionally `lc` (factor).#' @param test_sf `sf` with matching predictors.#'#' @return Numeric vector of GAM predictions; `NA_real_` if the model could#' not be trained.#' @examples#' # y_hat <- pred_GAM(train_sf, grid_sf)pred_GAM <-function(train_sf, test_sf) { dtr <- sf::st_drop_geometry(train_sf) keep <-intersect(c("temp","x","y","z_surf","slp","cosi","lc"), names(dtr)) dtr <- dtr[stats::complete.cases(dtr[, keep, drop =FALSE]), keep, drop =FALSE]if (!nrow(dtr)) return(rep(NA_real_, nrow(test_sf)))if ("lc"%in%names(dtr)) dtr$lc <-droplevels(factor(dtr$lc)) inc_lc <-"lc"%in%names(dtr) &&nlevels(dtr$lc) >=2if (nrow(dtr) <10) return(rep(NA_real_, nrow(test_sf))) gm <- mgcv::gam(formula =safe_gam_formula(dtr, include_lc = inc_lc), data = dtr, method ="REML", select =TRUE) dte <- sf::st_drop_geometry(test_sf) vars <-c("x","y","z_surf","slp","cosi", if (inc_lc) "lc"); vars <-intersect(vars, names(dte))if (inc_lc) { lev <-levels(model.frame(gm)$lc)if (!("lc"%in%names(dte))) dte$lc <- lev[1] dte$lc <-.align_factor_to_model(dte$lc, lev) } good <- stats::complete.cases(dte[, vars, drop =FALSE]) out <-rep(NA_real_, nrow(dte)); if (any(good)) out[good] <- stats::predict(gm, dte[good, vars, drop =FALSE], type ="response") out}#' Predict on a Raster Grid with Multiple Learners + Pretty Plots#'#' @description#' High-level utility that:#' 1. Ensures station covariates exist (E, slope, cos(i), optional LC).#' 2. Builds a **full-grid** data frame of covariates from rasters.#' 3. Runs selected learners (`Voronoi`, `IDW`, `OK`, `KED`, `RF`, `GAM`).#' 4. Returns both **prediction rasters** and **ggplot** panels.#'#' @param stn_sf `sf` training stations; must have `temp` and (if missing)#' this function will derive `x`, `y` and extract missing covariates from#' rasters.#' @param truth_raster `SpatRaster` (single-layer) used only for common#' color scaling in plots (and optional “truth” visualization).#' @param which_time Character; `"T14"` or `"T05"` (plot titles only).#' @param scen A scenario list containing at least: `E`, `slp`, and either#' `I14` or `I05` (for cos(i)) and optionally `lc` + `lc_levels`.#' @param models Character vector of learners to run.#' @param lc_levels Optional character vector of LC levels (defaults to#' `scen$lc_levels`).#' @param feature_rasters Optional list with named rasters `E`, `slp`, `cosi`#' to **override** the scenario’s baseline (e.g., when using tuned R*).#'#' @return A list with:#' \describe{#' \item{pred_df}{Tidy `tibble` of predictions for all models & grid cells}#' \item{pred_rasters}{`list` of `SpatRaster` predictions, one per model}#' \item{p_pred}{`ggplot` facet showing all model maps}#' \item{p_truth}{`ggplot` of the truth raster (for reference)}#' }#'#' @note#' Requires helpers/constants: `%||%`, `temp_palette`, `stretch_q`, plus#' land-cover level alignment utilities.#'#' @examples#' # out <- predict_maps(stn_sf, scen$R14, which_time = "T14", scen = scen)#' # print(out$p_truth); print(out$p_pred)predict_maps <-function(stn_sf, truth_raster,which_time =c("T14","T05"), scen, models =c("Voronoi","IDW","OK","KED","RF","GAM"),lc_levels =NULL,feature_rasters =NULL) { which_time <-match.arg(which_time) lc_levels <- lc_levels %||% scen$lc_levels E <- feature_rasters$E %||% scen$E slp_r <- feature_rasters$slp %||% scen$slp cosi_r <- feature_rasters$cosi %||%if (which_time =="T14") scen$I14 else scen$I05 has_lc <- ("lc"%in%names(scen)) &&!is.null(scen$lc) lc_r <-if (has_lc) scen$lc elseNULL train_sf <- stn_sfif (!all(c("x","y") %in%names(train_sf))) { xy <- sf::st_coordinates(train_sf); train_sf$x <- xy[,1]; train_sf$y <- xy[,2] }if (!("z_surf"%in%names(train_sf))) train_sf$z_surf <-as.numeric(terra::extract(E, sf::st_coordinates(train_sf))[,1])if (!("slp"%in%names(train_sf))) train_sf$slp <-as.numeric(terra::extract(slp_r, sf::st_coordinates(train_sf))[,1])if (!("cosi"%in%names(train_sf))) train_sf$cosi <-as.numeric(terra::extract(cosi_r, sf::st_coordinates(train_sf))[,1])if (has_lc &&!("lc"%in%names(train_sf))) { lc_codes <-as.integer(terra::extract(lc_r, sf::st_coordinates(train_sf))[,1]) lc_codes[is.na(lc_codes)] <-1L lc_codes <-pmax(1L, pmin(lc_codes, length(lc_levels))) train_sf$lc <-factor(lc_levels[lc_codes], levels = lc_levels) } xy <-as.data.frame(terra::xyFromCell(E, 1:terra::ncell(E))); names(xy) <-c("x","y") grid_df <- xy grid_df$z_surf <-as.numeric(terra::values(E)) grid_df$slp <-as.numeric(terra::values(slp_r)) grid_df$cosi <-as.numeric(terra::values(cosi_r))if (has_lc) { lc_codes <-as.integer(terra::values(lc_r)) lc_codes[!is.finite(lc_codes)] <-1L lc_codes <-pmax(1L, pmin(lc_codes, length(lc_levels))) grid_df$lc <-factor(lc_levels[lc_codes], levels = lc_levels) } grid_sf <- sf::st_as_sf(grid_df, coords =c("x","y"),crs = sf::st_crs(train_sf), remove =FALSE) use_lc <- has_lc && ("lc"%in%names(train_sf)) && ("lc"%in%names(grid_sf))if (use_lc) { lev <-levels(droplevels(factor(train_sf$lc))) train_sf$lc <-factor(as.character(train_sf$lc), levels = lev) grid_sf$lc <-factor(as.character(grid_sf$lc), levels = lev)if (anyNA(train_sf$lc) ||anyNA(grid_sf$lc)) { use_lc <-FALSE; train_sf$lc <-NULL; grid_sf$lc <-NULL } } pred_list <-list()if ("Voronoi"%in% models) pred_list$Voronoi <-pred_Voronoi(train_sf, grid_sf)if ("IDW"%in% models) pred_list$IDW <-pred_IDW (train_sf, grid_sf, idp =2)if ("OK"%in% models) pred_list$OK <-pred_OK (train_sf, grid_sf)if ("KED"%in% models) pred_list$KED <-pred_KED (train_sf, grid_sf)if ("RF"%in% models) { dtr <- sf::st_drop_geometry(train_sf) rf_vars <-c("x","y","z_surf","slp","cosi", if (use_lc) "lc") dtr <- stats::na.omit(dtr[, c("temp", rf_vars), drop =FALSE]) pred_list$RF <-if (nrow(dtr) >=5) { rf <- randomForest::randomForest(stats::as.formula(paste("temp ~", paste(rf_vars, collapse =" + "))),data = dtr, na.action = na.omit)as.numeric(stats::predict(rf, sf::st_drop_geometry(grid_sf)[, rf_vars, drop =FALSE])) } elserep(NA_real_, nrow(grid_sf)) }if ("GAM"%in% models) pred_list$GAM <-pred_GAM (train_sf, grid_sf) pred_df <- dplyr::bind_rows(lapply(names(pred_list), function(nm) { tibble::tibble(model = nm, x = grid_df$x, y = grid_df$y, pred = pred_list[[nm]]) })) make_r <-function(vals, template = E) { r <- terra::rast(template); terra::values(r) <-as.numeric(vals); r } pred_rasters <-lapply(pred_list, make_r) truth_df <-as.data.frame(truth_raster, xy =TRUE, na.rm =FALSE)names(truth_df) <-c("x","y","truth") lims <- stats::quantile(truth_df$truth, probs = stretch_q, na.rm =TRUE) p_pred <- ggplot2::ggplot(pred_df, ggplot2::aes(x, y, fill = pred)) + ggplot2::geom_raster() + ggplot2::scale_fill_gradientn(colors =temp_palette(256), limits = lims,oob = scales::squish, name ="Temp") + ggplot2::coord_equal() + ggplot2::theme_minimal() + ggplot2::labs(title =sprintf("%s — Predictions by model", which_time),x ="Easting", y ="Northing") + ggplot2::facet_wrap(~ model, ncol =3) p_truth <- ggplot2::ggplot(truth_df, ggplot2::aes(x, y, fill = truth)) + ggplot2::geom_raster() + ggplot2::scale_fill_gradientn(colors =temp_palette(256), limits = lims,oob = scales::squish, name ="Temp") + ggplot2::coord_equal() + ggplot2::theme_minimal() + ggplot2::labs(title =sprintf("%s — Truth raster", which_time),x ="Easting", y ="Northing")list(pred_df = pred_df, pred_rasters = pred_rasters, p_pred = p_pred, p_truth = p_truth)}```
Scenarios (data & geometry):scenarios/ + registry.R. Each scenario provides a make() factory returning a standard object contract (see below). registry.R maps a name to its builder; source_scenario(name) returns the make() function. Scenarios control domain rasters, station sets, and recommended block size.
Produces: CV metrics (MAE/RMSE/R²), blocks plot (spatial CV folds), diag plot (pred vs obs), truth & prediction maps.
What do we see?
The pipeline is behaving sensibly: block CV exposes how much of the error comes from sensors versus spatial structure. At daytime (T14), most error is already “accounted for” by instrumentation; what remains splits between fine-scale heterogeneity and broader structure. At pre-dawn (T05), fields are smoother at the choosen station spacing: after peeling off instrument noise, almost all leftover error is mesoscale—i.e., large-scale processes your mean/residual model doesn’t yet capture.
Daytime (T14)
What the errors mean: Sensors do a lot of the damage; once you discount that, the remainder is balanced: some is truly microscale (representativeness—canopy edges, roughness, small topographic breaks), and some is mesoscale (gentle gradients / structure not fully in the drift).
What you likely see on maps: Predictions look coherent, but local pockets near environmental transitions (edge effects, slope/aspect shifts) show residual texture. Scatter plots are tight with mild spread; block-wise boxplots show similar medians across your better models.
Model behavior: Simple geometric baselines (Voronoi/IDW) are adequate as references but can over- or under-smooth. KED/GAM/RF generally capture the daytime drift better; their residuals still show a bit of spatial correlation, which is consistent with the micro+meso split.
Implication: Improvements now come from targeted features (multi-scale terrain/roughness/canopy, radiation states) and, if residuals are directional, anisotropic/cost-aware residual modeling. More raw smoothing won’t help much.
Pre-dawn (T05)
What the errors mean: After instrument noise, the leftover error is overwhelmingly mesoscale. That points to organised nocturnal processes—cold-air drainage, pooling, gentle advection—operating at scales larger than your microstructure and not fully encoded by the current drift.
What you likely see on maps: Smooth, broad residual swaths aligned with terrain and flow paths rather than speckled local noise. Scatter remains fair, but block-wise errors can vary between blocks that sit on/away from drainage lines.
Model behavior: Purely geometric methods under-represent the organised night-time structure; RF/GAM help if fed flow-aware features, but without those drivers their residuals still carry structure. Kriging alone won’t fix it unless the metric (anisotropy, barriers, cost distance) matches the physics.
Implication: Focus on process-scale cues: flow-aligned neighborhoods, barriers across ridgelines, cost/geodesic distances that penalise uphill motion, and drift terms that proxy stratification (e.g., cold-air pathways, sky-view, topographic indices at appropriate scales). If residual correlation persists after that, add residual kriging with an anisotropic/cost metric.
Practical takeaways
You’re not bottlenecked by algorithm choice; you’re bottlenecked by process representation:
T14: refine multi-scale features (terrain filters, canopy/roughness, radiation states) and allow directional residuals where wind/valley orientation matters.
T05: elevate flow physics in the mean and residual metric; think valley graphs, cost distance, barriers.
Sampling/design: If T14 micro noise bothers you, micro-targeted placements (edges, small basins) or denser spacing help. For T05 meso, ensure transects across valley axes and basin outlets so the model “sees” the gradients it must learn.
QC nudge: If any model shows physically implausible tails (e.g., extreme GAM values at night), tighten smoothness or clip to plausible ranges and revisit feature scaling.
Bottom line: baseline runs already tell a coherent story—daytime limits are representativeness + mild missing drift; night-time limits are larger-scale drainage/advection. Aim your next changes at process-aware features and matching the residual metric to the physics rather than chasing extra smoothing.
U-curve → R*:tune_Rstar_ucurve(st, E, alt, az, L50, L95, block_fallback, n_grid) scans smoothing radii around [L50, L95] to find the RMSE-minimizer R* via spatial CV.
Features @ R*:smooth_dem_and_derive(E, alt, az, radius_m=R*) → Es, slp, cosi.
Re-extract station features @ R*:add_drifts_at_R(st, E, alt, az, R*, lc=..., lc_levels=...) aligns training features with tuned rasters.
Accuracy: Day (T14) ~0.45 °C RMSE, Pre-dawn (T05) ~0.60 °C RMSE — similar before/after tuning.
Error structure:
T14: sensors dominate; leftover split between micro and meso.
T05: sensors ~half; remainder is mostly mesoscale (smooth nocturnal structure).
Daytime (T14)
Baseline (LBO-CV)
RMSE:0.452 °C (Total var 0.206 °C²)
Instrument:0.116 °C² (~56%)
Microscale:0.0496 °C² (~24%)
Mesoscale:0.0401 °C² (~20%) Reading: After removing sensor noise, remaining error is fairly balanced between sub-grid heterogeneity (representativeness) and broader, unmodelled structure.
Tuned (R* features)
RMSE: essentially unchanged (≈ 0.45 °C).
Budget: slices stay close to baseline (minor shifts only). Reading: Tuning mainly harmonizes feature scale (R*) and improves map coherence; it doesn’t move headline error much.
Pre-dawn (T05)
Baseline (LBO-CV)
RMSE:0.600 °C (Total var 0.362 °C²)
Instrument:0.185 °C² (~51%)
Microscale:≈ 0 °C² (~0%)
Mesoscale:0.177 °C² (~49%) Reading: Nearly all non-instrument error is mesoscale — consistent with smooth nocturnal fields (drainage/advection, stratification) at your station spacing.
Tuned (R* features)
RMSE: essentially unchanged (≈ 0.60 °C).
Budget: instrument remains ~51%; remainder still meso-heavy. Reading: Tuning doesn’t change the picture: add process-scale structure rather than more smoothing.
What to do next (precise & minimal)
Sensors (both times): sanity-check σ_inst using co-location/specs; improve shielding/siting if instrument slice looks high.
Meso: add/strengthen process covariates (radiation states, exposure), and allow anisotropy/cost in residuals where flow/wind channel influence.
T05 (meso-dominated): favor flow-aligned neighborhoods, barriers across ridgelines, and advection/drainage proxies; if RF/GAM residuals remain correlated, add residual kriging (possibly anisotropic/cost-based).
How to present (clear story)
Per time slice: metrics table → obs-pred scatter → block-wise error boxplots → truth vs prediction maps → stacked error-budget bars (Instrument / Micro / Meso).
Then show the same row for tuned to highlight that scale harmonization improved map consistency while headline RMSE stayed stable.
Bottom line: Your models are already accurate at T14 and reasonable at T05. The limiting factor during the day is representativeness + mild missing structure; at night it’s large-scale process not yet in the mean/residual model. Focus on process-aware features and anisotropic/cost residuals, not more smoothing.
Error budget — idea, concept, and purpose
Idea. Turn the overall CV error (\(\sigma_{\text{cv}}^2 \approx \text{RMSE}^2\)) into a story of where it comes from:
Instrument (sensor noise you assume),
Microscale (sub-grid, representativeness),
Mesoscale (larger-scale, unmodelled structure).
Concept.
First peel off a fixed instrument part using your sensor SD (sigma_inst).
Split the leftover between micro and meso with α.
Prefer α from the residual variogram’s nugget fraction (on CV residuals): nugget ≈ micro.
Otherwise use a site heuristic (open 0.2–0.4; mixed 0.4–0.6; complex 0.6–0.8).
What for.
Diagnose limits: Is error dominated by instrument, micro (representativeness), or meso (missing process/scale)?
Guide action:
Big instrument → sensor QA, shielding, calibration.
Big meso → add process covariates (drift), anisotropy/cost metrics, rethink neighborhood/blocks.
Compare models/scenarios: Same RMSE can hide very different error structures.
Communicate uncertainty: Bars in °C² (optionally show SD by √).
Why σ_inst and α.
σ_inst externalizes known noise (you choose it from specs/co-location).
α makes the remaining variance interpretable by scale: micro (< ~R*/2) vs meso (≳ ~R*).
Practical workflow.
Compute CV errors → overall variance.
Set σ_inst.
Get α from residual variogram nugget fraction (per time/model), fallback to heuristic.
Report instrument / micro / meso components; act on the largest slice.
Here’s the plain-English read of your results.
T14 (daytime)
Total variance: 0.206 °C² (RMSE ≈ 0.452 °C).
Instrument: 0.116 °C² → ~56% of total (SD ≈ 0.34 °C).
Microscale: 0.0496 °C² → ~24% (SD ≈ 0.22 °C).
Mesoscale: 0.0401 °C² → ~20% (SD ≈ 0.20 °C).
Interpretation: Most error is explained by sensor noise. The remaining ~44% splits fairly evenly between sub-grid/representativeness (micro) and larger-scale unmodeled structure (meso). Both small-scale heterogeneity and some broader process signal are still in play.
T05 (pre-dawn)
Total variance: 0.362 °C² (RMSE ≈ 0.600 °C).
Instrument: 0.185 °C² → ~51% of total (SD ≈ 0.43 °C).
Microscale:≈ 0.
Mesoscale: 0.177 °C² → ~49% (SD ≈ 0.42 °C).
Interpretation: About half the error is sensor noise. Nearly all of the leftover is mesoscale, i.e., smoother, broader structure that the mean/residual model hasn’t captured (e.g., nocturnal drainage patterns, advection, or missing covariates). A near-zero micro slice is plausible at night when fields are smoother at the station spacing.
Net takeaway
T14: Sensor noise dominates; the rest is a balanced mix of micro and meso → both better micro-scale features (or finer R*) and some added process covariates could help.
T05: Sensor ~half; remainder is meso-dominated → focus on process/scale (e.g., flow-aligned/anisotropic metrics, cost distance, additional nocturnal drivers).