library(spopt)
library(tidycensus)
library(tidyverse)
library(sf)
library(mapgl)
dallas <- get_acs(
geography = "tract",
variables = c(
pop = "B01003_001",
income = "B19013_001",
bachelors = "DP02_0068P"
),
state = "TX",
county = "Dallas",
geometry = TRUE,
year = 2023,
output = "wide"
) |>
filter(!is.na(incomeE), !is.na(bachelorsE))Regionalization refers to the process of grouping smaller geographic units into larger, spatially contiguous regions. Unlike standard clustering methods, regionalization algorithms enforce that resulting regions form connected geographic areas - you can’t have a region with disconnected pieces scattered across the map.
This vignette walks through spopt’s regionalization algorithms using Census tract data from Dallas, Texas. We’ll explore how to build regions that minimize internal heterogeneity while maintaining spatial contiguity and meeting population thresholds.
When would you use regionalization?
Regionalization solves problems across many fields:
- Political redistricting: Building compact, contiguous districts that balance population
- Market segmentation: Creating sales territories with similar customer characteristics
- Health planning: Aggregating small-area data while preserving spatial relationships
- Urban planning: Delineating neighborhoods based on socioeconomic similarity
- Census data analysis: Addressing differential privacy concerns by aggregating blocks into larger areas
Getting Census data
Let’s start by pulling some demographic data for Census tracts in Dallas County, Texas. We’ll use the tidycensus package to get population, median household income, and percentage with a bachelor’s degree - variables that might define meaningful neighborhood clusters.
We now have 642 Census tracts with population, income, and education data. Let’s take a quick look at the geographic distribution of median household income:
maplibre_view(dallas, column = "incomeE")The map reveals the familiar spatial pattern of income inequality in Dallas - higher incomes concentrated in the Park Cities north of downtown, with lower incomes in the southern part of the county.
Max-P regionalization
The Max-P algorithm (Duque, Anselin, and Rey 2012) finds the maximum number of regions such that each region exceeds a specified threshold while minimizing within-region heterogeneity. This is particularly useful when you need regions that meet minimum population requirements for statistical reliability. Recent extensions support compactness constraints (Feng, Rey, and Wei 2022) and improved efficiency (Wei, Rey, and Knaap 2021).
Let’s create regions where each must contain at least 50,000 people:
maxp_result <- max_p_regions(
dallas,
attrs = c("incomeE", "bachelorsE"),
threshold_var = "popE",
threshold = 50000,
n_iterations = 100,
seed = 1983
)
maplibre_view(maxp_result, column = ".region", legend = FALSE)Let’s step through the key parameters:
-
attrs: The variables used to measure similarity. Tracts with similar income and education levels will be grouped together. -
threshold_var: The variable that must meet the minimum threshold (population in this case). -
threshold: Each region must have at least this many people. -
n_iterations: The algorithm uses a tabu search heuristic; more iterations generally yield better solutions. -
seed: For reproducibility, since the algorithm has stochastic elements.
The result is an sf object with a new .region column indicating each tract’s assigned region. The algorithm found 41 regions, each with at least 50,000 residents.
You can access metadata about the solution through the spopt attribute:
attr(maxp_result, "spopt")$algorithm
[1] "max_p"
$n_regions
[1] 41
$objective
[1] 765.0026
$threshold_var
[1] "popE"
$threshold
[1] 50000
$region_stats
region n_areas threshold_sum meets_threshold
1 9 19 57342 TRUE
2 28 20 52429 TRUE
3 15 28 82128 TRUE
4 23 23 78317 TRUE
5 3 19 78368 TRUE
6 29 23 75166 TRUE
7 7 19 75973 TRUE
8 35 23 99588 TRUE
9 11 17 65379 TRUE
10 24 14 59220 TRUE
11 19 19 66522 TRUE
12 39 16 59380 TRUE
13 34 13 54521 TRUE
14 5 12 50064 TRUE
15 36 18 71518 TRUE
16 16 13 57744 TRUE
17 27 11 50386 TRUE
18 33 10 51744 TRUE
19 13 13 58900 TRUE
20 8 13 54524 TRUE
21 30 13 67132 TRUE
22 22 11 59834 TRUE
23 32 15 68631 TRUE
24 41 11 57571 TRUE
25 17 16 53816 TRUE
26 31 16 65380 TRUE
27 37 12 53023 TRUE
28 20 20 76831 TRUE
29 18 15 54725 TRUE
30 14 16 57937 TRUE
31 2 18 62688 TRUE
32 21 18 72625 TRUE
33 38 13 56932 TRUE
34 6 12 52019 TRUE
35 26 10 51164 TRUE
36 1 16 79084 TRUE
37 25 16 67638 TRUE
38 4 11 54192 TRUE
39 40 10 54831 TRUE
40 12 13 63749 TRUE
41 10 17 69841 TRUE
$solve_time
[1] 0.07716489
$scaled
[1] TRUE
$n_iterations
[1] 100
$n_sa_iterations
[1] 100
$compact
[1] FALSE
$compact_weight
[1] 0.5
$homogeneous
[1] TRUE
$mean_compactness
NULL
$region_compactness
NULL
Spatial weights
By default, all regionalization functions use queen contiguity - two tracts are neighbors if they share any boundary point (including corners). You can also use rook contiguity, where tracts must share an edge to be neighbors:
maxp_rook <- max_p_regions(
dallas,
attrs = c("incomeE", "bachelorsE"),
threshold_var = "popE",
threshold = 50000,
weights = "rook",
n_iterations = 100,
seed = 1983
)For more control, you can specify weights as a list:
-
list(type = "knn", k = 6): K-nearest neighbors (useful for point data or ensuring connectivity) -
list(type = "distance", d = 5000): Distance-based weights (units match your CRS)
You can also pass an nb object created with spdep or spopt’s sp_weights() function.
Compact regions
For applications like sales territories or electoral districts, you may want regions with compact, regular shapes. The compact parameter optimizes for compactness in addition to attribute homogeneity:
maxp_compact <- max_p_regions(
dallas,
attrs = c("incomeE", "bachelorsE"),
threshold_var = "popE",
threshold = 50000,
weights = "rook",
compact = TRUE,
compact_weight = 0.5,
n_iterations = 100,
seed = 1983
)
maplibre_view(maxp_compact, column = ".region", legend = FALSE)The compact_weight parameter (0 to 1) controls the trade-off between attribute homogeneity and geometric compactness. Higher values prioritize compact shapes. The parameter compact_metric provides a choice between two compactness metrics. The default, “centroid dispersion”, is appropriate for both polygons (e.g., state borders) and point geometries (e.g., store locations). The alternative option, “NMI” (normalized moment of inertia), is the original metric proposed by Feng, Rey, and Wei (2022), and is appropriate only for polygons.
SKATER algorithm
SKATER (Spatial K’luster Analysis by Tree Edge Removal) (Assunção et al. 2006) takes a different approach. It first builds a minimum spanning tree connecting all tracts based on their attribute similarity, then iteratively removes edges to create clusters. The algorithm is fast and produces spatially coherent regions.
skater_result <- skater(
dallas,
attrs = c("incomeE", "bachelorsE"),
n_regions = 6,
seed = 1983
)
maplibre_view(skater_result, column = ".region", legend = FALSE)SKATER supports a floor and floor_value parameter if you need minimum population constraints:
AZP: Automatic Zoning Procedure
The Automatic Zoning Procedure (AZP) (Openshaw 1977; Openshaw and Rao 1995) uses local search optimization with three algorithm variants: basic (greedy), tabu search, and simulated annealing.
azp_result <- azp(
dallas,
attrs = c("incomeE", "bachelorsE"),
n_regions = 20,
method = "tabu",
tabu_length = 10,
max_iterations = 100,
seed = 1983
)
maplibre_view(azp_result, column = ".region", legend = FALSE)The method parameter controls which algorithm variant to use:
-
"basic": Simple greedy local search (fastest) -
"tabu": Tabu search, which maintains a list of recent moves to avoid getting stuck in local optima -
"sa": Simulated annealing, which accepts some worse solutions early to explore more of the solution space
For large problems, you may also want to use the simulated annealing variant:
SPENC: Spatially-Encouraged Spectral Clustering
SPENC (Wolf 2021) combines spectral clustering with spatial constraints. It uses a radial basis function (RBF) kernel to measure attribute similarity and incorporates spatial connectivity into the spectral embedding. This approach can find clusters with complex, non-convex shapes that other methods might miss.
spenc_result <- spenc(
dallas,
attrs = c("incomeE", "bachelorsE"),
n_regions = 15,
gamma = 1.0,
seed = 1983
)
maplibre_view(spenc_result, column = ".region", legend = FALSE)The gamma parameter controls the RBF kernel bandwidth - higher values create “tighter” clusters in attribute space.
Ward spatial clustering
Spatially-constrained Ward clustering is a hierarchical method that only allows merging adjacent clusters. At each step, it merges the pair of adjacent clusters that minimizes the increase in total within-cluster variance.
ward_result <- ward_spatial(
dallas,
attrs = c("incomeE", "bachelorsE"),
n_regions = 15
)
maplibre_view(ward_result, column = ".region", legend = FALSE)Ward clustering is deterministic (no random seed needed) and tends to produce compact, roughly equal-sized regions.
Choosing an algorithm
Each regionalization algorithm has strengths for different scenarios:
| Algorithm | Best for | Key features |
|---|---|---|
| Max-P | Population thresholds | Maximizes number of regions meeting constraints |
| SKATER | Fast, interpretable results | Tree-based, good for large datasets |
| AZP | High-quality solutions | Multiple optimization variants |
| SPENC | Complex cluster shapes | Spectral embedding with spatial constraints |
| Ward | Deterministic, balanced regions | Hierarchical, no tuning required |
For most applications, I’d recommend starting with Max-P if you have population constraints, or SKATER for a quick first pass. If you want to explore the solution space more thoroughly, try AZP with tabu search or simulated annealing.
Next steps
- Facility Location - Solve location-allocation problems
- Huff Model - Model market share and retail competition
- Travel-Time Cost Matrices - Use real-world travel times
