library(spopt)
library(tidycensus)
library(tidyverse)
library(sf)
library(mapgl)
dallas <- get_acs(
geography = "tract",
variables = c(
pop = "B01003_001",
income = "B19013_001",
bachelors = "DP02_0068P"
),
state = "TX",
county = "Dallas",
geometry = TRUE,
year = 2023,
output = "wide"
) |>
filter(!is.na(incomeE), !is.na(bachelorsE))Regionalization refers to the process of grouping smaller geographic units into larger, spatially contiguous regions. Unlike standard clustering methods, regionalization algorithms enforce that resulting regions form connected geographic areas - you can’t have a region with disconnected pieces scattered across the map.
This vignette walks through spopt’s regionalization algorithms using Census tract data from Dallas, Texas. We’ll explore how to build regions that minimize internal heterogeneity while maintaining spatial contiguity and meeting population thresholds.
When would you use regionalization?
Regionalization solves problems across many fields:
- Political redistricting: Building compact, contiguous districts that balance population
- Market segmentation: Creating sales territories with similar customer characteristics
- Health planning: Aggregating small-area data while preserving spatial relationships
- Urban planning: Delineating neighborhoods based on socioeconomic similarity
- Census data analysis: Addressing differential privacy concerns by aggregating blocks into larger areas
Getting Census data
Let’s start by pulling some demographic data for Census tracts in Dallas County, Texas. We’ll use the tidycensus package to get population, median household income, and percentage with a bachelor’s degree - variables that might define meaningful neighborhood clusters.
We now have 642 Census tracts with population, income, and education data. Let’s take a quick look at the geographic distribution of median household income:
maplibre_view(dallas, column = "incomeE")The map reveals the familiar spatial pattern of income inequality in Dallas - higher incomes concentrated in the Park Cities north of downtown, with lower incomes in the southern part of the county.
Max-P regionalization
The Max-P algorithm (Duque, Anselin, and Rey 2012) finds the maximum number of regions such that each region exceeds a specified threshold while minimizing within-region heterogeneity. This is particularly useful when you need regions that meet minimum population requirements for statistical reliability. Recent extensions support compactness constraints (Feng, Rey, and Wei 2022) and improved efficiency (Wei, Rey, and Knaap 2021).
Let’s create regions where each must contain at least 50,000 people:
maxp_result <- max_p_regions(
dallas,
attrs = c("incomeE", "bachelorsE"),
threshold_var = "popE",
threshold = 50000,
n_iterations = 100,
seed = 1983
)
maplibre_view(maxp_result, column = ".region", legend = FALSE)Let’s step through the key parameters:
-
attrs: The variables used to measure similarity. Tracts with similar income and education levels will be grouped together. -
threshold_var: The variable that must meet the minimum threshold (population in this case). -
threshold: Each region must have at least this many people. -
n_iterations: The algorithm uses a tabu search heuristic; more iterations generally yield better solutions. -
seed: For reproducibility, since the algorithm has stochastic elements.
The result is an sf object with a new .region column indicating each tract’s assigned region. The algorithm found 41 regions, each with at least 50,000 residents.
You can access metadata about the solution through the spopt attribute:
attr(maxp_result, "spopt")$algorithm
[1] "max_p"
$n_regions
[1] 41
$objective
[1] 590.8766
$threshold_var
[1] "popE"
$threshold
[1] 50000
$region_stats
region n_areas threshold_sum meets_threshold
1 32 21 72029 TRUE
2 3 15 53946 TRUE
3 23 21 63065 TRUE
4 30 17 62150 TRUE
5 24 26 71536 TRUE
6 37 21 83805 TRUE
7 38 18 50354 TRUE
8 29 15 66966 TRUE
9 35 15 56258 TRUE
10 16 11 53648 TRUE
11 1 14 56269 TRUE
12 8 18 52794 TRUE
13 28 15 55143 TRUE
14 22 15 57214 TRUE
15 19 13 54953 TRUE
16 9 13 57807 TRUE
17 27 13 59039 TRUE
18 12 15 76933 TRUE
19 2 12 75397 TRUE
20 7 14 56740 TRUE
21 6 16 61365 TRUE
22 20 12 55996 TRUE
23 39 15 54040 TRUE
24 25 13 56181 TRUE
25 36 15 66457 TRUE
26 21 12 59392 TRUE
27 40 16 59179 TRUE
28 31 24 102478 TRUE
29 34 19 61859 TRUE
30 17 19 61110 TRUE
31 14 18 71146 TRUE
32 26 13 56329 TRUE
33 10 14 62108 TRUE
34 15 14 54934 TRUE
35 4 13 71990 TRUE
36 5 18 75197 TRUE
37 13 14 70617 TRUE
38 41 12 64772 TRUE
39 11 12 54340 TRUE
40 33 18 78145 TRUE
41 18 13 65175 TRUE
$solve_time
[1] 0.07347894
$scaled
[1] TRUE
$n_iterations
[1] 100
$n_sa_iterations
[1] 100
$compact
[1] FALSE
$compact_weight
[1] 0.5
$mean_compactness
NULL
$region_compactness
NULL
Spatial weights
By default, all regionalization functions use queen contiguity - two tracts are neighbors if they share any boundary point (including corners). You can also use rook contiguity, where tracts must share an edge to be neighbors:
maxp_rook <- max_p_regions(
dallas,
attrs = c("incomeE", "bachelorsE"),
threshold_var = "popE",
threshold = 50000,
weights = "rook",
n_iterations = 100,
seed = 1983
)For more control, you can specify weights as a list:
-
list(type = "knn", k = 6): K-nearest neighbors (useful for point data or ensuring connectivity) -
list(type = "distance", d = 5000): Distance-based weights (units match your CRS)
You can also pass an nb object created with spdep or spopt’s sp_weights() function.
Compact regions
For applications like sales territories or electoral districts, you may want regions with compact, regular shapes. The compact parameter optimizes for compactness in addition to attribute homogeneity:
maxp_compact <- max_p_regions(
dallas,
attrs = c("incomeE", "bachelorsE"),
threshold_var = "popE",
threshold = 50000,
weights = "rook",
compact = TRUE,
compact_weight = 0.5,
n_iterations = 100,
seed = 1983
)
maplibre_view(maxp_compact, column = ".region", legend = FALSE)The compact_weight parameter (0 to 1) controls the trade-off between attribute homogeneity and geometric compactness. Higher values prioritize compact shapes.
SKATER algorithm
SKATER (Spatial K’luster Analysis by Tree Edge Removal) (Assunção et al. 2006) takes a different approach. It first builds a minimum spanning tree connecting all tracts based on their attribute similarity, then iteratively removes edges to create clusters. The algorithm is fast and produces spatially coherent regions.
skater_result <- skater(
dallas,
attrs = c("incomeE", "bachelorsE"),
n_regions = 6,
seed = 1983
)
maplibre_view(skater_result, column = ".region", legend = FALSE)SKATER supports a floor and floor_value parameter if you need minimum population constraints:
AZP: Automatic Zoning Procedure
The Automatic Zoning Procedure (AZP) (Openshaw 1977; Openshaw and Rao 1995) uses local search optimization with three algorithm variants: basic (greedy), tabu search, and simulated annealing.
azp_result <- azp(
dallas,
attrs = c("incomeE", "bachelorsE"),
n_regions = 20,
method = "tabu",
tabu_length = 10,
max_iterations = 100,
seed = 1983
)
maplibre_view(azp_result, column = ".region", legend = FALSE)The method parameter controls which algorithm variant to use:
-
"basic": Simple greedy local search (fastest) -
"tabu": Tabu search, which maintains a list of recent moves to avoid getting stuck in local optima -
"sa": Simulated annealing, which accepts some worse solutions early to explore more of the solution space
For large problems, you may also want to use the simulated annealing variant:
SPENC: Spatially-Encouraged Spectral Clustering
SPENC (Wolf 2021) combines spectral clustering with spatial constraints. It uses a radial basis function (RBF) kernel to measure attribute similarity and incorporates spatial connectivity into the spectral embedding. This approach can find clusters with complex, non-convex shapes that other methods might miss.
spenc_result <- spenc(
dallas,
attrs = c("incomeE", "bachelorsE"),
n_regions = 15,
gamma = 1.0,
seed = 1983
)
maplibre_view(spenc_result, column = ".region", legend = FALSE)The gamma parameter controls the RBF kernel bandwidth - higher values create “tighter” clusters in attribute space.
Ward spatial clustering
Spatially-constrained Ward clustering is a hierarchical method that only allows merging adjacent clusters. At each step, it merges the pair of adjacent clusters that minimizes the increase in total within-cluster variance.
ward_result <- ward_spatial(
dallas,
attrs = c("incomeE", "bachelorsE"),
n_regions = 15
)
maplibre_view(ward_result, column = ".region", legend = FALSE)Ward clustering is deterministic (no random seed needed) and tends to produce compact, roughly equal-sized regions.
Choosing an algorithm
Each regionalization algorithm has strengths for different scenarios:
| Algorithm | Best for | Key features |
|---|---|---|
| Max-P | Population thresholds | Maximizes number of regions meeting constraints |
| SKATER | Fast, interpretable results | Tree-based, good for large datasets |
| AZP | High-quality solutions | Multiple optimization variants |
| SPENC | Complex cluster shapes | Spectral embedding with spatial constraints |
| Ward | Deterministic, balanced regions | Hierarchical, no tuning required |
For most applications, I’d recommend starting with Max-P if you have population constraints, or SKATER for a quick first pass. If you want to explore the solution space more thoroughly, try AZP with tabu search or simulated annealing.
Next steps
- Facility Location - Solve location-allocation problems
- Huff Model - Model market share and retail competition
- Travel-Time Cost Matrices - Use real-world travel times
