library(spopt)
library(tidycensus)
library(tidyverse)
library(sf)
library(mapgl)
# Get tract-level population data
tarrant <- get_acs(
geography = "tract",
variables = "B01003_001",
state = "TX",
county = "Tarrant",
geometry = TRUE,
year = 2023
) |>
filter(estimate > 0) |>
rename(population = estimate)
# Demand points: all tract centroids
demand_pts <- tarrant |>
st_centroid()
# Candidate facilities: sample 30 locations across the county
# In practice, these might be specific parcels, existing buildings, or zoned commercial sites
set.seed(1983)
n_candidates <- 30
county_boundary <- tarrant |> st_union()
candidate_pts <- st_sample(county_boundary, n_candidates) |>
st_as_sf() |>
mutate(id = row_number())Facility location problems ask: given a set of demand points and candidate facility sites, where should we place facilities to best serve demand? These problems arise across industries - siting fire stations to minimize emergency response times, placing retail stores to maximize customer coverage, or locating warehouses to minimize shipping distances.
spopt implements several classic facility location algorithms, each optimizing a different objective. This vignette demonstrates these algorithms using Census data from Tarrant County, Texas (home of Fort Worth), showing how to select optimal locations for public services.
Setting up the problem
Facility location problems have three core components:
- Demand points: Locations that need to be served (e.g., population centers, customer addresses)
- Candidate facilities: Potential sites where facilities could be built
- Cost matrix: Distance or travel time between each demand point and candidate facility
A key distinction: in practice, you typically have many demand points but few candidate facility sites. For example, you might need to serve 500 Census tracts but only have 25 potential building sites to choose from. This asymmetry is what makes the problem tractable - the solver selects from a limited set of candidates rather than considering every possible location.
Let’s set up a realistic scenario: we want to place community health centers to serve the population of Tarrant County. We’ll use Census tract centroids as demand points, and sample 30 candidate locations from across the county to represent potential facility sites.
We now have 448 demand points (tract centroids) and 30 candidate facility locations. This setup mirrors real-world planning where you’re evaluating a shortlist of potential sites.
# Visualize the setup
maplibre(bounds = tarrant) |>
add_fill_layer(
id = "tracts",
source = tarrant,
fill_color = "lightgray",
fill_opacity = 0.3
) |>
add_circle_layer(
id = "demand",
source = demand_pts,
circle_color = "steelblue",
circle_radius = 3,
circle_opacity = 0.5
) |>
add_circle_layer(
id = "candidates",
source = candidate_pts,
circle_color = "black",
circle_radius = 6,
circle_stroke_color = "white",
circle_stroke_width = 2
)P-Median: Minimizing total distance
The P-Median problem (Hakimi 1964) minimizes the total weighted distance from demand points to their assigned facilities. This is the classic efficiency-focused location model - it finds locations that minimize how far people, on average, must travel.
result_pmedian <- p_median(
demand = demand_pts,
facilities = candidate_pts,
n_facilities = 5,
weight_col = "population"
)The solver runs quickly with 30 candidates - the optimization scales with the number of candidate sites, not demand points. Let’s visualize the results:
# Get selected facility locations
selected <- result_pmedian$facilities |>
filter(.selected) |>
mutate(id = as.character(id))
# Color demand points by their assigned facility
demand_colored <- result_pmedian$demand |>
mutate(.facility = as.character(.facility))
# Map the results
maplibre(bounds = tarrant) |>
add_fill_layer(
id = "tracts",
source = tarrant,
fill_color = "lightgray",
fill_opacity = 0.3
) |>
add_circle_layer(
id = "demand",
source = demand_colored,
circle_color = match_expr(
column = ".facility",
values = selected$id,
stops = c("#e41a1c", "#377eb8", "#4daf4a", "#984ea3", "#ff7f00")
),
circle_radius = 4,
circle_opacity = 0.7
) |>
add_circle_layer(
id = "facilities",
source = selected,
circle_color = match_expr(
column = "id",
values = selected$id,
stops = c("#e41a1c", "#377eb8", "#4daf4a", "#984ea3", "#ff7f00")
),
circle_radius = 10,
circle_stroke_color = "white",
circle_stroke_width = 2
)Each demand point is colored by its assigned facility, and the black markers show the selected facility locations. The solution minimizes the total population-weighted distance.
You can access solution metadata through the spopt attribute:
attr(result_pmedian, "spopt")$algorithm
[1] "p_median"
$n_selected
[1] 5
$n_facilities
[1] 5
$objective
[1] 16669528322
$mean_distance
[1] 7805.025
$solve_time
[1] 0.2938151
The objective value represents the total weighted distance - lower is better.
P-Center: Minimizing maximum distance
While P-Median optimizes overall accessibility, the P-Center problem (Hakimi 1965) focuses on equity - it minimizes the maximum distance any demand point must travel. This is critical for emergency services where we need to guarantee that everyone is within a reasonable distance.
result_pcenter <- p_center(
demand = demand_pts,
facilities = candidate_pts,
n_facilities = 5
)
selected_pcenter <- result_pcenter$facilities |>
filter(.selected)
# Compare to P-Median locations
maplibre(bounds = tarrant) |>
add_fill_layer(
id = "tracts",
source = tarrant,
fill_color = "lightgray",
fill_opacity = 0.3
) |>
add_circle_layer(
id = "pmedian",
source = selected,
circle_color = "#3498db",
circle_radius = 10,
circle_stroke_color = "white",
circle_stroke_width = 2
) |>
add_circle_layer(
id = "pcenter",
source = selected_pcenter,
circle_color = "#e74c3c",
circle_radius = 5,
circle_stroke_color = "white",
circle_stroke_width = 2
)Notice how the P-Center solution (red) pushes facilities toward the edges of the county to ensure no one is too far away, while P-Median (blue) concentrates facilities where population is densest. Despite this, two common facilities are selected between the solutions.
MCLP: Maximum coverage with limited facilities
The Maximum Coverage Location Problem (MCLP) (Church and ReVelle 1974) maximizes the demand covered within a service radius when you can only build a fixed number of facilities. This is useful when you have budget constraints but want to cover as many people as possible.
result_mclp <- mclp(
demand = demand_pts,
facilities = candidate_pts,
n_facilities = 5,
service_radius = 5000, # 5 km
weight_col = "population"
)
# Calculate coverage
covered_pop <- result_mclp$demand |>
filter(.covered) |>
pull(population) |>
sum()
total_pop <- sum(demand_pts$population, na.rm = TRUE)
cat(sprintf("Coverage: %s of %s (%.1f%%)",
format(covered_pop, big.mark = ","),
format(total_pop, big.mark = ","),
100 * covered_pop / total_pop))Coverage: 520,623 of 2,135,743 (24.4%)
The service_radius parameter defines what “covered” means - any demand point within this distance of a selected facility is considered covered. The algorithm then selects facilities to maximize the total covered population.
LSCP: Minimum facilities for full coverage
The Location Set Covering Problem (LSCP) (Toregas et al. 1971) asks the opposite question from MCLP: what’s the minimum number of facilities needed to cover all demand within a service radius?
result_lscp <- lscp(
demand = demand_pts,
facilities = candidate_pts,
service_radius = 8000 # 8 km
)
n_selected <- sum(result_lscp$facilities$.selected)
cat(sprintf("Minimum facilities needed for full coverage: %d", n_selected))Minimum facilities needed for full coverage: 19
The solver finds 19 facilities in this example. LSCP is particularly useful for planning emergency services where coverage is mandatory - every resident must be within a certain response time of a fire station or hospital.
P-Dispersion: Spreading facilities apart
Most facility location problems assume demand points want to be close to facilities. But some facilities are obnoxious - landfills, prisons, or polluting industries that communities want far away. The P-Dispersion problem (Kuby 1987) maximizes the minimum distance between facilities.
P-Dispersion is also useful for environmental monitoring networks or cell tower placement where you want sensors spread across a region.
result_pdispersion <- p_dispersion(
facilities = candidate_pts,
n_facilities = 10
)
selected_disp <- result_pdispersion |>
filter(.selected)
maplibre(bounds = tarrant) |>
add_fill_layer(
id = "tracts",
source = tarrant,
fill_color = "lightgray",
fill_opacity = 0.3
) |>
add_circle_layer(
id = "facilities",
source = selected_disp,
circle_color = "#2ecc71",
circle_radius = 10,
circle_stroke_color = "white",
circle_stroke_width = 2
)Notice how the facilities are spread around the county’s perimeter and interior - maximizing the minimum inter-facility distance.
CFLP: Capacitated facility location
Real facilities have capacity limits - a clinic can only see so many patients per day, a warehouse can only store so much inventory. The Capacitated Facility Location Problem (CFLP) (Daskin 2013; Sridharan 1995) adds capacity constraints and allows demand to be split across multiple facilities.
Varying capacity limits
In practice, candidate sites often have different capacities based on lot size, zoning, or building constraints. Let’s simulate a realistic scenario with small, medium, and large sites:
# Create candidates with varying capacities
set.seed(1983)
candidate_facilities <- candidate_pts |>
mutate(
# Assign site sizes: small (200k), medium (400k), large (800k)
site_type = sample(c("small", "medium", "large"), n(), replace = TRUE),
capacity = case_when(
site_type == "small" ~ 200000,
site_type == "medium" ~ 400000,
site_type == "large" ~ 800000
)
)
result_cflp <- cflp(
demand = demand_pts,
facilities = candidate_facilities,
n_facilities = 5,
weight_col = "population",
capacity_col = "capacity"
)
# Check which sites were selected and their utilization
result_cflp$facilities |>
filter(.selected) |>
st_drop_geometry() |>
select(id, site_type, capacity, .utilization) id site_type capacity .utilization
1 5 medium 4e+05 0.9712625
2 7 medium 4e+05 1.0000000
3 12 medium 4e+05 1.0000000
4 21 large 8e+05 0.6974662
5 27 medium 4e+05 0.9731625
The solver selects facilities that together can serve all demand while minimizing total distance. Notice that .utilization shows what fraction of each facility’s capacity is used.
When demand exceeds capacity at the nearest facility, the solver splits demand across multiple facilities:
Incorporating facility costs
What if different sites have different costs - perhaps due to real estate prices, construction costs, or lease rates? CFLP can incorporate these costs to find the economically optimal solution.
When you provide a facility_cost_col and set n_facilities = 0, the solver determines the optimal number of facilities by balancing fixed costs (opening facilities) against variable costs (transportation distance). The key is scaling costs appropriately - fixed costs should be comparable to total transportation costs:
# Add costs based on site size
# Scale to be comparable with total transport costs (population * distance)
candidate_with_costs <- candidate_facilities |>
mutate(
# Fixed cost to open each facility (scaled to match transport cost units)
fixed_cost = case_when(
site_type == "small" ~ 5e8, # Higher cost per unit capacity
site_type == "medium" ~ 8e8,
site_type == "large" ~ 1e9
)
)
result_with_costs <- cflp(
demand = demand_pts,
facilities = candidate_with_costs,
n_facilities = 0, # Let solver determine optimal number
weight_col = "population",
capacity_col = "capacity",
facility_cost_col = "fixed_cost"
)
# How many facilities does the cost-optimized solution select?
cost_meta <- attr(result_with_costs, "spopt")
cat(sprintf("Optimal number of facilities: %d\n", cost_meta$n_selected))Optimal number of facilities: 9
This mode is powerful for budget planning: instead of arbitrarily choosing 5 facilities, let the optimization determine the cost-minimizing configuration.
Real estate costs example
Real estate prices typically vary spatially - central locations cost more than peripheral ones. Let’s create a realistic scenario where costs increase toward the county centroid:
# Calculate distance from county centroid (proxy for "centrality")
county_centroid <- st_centroid(county_boundary)
# Compute distances to center
dist_to_center <- as.numeric(st_distance(candidate_facilities, county_centroid))
candidate_with_realestate <- candidate_facilities |>
mutate(
dist_to_center = dist_to_center,
# Costs higher near center, lower at periphery
# Normalize to 0-1 range and invert (closer = higher cost)
centrality = 1 - (dist_to_center - min(dist_to_center)) /
(max(dist_to_center) - min(dist_to_center)),
# Cost ranges from $200-$500 per sqft based on location
cost_per_sqft = 200 + centrality * 300,
sqft = capacity / 10, # Assume 10 people per sqft capacity
# Scale fixed cost to be comparable with transport costs
fixed_cost = sqft * cost_per_sqft * 50
)Let’s visualize how costs vary across the county:
maplibre(bounds = tarrant) |>
add_fill_layer(
id = "tracts",
source = tarrant,
fill_color = "lightgray",
fill_opacity = 0.3
) |>
add_circle_layer(
id = "candidates",
source = candidate_with_realestate,
circle_color = interpolate(
column = "cost_per_sqft",
values = c(200, 350, 500),
stops = c("#2166ac", "#f7f7f7", "#b2182b")
),
circle_radius = interpolate(
column = "capacity",
values = c(200000, 500000, 800000),
stops = c(6, 10, 14)
),
circle_stroke_color = "white",
circle_stroke_width = 1
)Red sites are expensive (central), blue sites are cheaper (peripheral). Larger circles indicate higher capacity.
Now let’s compare two solutions: one ignoring costs (P-Median) and one incorporating real estate costs:
# Solution ignoring costs (just minimize distance)
result_no_cost <- p_median(
demand = demand_pts,
facilities = candidate_with_realestate,
n_facilities = 5,
weight_col = "population"
)
# Solution with real estate costs
result_with_realestate <- cflp(
demand = demand_pts,
facilities = candidate_with_realestate,
n_facilities = 5,
weight_col = "population",
capacity_col = "capacity",
facility_cost_col = "fixed_cost"
)
# Compare selections
selected_no_cost <- result_no_cost$facilities |>
filter(.selected) |>
mutate(method = "Distance only")
selected_with_cost <- result_with_realestate$facilities |>
filter(.selected) |>
mutate(method = "With real estate costs")
# Side-by-side comparison
maplibre(bounds = tarrant) |>
add_fill_layer(
id = "tracts",
source = tarrant,
fill_color = "lightgray",
fill_opacity = 0.3
) |>
add_circle_layer(
id = "no_cost",
source = selected_no_cost,
circle_color = "#e41a1c",
circle_radius = 12,
circle_stroke_color = "white",
circle_stroke_width = 2
) |>
add_circle_layer(
id = "with_cost",
source = selected_with_cost,
circle_color = "#377eb8",
circle_radius = 7,
circle_stroke_color = "white",
circle_stroke_width = 2
)Red circles show the distance-only solution (P-Median), blue circles show the cost-aware solution. Notice how the cost-aware solution may shift toward cheaper peripheral sites, trading some accessibility for lower real estate costs.
Comparing algorithms
Here’s a quick reference for choosing the right algorithm:
| Algorithm | Objective | Use case |
|---|---|---|
| P-Median | Minimize total weighted distance | General efficiency; warehouses, retail |
| P-Center | Minimize maximum distance | Equity-focused; emergency services |
| MCLP | Maximize covered demand | Fixed budget; expand coverage |
| LSCP | Minimize facilities for full coverage | Mandatory coverage; emergency planning |
| P-Dispersion | Maximize minimum inter-facility dist. | Obnoxious facilities; monitoring networks |
| CFLP | Minimize cost with capacity limits | Logistics; realistic capacity constraints |
For most public service planning, start with P-Median for an efficiency baseline, then compare to P-Center to see the equity trade-offs. If you have a hard budget constraint, MCLP helps understand what coverage is achievable with limited facilities.
Candidate site selection strategies
The examples above used randomly sampled candidate locations. In practice, you’d generate candidates more thoughtfully:
- Existing infrastructure: Current facility locations that could be expanded
- Zoning-based: Sites zoned for commercial or institutional use
- Network nodes: Major intersections or highway interchanges
- Population centers: Centroids of high-density areas
- Grid sampling: Regular grid across the study area for exploratory analysis
You can also combine strategies - start with a coarse grid of candidates for initial analysis, then refine to specific parcels for detailed planning.
Using custom cost matrices
By default, spopt calculates Euclidean distances between points. But real-world accessibility depends on road networks and travel times, not straight-line distance. All facility location functions accept a cost_matrix parameter for custom distances.
See the Travel-Time Cost Matrices vignette for how to generate travel time matrices using r5r, and then pass them to these functions.
# Example with custom cost matrix
cost_mat <- my_travel_time_matrix # Generated from r5r or similar
result <- p_median(
demand = demand_pts,
facilities = candidate_pts,
n_facilities = 5,
weight_col = "population",
cost_matrix = cost_mat
)Next steps
- Regionalization - Build spatially-contiguous regions
- Huff Model - Model market share and retail competition
- Travel-Time Cost Matrices - Use real-world travel times with r5r
