# 7 Spatial analysis with US Census data

A very common use-case of spatial data from the US Census Bureau is spatial analysis. Spatial analysis refers to the performance of analytic tasks that explicitly incorporate the spatial properties of a dataset. Principles in spatial analysis are closely related to the field of geographic information science, which incorporates both theoretical perspectives and methodological insights with regards to the use of geographic data.

Traditionally, geographic information science has been performed in a geographic information system, or “GIS,” which refers to an integrated software platform for the management, processing, analysis, and visualization of geographic data. As evidenced in this book already, R packages exist for handling these tasks, allowing R to function as a capable substitute for desktop GIS software like ArcGIS or QGIS.

Traditionally, spatial analytic tasks in R have been handled by the sp package and allied packages such as rgeos. In recent years, however, the sf package has emerged as the next-generation alternative to sp for spatial data analysis in R. In addition to the simpler representation of vector spatial data in R, as discussed in previous chapters, sf also includes significant functionality for spatial data analysis that integrates seamlessly with tidyverse tools.

This chapter covers how to perform common spatial analysis tasks with Census data using the sf package. As with previous chapters, the examples will focus on data acquired with the tidycensus and tigris packages, and will cover common workflows of use to practitioners who work with US Census Bureau data.

## 7.1 Spatial overlay

Spatial data analysis allows practitioners to consider how geographic datasets interrelate in geographic space. This analytic functionality facilitates answering a wide range of research and analytic questions that would otherwise prove difficult without reference to a dataset’s geographic properties.

One common use-case employed by the geospatial analyst is spatial overlay. Key to the concept of spatial overlay is the representation of geographic datasets as layers in a GIS. This representation is exemplified by the graphic below (credit to Rafael Pereira for the implementation).

In this representation, different components of the landscape that interact in the real world are abstracted out into different layers, represented by different geometries. For example, Census tracts might be represented as polygons; customers as points; and roads as linestrings. Separating out these components has significant utility for the geospatial analyst, however. By using spatial analytic tools, a researcher could answer questions like “How many customers live within a given Census tract?” or “Which roads intersect a given Census tract?”

### 7.1.1 Note: aligning coordinate reference systems

Section 5.4 covered coordinate reference systems in R, their importance to spatial data, and how to select appropriate projected coordinate reference systems using the crsuggest package. In any workflow using spatial overlay, including all of the methods discussed in this chapter, it is essential that all layers share the same CRS for overlay methods to work.

Spatial datasets obtained with tigris or tidycensus will by default share the same geographic CRS, NAD 1983. For geographic coordinate reference systems, the sf package uses the s2 spherical geometry library to compute three-dimensional overlay rather than assuming planar geometries for geographic coordinates. This represents a significant technical advancement; however I have found that it can be much slower to compute spatial overlay operations in this way than if the same workflow were using a projected coordinate reference system.

In turn, a recommended spatial analysis data preparation workflow is as follows:

2. Use suggest_crs() in the crsuggest package to identify an appropriate projected CRS for your layers;
3. Transform your data to the projected CRS using st_transform();
4. Compute the spatial overlay operation.

To avoid redundancy, step 2 is implied in the examples in this chapter and an appropriate projected coordinate reference system has been pre-selected for all sections.

### 7.1.2 Identifying geometries within a metropolitan area

One example of the utility of spatial overlay for the Census data analyst is the use of overlay techniques to find out which geographies lie within a given metropolitan area. Core-based statistical areas, also known as metropolitan or micropolitan areas, are common geographies defined by the US Census Bureau for regional analysis. Core-based statistical areas are defined as agglomerations of counties that are oriented around a central core or cores, and have a significant degree of population interaction as measured through commuting patterns. Metropolitan areas are those core-based statistical areas that have a population exceeding 50,000.

A Census data analyst in the United States will often need to know which Census geographies, such as Census tracts or block groups, fall within a given metropolitan area. However, these geographies are only organized by state and county, and don’t have metropolitan area identification included by default. Given that Census spatial datasets are designed to align with one another, spatial overlay can be used to identify geographic features that fall within a given metropolitan area and extract those features.

Let’s use the example of the Kansas City metropolitan area, which includes Census tracts in both Kansas and Missouri. We’ll first use tigris to acquire 2020 Census tracts for the two states that comprise the Kansas City region as well as the boundary of the Kansas City metropolitan area.

library(tigris)
library(tidyverse)
library(sf)
options(tigris_use_cache = TRUE)

# CRS used: NAD83(2011) Kansas Regional Coordinate System
# Zone 11 (for Kansas City)
ks_mo_tracts <- map_dfr(c("KS", "MO"), ~{
tracts(.x, cb = TRUE, year = 2020)
}) %>%
st_transform(8528)

kc_metro <- core_based_statistical_areas(cb = TRUE, year = 2020) %>%
filter(str_detect(NAME, "Kansas City")) %>%
st_transform(8528)

ggplot() +
geom_sf(data = ks_mo_tracts, fill = "white", color = "grey") +
geom_sf(data = kc_metro, fill = NA, color = "red") +
theme_void()

We can see visually from the plot which Census tracts are within the Kansas City metropolitan area, and which lay outside. This spatial relationship represented in the image can be expressed through code using spatial subsetting, enabled by functionality in the sf package.

### 7.1.3 Spatial subsets and spatial predicates

Spatial subsetting uses the extent of one spatial dataset to extract features from another spatial dataset based on co-location, defined by a spatial predicate. Spatial subsets can be expressed through base R indexing notation:

kc_tracts <- ks_mo_tracts[kc_metro, ]

ggplot() +
geom_sf(data = kc_tracts, fill = "white", color = "grey") +
geom_sf(data = kc_metro, fill = NA, color = "red") +
theme_void()

The spatial subsetting operation returns all the Census tracts that intersect the extent of the Kansas City metropolitan area, using the default spatial predicate, st_intersects(). This gives us back tracts that fall within the metro area’s boundary and those that cross or touch the boundary. For many analysts, however, this will be insufficient as they will want to tabulate statistics exclusively for tracts that fall within the metropolitan area’s boundaries. In this case, a different spatial predicate can be used with the op argument.

Generally, Census analysts will want to use the st_within() spatial predicate to return tracts within a given metropolitan area. As long as objects within the core Census hierarchy are obtained for the same year from tigris, the st_within() spatial predicate will cleanly return geographies that fall within the larger geography when requested. The example below illustrates the same process using the st_filter() function in sf, which allows spatial subsetting to be used cleanly within a tidyverse-style pipeline. The key difference between these two approaches to spatial subsetting is the argument name for the spatial predicate (op vs. .predicate).

kc_tracts_within <- ks_mo_tracts %>%
st_filter(kc_metro, .predicate = st_within)

# Equivalent syntax:
# kc_metro2 <- kc_tracts[kc_metro, op = st_within]

ggplot() +
geom_sf(data = kc_tracts_within, fill = "white", color = "grey") +
geom_sf(data = kc_metro, fill = NA, color = "red") +
theme_void()

## 7.2 Spatial joins

Spatial joins extend the aforementioned concepts in spatial overlay by transferring attributes between spatial layers. Conceptually, spatial joins can be thought of like the table joins covered in Section 6.4.1 where the equivalent of a “key field” used to match rows is a spatial relationship defined by a spatial predicate. Spatial joins in R are implemented in sf’s st_join() function. This section covers two common use cases for spatial joins with Census data. The first topic is the point-in-polygon spatial join, where a table of coordinates is matched to Census polygons to determine demographic characteristics around those locations. The second topic covers polygon-in-polygon spatial joins, where smaller Census shapes are matched to larger shapes.

### 7.2.1 Point-in-polygon spatial joins

Analysts are commonly tasked with matching point-level data to Census shapes in order to study demographic differences. For example, a marketing analyst may have a dataset of customers and needs to understand the characteristics of those customers’ neighborhoods in order to target products efficiently. Similarly, a health data analyst may need to match neighborhood demographic data to patient information to understand inequalities in patient outcomes. This scenario is explored in this section.

Let’s consider a hypothetical task where a health data analyst in Gainesville, Florida needs to determine the percentage of residents age 65 and up who lack health insurance in patients’ neighborhoods. The analyst has a dataset of patients with patient ID along with longitude and latitude information.

library(tidyverse)
library(sf)
library(tidycensus)
library(mapview)

gainesville_patients <- tibble(
patient_id = 1:10,
longitude = c(-82.308131, -82.311972, -82.361748, -82.374377,
-82.38177, -82.259461, -82.367436, -82.404031,
-82.43289, -82.461844),
latitude = c(29.645933, 29.655195, 29.621759, 29.653576,
29.677201, 29.674923, 29.71099, 29.711587,
29.648227, 29.624037)
)
Table 7.1: Hypothetical dataset of patients in Gainesville, Florida
patient_id longitude latitude
1 -82.30813 29.64593
2 -82.31197 29.65519
3 -82.36175 29.62176
4 -82.37438 29.65358
5 -82.38177 29.67720
6 -82.25946 29.67492
7 -82.36744 29.71099
8 -82.40403 29.71159
9 -82.43289 29.64823
10 -82.46184 29.62404

Whereas the spatial overlay example in the previous section used spatial datasets from tigris that already include geographic information, this dataset needs to be converted to a simple features object. The st_as_sf() function in the sf package can take an R data frame or tibble with longitude and latitude columns like this and create a dataset of geometry type POINT. By convention, the coordinate reference system used for longitude / latitude data is WGS 1984, represented with the EPSG code 4326. We’ll need to specify this CRS in st_as_sf() so that sf can locate the points correctly before we transform to an appropriate projected coordinate reference system with st_transform().

# CRS: NAD83(2011) / Florida North
gainesville_sf <- gainesville_patients %>%
st_as_sf(coords = c("longitude", "latitude"),
crs = 4326) %>%
st_transform(6440)

Once prepared as a spatial dataset, the patient information can be mapped.

mapview(
gainesville_sf,
col.regions = "red",
legend = FALSE
)

Figure 7.5: Map of hypothetical patient locations in Gainesville, Florida

As the patient data are now formatted as a simple features object, the next step is to acquire data on health insurance from the American Community Survey. A pre-computed percentage from the ACS Data Profile is available at the Census tract level, which will be used in the example below. Users who require a more granular geography can construct this information from the ACS Detailed Tables at the block group level using table B27001 and techniques learned in Section 3.3.2. As Gainesville is contained within Alachua County, Florida, we can obtain data from the 2015-2019 5-year ACS accordingly.

alachua_insurance <- get_acs(
geography = "tract",
variables = "DP03_0096P",
state = "FL",
county = "Alachua",
year = 2019,
geometry = TRUE
) %>%
select(GEOID, pct_insured = estimate,
pct_insured_moe = moe) %>%
st_transform(6440)

After obtaining the spatial & demographic data with get_acs() and the geometry = TRUE argument, two additional commands help pre-process the data for the spatial join. The call to select() retains three non-geometry columns in the simple features object: GEOID, which is the Census tract ID, and the ACS estimate and MOE renamed to pct_insured and pct_insured_moe, respectively. This formats the information that will be appended to the patient data in the spatial join. The st_transform() command then aligns the coordinate reference system of the Census tracts with the CRS used by the patient dataset.

Before computing the spatial join, the spatial relationships between patient points and Census tract demographics can be visualized interactively with mapview(), layering two interactive views with the + operator.

mapview(
alachua_insurance,
zcol = "pct_insured",
layer.name = "% with health<br/>insurance"
) +
mapview(
gainesville_sf,
col.regions = "red",
legend = FALSE
)

Figure 7.6: Layered interactive view of patients and Census tracts in Gainesville

The interrelationships between patient points and tract neighborhoods can be explored on the map. These relationships can be formalized with a spatial join, implemented with the st_join() function in the sf package. st_join() returns a new simple features object that inherits geometry and attributes from a first dataset x with attributes from a second dataset y appended. Rows in x are matched to rows in y based on a spatial relationship defined by a spatial predicate, which defaults in st_join() to st_intersects(). For point-in-polygon spatial joins, this default will be sufficient in most cases unless a point falls directly on the boundary between polygons (which is not true in this example).

patients_joined <- st_join(
gainesville_sf,
alachua_insurance
)
Table 7.2: Patients dataset after spatial join to Census tracts
patient_id geometry GEOID pct_insured pct_insured_moe
1 POINT (812216.5 73640.54) 12001000700 81.6 7.0
2 POINT (811825 74659.85) 12001000500 91.0 5.1
3 POINT (807076.2 70862.84) 12001001515 85.2 6.2
4 POINT (805787.6 74366.12) 12001001603 88.3 5.1
5 POINT (805023.3 76971.06) 12001001100 96.2 2.7
6 POINT (816865 76944.93) 12001001902 86.0 5.9
7 POINT (806340.4 80741.63) 12001001803 92.3 4.0
8 POINT (802798.8 80742.12) 12001001813 97.9 1.4
9 POINT (800134.1 73669.13) 12001002207 95.7 2.4
10 POINT (797379.1 70937.74) 12001002205 96.5 1.6

The output dataset includes the patient ID and the original POINT feature geometry, but also now includes GEOID information from the Census tract dataset along with neighborhood demographic information from the ACS. This workflow can be used for analyses of neighborhood characteristics in a wide variety of applications and to generate data suitable for hierarchical modeling.

An issue to avoid when interpreting the results of point-in-polygon spatial joins is the ecological fallacy, where individual-level characteristics are inferred from that of the neighborhood. While neighborhood demographics are useful for inferring the characteristics of the environment in which an observation is located, they do not necessarily provide information about the demographics of the observation itself - particularly important when the observations represent people.

### 7.2.2 Spatial joins and group-wise spatial analysis

Spatial data operations can also be embedded in workflows where analysts are interested in understanding how characteristics vary by group. For example, while demographic data for metropolitan areas can be readily acquired using tidycensus functions, we might also be interested in learning about how demographic characteristics of neighborhoods within metropolitan areas vary across the United States. The example below illustrates this with some important new concepts for spatial data analysts. It involves a polygon-on-polygon spatial join in which attention to the spatial predicate used will be very important. Additionally, as all polygons involved are acquired with tidycensus and get_acs(), the section will show how st_join() handles column names that are duplicated between datasets.

#### 7.2.2.1 Spatial join data setup

Let’s say that we are interested in analyzing the distributions of neighborhoods (defined here as Census tracts) by Hispanic population for the four largest metropolitan areas in Texas. We’ll use the variable B01003_001 from the 2019 1-year ACS to acquire population data by core-based statistical area (CBSA) along with simple feature geometry which will eventually be used for the spatial join.

library(tidycensus)
library(tidyverse)
library(sf)

# CRS: NAD83(2011) / Texas Centric Albers Equal Area
tx_cbsa <- get_acs(
geography = "cbsa",
variables = "B01003_001",
year = 2019,
survey = "acs1",
geometry = TRUE
) %>%
filter(str_detect(NAME, "TX")) %>%
slice_max(estimate, n = 4) %>%
st_transform(6579)
Table 7.3: Large CBSAs in Texas
GEOID NAME variable estimate moe geometry
19100 Dallas-Fort Worth-Arlington, TX Metro Area B01003_001 7573136 NA MULTIPOLYGON (((1681247 760…
26420 Houston-The Woodlands-Sugar Land, TX Metro Area B01003_001 7066140 NA MULTIPOLYGON (((2009903 730…
41700 San Antonio-New Braunfels, TX Metro Area B01003_001 2550960 NA MULTIPOLYGON (((1538306 729…
12420 Austin-Round Rock-Georgetown, TX Metro Area B01003_001 2227083 NA MULTIPOLYGON (((1664195 732…

The filtering steps used merit some additional explanation. The expression filter(str_detect(NAME, "TX")) first subsets the core-based statistical area data for only those metropolitan or micropolitan areas in (or partially in) Texas. Given that string matching in str_detect() is case-sensitive, using "TX" as the search string will match rows correctly. slice_max(), introduced in Section 4.1, then retains the four rows with the largest population values, found in the estimate column. Finally, the spatial dataset is transformed to an appropriate projected coordinate reference system for the state of Texas.

Given that all four of these metropolitan areas are completely contained within the state of Texas, we can obtain data on percent Hispanic by tract from the ACS Data Profile for 2015-2019.

pct_hispanic <- get_acs(
geography = "tract",
variables = "DP05_0071P",
state = "TX",
year = 2019,
geometry = TRUE
) %>%
st_transform(6579)
Table 7.4: Percent Hispanic by Census tract in Texas
GEOID NAME variable estimate moe geometry
48113019204 Census Tract 192.04, Dallas County, Texas DP05_0071P 58.1 5.7 MULTIPOLYGON (((1801563 765…
48377950200 Census Tract 9502, Presidio County, Texas DP05_0071P 95.6 3.7 MULTIPOLYGON (((1060841 729…
48029190601 Census Tract 1906.01, Bexar County, Texas DP05_0071P 86.2 6.7 MULTIPOLYGON (((1642736 726…
48355002301 Census Tract 23.01, Nueces County, Texas DP05_0071P 80.6 3.9 MULTIPOLYGON (((1755212 707…
48441012300 Census Tract 123, Taylor County, Texas DP05_0071P 26.4 6.3 MULTIPOLYGON (((1522575 758…
48051970500 Census Tract 9705, Burleson County, Texas DP05_0071P 20.4 6.1 MULTIPOLYGON (((1812744 736…
48441010400 Census Tract 104, Taylor County, Texas DP05_0071P 64.4 7.7 MULTIPOLYGON (((1522727 759…
48201311900 Census Tract 3119, Harris County, Texas DP05_0071P 84.9 5.8 MULTIPOLYGON (((1950592 729…
48113015500 Census Tract 155, Dallas County, Texas DP05_0071P 56.6 7.5 MULTIPOLYGON (((1778712 763…
48217960500 Census Tract 9605, Hill County, Texas DP05_0071P 9.3 4.8 MULTIPOLYGON (((1741591 753…

The returned dataset covers Census tracts in the entirety of the state of Texas; however we only need to retain those tracts that fall within our four metropolitan areas of interest. We can accomplish this with a spatial join using st_join().

#### 7.2.2.2 Computing and visualizing the spatial join

We know that in st_join(), we request that a given spatial dataset x, for which geometry will be retained, gains attributes from a second spatial dataset y based on their spatial relationship. This spatial relationship, as in the above examples, will be defined by a spatial predicate passed to the join parameter. The argument suffix defines the suffixes to be used for columns that share the same names, which will be important given that both datasets came from tidycensus. The argument left = FALSE requests an inner spatial join, returning only those tracts that fall within the four metropolitan areas.

hispanic_by_metro <- st_join(
pct_hispanic,
tx_cbsa,
join = st_within,
suffix = c("_tracts", "_metro"),
left = FALSE
) 
Table 7.5: Census tracts after spatial join operation
GEOID_tracts NAME_tracts variable_tracts estimate_tracts moe_tracts GEOID_metro NAME_metro variable_metro estimate_metro moe_metro geometry
1 48113019204 Census Tract 192.04, Dallas County, Texas DP05_0071P 58.1 5.7 19100 Dallas-Fort Worth-Arlington, TX Metro Area B01003_001 7573136 NA MULTIPOLYGON (((1801563 765…
3 48029190601 Census Tract 1906.01, Bexar County, Texas DP05_0071P 86.2 6.7 41700 San Antonio-New Braunfels, TX Metro Area B01003_001 2550960 NA MULTIPOLYGON (((1642736 726…
8 48201311900 Census Tract 3119, Harris County, Texas DP05_0071P 84.9 5.8 26420 Houston-The Woodlands-Sugar Land, TX Metro Area B01003_001 7066140 NA MULTIPOLYGON (((1950592 729…
9 48113015500 Census Tract 155, Dallas County, Texas DP05_0071P 56.6 7.5 19100 Dallas-Fort Worth-Arlington, TX Metro Area B01003_001 7573136 NA MULTIPOLYGON (((1778712 763…
11 48439102000 Census Tract 1020, Tarrant County, Texas DP05_0071P 15.7 6.4 19100 Dallas-Fort Worth-Arlington, TX Metro Area B01003_001 7573136 NA MULTIPOLYGON (((1746016 762…
13 48201450200 Census Tract 4502, Harris County, Texas DP05_0071P 10.7 3.9 26420 Houston-The Woodlands-Sugar Land, TX Metro Area B01003_001 7066140 NA MULTIPOLYGON (((1925521 730…
14 48201450400 Census Tract 4504, Harris County, Texas DP05_0071P 26.8 13.1 26420 Houston-The Woodlands-Sugar Land, TX Metro Area B01003_001 7066140 NA MULTIPOLYGON (((1922292 730…
22 48157670300 Census Tract 6703, Fort Bend County, Texas DP05_0071P 27.0 5.8 26420 Houston-The Woodlands-Sugar Land, TX Metro Area B01003_001 7066140 NA MULTIPOLYGON (((1935603 728…
25 48201554502 Census Tract 5545.02, Harris County, Texas DP05_0071P 13.4 3.0 26420 Houston-The Woodlands-Sugar Land, TX Metro Area B01003_001 7066140 NA MULTIPOLYGON (((1918552 732…
27 48121020503 Census Tract 205.03, Denton County, Texas DP05_0071P 35.3 8.7 19100 Dallas-Fort Worth-Arlington, TX Metro Area B01003_001 7573136 NA MULTIPOLYGON (((1766859 768…

The output dataset has been reduced from 5,265 Census tracts to 3,201 as a result of the inner spatial join. Notably, the output dataset now includes information for each Census tract about the metropolitan area that it falls within. This enables group-wise data visualization and analysis across metro areas such as a faceted plot:

hispanic_by_metro %>%
mutate(NAME_metro = str_replace(NAME_metro, ", TX Metro Area", "")) %>%
ggplot() +
geom_density(aes(x = estimate_tracts), color = "navy", fill = "navy",
alpha = 0.4) +
theme_minimal() +
facet_wrap(~NAME_metro) +
labs(title = "Distribution of Hispanic/Latino population by Census tract",
subtitle = "Largest metropolitan areas in Texas",
y = "Kernel density estimate",
x = "Percent Hispanic/Latino in Census tract")

Output from a spatial join operation can also be “rolled up” to a larger geography through group-wise data analysis. For example, let’s say we want to know the median value of the four distributions visualized in the plot above. As explained in Section 3.3, we can accomplish this by grouping our dataset by metro area then summarizing using the median() function.

median_by_metro <- hispanic_by_metro %>%
group_by(NAME_metro) %>%
summarize(median_hispanic = median(estimate_tracts, na.rm = TRUE))
Table 7.6: Summarized median Hispanic population by metro
NAME_metro median_hispanic geometry
Austin-Round Rock-Georgetown, TX Metro Area 25.9 POLYGON ((1732228 7281497, …
Dallas-Fort Worth-Arlington, TX Metro Area 22.6 POLYGON ((1868068 7602006, …
Houston-The Woodlands-Sugar Land, TX Metro Area 32.4 MULTIPOLYGON (((1957796 721…
San Antonio-New Braunfels, TX Metro Area 53.5 POLYGON ((1699769 7217758, …

The grouping column (NAME_metro) and the output of summarize() (median_hispanic) are returned as expected. However, the group_by() %>% summarize() operations also return the dataset as a simple features object with geometry, but in this case with only 4 rows. Let’s take a look at the output geometry:

plot(median_by_metro[1,]$geometry) The returned geometry represents the extent of the given metropolitan area (in the above example, Austin-Round Rock). The analytic process we carried out not only summarized the data by group, it also summarized the geometry by group. The typical name for this geometric process in geographic information systems is a dissolve operation, where geometries are identified by group and combined to return a single larger geometry. In this case, the Census tracts are dissolved by metropolitan area, returning metropolitan area geometries. This type of process is extremely useful when creating custom geographies (e.g. sales territories) from Census geometry building blocks that may belong to the same group. ## 7.3 Distance and proximity analysis A common use case for spatially-referenced demographic data is the analysis of accessibility. This might include studying the relative accessibility of different demographic groups to resources within a given region, or analyzing the characteristics of potential customers who live within a given distance of a store. Conceptually, there are a variety of ways to measure accessibility. The most straightforward method, computationally, is using straight-line (Euclidean) distances over geographic data in a projected coordinate system. A more computationally complex - but potentially more accurate - method involves the use of transportation networks to model accessibility, where proximity is measured not based on distance from a given location but instead based on travel times for a given transit mode, such as walking, cycling, or driving. This section will illustrate both types of approaches. Let’s consider the topic of accessibility to Level I and Level II trauma hospitals by Census tract in the state of Iowa. 2019 Census tract boundaries are acquired from tigris, and we use st_read() to read in a shapefile of hospital locations acquired from the US Department of Homeland Security. library(tigris) library(sf) library(tidyverse) options(tigris_use_cache = TRUE) # CRS: NAD83 / Iowa North ia_tracts <- tracts("IA", cb = TRUE, year = 2019) %>% st_transform(26975) hospital_url <- "https://opendata.arcgis.com/api/v3/datasets/6ac5e325468c4cb9b905f1728d6fbf0f_0/downloads/data?format=geojson&spatialRefId=4326" trauma <- st_read(hospital_url) %>% filter(str_detect(TRAUMA, "LEVEL I\\b|LEVEL II\\b|RTH|RTC")) %>% st_transform(26975) ## Reading layer Hospitals' from data source ## https://opendata.arcgis.com/api/v3/datasets/6ac5e325468c4cb9b905f1728d6fbf0f_0/downloads/data?format=geojson&spatialRefId=4326' ## using driver GeoJSON' ## Simple feature collection with 7596 features and 32 fields ## Geometry type: POINT ## Dimension: XY ## Bounding box: xmin: -176.6403 ymin: -14.29024 xmax: 145.7245 ymax: 71.29773 ## Geodetic CRS: WGS 84 names(trauma) ## [1] "OBJECTID" "ID" "NAME" "ADDRESS" "CITY" ## [6] "STATE" "ZIP" "ZIP4" "TELEPHONE" "TYPE" ## [11] "STATUS" "POPULATION" "COUNTY" "COUNTYFIPS" "COUNTRY" ## [16] "LATITUDE" "LONGITUDE" "NAICS_CODE" "NAICS_DESC" "SOURCE" ## [21] "SOURCEDATE" "VAL_METHOD" "VAL_DATE" "WEBSITE" "STATE_ID" ## [26] "ALT_NAME" "ST_FIPS" "OWNER" "TTL_STAFF" "BEDS" ## [31] "TRAUMA" "HELIPAD" "geometry" ### 7.3.1 Calculating distances To determine accessibility of Iowa Census tracts to Level I or II trauma centers, we need to identify not only those hospitals that are located in Iowa, but also those in other states near to the Iowa border, such as in Omaha, Nebraska and Rock Island, Illinois. We can accomplish this by applying a distance threshold in st_filter(). In this example, we use the spatial predicate st_is_within_distance, and set a 100km distance threshold with the dist = 100000 argument (specified in meters, the base measurement unit of our coordinate system used). ia_trauma <- trauma %>% st_filter(ia_tracts, .predicate = st_is_within_distance, dist = 100000) ggplot() + geom_sf(data = ia_tracts, color = "NA", fill = "grey50") + geom_sf(data = ia_trauma, color = "red") + theme_void() As illustrated in the visualization, the st_filter() operation has retained Level I and II trauma centers within the state of Iowa, but also within the 100km threshold beyond the state’s borders. With the Census tract and hospital data in hand, we can calculate distances from Census tracts to trauma centers by using the st_distance() function in the sf package. st_distance(x, y) by default returns the dense matrix of distances computed from the geometries in x to the geometries in y. In this example, we will calculate the distances from the centroids of Iowa Census tracts (reflecting the center points of each tract geometry) to each trauma center. dist <- ia_tracts %>% st_centroid() %>% st_distance(ia_trauma) dist[1:5, 1:5] ## Units: [m] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 279570.18 279188.81 385140.7 383863.7 257745.5 ## [2,] 298851.01 298409.46 400428.5 399022.9 276955.8 ## [3,] 350121.53 347800.57 353428.8 350263.3 404616.0 ## [4,] 361742.20 360450.59 421691.6 419364.9 369415.7 ## [5,] 66762.19 63479.67 143552.1 143935.5 194001.0 A glimpse at the matrix shows distances (in meters) between the first five Census tracts in the dataset and the first five hospitals. When considering accessibility, we may be interested in the distance to the nearest hospital to each Census tract. The code below extracts the minimum distance from the matrix for each row, converts to a vector, and divides each value by 1000 to convert values to kilometers. A quick histogram visualizes the distribution of minimum distances. min_dist <- dist %>% apply(1, min) %>% as.vector() %>% magrittr::divide_by(1000) hist(min_dist) While many tracts are within 10km of a trauma center, around 16 percent of Iowa Census tracts are beyond 100km from a Level I or II trauma center, suggesting significant accessibility issues for these areas. ### 7.3.2 Calculating travel times An alternative way to model accessibility to hospitals is through travel times rather than distance, as the way that people experience access to locations is through time expended given a transportation network. While network-based accessibility may be a more accurate representation of people’s lived experiences, it is more computationally complex and requires additional tools. To perform spatial network analyses, R users will either need to obtain network data (like roadways) and use appropriate tools that can model the network; set up a routing engine that R can connect to; or connect to a hosted routing engine via a web API. In this example, we’ll use the mapboxapi R package to perform network analysis using Mapbox’s travel-time Matrix API. The function mb_matrix() in mapboxapi works much like st_distance() in that it only requires arguments for origins and destinations, and will return the dense matrix of travel times by default. In turn, much of the computational complexity of routing is abstracted away by the function. However, as routes will be computed across the state of Iowa and API usage is subject to rate-limitations, the function can take several minutes to compute for larger matrices like this one. If you are using mapboxapi for the first time, visit mapbox.com, register for an account, and obtain an access token. The function mb_access_token() installs this token in your .Renviron for future use. library(mapboxapi) # mb_access_token("pk.eybcasq..., install = TRUE) times <- mb_matrix(ia_tracts, ia_trauma) times[1:5, 1:5] ## [,1] [,2] [,3] [,4] [,5] ## [1,] 211.43833 212.50667 278.5733 284.9717 212.1050 ## [2,] 218.15167 214.06000 280.1267 286.5250 226.7733 ## [3,] 274.84500 270.75333 291.8367 290.7117 292.2767 ## [4,] 274.58333 270.49167 291.5750 290.4500 292.0150 ## [5,] 56.80333 52.71167 122.3017 128.7000 161.2617 A glimpse at the travel-time matrix shows a similar format to the distance matrix, but with travel times in minutes used instead of meters. As with the distance-based example, we can determine the minimum travel time from each tract to a Level I or Level II trauma center. In this instance, we will visualize the result on a map. min_time <- apply(times, 1, min) ia_tracts$time <- min_time

ggplot(ia_tracts, aes(fill = time)) +
geom_sf(color = NA) +
scale_fill_viridis_c(option = "magma") +
theme_void() +
labs(fill = "Time (minutes)",
title = "Travel time to nearest Level I or Level II trauma hospital",
subtitle = "Census tracts in Iowa",
caption = "Data sources: US Census Bureau, US DHS, Mapbox")

The map illustrates considerable accessibility gaps to trauma centers across the state. Whereas urban residents typically live within 20 minutes of a trauma center, travel times in rural Iowa can exceed two hours.

An advantage to using a package like mapboxapi for routing and travel times is that users can connect directly to a hosted routing engine using an API. Due to rate limitations, however, web APIs are likely inadequate for more advanced users who need to compute travel times at scale. There are several R packages that can connect to user-hosted routing engines which may be better-suited to such tasks. These packages include osrm for the Open Source Routing Machine; opentripplanner for OpenTripPlanner; and r5r for R5.

### 7.3.3 Catchment areas with buffers and isochrones

The above example considers a broader accessibility analysis across the state of Iowa. In many cases, however, you’ll want to analyze accessibility in a more local way. A common use case might involve a study of the demographic characteristics of a hospital catchment area, defined as the area around a hospital from which patients will likely come.

As with the matrix-based accessibility approach outlined above, catchment area-based proximity can be modeled with either Euclidean distances or network travel times as well. Let’s consider the example of Iowa Methodist Medical Center in Des Moines, one of two Level I trauma centers in the state of Iowa.

The example below illustrates the distance-based approach using a buffer, implemented with the st_buffer() function in sf. A buffer is a common GIS operation that represents the area within a given distance of a location. The code below creates a 5km buffer around Iowa Methodist Medical Center by using the argument dist = 5000.

iowa_methodist <- filter(ia_trauma, NAME == "IOWA METHODIST MEDICAL CENTER")

buf5km <- st_buffer(iowa_methodist, dist = 5000) 

An alternative option is to create network-based isochrones, which are polygons that represent the accessible area around a given location within a given travel time for a given travel mode. Isochrones are implemented in the mapboxapi package with the mb_isochrone() function. The example below draws a 10-minute driving isochrone around Iowa Methodist.

iso10min <- mb_isochrone(iowa_methodist, time = 10)

We can visualize the comparative extents of these two methods in Des Moines. Run the code on your own computer to get a synced interactive map showing the two methods.

library(leaflet)
library(leafsync)

hospital_icon <- makeAwesomeIcon(icon = "ios-medical",
markerColor = "red",
library = "ion")

# The Leaflet package requires data be in CRS 4326
map1 <- leaflet() %>%
icon = hospital_icon)

map2 <- leaflet() %>%
icon = hospital_icon)

sync(map1, map2)

The comparative maps illustrate the differences between the two methods quite clearly. Many areas of equal distance to the hospital do not have the same level of access; this is particularly true of areas to the south of the Raccoon/Des Moines River. Conversely, due to the location of highways, there are some areas outside the 5km buffer area that can reach the hospital within 10 minutes.

### 7.3.4 Computing demographic estimates for zones with areal interpolation

Common to both methods, however, is a mis-alignment between their geometries and those of any Census geographies we may use to infer catchment area demographics. As opposed to the spatial overlay analysis matching Census tracts to metropolitan areas earlier in this chapter, Census tracts or block groups on the edge of the catchment area will only be partially included in the catchment. A common method for resolving this issue in geographic information science is areal interpolation. Areal interpolation refers to the allocation of data from one set of zones to a second overlapping set of zones that may or may not perfectly align spatially. In cases of mis-alignment, some type of weighting scheme needs to be specified to determine how to allocate partial data in areas of overlap.

Let’s produce interpolated estimates of the percentage of population in poverty for both catchment area definitions. This will require obtaining block-group level poverty information from the ACS for Polk County, Iowa, which encompasses both the buffer and the isochrone. The variables requested from the ACS include the number of family households with incomes below the poverty line along with total number of family households to serve as a denominator.

polk_poverty <- get_acs(
geography = "block group",
variables = c(poverty_denom = "B17010_001",
poverty_num = "B17010_002"),
state = "IA",
county = "Polk",
geometry = TRUE,
output = "wide",
year = 2019
) %>%
select(poverty_denomE, poverty_numE) %>%
st_transform(26975)

Area-weighted areal interpolation is implemented in sf with the st_interpolate_aw() function. This method uses the area of overlap of geometries as the interpolation weights. This means that, in this example, Census tracts that fall entirely within the 5km buffer polygon will receive a weight of 1. Census tracts that overlap the edge of the buffer area will receive a weight commensurate with the proportion of their area that falls within the buffer area. Those weights are applied to target variables (in this case, the poverty information) in accordance with the value of the extensive argument. If extensive = TRUE, as used below, weighted sums will be computed. Alternatively, if extensive = FALSE, the function returns weighted means.

library(glue)

buffer_pov <- st_interpolate_aw(
polk_poverty,
st_transform(buf5km, 26975),
extensive = TRUE
) %>%
mutate(pct_poverty = 100 * (poverty_numE / poverty_denomE))

iso_pov <- st_interpolate_aw(
polk_poverty,
st_transform(iso10min, 26975),
extensive = TRUE
) %>%
mutate(pct_poverty = 100 * (poverty_numE / poverty_denomE))

print(glue("Family poverty (5km buffer method): {round(buffer_pov$pct_poverty, 1)}%")) ## Family poverty (5km buffer method): 13.5% print(glue("Family poverty (10min isochrone method): {round(iso_pov$pct_poverty, 1)}%"))
## Family poverty (10min isochrone method): 13.9%

The two methods return slightly different results, illustrating how the definition of catchment area impacts downstream analyses.

Following up on Section 5.3.3, areal interpolation is also a common method used to adjust population data aggregated to inconsistent Census boundaries over time. However, it can introduce inaccuracies in sparsely populated areas, which can be improved with a population-weighted interpolation approach. While this approach is not covered here, provide a good discussion of the method.

## 7.4 Better cartography with spatial overlay

As discussed in Section 6.1, one of the major benefits of working with the tidycensus package to get Census data in R is its ability to retrieve pre-joined feature geometry for Census geographies with the argument geometry = TRUE. tidycensus uses the tigris package to fetch these geometries, which default to the Census Bureau’s cartographic boundary shapefiles. Cartographic boundary shapefiles are preferred to the core TIGER/Line shapefiles in tidycensus as their smaller size speeds up processing and because they are pre-clipped to the US coastline.

However, there may be circumstances in which your mapping requires more detail. A good example of this would be maps of New York City, in which even the cartographic boundary shapefiles include water area. For example, take this example of median household income by Census tract in Manhattan (New York County), NY:

library(tidycensus)
library(tidyverse)
options(tigris_use_cache = TRUE)

ny <- get_acs(
geography = "tract",
variables = "B19013_001",
state = "NY",
county = "New York",
year = 2019,
geometry = TRUE
)

ggplot(ny) +
geom_sf(aes(fill = estimate)) +
scale_fill_viridis_c(labels = scales::dollar) +
theme_void() +
labs(fill = "Median household\nincome")

As illustrated in the graphic, the boundaries of Manhattan include water boundaries - stretching into the Hudson and East Rivers. In turn, a more accurate representation of Manhattan’s land area might be desired. To accomplish this, a tidycensus user can use the core TIGER/Line shapefiles instead, then erase water area from Manhattan’s geometry.

### 7.4.1 “Erasing” areas from Census polygons

tidycensus allows users to get TIGER/Line instead of cartographic boundary shapefiles with the keyword argument cb = FALSE. This argument will be familiar to users of the tigris package, as it is used by tigris to distinguish between cartographic boundary and TIGER/Line shapefiles in the package.

# CRS: NAD83(2011) / New York Long Island
ny2 <- get_acs(
geography = "tract",
variables = "B19013_001",
state = "NY",
county = "New York",
geometry = TRUE,
year = 2019,
cb = FALSE
) %>%
st_transform(6538)

Next, tools in the tigris and sf package can be used to remove the water area from Manhattan’s Census tracts. sf allows users to “erase” one geometry from another, akin to tools available in desktop GIS software. The st_erase() function defined below is not exported by the package, but is defined in the documentation for st_difference().

The geometry used to “erase” water area from the tract polygons is obtained by the area_water() function in tigris.

library(sf)
library(tigris)

st_erase <- function(x, y) {
st_difference(x, st_union(y))
}

ny_water <- area_water("NY", "New York", year = 2019) %>%
st_transform(6538)

ny_erase <- st_erase(ny2, ny_water)

After performing this operation, we can map the result:

ggplot(ny_erase) +
geom_sf(aes(fill = estimate)) +
scale_fill_viridis_c(labels = scales::dollar) +
theme_void() +
labs(fill = "Median household\nincome")

The map appears as before, but now with a more familiar representation of the extent of Manhattan.

## 7.5 Spatial neighborhoods and spatial weights matrices

The spatial capabilities of tidycensus also allow for exploratory spatial data analysis (ESDA) within R. ESDA refers to the use of datasets’ spatial properties in addition to their attributes to explore patterns and relationships. This may involve exploration of spatial patterns in datasets or identification of spatial clustering of a given demographic attribute.

To illustrate how an analyst can apply ESDA to Census data, let’s acquire a dataset on median age by Census tract in the Dallas-Fort Worth, TX metropolitan area. Census tracts in the metro area will be identified using methods introduced earlier in this chapter.

library(tidycensus)
library(tidyverse)
library(tigris)
library(sf)
library(spdep)
options(tigris_use_cache = TRUE)

# CRS: NAD83 / Texas North Central
dfw <- core_based_statistical_areas(cb = TRUE, year = 2019) %>%
filter(str_detect(NAME, "Dallas")) %>%
st_transform(32138)

dfw_tracts <- get_acs(
geography = "tract",
variables = "B01002_001",
state = "TX",
year = 2019,
geometry = TRUE
) %>%
st_transform(32138) %>%
st_filter(dfw, .predicate = st_within) %>%
na.omit()

ggplot(dfw_tracts) +
geom_sf(aes(fill = estimate), color = NA) +
scale_fill_viridis_c() +
theme_void()

### 7.5.1 Understanding spatial neighborhoods

Exploratory spatial data analysis relies on the concept of a neighborhood, which is a representation of how a given geographic feature interrelates with other features nearby. The workhorse package for exploratory spatial data analysis in R is spdep, which includes a wide range of tools for exploring and modeling spatial data. As part of this framework, spdep supports a variety of neighborhood definitions. These definitions include:

• Proximity-based neighbors, where neighboring features are identified based on some measure of distance. Neighbors might be defined as those that fall within a given distance threshold (e.g. all features within 2km of a given feature) or as k-nearest neighbors (e.g. the nearest eight features to a given feature).
• Graph-based neighbors, where neighbors are defined through network relationships (e.g. along a street network).
• Contiguity-based neighbors, used when geographic features are polygons. Options for contiguity-based spatial relationships include queen’s case neighbors, where all polygons that share at least one vertex are considered neighbors; and rook’s case neighbors, where polygons must share at least one line segment to be considered neighbors.

In this example, we’ll choose a queen’s case contiguity-based neighborhood definition for our Census tracts. We implement this with the function poly2nb(), which can take an sf object as an argument and produce a neighbors list object. We use the argument queen = TRUE to request queen’s case neighbors explicitly (though this is the function default).

neighbors <- poly2nb(dfw_tracts, queen = TRUE)

summary(neighbors)
## Neighbour list object:
## Number of regions: 1310
## Number of nonzero links: 8446
## Percentage nonzero weights: 0.4921625
## Average number of links: 6.447328
##
##   2   3   4   5   6   7   8   9  10  11  12  13  16
##   6  43 111 254 286 291 159  94  30  20  11   4   1
## 6 least connected regions:
## 15 626 663 768 823 824 with 2 links
## 1 most connected region:
## 924 with 16 links

On average, the Census tracts in the Dallas-Fort Worth metropolitan area have 6.44 neighbors. The minimum number of neighbors in the dataset is 2 (there are six such tracts), and the maximum number of neighbors is 16 (the tract at row index 924). An important caveat to keep in mind here is that tracts with few neighbors may actually have more neighbors than listed here given that we have restricted the tract dataset to those tracts within the Dallas-Fort Worth metropolitan area. In turn, our analysis will be influenced by edge effects as neighborhoods on the edge of the metropolitan area are artificially restricted.

Neighborhood relationships can be visualized using plotting functionality in spdep:

dfw_coords <- dfw_tracts %>%
st_centroid() %>%
st_coordinates()

plot(dfw_tracts$geometry) plot(neighbors, coords = dfw_coords, add = TRUE, col = "blue", points = FALSE) Additionally, row indices for the neighbors of a given feature can be readily extracted from the neighbors list object. # Get the row indices of the neighbors of the Census tract at row index 1 neighbors[[1]] ## [1] 47 153 227 246 353 613 660 751 942 1025 1208 ### 7.5.2 Generating the spatial weights matrix To perform exploratory spatial data analysis, we can convert the neighbors list object into spatial weights. Spatial weights define how metrics associated with a feature’s neighbors should be weighted. Weight generation is implemented in the nb2listw() function, to which we pass the neighbors object and specify a style of weights. The default, style = "W", produces a row-standardized weights object where the weights for all neighbors of a given feature sum to 1. This is the option you would choose when analyzing neighborhood means. An alternative option, style = "B", produces binary weights where neighbors are given the weight of 1 and non-neighbors take the weight of 0. This style of weights is useful for producing neighborhood sums. In the example below, we create row-standardized spatial weights for the Dallas-Fort Worth Census tracts and check their values for the feature at row index 1. weights <- nb2listw(neighbors, style = "W") weights$weights[[1]]
##  [1] 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909
##  [7] 0.09090909 0.09090909 0.09090909 0.09090909 0.09090909

Given that the Census tract at row index 1 has eleven neighbors, each neighbor is assigned the weight 0.0909.

## 7.6 Global and local spatial autocorrelation

The row-standardized spatial weights object named weights provides the needed information to perform exploratory spatial data analysis of educational attainment in the Dallas-Fort Worth metropolitan area. In many cases, an analyst may be interested in understanding how the attributes of geographic features relate to those of their neighbors. Formally, this concept is called spatial autocorrelation. The concept of spatial autocorrelation relates to Waldo Tobler’s famous “first law of geography,” which reads :

Everything is related to everything else, but near things are more related than distant things.

This formulation informs much of the theory behind spatial data science and geographical inquiry more broadly. With respect to the exploratory spatial analysis of Census data, we might be interested in the degree to which a given attribute clusters spatially, and subsequently where those clusters are found. One such way to assess clustering is to assess the degree to which attribute values are similar to or differ from those of their neighbors as defined by a weights matrix. Patterns can in turn be explained as follows:

• Spatial clustering: data values tend to be similar to neighboring data values;
• Spatial uniformity: data values tend to differ from neighboring data values;
• Spatial randomness: there is no apparent relationship between data values and those of their neighbors.

Given Tobler’s first law of geography, we tend to expect that most geographic phenomena exhibit some degree of spatial clustering. This section introduces a variety of methods available in R to evaluate spatial clustering using ESDA and the spdep package .

### 7.6.1 Spatial lags and Moran’s I

Spatial weights matrices can be used to calculate the spatial lag of a given attribute for each observation in a dataset. The spatial lag refers to the neighboring values of an observation given a spatial weights matrix. As discussed above, row-standardized weights matrices will produce lagged means, and binary weights matrices will produce lagged sums. Spatial lag calculations are implemented in the function lag.listw(), which requires a spatial weights list object and a numeric vector from which to compute the lag.

dfw_tracts$lag_estimate <- lag.listw(weights, dfw_tracts$estimate)

The code above creates a new column in dfw_tracts, lag_estimate, that represents the mean percentage of bachelor’s degree holders for the neighbors of each Census tract in the Dallas-Fort Worth metropolitan area. Using this information, we can draw a scatterplot of the ACS estimate vs. its lagged mean to do a preliminary assessment of spatial clustering in the data.

ggplot(dfw_tracts, aes(x = estimate, y = lag_estimate)) +
geom_point(alpha = 0.3) +
geom_abline(color = "red") +
theme_minimal() +
labs(title = "Median age by Census tract, Dallas-Fort Worth TX",
x = "Median age",
y = "Spatial lag, median age",
caption = "Data source: 2015-2019 ACS via the tidycensus R package.\nSpatial relationships based on queens-case polygon contiguity.")

The scatterplot suggests a positive correlation between the ACS estimate and its spatial lag, representative of spatial autocorrelation in the data. This relationship can be evaluated further by using a test of global spatial autocorrelation. The most common method used for spatial autocorrelation evaluation is Moran’s $$I$$, which can be interpreted similar to a correlation coefficient but for the relationship between observations and their neighbors. The statistic is computed as:

$I = \frac{N}{W}\frac{\sum_i\sum_j w_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_i(x_i - \bar{x})^2}$

where $$w_{ij}$$ represents the spatial weights matrix, $$N$$ is the number of spatial units denoted by $$i$$ and $$j$$, and $$W$$ is the sum of the spatial weights.

Moran’s $$I$$ is implemented in spdep with the moran.test() function, which requires a numeric vector and a spatial weights list object.

moran.test(dfw_tracts$estimate, weights) ## ## Moran I test under randomisation ## ## data: dfw_tracts$estimate
## weights: weights
##
## Moran I statistic standard deviate = 22.979, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance
##      0.3593580070     -0.0007639419      0.0002456130

The Moran’s $$I$$ statistic of 0.359 is positive, and the small p-value suggests that we reject the null hypothesis of spatial randomness in our dataset. (See Section 8.2.4 for additional discussion of p-values). In a practical sense, this means that Census tracts with older populations tend to be located near one another, and Census tracts with younger populations also tend to be found in the same areas.

### 7.6.2 Local spatial autocorrelation

We can explore this further with local spatial autocorrelation analysis. Local measures of spatial autocorrelation disaggregate global results to identify “hot spots” of similar values within a given spatial dataset. One such example is the Getis-Ord local G statistic , which is computed as follows:

$G_{i} = \dfrac{\sum\limits_{j}w_{ij}x_j}{\sum\limits_{j=1}^{n}x_j} \text{ for all }i \neq j$

In summary, the equation computes a ratio of the weighted average of the neighborhood values to the total sum of values for the dataset. While the default version of the local G (represented in the equation above) omits the location $$i$$ from its calculation, a variant of the local G statistic, $$G_i*$$, includes this location. Results are returned as z-scores, and implemented in the localG() function in spdep.

The code below calculates the local G variant $$G_i*$$ by re-generating the weights matrix with include.self(), then passing this weights matrix to the localG() function.

# For Gi*, re-compute the weights with include.self()
localg_weights <- nb2listw(include.self(neighbors))

dfw_tracts$localG <- localG(dfw_tracts$estimate, localg_weights)

ggplot(dfw_tracts) +
geom_sf(aes(fill = localG), color = NA) +
scale_fill_distiller(palette = "RdYlBu") +
theme_void() +
labs(fill = "Local Gi* statistic")

Given that the returned results are z-scores, an analyst can choose hot spot thresholds in the statistic, calculate them with case_when(), then plot them accordingly.

dfw_tracts <- dfw_tracts %>%
mutate(hotspot = case_when(
localG >= 2.56 ~ "High cluster",
localG <= -2.56 ~ "Low cluster",
TRUE ~ "Not significant"
))

ggplot(dfw_tracts) +
geom_sf(aes(fill = hotspot), color = "grey90", size = 0.1) +
scale_fill_manual(values = c("red", "blue", "grey")) +
theme_void()

### 7.6.3 Identifying clusters and spatial outliers with local indicators of spatial association (LISA)

An alternative method for the calculation of local spatial autocorrelation is the local indicators of spatial association statistic, commonly referred to as LISA or the local form of Moran’s $$I$$ . As an extension of the Global Moran’s $$I$$ statistic, the local statistic $$I_i$$ for a given local feature $$i$$ with neighbors $$j$$ is computed as follows:

$I_i = z_i \sum_j w_{ij} z_j,$

where $$z_i$$ and $$z_j$$ are expressed as deviations from the mean.

LISA is a popular method for exploratory spatial data analysis in the spatial social sciences implemented in a variety of software packages. ArcGIS implements LISA in its Cluster and Outlier Analysis geoprocessing tool; Anselin’s open-source GeoDa software has a graphical interface for calculating LISA statistics; and Python users can compute LISA using the PySAL library.

In R, LISA can be computed using the localmoran() family of functions in the spdep package. For users familiar with using LISA in other software packages, the localmoran_perm() function implements LISA where statistical significance is calculated based on a conditional permutation-based approach.

The example below calculates local Moran’s $$I$$ statistics in a way that resembles the output from GeoDa, which returns a cluster map and a Moran scatterplot. One of the major benefits of using LISA for exploratory analysis is its ability to identify both spatial clusters, where observations are surrounded by similar values, and spatial outliers, where observations are surrounded by dissimilar values. We’ll use this method to explore clustering and the possible presence of spatial outliers with respect to Census tract median age in Dallas-Fort Worth.

set.seed(1983)

dfw_tracts$scaled_estimate <- as.numeric(scale(dfw_tracts$estimate))

dfw_lisa <- localmoran_perm(
dfw_tracts\$scaled_estimate,
weights,
nsim = 999L,
alternative = "two.sided"
) %>%
as_tibble() %>%
set_names(c("local_i", "exp_i", "var_i", "z_i", "p_i",
"p_i_sim", "pi_sim_folded", "skewness", "kurtosis"))

dfw_lisa_df <- dfw_tracts %>%
select(GEOID, scaled_estimate) %>%
mutate(lagged_estimate = lag.listw(weights, scaled_estimate)) %>%
bind_cols(dfw_lisa)

The above code uses the following steps:

1. First, a random number seed is set given that we are using the conditional permutation approach to calculating statistical significance. This will ensure reproducibility of results when the process is re-run.
2. The ACS estimate for median age is converted to a z-score using scale(), which subtracts the mean from the estimate then divides by its standard deviation. This follows convention from GeoDa.
3. LISA is computed with localmoran_perm() for the scaled value for median age, using the contiguity-based spatial weights matrix. 999 conditional permutation simulations are used to calculate statistical significance, and the argument alternative = "two.sided" will identify both statistically significant clusters and statistically significant spatial outliers.
4. The LISA data frame is attached to the Census tract shapes after computing the lagged value for median age.

Our result appears as follows:

Table 7.7: Local Moran’s I results
GEOID scaled_estimate lagged_estimate local_i exp_i var_i z_i p_i p_i_sim pi_sim_folded skewness kurtosis geometry
48113019204 -0.9083914 -0.3654568 0.3322315 -0.0019753 0.0767748 1.2061629 0.2277547 0.234 0.117 -0.1346414 -0.0136394 MULTIPOLYGON (((761835.4 21…
48113015500 -1.0195775 -0.7515397 0.7668383 -0.0115062 0.1304586 2.1549417 0.0311664 0.022 0.011 -0.2059971 -0.0706536 MULTIPOLYGON (((738672.9 21…
48439102000 -0.3365773 -0.1822783 0.0613976 0.0029962 0.0163324 0.4569814 0.6476844 0.670 0.335 -0.1938830 -0.0417957 MULTIPOLYGON (((705890.8 21…
48121020503 -1.0195775 0.3662775 -0.3737336 0.0185373 0.2419147 -0.7975446 0.4251348 0.390 0.195 -0.2628396 0.1497801 MULTIPOLYGON (((727489.9 21…
48439113710 -0.2730424 0.6588028 -0.1800185 0.0004427 0.0244266 -1.1546556 0.2482315 0.272 0.136 -0.2869858 0.0021632 MULTIPOLYGON (((730154.3 21…

The information returned by localmoran_perm() can be used to compute both a GeoDa-style LISA quadrant plot as well as a cluster map. The LISA quadrant plot is similar to a Moran scatterplot, but also identifies “quadrants” of observations with respect to the spatial relationships identified by LISA. The code below uses case_when() to recode the data into appropriate categories for the LISA quadrant plot, using a significance level of p = 0.05.

dfw_lisa_clusters <- dfw_lisa_df %>%
mutate(lisa_cluster = case_when(
p_i >= 0.05 ~ "Not significant",
scaled_estimate > 0 & local_i > 0 ~ "High-high",
scaled_estimate > 0 & local_i < 0 ~ "High-low",
scaled_estimate < 0 & local_i > 0 ~ "Low-low",
scaled_estimate < 0 & local_i < 0 ~ "Low-high"
))

The LISA quadrant plot then appears as follow:

color_values <- c(High-high = "red",
High-low = "pink",
Low-low = "blue",
Low-high = "lightblue",
Not significant = "white")

ggplot(dfw_lisa_clusters, aes(x = scaled_estimate,
y = lagged_estimate,
fill = lisa_cluster)) +
geom_point(color = "black", shape = 21, size = 2) +
theme_minimal() +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
scale_fill_manual(values = color_values) +
labs(x = "Median age (z-score)",
y = "Spatial lag of median age (z-score)",
fill = "Cluster type")

Observations falling in the top-right quadrant represent “high-high” clusters, where older Census tracts are also surrounded by older tracts in their spatial neighborhoods. Statistically significant clusters - those with a p-value less than or equal to 0.05 - are colored red on the chart. The bottom-left quadrant also represents spatial clusters, but instead includes younger tracts that are also surrounded by younger tracts. The top-left and bottom-right quadrants are home to the spatial outliers, where values are dissimilar from their neighbors.

GeoDa also implements a “cluster map” where observations are visualized in relationship to their cluster membership and statistical significance. The code below reproduces the GeoDa cluster map using ggplot2 and geom_sf().

ggplot(dfw_lisa_clusters, aes(fill = lisa_cluster)) +
geom_sf(size = 0.1) +
theme_void() +
scale_fill_manual(values = color_values) +
labs(fill = "Cluster type")

The map illustrates distinctive patterns of spatial clustering by age in the Dallas-Fort Worth region. Older clusters are colored red; this includes areas like the wealthy Highland Park community north of downtown Dallas. Younger clusters are colored dark blue, and found in areas like east Fort Worth, east Dallas, and Arlington in the center of the metropolitan area. Spatial outliers appear scattered throughout the map as well; in the Dallas area, low-high clusters are Census tracts with large quantities of multifamily housing that are adjacent to predominantly single-family neighborhoods.

One very useful feature of GeoDa for exploratory spatial data analysis is the ability to perform linked brushing between the LISA quadrant plot and cluster map. This allows users to click and drag on either plot and highlight the corresponding observations on the other plot. Building on the chart linking example using ggiraph introduced in Section 6.6.2, a linked brushing approach similar to GeoDa can be implemented in Shiny, and is represented in the image below and available at https://walkerke.shinyapps.io/linked-brushing/.

Using the lasso select tool, you can click and drag on either the scatterplot or the map and view the corresponding observations highlighted on the other chart panel. Code to reproduce this Shiny app is available in scripts/linked_brushing in the book’s GitHub repository.

## 7.7 Exercises

1. Identify a different core-based statistical area of interest and use the methods introduced in this chapter to extract Census tracts or block groups for that CBSA.
2. Replicate the st_erase() cartographic workflow for a different county with significant water area; a good choice is King County, Washington. Be sure to transform your data to an appropriate projected coordinate system (selected with suggest_crs()) first. If the operation creates too many “holes” for water areas, try filtering down the water dataset by its AWATER column and retaining only the largest water areas, then re-run and see what you get.
3. Acquire a spatial dataset with tidycensus for a region of interest and a variable of interest to you. Follow the instructions in this chapter to generate a spatial weights matrix, then compute a hot-spot analysis with localG()`.