class: center, middle, inverse, title-slide # Spatial Analysis of US Census Data in R ## Geometries, maps, and methods ### Kyle Walker ### March 11, 2021 --- ## About me * Associate Professor of Geography at TCU * Spatial data science researcher and consultant * R package developer: tidycensus, tigris, mapboxapi * Book coming this year: _Analyzing the US Census with R_ - These workshops are a sneak preview of the book's content! --- ## SSDAN workshop series * Today: spatial analysis and mapping in R * Last Thursday (March 4): an introduction to analyzing US Census data with tidycensus ([code](https://github.com/walkerke/umich-workshop/blob/main/census-data-in-r/code/part-1-code.R) | [video recording](https://www.youtube.com/watch?v=PnFJfuJ83NI)) * Thursday, March 25: working with US Census microdata (PUMS) with R and tidycensus --- ## Today's agenda * Hour 1: An introduction to Census geometries and the tigris package * Hour 2: Mapping Census data in R * Hour 3: Spatial analysis of Census data with the sf package --- class: middle, center, inverse ## Part 1: An introduction to Census geometries and the tigris package --- ## US Census Geography <img src=img/census_diagram.png style="width: 500px"> .footnote[Source: [US Census Bureau](https://www2.census.gov/geo/pdfs/reference/geodiagram.pdf)] --- ## Census TIGER/Line shapefiles .pull-left[ <img src=img/tiger_logo.png style="width: 350px"> ] .pull-right[ * TIGER: Topologically Integrated Geographic Encoding and Referencing database * High-quality series of geographic datasets released by the US Census Bureau * Distributed as _shapefiles_, a common GIS data format comprised of several related files ] .footnote[Image source: [US Census Bureau](https://www2.census.gov/geo/pdfs/maps-data/data/tiger/tgrshp2020/TGRSHP2020_TechDoc.pdf)] --- ## A typical GIS workflow <img src="img/co_counties.png" style="width: 800px"> --- ## The __tigris__ R package .pull-left[ <img src="https://raw.githubusercontent.com/walkerke/tigris/master/tools/readme/tigris_sticker.png" style="width: 400px"> ] .pull-right[ * R interface to the US Census Bureau's TIGER/Line shapefile FTP server * No API key necessary - just install the package and start using Census shapefiles in R! ] --- ## Basic usage of tigris * To use tigris, call a function that corresponds to the Census geography you want, optionally by `state` or `county`, when appropriate ```r or_counties <- counties(state = "OR") or_counties ``` ``` ## Simple feature collection with 36 features and 17 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: -124.7035 ymin: 41.99208 xmax: -116.4633 ymax: 46.2991 ## geographic CRS: NAD83 ## First 10 features: ## STATEFP COUNTYFP COUNTYNS GEOID NAME NAMELSAD LSAD CLASSFP ## 19 41 063 01155135 41063 Wallowa Wallowa County 06 H1 ## 88 41 013 01155128 41013 Crook Crook County 06 H1 ## 125 41 005 01155127 41005 Clackamas Clackamas County 06 H1 ## 173 41 007 01135846 41007 Clatsop Clatsop County 06 H1 ## 204 41 035 01155134 41035 Klamath Klamath County 06 H1 ## 222 41 049 01135860 41049 Morrow Morrow County 06 H1 ## 279 41 029 01135853 41029 Jackson Jackson County 06 H1 ## 320 41 051 01135861 41051 Multnomah Multnomah County 06 H1 ## 356 41 001 01135845 41001 Baker Baker County 06 H1 ## 500 41 003 01155126 41003 Benton Benton County 06 H1 ## MTFCC CSAFP CBSAFP METDIVFP FUNCSTAT ALAND AWATER INTPTLAT ## 19 G4020 <NA> <NA> <NA> A 8147835329 14191752 +45.5937530 ## 88 G4020 140 39260 <NA> A 7715390842 21004473 +44.1630537 ## 125 G4020 440 38900 <NA> A 4845035564 31872078 +45.1604934 ## 173 G4020 <NA> 11820 <NA> A 2144872198 661629272 +46.0245094 ## 204 G4020 <NA> 28900 <NA> A 15391639810 503686589 +42.6837613 ## 222 G4020 <NA> 25840 <NA> A 5259046542 44372276 +45.4254956 ## 279 G4020 366 32780 <NA> A 7208597903 46766412 +42.4116214 ## 320 G4020 440 38900 <NA> A 1116560269 88220750 +45.5477107 ## 356 G4020 <NA> <NA> <NA> A 7945996793 51832631 +44.7034268 ## 500 G4020 440 18700 <NA> A 1748712549 7740248 +44.4938816 ## INTPTLON geometry ## 19 -117.1855796 MULTIPOLYGON (((-117.7475 4... ## 88 -120.3715849 MULTIPOLYGON (((-121.0145 4... ## 125 -122.1951274 MULTIPOLYGON (((-122.7437 4... ## 173 -123.7050366 MULTIPOLYGON (((-124.0081 4... ## 204 -121.6461682 MULTIPOLYGON (((-121.3165 4... ## 222 -119.6023111 MULTIPOLYGON (((-119.9982 4... ## 279 -122.6756106 MULTIPOLYGON (((-122.9512 4... ## 320 -122.4173625 MULTIPOLYGON (((-122.8675 4... ## 356 -117.6919334 MULTIPOLYGON (((-117.9462 4... ## 500 -123.4246641 MULTIPOLYGON (((-123.182 44... ``` --- ```r plot(or_counties$geometry) ``` ![](index_files/figure-html/basic-plot-1.png)<!-- --> --- ## The __sf__ package and simple feature geometry .pull-left[ <img src="https://user-images.githubusercontent.com/520851/34887433-ce1d130e-f7c6-11e7-83fc-d60ad4fae6bd.gif" style="width: 400px"> ] .pull-right[ * The sf package implements a _simple features data model_ for vector spatial data in R * Vector geometries: _points_, _lines_, and _polygons_ stored in a list-column of a data frame * Allows for tidy spatial data analysis (coming in hour 3!) ] --- ## Datasets available in tigris * __Legal entities__: units that have legal significance in the US (e.g. states, counties) * __Statistical entities__: units that are used to tabulate Census data but do not have legal standing (e.g. Census tracts or block groups) * __Geographic features__: other geographic datasets provided by the Census Bureau that are not used for demographic tabulation (e.g. roads, water) --- ## Example: statistical entities ```r benton_tracts <- tracts(state = "OR", county = "Benton") plot(benton_tracts$geometry) ``` ![](index_files/figure-html/benton-tracts-1.png)<!-- --> --- ## Example: geographic features ```r benton_roads <- roads(state = "OR", county = "Benton") plot(benton_roads$geometry) ``` ![](index_files/figure-html/benton-roads-1.png)<!-- --> --- ## Example: geographic features ```r dc_landmarks <- landmarks("DC", type = "point") plot(dc_landmarks$geometry) ``` ![](index_files/figure-html/dc-landmarks-1.png)<!-- --> --- ## How tigris works When you call a tigris function, it does the following: * _Downloads_ your data from the US Census Bureau website; * _Stores_ your data in a temporary directory by default; * _Loads_ your data into R as a simple features object using `sf::st_read()` * Recommended option: use `options(tigris_use_cache = TRUE)` to cache downloaded shapefiles and prevent having to re-download every time you use them --- class: middle, center, inverse ## tigris features and options --- ## Cartographic boundary shapefiles .pull-left[ * Question I've received over the years: "Why does Michigan look so weird?" * The core TIGER/Line shapefiles include _water area_ that belongs to US states and counties ] .pull-right[ ```r mi_counties <- counties("MI") plot(mi_counties$geometry) ``` ![](index_files/figure-html/michigan-tiger-1.png)<!-- --> ] --- ## Cartographic boundary shapefiles .pull-left[ * Use the argument `cb = TRUE` to obtain a _cartographic boundary shapefile_ pre-clipped to the US shoreline * For some geographies, highly generalized (1:5 million and 1:20 million) shapefiles are available with the `resolution` argument ] .pull-right[ ```r mi_counties_cb <- counties("MI", cb = TRUE) plot(mi_counties_cb$geometry) ``` ![](index_files/figure-html/michigan-cb-1.png)<!-- --> ] --- ## Understanding yearly differences in TIGER/Line files * Whereas legal entities change shape very rarely (but they do change!), statistical entities change with every decennial Census * tigris fetches Census shapefiles from 1990 up through 2020 ```r tarrant90 <- tracts("TX", "Tarrant", cb = TRUE, year = 1990) tarrant00 <- tracts("TX", "Tarrant", cb = TRUE, year = 2000) tarrant10 <- tracts("TX", "Tarrant", cb = TRUE, year = 2010) # Cartographic boundary files not yet released for 2020 tarrant20 <- tracts("TX", "Tarrant", year = 2020) ``` --- ```r par(mfrow = c(2, 2)) plot(tarrant90$geometry, main = "1990") plot(tarrant00$geometry, main = "2000") plot(tarrant10$geometry, main = "2010") plot(tarrant20$geometry, main = "2020") ``` ![](index_files/figure-html/plot-yearly-data-1.png)<!-- --> --- ## Interactive viewing of data with __mapview__ * The mapview package brings interactive spatial data viewing to R: ```r library(mapview) mapview(tarrant20) ``` * As an extension, use the leafsync package to interactively compare two or more maps ```r library(leafsync) sync(mapview(tarrant90), mapview(tarrant20)) ``` --- ## Combining spatial datasets * The `map_*()` family of functions in the tidyverse's purrr package allows for iteration over values/datasets, which works well for assembly of regional or national objects ```r library(tidyverse) state_codes <- c(state.abb, "DC") us_bgs <- map_df(state_codes, ~block_groups(state = .x, cb = TRUE)) glimpse(us_bgs) ``` ``` ## Rows: 217,453 ## Columns: 11 ## $ STATEFP <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", … ## $ COUNTYFP <chr> "073", "025", "077", "055", "131", "055", "055", "015", "073… ## $ TRACTCE <chr> "012910", "957602", "011802", "010200", "034700", "010601", … ## $ BLKGRPCE <chr> "1", "5", "2", "5", "1", "1", "2", "1", "1", "2", "2", "1", … ## $ AFFGEOID <chr> "1500000US010730129101", "1500000US010259576025", "1500000US… ## $ GEOID <chr> "010730129101", "010259576025", "010770118022", "01055010200… ## $ NAME <chr> "1", "5", "2", "5", "1", "1", "2", "1", "1", "2", "2", "1", … ## $ LSAD <chr> "BG", "BG", "BG", "BG", "BG", "BG", "BG", "BG", "BG", "BG", … ## $ ALAND <dbl> 3671177, 5858511, 4693254, 2359614, 402854096, 11699694, 418… ## $ AWATER <dbl> 150769, 0, 4993, 0, 19530524, 1158709, 0, 6370, 0, 0, 25350,… ## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-86.75138 3..., MULTIPOLYGON ((… ``` --- ## US Census shapefiles and coordinate reference systems * _Coordinate reference system_ (CRS): how coordinates in your spatial data are referenced to the Earth's surface * Distinction: _geographic coordinate systems_ (longitude/latitude) and _projected coordinate systems_ (planar, commonly measured in meters or US feet) * CRS used by Census shapefiles: North American Datum of 1983 (EPSG code 4269) --- ```r library(sf) fl_counties <- counties("FL", cb = TRUE) st_crs(fl_counties) ``` ``` ## Coordinate Reference System: ## User input: NAD83 ## wkt: ## GEOGCRS["NAD83", ## DATUM["North American Datum 1983", ## ELLIPSOID["GRS 1980",6378137,298.257222101, ## LENGTHUNIT["metre",1]]], ## PRIMEM["Greenwich",0, ## ANGLEUNIT["degree",0.0174532925199433]], ## CS[ellipsoidal,2], ## AXIS["latitude",north, ## ORDER[1], ## ANGLEUNIT["degree",0.0174532925199433]], ## AXIS["longitude",east, ## ORDER[2], ## ANGLEUNIT["degree",0.0174532925199433]], ## ID["EPSG",4269]] ``` --- ## Choosing a coordinate reference system * __crsuggest__ my _developmental_ R package for helping choose the right coordinate system transformation * To install: `remotes::install_github("walkerke/crsuggest")` * Two core functions: `suggest_crs()` for a table of CRS recommendations; `suggest_top_crs()` for a quick best match --- ## Choosing a coordinate reference system ```r library(crsuggest) fl_crs <- suggest_crs(fl_counties) glimpse(fl_crs) ``` ``` ## Rows: 10 ## Columns: 7 ## $ area_code <chr> "1379", "1379", "1379", "1379", "2188", "2188", "2188", "21… ## $ crs_type <chr> "projected", "projected", "projected", "projected", "projec… ## $ crs_code <chr> "3086", "3087", "3513", "6439", "2237", "2778", "2882", "35… ## $ crs_name <chr> "NAD83 / Florida GDL Albers", "NAD83(HARN) / Florida GDL Al… ## $ crs_gcs <chr> "4269", "4152", "4759", "6318", "4269", "4152", "4152", "47… ## $ crs_units <chr> "m", "m", "m", "m", "us-ft", "m", "us-ft", "m", "us-ft", "m" ## $ crs_proj4 <chr> "+proj=aea +lat_0=24 +lon_0=-84 +lat_1=24 +lat_2=31.5 +x_0=… ``` --- ## CRS transformations * Use the `st_transform()` function in the sf package to perform coordinate reference system transformations with the appropriate EPSG code ```r fl_projected <- st_transform(fl_counties, crs = 3086) head(fl_projected) ``` ``` ## Simple feature collection with 6 features and 9 fields ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: 83061.43 ymin: 130374.3 xmax: 788415.3 ymax: 780618.8 ## projected CRS: NAD83 / Florida GDL Albers ## STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD ALAND ## 24 12 075 00295724 0500000US12075 12075 Levy 06 2896183010 ## 96 12 086 00295755 0500000US12086 12086 Miami-Dade 06 4920565755 ## 97 12 073 00306916 0500000US12073 12073 Leon 06 1727237331 ## 98 12 057 00295757 0500000US12057 12057 Hillsborough 06 2646772012 ## 99 12 083 00306922 0500000US12083 12083 Marion 06 4113978836 ## 100 12 113 00306914 0500000US12113 12113 Santa Rosa 06 2622050628 ## AWATER geometry ## 24 762935040 MULTIPOLYGON (((493790.2 56... ## 96 1376144237 MULTIPOLYGON (((783767.2 17... ## 97 90397079 MULTIPOLYGON (((331309.3 70... ## 98 631505816 MULTIPOLYGON (((555139.1 42... ## 99 192297049 MULTIPOLYGON (((542215.2 56... ## 100 418020790 MULTIPOLYGON (((83267.92 76... ``` --- ## Part 1 exercises * Give tigris a try for yourselves! [Explore the available geographies in the tigris documentation](https://github.com/walkerke/tigris) and fetch data for a state and/or county of your choosing. Plot the result with `plot()` or with `mapview()`. --- class: middle, center, inverse ## Part 2: Mapping US Census data in R --- ## Typical Census GIS workflows Traditionally, getting "spatial" Census data requires: -- * Fetching shapefiles from the Census website; -- * Downloading a CSV of data, cleaning/formatting it; -- * Loading geometries and data into your GIS of choice; -- * Aligning key fields in your GIS and joining your data --- ## Geometry in tidycensus * tidycensus takes care of this entire process with the argument `geometry = TRUE` ```r library(tidycensus) options(tigris_use_cache = TRUE) dc_income <- get_acs(geography = "tract", variables = c(hhincome = "B19013_001"), state = "DC", geometry = TRUE) ``` --- ```r dc_income ``` ``` ## Simple feature collection with 179 features and 5 fields ## geometry type: POLYGON ## dimension: XY ## bbox: xmin: -77.11976 ymin: 38.79164 xmax: -76.9094 ymax: 38.99511 ## geographic CRS: NAD83 ## First 10 features: ## GEOID NAME ## 1 11001009509 Census Tract 95.09, District of Columbia, District of Columbia ## 2 11001010100 Census Tract 101, District of Columbia, District of Columbia ## 3 11001008301 Census Tract 83.01, District of Columbia, District of Columbia ## 4 11001002101 Census Tract 21.01, District of Columbia, District of Columbia ## 5 11001004100 Census Tract 41, District of Columbia, District of Columbia ## 6 11001008001 Census Tract 80.01, District of Columbia, District of Columbia ## 7 11001002900 Census Tract 29, District of Columbia, District of Columbia ## 8 11001005600 Census Tract 56, District of Columbia, District of Columbia ## 9 11001007605 Census Tract 76.05, District of Columbia, District of Columbia ## 10 11001005900 Census Tract 59, District of Columbia, District of Columbia ## variable estimate moe geometry ## 1 hhincome 75515 19621 POLYGON ((-77.00201 38.9510... ## 2 hhincome 94861 16089 POLYGON ((-77.03653 38.9056... ## 3 hhincome 138487 30838 POLYGON ((-77.00352 38.9000... ## 4 hhincome 67984 11327 POLYGON ((-77.02803 38.9610... ## 5 hhincome 156625 27218 POLYGON ((-77.05832 38.9177... ## 6 hhincome 154423 28910 POLYGON ((-76.99025 38.8973... ## 7 hhincome 116875 20769 POLYGON ((-77.03273 38.9341... ## 8 hhincome 79357 16304 POLYGON ((-77.05779 38.9025... ## 9 hhincome 38659 7543 POLYGON ((-76.98436 38.8666... ## 10 hhincome 98750 21107 POLYGON ((-77.01901 38.8946... ``` --- ## Basic mapping with base plotting ```r plot(dc_income["estimate"]) ``` ![](index_files/figure-html/plot-geometry-1.png)<!-- --> --- ## Basic mapping with ggplot2 * `geom_sf()`: ggplot2 method to use simple features in your data visualization workflows ```r library(tidyverse) dc_map <- ggplot(dc_income, aes(fill = estimate)) + geom_sf() ``` --- ```r dc_map ``` ![](index_files/figure-html/plot-geom-sf-1.png)<!-- --> --- class: middle, center, inverse ## Mapping Census data with tmap --- ## The tmap package .pull-left[ <img src="https://user-images.githubusercontent.com/2444081/106523217-12c12880-64e1-11eb-8d55-01442d535400.png" style="width: 350px"> ] .pull-right[ * Comprehensive package for thematic mapping in R * ggplot2-like syntax, but designed in a way to feel friendly to GIS cartographers coming to R for mapping ] --- ## Example data * Our example: comparing the distributions of racial and ethnic groups in Hennepin County, Minnesota (Minneapolis) ```r hennepin_race <- get_acs( geography = "tract", state = "MN", county = "Hennepin", variables = c(White = "B03002_003", Black = "B03002_004", Native = "B03002_005", Asian = "B03002_006", Hispanic = "B03002_012"), summary_var = "B03002_001", geometry = TRUE ) %>% mutate(percent = 100 * (estimate / summary_est)) ``` --- ```r glimpse(hennepin_race) ``` ``` ## Rows: 1,495 ## Columns: 9 ## $ GEOID <chr> "27053103900", "27053103900", "27053103900", "27053103900… ## $ NAME <chr> "Census Tract 1039, Hennepin County, Minnesota", "Census … ## $ variable <chr> "White", "Black", "Native", "Asian", "Hispanic", "White",… ## $ estimate <dbl> 2940, 372, 10, 504, 214, 4951, 175, 77, 24, 250, 4757, 75… ## $ moe <dbl> 421, 276, 15, 153, 76, 402, 277, 85, 30, 175, 327, 57, 34… ## $ summary_est <dbl> 4221, 4221, 4221, 4221, 4221, 5753, 5753, 5753, 5753, 575… ## $ summary_moe <dbl> 447, 447, 447, 447, 447, 316, 316, 316, 316, 316, 286, 28… ## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-93.23693 4..., MULTIPOLYGON… ## $ percent <dbl> 69.6517413, 8.8130775, 0.2369107, 11.9402985, 5.0698887, … ``` --- ## Basic plotting with tmap .pull-left[ * `tm_shape()` initializes the shape; `tm_polygons()` shows the polygons for quick display ```r library(tmap) hennepin_black <- filter(hennepin_race, variable == "Black") tm_shape(hennepin_black) + tm_polygons() ``` ] .pull-right[ ![](index_files/figure-html/polygons-map-1.png)<!-- --> ] --- ## Choropleth mapping with tmap .pull-left[ * _Choropleth maps_ show statistical variation through color or shading of areas * They generally should be used with _normalized data_ such as rates or percentages, not counts ```r tm_shape(hennepin_black) + tm_polygons(col = "percent") ``` ] .pull-right[ ![](index_files/figure-html/choropleth-show-1.png)<!-- --> ] --- ## Modifying choropleth options * Color palettes can be modified with the `palette` parameter, which accepts ColorBrewer and viridis palettes * If you've mapped with GIS software before, the `style` parameter implements various breaks methods, including `"equal"`, `"quantile"` and `"jenks"` --- .pull-left[ ```r tm_shape(hennepin_black, projection = sf::st_crs(26915)) + tm_polygons(col = "percent", style = "quantile", n = 7, palette = "Purples", title = "ACS estimate") + tm_layout(title = "Percent Black\nby Census tract", frame = FALSE, legend.outside = TRUE) ``` ] .pull-right[ ![](index_files/figure-html/custom-choropleth-show-1.png)<!-- --> ] --- ## tmap choropleth tips and tricks * Use `tmaptools::palette_explorer()` to interactively browse color options * Use the option `tm_layout(legend.hist = TRUE)` to display the distribution of data values among classes --- .pull-left[ ```r tm_shape(hennepin_black, projection = sf::st_crs(26915)) + tm_polygons(col = "percent", style = "jenks", n = 7, palette = "viridis", title = "ACS estimate", legend.hist = TRUE) + tm_layout(title = "Percent Black population\nby Census tract", frame = FALSE, legend.outside = TRUE) ``` ] .pull-right[ ![](index_files/figure-html/jenks-show-1.png)<!-- --> ] --- ## Graduated symbol maps * Graduated symbols: using _size_ of a symbol to represent statistical variation on a map * Implemented in tmap with `tm_bubbles()` ```r symbol_map <- tm_shape(hennepin_black) + tm_polygons() + tm_bubbles(size = "estimate", alpha = 0.5, col = "navy") ``` --- ```r symbol_map ``` ![](index_files/figure-html/bubbles-map-1.png)<!-- --> --- ## Faceted mapping * `tm_facets()` allows for comparative small multiples maps. It works well with long-form spatial data returned by tidycensus ```r facet_map <- tm_shape(hennepin_race, projection = sf::st_crs(26915)) + tm_facets(by = "variable", scale.factor = 4) + tm_fill(col = "percent", style = "quantile", n = 7, palette = "Blues") ``` --- ```r facet_map ``` ![](index_files/figure-html/facet-map-1.png)<!-- --> --- ## Advanced example: dot-density mapping .pull-left[ * Dot-density maps scatter dots relative to data values, and are good for showing within-polygon diversity * To generate points for dot-density mapping, points should be proportionally sampled in polygons relative to data values then randomized with `slice_sample()` ] .pull-right[ ```r groups <- unique(hennepin_race$variable) hennepin_dots <- map_df(groups, ~{ hennepin_race %>% filter(variable == .x) %>% st_transform(26915) %>% mutate(est50 = as.integer(estimate / 50)) %>% st_sample(size = .$est50, exact = TRUE) %>% st_sf() %>% mutate(group = .x) }) %>% slice_sample(prop = 1) ``` ] --- ```r tm_shape(hennepin_dots) + tm_dots(col = "group", palette = "Set1", size = 0.005) ``` <img src=img/hennepin_dots.png style="width: 600px"> --- ## Mapping the entire US with `shift_geo` * The `shift_geo` parameter uses Bob Rudis's [albersusa geometries](https://github.com/hrbrmstr/albersusa) to shift and re-scale Alaska and Hawaii for thematic mapping ```r us_median_age <- get_acs(geography = "state", variables = "B01002_001", year = 2019, survey = "acs1", geometry = TRUE, shift_geo = TRUE) ``` --- ```r tm_shape(us_median_age) + tm_polygons() ``` ![](index_files/figure-html/show-shift-geo-1.png)<!-- --> --- ```r tm_shape(us_median_age) + tm_polygons(col = "estimate", palette = "RdPu", title = "Median age") + tm_layout(legend.outside = TRUE) ``` ![](index_files/figure-html/style-shift-geo-1.png)<!-- --> --- ## Interactive mapping * For quick interactive mapping, use the `zcol` parameter in `mapview()` or `tmap_mode("view")` in tmap * For customizable interactive mapping, check out the leaflet package and its integration with the Shiny framework for data dashboards ```r library(mapview) mapview(dc_income, zcol = "estimate") ``` --- ## What if I still want to use a GIS? * No problem! Write out your data to a shapefile/GeoJSON/GeoPackage with `sf::st_write()` and load into your GIS of choice * Recommendation: use `output = "wide"` for multi-variable datasets (easier to use in a desktop GIS) ```r library(sf) st_write(dc_income, "data/dc_income.shp") ``` --- ## Part 2 exercises Try making your own map with tmap! * If you are just getting started with tidycensus/the tidyverse, make a race/ethnicity map by adapting the code provided in this section but for a different county. * If you are comfortable with tidycensus at this stage, pick a different variable to map instead! --- class: middle, center, inverse ## Spatial analysis of US Census data with the sf package --- ## Spatial analysis in a GIS .pull-left[ Some typical terminology for spatial analysis in a desktop GIS context: * Select by Attributes * Select by Location * Spatial Join * Geoprocessing (Clip, Dissolve, Union, etc.) ] .pull-right[ <img src="img/co_dissolve.png" style="width: 500px"> ] --- ## sf: a tidy data model for spatial analysis * sf's data model is designed for integration within tidyverse data wrangling workflows * Many tidyverse methods (including those presented last week) work on sf objects * Geometries are "sticky" and not lost after performing data wrangling operations * The `st_*()` family of functions help you complete many common GIS tasks --- ## Topic: COVID-19 vaccinations in Texas * Of interest: how do COVID-19 vaccination rates vary geographically in Texas? * The most granular geography available is the zip code, which we approximate with the Zip Code Tabulation Area (ZCTA) * Data should be normalized by population given wide variations in ZCTA populations --- ## Data setup * [Up-to-date vaccination data available from the Texas Department of State Health Services](https://dshs.texas.gov/coronavirus/AdditionalData.aspx); I've cleaned up the Excel spreadsheet for us ```r library(tidyverse) tx_vaccinations <- read_rds("spatial-analysis/data/tx_vaccinations.rds") glimpse(tx_vaccinations) ``` ``` ## Rows: 2,487 ## Columns: 4 ## $ zip_code <chr> "73301", "73344", "73960", "75001", "75002", "7500… ## $ total_vaccinations <dbl> 102, NA, 13, 3526, 18830, 9806, 13794, 4103, 8983,… ## $ first_dose <dbl> 55, NA, 7, 2328, 11693, 6487, 9163, 2545, 5774, 64… ## $ fully_vaccinated <dbl> 47, NA, NA, 1202, 7155, 3333, 4657, 1561, 3221, 33… ``` --- ## Data setup .pull-left[ * Given that vaccines are only administered to patients age 16 and up, we can acquire appropriate normalization data from the ACS Data Profile ```r library(tidycensus) options(tigris_use_cache = TRUE) pop16up <- get_acs( geography = "zcta", variables = "DP03_0001", geometry = TRUE, state = "TX" ) ``` ] .pull-right[ ```r library(tmap) tm_shape(pop16up) + tm_fill(col = "estimate") ``` ![](index_files/figure-html/map-normalization-pop-1.png)<!-- --> ] --- ## "Joining" geometries and attributes * dplyr's `*_join()` family of functions help analysts merge tabular data to sf objects ```r tx_vacc_rate <- pop16up %>% left_join(tx_vaccinations, by = c("GEOID" = "zip_code")) %>% mutate(pct_vaccinated = 100 * (total_vaccinations / estimate)) %>% mutate(pct_vaccinated = ifelse(pct_vaccinated > 100, NA, pct_vaccinated)) ``` --- ```r tm_shape(tx_vacc_rate) + tm_fill(col = "pct_vaccinated", palette = "Reds") ``` ![](index_files/figure-html/map-rates-1.png)<!-- --> --- ## Selecting data by attributes * dplyr's `filter()` works on spatial data and allows for targeted visualization by attribute values ```r dallas_area <- tx_vacc_rate %>% filter(str_sub(GEOID, 1, 2) == "75") above_40 <- dallas_area %>% filter(pct_vaccinated >= 40) ``` --- ```r tm_shape(dallas_area) + tm_polygons() + tm_shape(above_40) + tm_polygons(col = "navy") ``` ![](index_files/figure-html/above-40-1.png)<!-- --> --- class: middle, center, inverse ## Spatial overlay --- ## Spatial overlay * Common question: how do local characteristics vary within and between metropolitan areas? * Methods: _spatial filtering_ ("select by location" in a GIS context) and _spatial joins_ * tigris geometries work well for a wide range of Census overlays ```r dfw_metro <- core_based_statistical_areas(cb = TRUE) %>% filter(str_detect(NAME, "Dallas")) ``` --- ## Spatial filters * The sf package implements spatial filters using both base R notation and the tidyverse-style `st_filter()` function For example, ```r library(sf) dfw_zips <- tx_vacc_rate[dfw_metro, ] ``` is equivalent to: ```r dfw_zips <- st_filter(tx_vacc_rate, dfw_metro) ``` --- ```r tm_shape(dfw_zips) + tm_polygons(col = "pct_vaccinated", palette = "Reds") + tm_layout(legend.outside = TRUE) ``` ![](index_files/figure-html/plot-filter-1.png)<!-- --> --- ## Spatial predicates * The default _spatial predicate_, `st_intersects()`, defines the spatial relationship as any overlap between the two layers * Many other spatial predicates are available ([see the sf documentation for more examples](https://r-spatial.github.io/sf/reference/geos_binary_pred.html)) * Example: ZCTAs that fall entirely within the Dallas-Fort Worth metro boundary (do not cross the border) ```r dfw_zips_within <- st_filter(tx_vacc_rate, dfw_metro, .predicate = st_within) ``` --- ```r tm_shape(dfw_zips_within) + tm_polygons(col = "pct_vaccinated", palette = "Reds") + tm_layout(legend.outside = TRUE) ``` ![](index_files/figure-html/plot-within-1.png)<!-- --> --- ## Spatial joins .pull-left[ * Spatial joins transfer attributes from one spatial dataset to another based on shared spatial relationships * Like spatial filters, spatial joins rely on spatial predicates to determine spatial relationships between features * Implemented in the sf package with `st_join()` ] .pull-right[ ```r tx_metros <- core_based_statistical_areas(cb = TRUE) %>% filter(GEOID %in% c("19100", "26420", "41700", "12420")) %>% select(metro_name = NAME) zips_by_metro <- tx_vacc_rate %>% st_join(tx_metros, left = FALSE) ``` ] --- ```r glimpse(zips_by_metro) ``` ``` ## Rows: 827 ## Columns: 11 ## $ GEOID <chr> "77414", "75287", "78621", "76201", "78150", "7861… ## $ NAME <chr> "ZCTA5 77414", "ZCTA5 75287", "ZCTA5 78621", "ZCTA… ## $ variable <chr> "DP03_0001", "DP03_0001", "DP03_0001", "DP03_0001"… ## $ estimate <dbl> 18595, 45980, 18275, 27079, 37, 9643, 1121, 12809,… ## $ moe <dbl> 626, 883, 1288, 1116, 24, 1398, 271, 649, 973, 406… ## $ total_vaccinations <dbl> 3531, 9042, 4069, 4429, NA, 1953, 60, 4447, 9240, … ## $ first_dose <dbl> 2433, 5745, 2984, 2860, NA, 1444, 40, 2970, 5830, … ## $ fully_vaccinated <dbl> 1099, 3307, 1088, 1575, NA, 510, 20, 1486, 3416, 4… ## $ pct_vaccinated <dbl> 18.988976, 19.665072, 22.265390, 16.355848, NA, 20… ## $ metro_name <chr> "Houston-The Woodlands-Sugar Land, TX", "Dallas-Fo… ## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-96.07715 2..., MULTI… ``` --- ## Comparative visualization ```r ggplot(zips_by_metro) + geom_density(aes(x = pct_vaccinated), color = "navy", fill = "navy", alpha = 0.4) + theme_minimal() + facet_wrap(~metro_name) + labs(title = "Percent of population age 16+ vaccinated for COVID-19", subtitle = "ZCTAs, largest metropolitan areas in Texas", y = "Kernel density estimate", x = "Percent receiving at least one vaccine dose") ``` --- <img src=img/compare_by_metro.png style="width: 800px"> --- class: middle, center, inverse ## Advanced application: spatial clustering and group-wise spatial data analysis --- ## The spdep package .pull-left[ * spdep: the workhorse R package for both exploratory spatial data analysis and spatial modeling * Key concept: _spatial neighbors_ * spdep helps model spatial relationships for clustering analysis, spatial lags & smoothing, and spatially-aware modeling ] .pull-right[ <img src=img/neighbors.png style="width: 450px"> ] --- ## Data setup * Research question: can we identify contiguous "regions" in Wayne County, Michigan by age, income, and education? ```r input_tracts <- get_acs( geography = "tract", variables = c(median_age = "B01002_001", median_income = "B19013_001", pct_college = "DP02_0068P"), state = "MI", county = "Wayne", geometry = TRUE, output = "wide" ) %>% select(-ends_with("M")) %>% na.omit() ``` --- ## The SKATER algorithm .pull-left[ * General concept: unsupervised clustering with spatial constraints * SKATER: Spatial "Kluster" Analysis by Tree Edge Removal * Uses minimum spanning trees to minimize within-group dissimilarity while ensuring that groups are spatially contiguous ] .pull-right[ ```r library(spdep) input_vars <- input_tracts %>% select(median_ageE:pct_collegeE) %>% st_drop_geometry() %>% scale() %>% as.data.frame() nb <- poly2nb(input_tracts) costs <- nbcosts(nb, input_vars) weights <- nb2listw(nb, costs, style = "B") ``` ] --- ## Regionalization with SKATER ```r mst <- mstree(weights) regions <- skater(mst[,1:2], input_vars, ncuts = 6) input_tracts$group <- as.character(regions$group) ``` --- ```r tm_shape(input_tracts) + tm_polygons("group", palette = "Set1") ``` ![](index_files/figure-html/plot-groups-1.png)<!-- --> --- ## Group-wise operations on spatial data * dplyr's `group_by()` and `summarize()`, used together on a spatial object, implements an analogous operation to a _Dissolve_ in a GIS ```r wayne_regions <- input_tracts %>% group_by(group) %>% summarize(across(.cols = where(is.numeric), .fns = mean)) ``` --- ```r tm_shape(wayne_regions) + tm_polygons(col = "group", palette = "Set1") ``` ![](index_files/figure-html/wayne-regions-map-1.png)<!-- --> --- ```r wayne_regions ``` ``` ## Simple feature collection with 7 features and 4 fields ## geometry type: GEOMETRY ## dimension: XY ## bbox: xmin: -83.55191 ymin: 42.03334 xmax: -82.87023 ymax: 42.45101 ## geographic CRS: NAD83 ## # A tibble: 7 x 5 ## group median_ageE median_incomeE pct_collegeE geometry ## <chr> <dbl> <dbl> <dbl> <GEOMETRY [°]> ## 1 1 35.9 30436. 12.7 POLYGON ((-83.14968 42.24173, -… ## 2 2 40.5 57190. 19.7 MULTIPOLYGON (((-83.19159 42.03… ## 3 3 46.2 127539. 67.7 POLYGON ((-82.91176 42.3793, -8… ## 4 4 40.2 61136. 28.6 MULTIPOLYGON (((-83.40908 42.33… ## 5 5 33.2 47426. 57.1 MULTIPOLYGON (((-83.06632 42.32… ## 6 6 42.1 102535. 55.8 POLYGON ((-83.44288 42.3086, -8… ## 7 7 48.8 102002. 50.8 POLYGON ((-83.47005 42.39107, -… ``` --- ## Part 3 exercises Try putting together a spatial analysis workflow yourself! Acquire data on median household income by Census tract for Texas: ```r texas_income <- get_acs( geography = "tract", variables = "B19013_001", state = "TX", geometry = TRUE ) ``` Answer the following: * How does median household income vary geographically in the Austin metropolitan area? Show this on a map. * How does the distribution of median household incomes by Census tract vary among the four largest metro areas in Texas? --- class: middle, center, inverse ## Thank you!