If requested, tidycensus can return simple feature
geometry for geographic units along with variables from the decennial US
Census or American Community survey. By setting
geometry = TRUE
in a tidycensus function
call, tidycensus will use the tigris
package to retrieve the corresponding geographic dataset from the US
Census Bureau and pre-merge it with the tabular data obtained from the
Census API.
The following example shows median household income from the 2016-2020 ACS for Census tracts in Orange County, California:
library(tidycensus)
library(tidyverse)
options(tigris_use_cache = TRUE)
orange <- get_acs(
state = "CA",
county = "Orange",
geography = "tract",
variables = "B19013_001",
geometry = TRUE,
year = 2020
)
head(orange)
## Simple feature collection with 6 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -118.0369 ymin: 33.69354 xmax: -117.7822 ymax: 33.85749
## Geodetic CRS: NAD83
## GEOID NAME variable
## 1 06059086701 Census Tract 867.01, Orange County, California B19013_001
## 2 06059075901 Census Tract 759.01, Orange County, California B19013_001
## 3 06059075303 Census Tract 753.03, Orange County, California B19013_001
## 4 06059052527 Census Tract 525.27, Orange County, California B19013_001
## 5 06059110109 Census Tract 1101.09, Orange County, California B19013_001
## 6 06059087106 Census Tract 871.06, Orange County, California B19013_001
## estimate moe geometry
## 1 86922 11391 MULTIPOLYGON (((-117.9762 3...
## 2 78846 10972 MULTIPOLYGON (((-117.8618 3...
## 3 123654 21900 MULTIPOLYGON (((-117.8824 3...
## 4 135097 10971 MULTIPOLYGON (((-117.8035 3...
## 5 107463 12665 MULTIPOLYGON (((-118.0369 3...
## 6 45327 8700 MULTIPOLYGON (((-117.9414 3...
Our object orange
looks much like the basic
tidycensus output, but with a geometry
list-column describing the geometry of each feature, using the
geographic coordinate system NAD 1983 (EPSG: 4269) which is the default
for Census shapefiles. tidycensus uses the Census cartographic
boundary shapefiles for faster processing; if you prefer the
TIGER/Line shapefiles, set cb = FALSE
in the function
call.
As the dataset is in a tidy format, it can be quickly visualized with
the geom_sf
functionality currently in the development
version of ggplot2:
orange %>%
ggplot(aes(fill = estimate)) +
geom_sf(color = NA) +
scale_fill_viridis_c(option = "magma")
Please note that the UTM Zone 11N coordinate system
(26911
) is appropriate for Southern California but may not
be for your area of interest. For help identifying an appropriate
projected coordinate system for your data, take a look at the {crsuggest} R
package.
One of the most powerful features of ggplot2 is its
support for small multiples, which works very well with the tidy data
format returned by tidycensus. Many Census and ACS
variables return counts, however, which are generally
inappropriate for choropleth mapping. In turn,
get_decennial
and get_acs
have an optional
argument, summary_var
, that can work as a multi-group
denominator when appropriate. Let’s use the following example of the
racial geography of Harris County, Texas. First, we’ll request data for
non-Hispanic whites, non-Hispanic blacks, non-Hispanic Asians, and
Hispanics by Census tract for the 2020 Census, using the PL-94171
summary file.
racevars <- c(White = "P2_005N",
Black = "P2_006N",
Asian = "P2_008N",
Hispanic = "P2_002N")
harris <- get_decennial(
geography = "tract",
variables = racevars,
state = "TX",
county = "Harris County",
geometry = TRUE,
summary_var = "P2_001N",
year = 2020,
sumfile = "pl"
)
head(harris)
## Simple feature collection with 6 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -95.46502 ymin: 29.53424 xmax: -95.09005 ymax: 29.96492
## Geodetic CRS: NAD83
## # A tibble: 6 × 6
## GEOID NAME variable value summary_value geometry
## <chr> <chr> <chr> <dbl> <dbl> <MULTIPOLYGON [°]>
## 1 48201341203 Census Tra… White 1503 2355 (((-95.10641 29.54594, -…
## 2 48201341203 Census Tra… Black 177 2355 (((-95.10641 29.54594, -…
## 3 48201341203 Census Tra… Asian 54 2355 (((-95.10641 29.54594, -…
## 4 48201341203 Census Tra… Hispanic 492 2355 (((-95.10641 29.54594, -…
## 5 48201550601 Census Tra… White 265 6673 (((-95.46502 29.96456, -…
## 6 48201550601 Census Tra… Black 2156 6673 (((-95.46502 29.96456, -…
We notice that there are four entries for each Census tract, with
each entry representing one of our requested variables. The
summary_value
column represents the value of the summary
variable, which is total population in this instance. When a summary
variable is specified in get_acs
, both
summary_est
and summary_moe
columns will be
returned.
With this information, we can set up an analysis pipeline in which we calculate a new percent-of-total column and visualize the result for each group in a faceted plot.
harris %>%
mutate(percent = 100 * (value / summary_value)) %>%
ggplot(aes(fill = percent)) +
facet_wrap(~variable) +
geom_sf(color = NA) +
theme_void() +
scale_fill_viridis_c() +
labs(fill = "% of population\n(2020 Census)")
Geometries in tidycensus default to the Census Bureau’s cartographic boundary shapefiles. Cartographic boundary shapefiles are preferred to the core TIGER/Line shapefiles in tidycensus as their smaller size speeds up processing and because they are pre-clipped to the US coastline.
However, there may be circumstances in which your mapping requires more detail. A good example of this would be maps of New York City, in which even the cartographic boundary shapefiles include water area. For example, take this example of median household income by Census tract in Manhattan (New York County), NY:
library(tidycensus)
library(tidyverse)
options(tigris_use_cache = TRUE)
ny <- get_acs(geography = "tract",
variables = "B19013_001",
state = "NY",
county = "New York",
year = 2020,
geometry = TRUE)
ggplot(ny, aes(fill = estimate)) +
geom_sf() +
theme_void() +
scale_fill_viridis_c(labels = scales::dollar)
As illustrated in the graphic, the boundaries of Manhattan include
water boundaries - stretching into the Hudson and East Rivers. In turn,
a more accurate representation of Manhattan’s land area might be
desired. To accomplish this, a tidycensus user can use the core
TIGER/Line shapefiles instead with the argument cb = FALSE
,
then erase water area from Manhattan’s geometry. The
erase_water()
function in the tigris R
package will automatically remove proximate water areas from Census
polygons, improving cartographic display. The
area_threshold
argument determines the percentile ranking
of the water areas by size in the data’s proximity to retain; the
default, 0.75, will keep the largest 25 percent of areas. Data should be
first transformed to a projected coordinate reference system to improve
performance.
library(tigris)
library(sf)
ny_erase <- get_acs(
geography = "tract",
variables = "B19013_001",
state = "NY",
county = "New York",
year = 2020,
geometry = TRUE,
cb = FALSE
) %>%
st_transform(26918) %>%
erase_water(year = 2020)
ggplot(ny_erase, aes(fill = estimate)) +
geom_sf() +
theme_void() +
scale_fill_viridis_c(labels = scales::dollar)
The map appears as before, but instead the polygons now hug the
shoreline of Manhattan. Setting the same year
in
erase_water()
as your input data is recommended to avoid
sliver polygons, which are small polygons that can appear as a
result of misaligned overlay operations.
Beyond this, you might be interested in writing your dataset to a
shapefile or GeoJSON for use in external GIS or visualization
applications. You can accomplish this with the st_write
function in the sf package:
Your tidycensus-obtained dataset can now be used in ArcGIS, QGIS, Tableau, or any other application that reads shapefiles.
There is a lot more you can do with the spatial functionality in tidycensus, including more sophisticated visualization and spatial analysis; look for updates on my blog and in this space.