The main intent of the tidycensus package is to return population characteristics of the United States in tidy format allowing for integration with simple feature geometries. Its intent is not, and has never been, to wrap the universe of APIs and datasets available from the US Census Bureau. For datasets not included in tidycensus, I recommend Hannah Recht’s censusapi package (https://github.com/hrecht/censusapi), which allows R users to access all Census APIs, and packages such as Jamaal Green’s lehdr package (https://github.com/jamgreen/lehdr) which grants R users access to Census Bureau LODES data.
However, tidycensus incorporates a select number of Census Bureau datasets outside the decennial Census and ACS that are aligned with the basic goals of the package. One such dataset is the Population Estimates API, which includes information on a wide variety of population characteristics that is updated annually. Also available through tidycensus is the Migration Flows API, which estimates the number of people that have moved between pairs of places in a given year.
Population estimates are available in tidycensus through the get_estimates()
function. Estimates are organized into products, which in tidycensus include "population"
, "components"
, "housing"
, and "characteristics"
. The population and housing products contain population/density and housing unit estimates, respectively. The components of change and characteristics products, in contrast, include a wider range of possible variables.
By default, specifying "population"
, "components"
, or "housing"
as the product
in get_estimates()
returns all variables associated with that component. For example, we can request all components of change variables for US states in 2019:
library(tidycensus)
library(tidyverse)
options(tigris_use_cache = TRUE)
us_components <- get_estimates(geography = "state", product = "components")
us_components
## # A tibble: 624 x 4
## NAME GEOID variable value
## <chr> <chr> <chr> <dbl>
## 1 Mississippi 28 BIRTHS 35978
## 2 Missouri 29 BIRTHS 71297
## 3 Montana 30 BIRTHS 11618
## 4 Nebraska 31 BIRTHS 25343
## 5 Nevada 32 BIRTHS 35932
## 6 New Hampshire 33 BIRTHS 12004
## 7 New Jersey 34 BIRTHS 99501
## 8 New Mexico 35 BIRTHS 23125
## 9 New York 36 BIRTHS 222924
## 10 North Carolina 37 BIRTHS 119203
## # … with 614 more rows
The variables included in the components of change product consist of both estimates of counts and rates. Rates are preceded by an R
in the variable name and are calculated per 1000 residents.
unique(us_components$variable)
## [1] "BIRTHS" "DEATHS" "DOMESTICMIG"
## [4] "INTERNATIONALMIG" "NATURALINC" "NETMIG"
## [7] "RBIRTH" "RDEATH" "RDOMESTICMIG"
## [10] "RINTERNATIONALMIG" "RNATURALINC" "RNETMIG"
Available geographies include "us"
, "state"
, "county"
, "metropolitan statistical area/micropolitan statistical area"
, and "combined statistical area"
.
If desired, users can request a specific component or components by supplying a character vector to the variables
parameter, as in other tidycensus functions. get_estimates()
also supports simple feature geometry integration to facilitate mapping. In the example below, we’ll acquire data on the net migration rate between 2018 and 2019 for all counties in the United States, and request shifted and re-scaled feature geometry for Alaska and Hawaii to facilitate national mapping.
net_migration <- get_estimates(geography = "county",
variables = "RNETMIG",
geometry = TRUE,
shift_geo = TRUE)
net_migration
## Simple feature collection with 3142 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -2100000 ymin: -2500000 xmax: 2516374 ymax: 732103.3
## CRS: +proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs
## # A tibble: 3,142 x 5
## GEOID NAME variable value geometry
## <chr> <chr> <chr> <dbl> <MULTIPOLYGON [m]>
## 1 01001 Autauga Cou… RNETMIG 4.56 (((1269841 -1303980, 1248372 -1300830, 124…
## 2 01009 Blount Coun… RNETMIG 1.12 (((1240383 -1149119, 1222632 -1143475, 120…
## 3 01017 Chambers Co… RNETMIG -7.75 (((1382944 -1225846, 1390214 -1235634, 138…
## 4 01021 Chilton Cou… RNETMIG 4.94 (((1257515 -1230045, 1259055 -1240041, 126…
## 5 01033 Colbert Cou… RNETMIG 5.97 (((1085910 -1080751, 1085892 -1080071, 108…
## 6 01045 Dale County… RNETMIG -1.44 (((1382203 -1366760, 1387076 -1400145, 138…
## 7 01051 Elmore Coun… RNETMIG 1.13 (((1278144 -1255151, 1279961 -1256403, 130…
## 8 01065 Hale County… RNETMIG -7.21 (((1176099 -1258997, 1172005 -1264523, 117…
## 9 01079 Lawrence Co… RNETMIG 2.00 (((1178216 -1055420, 1179636 -1066254, 118…
## 10 01083 Limestone C… RNETMIG 26.4 (((1197770 -1018013, 1199180 -1017791, 120…
## # … with 3,132 more rows
We’ll next use tidyverse tools to generate a groups
column that bins the net migration rates into comprehensible categories, and plot the result using geom_sf()
and ggplot2.
order = c("-15 and below", "-15 to -5", "-5 to +5", "+5 to +15", "+15 and up")
net_migration <- net_migration %>%
mutate(groups = case_when(
value > 15 ~ "+15 and up",
value > 5 ~ "+5 to +15",
value > -5 ~ "-5 to +5",
value > -15 ~ "-15 to -5",
TRUE ~ "-15 and below"
)) %>%
mutate(groups = factor(groups, levels = order))
ggplot() +
geom_sf(data = net_migration, aes(fill = groups, color = groups), lwd = 0.1) +
geom_sf(data = tidycensus::state_laea, fill = NA, color = "black", lwd = 0.1) +
scale_fill_brewer(palette = "PuOr", direction = -1) +
scale_color_brewer(palette = "PuOr", direction = -1, guide = FALSE) +
coord_sf(datum = NA) +
theme_minimal(base_family = "Roboto") +
labs(title = "Net migration per 1000 residents by county",
subtitle = "US Census Bureau 2019 Population Estimates",
fill = "Rate",
caption = "Data acquired with the R tidycensus package | @kyle_e_walker")
The fourth population estimates product available in get_estimates()
, "characteristics"
, is formatted differently than the other three. It returns population estimates broken down by categories of AGEGROUP
, SEX
, RACE
, and HISP
, for Hispanic origin. Requested breakdowns should be specified as a character vector supplied to the breakdown
parameter when the product
is set to "characteristics"
.
By default, the returned categories are formatted as integers that map onto the Census Bureau definitions explained here: https://www.census.gov/data/developers/data-sets/popest-popproj/popest/popest-vars/2017.html. However, by specifying breakdown_labels = TRUE
, the function will return the appropriate labels instead. For example:
la_age_hisp <- get_estimates(geography = "county",
product = "characteristics",
breakdown = c("SEX", "AGEGROUP", "HISP"),
breakdown_labels = TRUE,
state = "CA",
county = "Los Angeles")
la_age_hisp
## # A tibble: 210 x 6
## GEOID NAME value SEX AGEGROUP HISP
## <chr> <chr> <dbl> <chr> <fct> <chr>
## 1 06037 Los Angeles County, C… 10039107 Both se… All ages Both Hispanic …
## 2 06037 Los Angeles County, C… 5157137 Both se… All ages Non-Hispanic
## 3 06037 Los Angeles County, C… 4881970 Both se… All ages Hispanic
## 4 06037 Los Angeles County, C… 579856 Both se… Age 0 to 4 ye… Both Hispanic …
## 5 06037 Los Angeles County, C… 236459 Both se… Age 0 to 4 ye… Non-Hispanic
## 6 06037 Los Angeles County, C… 343397 Both se… Age 0 to 4 ye… Hispanic
## 7 06037 Los Angeles County, C… 600191 Both se… Age 5 to 9 ye… Both Hispanic …
## 8 06037 Los Angeles County, C… 229438 Both se… Age 5 to 9 ye… Non-Hispanic
## 9 06037 Los Angeles County, C… 370753 Both se… Age 5 to 9 ye… Hispanic
## 10 06037 Los Angeles County, C… 373670 Both se… Age 15 to 19 … Hispanic
## # … with 200 more rows
With some additional data wrangling, the returned format facilitates analysis and visualization. For example, we can compare population pyramids for Hispanic and non-Hispanic populations in Los Angeles County:
compare <- filter(la_age_hisp, str_detect(AGEGROUP, "^Age"),
HISP != "Both Hispanic Origins",
SEX != "Both sexes") %>%
mutate(value = ifelse(SEX == "Male", -value, value))
ggplot(compare, aes(x = AGEGROUP, y = value, fill = SEX)) +
geom_bar(stat = "identity", width = 1) +
theme_minimal(base_family = "Roboto") +
scale_y_continuous(labels = function(y) paste0(abs(y / 1000), "k")) +
scale_x_discrete(labels = function(x) gsub("Age | years", "", x)) +
scale_fill_manual(values = c("darkred", "navy")) +
coord_flip() +
facet_wrap(~HISP) +
labs(x = "",
y = "2019 Census Bureau population estimate",
title = "Population structure by Hispanic origin",
subtitle = "Los Angeles County, California",
fill = "",
caption = "Data source: US Census Bureau population estimates & tidycensus R package")
The American Community Survey Migration Flows dataset estimates the number of people that have moved between pairs of places. The estimates are calculated based on where a person lived when surveyed and where they lived one year prior to being surveyed. The data is available at three geographic levels: county, county subdivision (minor civil division), and metropolitan statistical area (MSA). Because the number of movers may be small for some pairs of counties, the data is aggregated over a five-year period. The estimates for each five-year period represent the number of people that moved between places each year during that period.
The data is set up such that for each county, you can find the number of people that moved to that county from each of the other counties in the US as well as the number of people that moved from that county to each of the other counties. The net migration for each pair of counties is also provided (although this is simply the moved to minus moved from).
get_flows()
The get_flows()
function from tidycensus provides access to these estimates. The only required argument for get_flows()
is geography
, which can be set to "county"
, "county subdivision"
, or "metropolitan statistical area"
. If geography is set to "county"
and no other arguments are set, data for all pairs of counties is pulled from the Census API. This is a large data request as it will get all combinations of counties that had movers. More commonly, you might be interested in knowing the flows in and out of one county, county subdivision, or MSA. In this case, you can specify the state and county or MSA.
Here we get county-to-county flow data for Westchester County, NY:
wch_flows <- get_flows(
geography = "county",
state = "NY",
county = "Westchester",
year = 2018
)
wch_flows %>%
filter(!is.na(GEOID2)) %>%
head()
## # A tibble: 6 x 7
## GEOID1 GEOID2 FULL1_NAME FULL2_NAME variable estimate moe
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 36119 01089 Westchester County, … Madison County, A… MOVEDIN 0 28
## 2 36119 01089 Westchester County, … Madison County, A… MOVEDOUT 26 41
## 3 36119 01089 Westchester County, … Madison County, A… MOVEDNET -26 41
## 4 36119 01095 Westchester County, … Marshall County, … MOVEDIN 0 28
## 5 36119 01095 Westchester County, … Marshall County, … MOVEDOUT 35 55
## 6 36119 01095 Westchester County, … Marshall County, … MOVEDNET -35 55
With the default setting of get_flows()
, data is returned in a “tidy” or long format. Notice that for each pair of places, there are three rows returned with one row for each variable (MOVEDIN
, MOVEDOUT
, and MOVEDNET
) and the the estimate and margin of error for these variables are in columns. GEOID1
and FULL1_NAME
are the FIPS code and name of the origin county. In this case, it will also be Westchester County since that is the only county we requested. GEOID2
and FULL2_NAME
are the FIPS code and name of the destination county.
One simple question we can answer with this data is, to which county did the most people move from Westchester?
## # A tibble: 6 x 7
## GEOID1 GEOID2 FULL1_NAME FULL2_NAME variable estimate moe
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 36119 09001 Westchester County,… Fairfield County, … MOVEDOUT 3916 778
## 2 36119 36061 Westchester County,… New York County, N… MOVEDOUT 3328 596
## 3 36119 36005 Westchester County,… Bronx County, New … MOVEDOUT 2063 418
## 4 36119 36027 Westchester County,… Dutchess County, N… MOVEDOUT 1870 454
## 5 36119 36079 Westchester County,… Putnam County, New… MOVEDOUT 1318 324
## 6 36119 36081 Westchester County,… Queens County, New… MOVEDOUT 1082 240
The MOVEDOUT
variable only estimates the number of people that moved out of Westchester County and doesn’t account for the number of people that moved in to Westchester from each county. If you are interested in net migration (moved in - moved out), you can use the MOVEDNET
variable.
## # A tibble: 6 x 7
## GEOID1 GEOID2 FULL1_NAME FULL2_NAME variable estimate moe
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 36119 09001 Westchester County… Fairfield County, C… MOVEDNET -1768 958
## 2 36119 36027 Westchester County… Dutchess County, Ne… MOVEDNET -1119 497
## 3 36119 06037 Westchester County… Los Angeles County,… MOVEDNET -486 339
## 4 36119 12099 Westchester County… Palm Beach County, … MOVEDNET -450 182
## 5 36119 25021 Westchester County… Norfolk County, Mas… MOVEDNET -358 351
## 6 36119 36079 Westchester County… Putnam County, New … MOVEDNET -340 407
You may have noticed that there are some destination geographies that are not other counties. For people that moved into to Westchester from outside the United States, the Migration Flows data reports the region that they moved from, such as Africa or Asia. Since this dataset is based on the American Community Survey, there is no way of knowing how many people moved out of the United States, so for all pairs of US to non-US places, the value of MOVEDOUT
and MOVEDNET
is NA
. The GEOID
of non-US places is also NA
.
## # A tibble: 6 x 7
## GEOID1 GEOID2 FULL1_NAME FULL2_NAME variable estimate moe
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 36119 <NA> Westchester County, New York Africa MOVEDIN 419 411
## 2 36119 <NA> Westchester County, New York Africa MOVEDOUT NA NA
## 3 36119 <NA> Westchester County, New York Africa MOVEDNET NA NA
## 4 36119 <NA> Westchester County, New York Asia MOVEDIN 2267 436
## 5 36119 <NA> Westchester County, New York Asia MOVEDOUT NA NA
## 6 36119 <NA> Westchester County, New York Asia MOVEDNET NA NA
Datasets between 2006-2010 and 2011-2015 have the ability to cross flow data with selected demographic characteristics such as age, race, employment status. For instance, the following call will get the number of movers to and from the Los Angeles-Long Beach Metro Area by race.
la_flows <- get_flows(
geography = "metropolitan statistical area",
breakdown = "RACE",
breakdown_labels = TRUE,
msa = 31080, # los angeles msa fips code
year = 2015
)
# net migration between la and san francisco
la_flows %>%
filter(str_detect(FULL2_NAME, "San Fran"), variable == "MOVEDNET")
## # A tibble: 5 x 9
## GEOID1 GEOID2 FULL1_NAME FULL2_NAME RACE RACE_label variable estimate moe
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 31080 41860 Los Angele… San Franci… 00 All races MOVEDNET -2433 1585
## 2 31080 41860 Los Angele… San Franci… 01 White alo… MOVEDNET -1077 1096
## 3 31080 41860 Los Angele… San Franci… 02 Black or … MOVEDNET 98 378
## 4 31080 41860 Los Angele… San Franci… 03 Asian alo… MOVEDNET -580 778
## 5 31080 41860 Los Angele… San Franci… 04 Other rac… MOVEDNET -874 549
Note that the demographic characteristics must be specified in the breakdown
argument of get_flows()
(not the variable
argument). For each dataset there are three or four demographic characteristics to choose from. For more information and to see which characteristics are available in each year, visit the Census Migration Flows documentation.
An additional feature of get_flows()
is an option to return spatial data associated with each place. In contrast to other spatial data available through tidycensus, get_flows()
returns point rather than polygon geometry. To get geometry with the flows data, set geometry = TRUE
. The return of get_flows()
will now be an sf object with the centroids of both origin and destination as sfc_POINT
columns. The origin point feature is returned in a column named centroid1
and is the active geometry column in the sf object. The destination point feature is returned in the centroid2
column.
phx_flows <- get_flows(
geography = "metropolitan statistical area",
msa = 38060,
year = 2018,
geometry = TRUE
)
phx_flows %>%
head()
## Simple feature collection with 6 features and 7 fields
## Active geometry column: centroid1
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -112.0705 ymin: 33.18571 xmax: -112.0705 ymax: 33.18571
## Geodetic CRS: NAD83
## # A tibble: 6 x 9
## GEOID1 GEOID2 FULL1_NAME FULL2_NAME variable estimate moe
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 38060 <NA> Phoenix-Mesa-Scott… Outside Metro Area … MOVEDIN 21602 1464
## 2 38060 <NA> Phoenix-Mesa-Scott… Outside Metro Area … MOVEDOUT 21192 1559
## 3 38060 <NA> Phoenix-Mesa-Scott… Outside Metro Area … MOVEDNET 410 2186
## 4 38060 <NA> Phoenix-Mesa-Scott… Africa MOVEDIN 1078 385
## 5 38060 <NA> Phoenix-Mesa-Scott… Africa MOVEDOUT NA NA
## 6 38060 <NA> Phoenix-Mesa-Scott… Africa MOVEDNET NA NA
## # … with 2 more variables: centroid1 <POINT [°]>, centroid2 <POINT [°]>
With the centroids attached to each pair of places, it is straightforward to map the migration flows. Here, we look at the most common origin MSAs for people moving to Phoenix-Mesa-Scottsdale, AZ. To make an interactive map of the flows, we’ll use the excellent mapdeck package. To use mapdeck, you’ll need a Mapbox account and access token.
library(mapdeck)
top_move_in <- phx_flows %>%
filter(!is.na(GEOID2), variable == "MOVEDIN") %>%
slice_max(n = 25, order_by = estimate) %>%
mutate(
width = estimate / 500,
tooltip = paste0(
scales::comma(estimate * 5, 1),
" people moved from ", str_remove(FULL2_NAME, "Metro Area"),
" to ", str_remove(FULL1_NAME, "Metro Area"), " between 2014 and 2018"
)
)
top_move_in %>%
mapdeck(style = mapdeck_style("dark"), pitch = 45) %>%
add_arc(
origin = "centroid1",
destination = "centroid2",
stroke_width = "width",
auto_highlight = TRUE,
highlight_colour = "#8c43facc",
tooltip = "tooltip"
)