The main intent of the tidycensus package is to return population characteristics of the United States in tidy format allowing for integration with simple feature geometries. Its intent is not, and has never been, to wrap the universe of APIs and datasets available from the US Census Bureau. For datasets not included in tidycensus, I recommend Hannah Recht’s censusapi package (https://github.com/hrecht/censusapi), which allows R users to access all Census APIs, and packages such as Jamaal Green’s lehdr package (https://github.com/jamgreen/lehdr) which grants R users access to Census Bureau LODES data.

However, tidycensus will ultimately incorporate a select number of Census Bureau datasets outside the decennial Census and ACS that are aligned with the basic goals of the package. One such dataset is the Population Estimates API, which includes information on a wide variety of population characteristics that is updated annually.

Population estimates are available in tidycensus through the get_estimates() function. Estimates are organized into products, which in tidycensus include "population", "components", "housing", and "characteristics". The population and housing products contain population/density and housing unit estimates, respectively. The components of change and characteristics products, in contrast, include a wider range of possible variables.

Components of change population estimates

By default, specifying "population", "components", or "housing" as the product in get_estimates() returns all variables associated with that component. For example, we can request all components of change variables for US states in 2017:

library(tidycensus)
library(tidyverse)
options(tigris_use_cache = TRUE)

us_components <- get_estimates(geography = "state", product = "components")

us_components
## # A tibble: 624 x 4
##    NAME                 GEOID variable  value
##    <chr>                <chr> <chr>     <dbl>
##  1 Alabama              01    BIRTHS    59637
##  2 Alaska               02    BIRTHS    11335
##  3 Arizona              04    BIRTHS    86765
##  4 Arkansas             05    BIRTHS    38779
##  5 California           06    BIRTHS   500353
##  6 Colorado             08    BIRTHS    66345
##  7 Connecticut          09    BIRTHS    36319
##  8 Delaware             10    BIRTHS    11026
##  9 District of Columbia 11    BIRTHS     9652
## 10 Florida              12    BIRTHS   221755
## # … with 614 more rows

The variables included in the components of change product consist of both estimates of counts and rates. Rates are preceded by an R in the variable name and are calculated per 1000 residents.

unique(us_components$variable)
##  [1] "BIRTHS"            "DEATHS"            "DOMESTICMIG"      
##  [4] "INTERNATIONALMIG"  "NATURALINC"        "NETMIG"           
##  [7] "RBIRTH"            "RDEATH"            "RDOMESTICMIG"     
## [10] "RINTERNATIONALMIG" "RNATURALINC"       "RNETMIG"

Available geographies include "us", "state", "county", "metropolitan statistical area/micropolitan statistical area", and "combined statistical area".

If desired, users can request a specific component or components by supplying a character vector to the variables parameter, as in other tidycensus functions. get_estimates() also supports simple feature geometry integration to facilitate mapping. In the example below, we’ll acquire data on the net migration rate between 2016 and 2017 for all counties in the United States, and request shifted and re-scaled feature geometry for Alaska and Hawaii to facilitate national mapping.

net_migration <- get_estimates(geography = "county",
                               variables = "RNETMIG",
                               geometry = TRUE,
                               shift_geo = TRUE)

net_migration
## Simple feature collection with 3142 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -2100000 ymin: -2500000 xmax: 2516374 ymax: 732103.3
## CRS:            +proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs
## # A tibble: 3,142 x 5
##    GEOID NAME         variable  value                                   geometry
##    <chr> <chr>        <chr>     <dbl>                         <MULTIPOLYGON [m]>
##  1 01001 Autauga Cou… RNETMIG  -1.86  (((1269841 -1303980, 1248372 -1300830, 12…
##  2 01009 Blount Coun… RNETMIG  -1.29  (((1240383 -1149119, 1222632 -1143475, 12…
##  3 01017 Chambers Co… RNETMIG   1.59  (((1382944 -1225846, 1390214 -1235634, 13…
##  4 01021 Chilton Cou… RNETMIG  -1.78  (((1257515 -1230045, 1259055 -1240041, 12…
##  5 01033 Colbert Cou… RNETMIG   0.919 (((1085910 -1080751, 1085892 -1080071, 10…
##  6 01045 Dale County… RNETMIG  -3.48  (((1382203 -1366760, 1387076 -1400145, 13…
##  7 01051 Elmore Coun… RNETMIG   2.51  (((1278144 -1255151, 1279961 -1256403, 13…
##  8 01065 Hale County… RNETMIG  -4.12  (((1176099 -1258997, 1172005 -1264523, 11…
##  9 01079 Lawrence Co… RNETMIG  -8.08  (((1178216 -1055420, 1179636 -1066254, 11…
## 10 01083 Limestone C… RNETMIG   7.69  (((1197770 -1018013, 1199180 -1017791, 12…
## # … with 3,132 more rows

We’ll next use tidyverse tools to generate a groups column that bins the net migration rates into comprehensible categories, and plot the result using geom_sf() and ggplot2.

order = c("-15 and below", "-15 to -5", "-5 to +5", "+5 to +15", "+15 and up")

net_migration <- net_migration %>%
  mutate(groups = case_when(
    value > 15 ~ "+15 and up",
    value > 5 ~ "+5 to +15",
    value > -5 ~ "-5 to +5",
    value > -15 ~ "-15 to -5",
    TRUE ~ "-15 and below"
  )) %>%
  mutate(groups = factor(groups, levels = order))

ggplot() +
  geom_sf(data = net_migration, aes(fill = groups, color = groups), lwd = 0.1) +
  geom_sf(data = tidycensus::state_laea, fill = NA, color = "black", lwd = 0.1) +
  scale_fill_brewer(palette = "PuOr", direction = -1) +
  scale_color_brewer(palette = "PuOr", direction = -1, guide = FALSE) +
  coord_sf(datum = NA) +
  theme_minimal(base_family = "Roboto") +
  labs(title = "Net migration per 1000 residents by county",
       subtitle = "US Census Bureau 2017 Population Estimates",
       fill = "Rate",
       caption = "Data acquired with the R tidycensus package | @kyle_e_walker")

Estimates of population characteristics

The fourth population estimates product available in get_estimates(), "characteristics", is formatted differently than the other three. It returns population estimates broken down by categories of AGEGROUP, SEX, RACE, and HISP, for Hispanic origin. Requested breakdowns should be specified as a character vector supplied to the breakdown parameter when the product is set to "characteristics".

By default, the returned categories are formatted as integers that map onto the Census Bureau definitions explained here: https://www.census.gov/data/developers/data-sets/popest-popproj/popest/popest-vars/2017.html. However, by specifying breakdown_labels = TRUE, the function will return the appropriate labels instead. For example:

la_age_hisp <- get_estimates(geography = "county",
                             product = "characteristics",
                             breakdown = c("SEX", "AGEGROUP", "HISP"),
                             breakdown_labels = TRUE,
                             state = "CA",
                             county = "Los Angeles")

la_age_hisp
## # A tibble: 210 x 6
##    GEOID NAME                      value SEX       AGEGROUP      HISP           
##    <chr> <chr>                     <dbl> <chr>     <fct>         <chr>          
##  1 06037 Los Angeles County, C… 10105518 Both sex… All ages      Both Hispanic …
##  2 06037 Los Angeles County, C…  5190231 Both sex… All ages      Non-Hispanic   
##  3 06037 Los Angeles County, C…  4915287 Both sex… All ages      Hispanic       
##  4 06037 Los Angeles County, C…  4981895 Male      All ages      Both Hispanic …
##  5 06037 Los Angeles County, C…  2529798 Male      All ages      Non-Hispanic   
##  6 06037 Los Angeles County, C…  2452097 Male      All ages      Hispanic       
##  7 06037 Los Angeles County, C…  5123623 Female    All ages      Both Hispanic …
##  8 06037 Los Angeles County, C…  2660433 Female    All ages      Non-Hispanic   
##  9 06037 Los Angeles County, C…  2463190 Female    All ages      Hispanic       
## 10 06037 Los Angeles County, C…   603555 Both sex… Age 0 to 4 y… Both Hispanic …
## # … with 200 more rows

With some additional data wrangling, the returned format facilitates analysis and visualization. For example, we can compare population pyramids for Hispanic and non-Hispanic populations in Los Angeles County:

compare <- filter(la_age_hisp, str_detect(AGEGROUP, "^Age"),
                  HISP != "Both Hispanic Origins",
                  SEX != "Both sexes") %>%
  mutate(value = ifelse(SEX == "Male", -value, value))

ggplot(compare, aes(x = AGEGROUP, y = value, fill = SEX)) +
  geom_bar(stat = "identity", width = 1) +
  theme_minimal(base_family = "Roboto") +
  scale_y_continuous(labels = function(y) paste0(abs(y / 1000), "k")) +
  scale_x_discrete(labels = function(x) gsub("Age | years", "", x)) +
  scale_fill_manual(values = c("darkred", "navy")) +
  coord_flip() +
  facet_wrap(~HISP) +
  labs(x = "",
       y = "2017 Census Bureau population estimate",
       title = "Population structure by Hispanic origin",
       subtitle = "Los Angeles County, California",
       fill = "",
       caption = "Data source: US Census Bureau population estimates & tidycensus R package")