class: center, middle, inverse, title-slide .title[ # Working with the 2021 American Community Survey with R and tidycensus ] .subtitle[ ## SSDAN Workshop Series ] .author[ ### Kyle Walker ] .date[ ### February 8, 2023 ] --- ## About me .pull-left[ * Associate Professor of Geography at TCU * Spatial data science researcher and consultant * Package developer: __tidycensus__, __tigris__, __mapboxapi__, __crsuggest__, __idbr__ (R), __pygris__ (Python) * Book: [_Analyzing US Census Data: Methods, Maps and Models in R_](https://walker-data.com/census-r/) - Print release date February 16 - To support these workshops: [buy on Amazon](https://www.amazon.com/Analyzing-US-Census-Data-Methods/dp/1032366443) or [direct from CRC Press](https://www.routledge.com/Analyzing-US-Census-Data-Methods-Maps-and-Models-in-R/Walker/p/book/9781032366449) ] .pull-right[ <img src=img/walker_crop.jpg> ] --- ## SSDAN workshop series * Today: Working with the 2021 American Community Survey with R and tidycensus * Next Wednesday (February 15): Mapping and spatial analysis with ACS data in R * Wednesday, February 22: Working with geographic data and making maps in Python --- ## Today's agenda * Hour 1: The American Community Survey, R, and tidycensus * Hour 2: Analysis and visualization of ACS data with tidyverse tools * Hour 3: An introduction to ACS microdata --- class: middle, center, inverse ## Part 1: The American Community Survey, R, and tidycensus --- ## The American Community Survey * Annual survey of 3.5 million US households * Covers topics not available in decennial US Census data (e.g. income, education, language, housing characteristics) * Available as 1-year estimates (for geographies of population 65,000 and greater) and 5-year estimates (for geographies down to the block group) - 2020 1-year data only available as [experimental estimates](https://www.census.gov/programs-surveys/acs/data/experimental-data.html) * Data delivered as _estimates_ characterized by _margins of error_ --- ## How to get ACS data * [data.census.gov](https://data.census.gov) is the main, revamped interactive data portal for browsing and downloading Census datasets, including the ACS * [The US Census __A__pplication __P__rogramming __Interface__ (API)](https://www.census.gov/data/developers/data-sets.html) allows developers to access Census data resources programmatically --- ## tidycensus * R interface to the Decennial Census, American Community Survey, Population Estimates Program, and Public Use Microdata Series APIs * Key features: - Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested); - Automatically downloads and merges Census geometries to data for mapping (next week's workshop!); - Includes tools for handling margins of error in the ACS and working with survey weights in the ACS PUMS; - States and counties can be requested by name (no more looking up FIPS codes!) --- ## R and RStudio * R: programming language and software environment for data analysis (and wherever else your imagination can take you!) * RStudio: integrated development environment (IDE) for R developed by [Posit](https://posit.co/) * Posit Cloud: run RStudio with today's workshop pre-configured at https://posit.cloud/content/5377428 --- ## Getting started with tidycensus * To get started, install the packages you'll need for today's workshop * If you are using the Posit Cloud environment, these packages are already installed for you ```r install.packages(c("tidycensus", "tidyverse")) ``` * Optional, to run advanced examples: ```r install.packages(c("mapview", "plotly", "ggiraph", "survey", "srvyr")) ``` --- ## Optional: your Census API key * tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day * Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email. * Once activated, use the `census_api_key()` function to set your key as an environment variable ```r library(tidycensus) census_api_key("YOUR KEY GOES HERE", install = TRUE) ``` --- class: middle, center, inverse ## Getting started with ACS data in tidycensus --- ## Using the `get_acs()` function * The `get_acs()` function is your portal to access ACS data using tidycensus * The two required arguments are `geography` and `variables`. The function defaults to the 2017-2021 5-year ACS ```r library(tidycensus) median_income <- get_acs( geography = "county", variables = "B19013_001", year = 2021 ) ``` --- * ACS data are returned with five columns: `GEOID`, `NAME`, `variable`, `estimate`, and `moe` ```r median_income ``` ``` ## # A tibble: 3,221 × 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 01001 Autauga County, Alabama B19013_001 62660 4834 ## 2 01003 Baldwin County, Alabama B19013_001 64346 2377 ## 3 01005 Barbour County, Alabama B19013_001 36422 3282 ## 4 01007 Bibb County, Alabama B19013_001 54277 7325 ## 5 01009 Blount County, Alabama B19013_001 52830 3197 ## 6 01011 Bullock County, Alabama B19013_001 29063 7174 ## 7 01013 Butler County, Alabama B19013_001 45236 3880 ## 8 01015 Calhoun County, Alabama B19013_001 50977 1889 ## 9 01017 Chambers County, Alabama B19013_001 47232 5252 ## 10 01019 Cherokee County, Alabama B19013_001 43475 3643 ## # … with 3,211 more rows ``` --- ## 1-year ACS data * 1-year ACS data are more current, but are only available for geographies of population 65,000 and greater * Access 1-year ACS data with the argument `survey = "acs1"`; defaults to `"acs5"` ```r median_income_1yr <- get_acs( geography = "county", variables = "B19013_001", year = 2021, * survey = "acs1" ) ``` --- * ACS data are returned with five columns: `GEOID`, `NAME`, `variable`, `estimate`, and `moe` ```r median_income_1yr ``` ``` ## # A tibble: 841 × 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 01003 Baldwin County, Alabama B19013_001 63866 5457 ## 2 01015 Calhoun County, Alabama B19013_001 46524 6711 ## 3 01043 Cullman County, Alabama B19013_001 55517 6722 ## 4 01049 DeKalb County, Alabama B19013_001 41800 6689 ## 5 01051 Elmore County, Alabama B19013_001 59032 8600 ## 6 01055 Etowah County, Alabama B19013_001 45298 4507 ## 7 01069 Houston County, Alabama B19013_001 46931 3484 ## 8 01073 Jefferson County, Alabama B19013_001 55006 1969 ## 9 01077 Lauderdale County, Alabama B19013_001 50218 7743 ## 10 01081 Lee County, Alabama B19013_001 52424 4686 ## # … with 831 more rows ``` --- ## Requesting tables of variables * The `table` parameter can be used to obtain all related variables in a "table" at once ```r income_table <- get_acs( geography = "county", * table = "B19001", year = 2021 ) ``` --- ```r income_table ``` ``` ## # A tibble: 54,757 × 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 01001 Autauga County, Alabama B19001_001 21856 424 ## 2 01001 Autauga County, Alabama B19001_002 1213 287 ## 3 01001 Autauga County, Alabama B19001_003 1051 313 ## 4 01001 Autauga County, Alabama B19001_004 1062 309 ## 5 01001 Autauga County, Alabama B19001_005 1324 322 ## 6 01001 Autauga County, Alabama B19001_006 1048 266 ## 7 01001 Autauga County, Alabama B19001_007 723 258 ## 8 01001 Autauga County, Alabama B19001_008 672 188 ## 9 01001 Autauga County, Alabama B19001_009 1240 342 ## 10 01001 Autauga County, Alabama B19001_010 731 237 ## # … with 54,747 more rows ``` --- class: middle, center, inverse ## Understanding geography and variables in tidycensus --- ## US Census Geography <img src=img/census_diagram.png style="width: 500px"> .footnote[Source: [US Census Bureau](https://www2.census.gov/geo/pdfs/reference/geodiagram.pdf)] --- ## Geography in tidycensus * Information on available geographies, and how to specify them, can be found [in the tidycensus documentation](https://walker-data.com/tidycensus/articles/basic-usage.html#geography-in-tidycensus-1) <img src=img/tidycensus_geographies.png style="width: 650px"> --- ## Querying by state .pull-left[ * For geographies available below the state level, the `state` parameter allows you to query data for a specific state * For smaller geographies (Census tracts, block groups), a `county` can also be requested * __tidycensus__ translates state names and postal abbreviations internally, so you don't need to remember the FIPS codes! * Example: data on median household income in Minnesota by county ] .pull-right[ ```r mn_income <- get_acs( geography = "county", variables = "B19013_001", * state = "MN", year = 2021 ) ``` ] --- ```r mn_income ``` ``` ## # A tibble: 87 × 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 27001 Aitkin County, Minnesota B19013_001 50900 2019 ## 2 27003 Anoka County, Minnesota B19013_001 88680 1163 ## 3 27005 Becker County, Minnesota B19013_001 64296 2108 ## 4 27007 Beltrami County, Minnesota B19013_001 55910 2933 ## 5 27009 Benton County, Minnesota B19013_001 65529 2225 ## 6 27011 Big Stone County, Minnesota B19013_001 58095 4005 ## 7 27013 Blue Earth County, Minnesota B19013_001 65000 2446 ## 8 27015 Brown County, Minnesota B19013_001 62999 2334 ## 9 27017 Carlton County, Minnesota B19013_001 68579 2910 ## 10 27019 Carver County, Minnesota B19013_001 107890 3192 ## # … with 77 more rows ``` --- ## Searching for variables * To search for variables, use the `load_variables()` function along with a year and dataset * The `View()` function in RStudio allows for interactive browsing and filtering ```r vars <- load_variables(2021, "acs5") View(vars) ``` --- ## Available ACS datasets in tidycensus * Detailed Tables * Data Profile (add `"/profile"` for variable lookup) * Subject Tables (add `"/subject"`) * Comparison Profile (add `"/cprofile"`) * Supplemental Estimates (use `"acsse"`) * Migration Flows (access with `get_flows()`) --- class: middle, center, inverse ## Data structure in tidycensus --- ## "Tidy" or long-form data .pull-left[ * The default data structure returned by __tidycensus__ is "tidy" or long-form data, with variables by geography stacked by row ] .pull-right[ ```r age_sex_table <- get_acs( geography = "state", table = "B01001", year = 2021, survey = "acs1", ) ``` ] --- ```r age_sex_table ``` ``` ## # A tibble: 2,548 × 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 01 Alabama B01001_001 5039877 NA ## 2 01 Alabama B01001_002 2445896 5868 ## 3 01 Alabama B01001_003 149482 2753 ## 4 01 Alabama B01001_004 154786 5044 ## 5 01 Alabama B01001_005 168349 5077 ## 6 01 Alabama B01001_006 97886 2970 ## 7 01 Alabama B01001_007 70364 3696 ## 8 01 Alabama B01001_008 38216 3736 ## 9 01 Alabama B01001_009 32896 3278 ## 10 01 Alabama B01001_010 90846 4401 ## # … with 2,538 more rows ``` --- ## "Wide" data .pull-left[ * The argument `output = "wide"` spreads Census variables across the columns, returning one row per geographic unit and one column per variable ] .pull-right[ ```r age_sex_table_wide <- get_acs( geography = "state", table = "B01001", year = 2021, survey = "acs1", * output = "wide" ) ``` ] --- ```r age_sex_table_wide ``` ``` ## # A tibble: 52 × 100 ## GEOID NAME B0100…¹ B0100…² B0100…³ B0100…⁴ B0100…⁵ B0100…⁶ B0100…⁷ B0100…⁸ ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 01 Alabama 5.04e6 NA 2.45e6 5868 149482 2753 154786 5044 ## 2 72 Puerto… 3.26e6 NA 1.54e6 3969 49739 2447 77392 4089 ## 3 04 Arizona 7.28e6 NA 3.63e6 2852 203199 1688 232179 6421 ## 4 05 Arkans… 3.03e6 NA 1.49e6 4590 91430 2900 103629 5153 ## 5 06 Califo… 3.92e7 NA 1.96e7 6432 1129355 3475 1210580 14702 ## 6 08 Colora… 5.81e6 NA 2.94e6 3980 158361 2228 171525 5655 ## 7 09 Connec… 3.61e6 NA 1.77e6 2237 90173 1059 98412 4130 ## 8 10 Delawa… 1.00e6 NA 4.86e5 1201 27366 480 30336 2696 ## 9 11 Distri… 6.70e5 NA 3.19e5 601 20882 56 16256 2122 ## 10 12 Florida 2.18e7 NA 1.07e7 8241 558584 3735 584591 12459 ## # … with 42 more rows, 90 more variables: B01001_005E <dbl>, B01001_005M <dbl>, ## # B01001_006E <dbl>, B01001_006M <dbl>, B01001_007E <dbl>, B01001_007M <dbl>, ## # B01001_008E <dbl>, B01001_008M <dbl>, B01001_009E <dbl>, B01001_009M <dbl>, ## # B01001_010E <dbl>, B01001_010M <dbl>, B01001_011E <dbl>, B01001_011M <dbl>, ## # B01001_012E <dbl>, B01001_012M <dbl>, B01001_013E <dbl>, B01001_013M <dbl>, ## # B01001_014E <dbl>, B01001_014M <dbl>, B01001_015E <dbl>, B01001_015M <dbl>, ## # B01001_016E <dbl>, B01001_016M <dbl>, B01001_017E <dbl>, … ``` --- ## Using named vectors of variables .pull-left[ * Census variables can be hard to remember; using a named vector to request variables will replace the Census IDs with a custom input * In long form, these custom inputs will populate the `variable` column; in wide form, they will replace the column names ] .pull-right[ ```r ca_education <- get_acs( geography = "county", state = "CA", * variables = c(percent_high_school = "DP02_0062P", percent_bachelors = "DP02_0065P", * percent_graduate = "DP02_0066P"), year = 2021 ) ``` ] --- ```r ca_education ``` ``` ## # A tibble: 174 × 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 06001 Alameda County, California percent_high_school 16.7 0.4 ## 2 06001 Alameda County, California percent_bachelors 28.3 0.3 ## 3 06001 Alameda County, California percent_graduate 21.3 0.3 ## 4 06003 Alpine County, California percent_high_school 25.7 7.5 ## 5 06003 Alpine County, California percent_bachelors 20.6 7.5 ## 6 06003 Alpine County, California percent_graduate 18.7 8.5 ## 7 06005 Amador County, California percent_high_school 30.7 2.2 ## 8 06005 Amador County, California percent_bachelors 13.6 1.8 ## 9 06005 Amador County, California percent_graduate 5.9 1.1 ## 10 06007 Butte County, California percent_high_school 22.3 0.9 ## # … with 164 more rows ``` --- ## Part 1 exercises 1. Use the `load_variables()` function to find a variable that interests you that we haven't used yet. 2. Use `get_acs()` to fetch data on that variable from the ACS for counties, similar to how we did for median household income. --- class: middle, center, inverse ## Part 2: Analyzing and visualizing ACS data --- ## The tidyverse ```r library(tidyverse) tidyverse_logo() ``` ``` ## ⬢ __ _ __ . ⬡ ⬢ . ## / /_(_)__/ /_ ___ _____ _______ ___ ## / __/ / _ / // / |/ / -_) __(_-</ -_) ## \__/_/\_,_/\_, /|___/\__/_/ /___/\__/ ## ⬢ . /___/ ⬡ . ⬢ ``` * The [tidyverse](https://tidyverse.tidyverse.org/index.html): an integrated set of packages developed primarily by Hadley Wickham and the RStudio team --- ## tidycensus and the tidyverse * Census data are commonly used in _wide_ format, with categories spread across the columns * tidyverse tools work better with [data that are in "tidy", or _long_ format](https://vita.had.co.nz/papers/tidy-data.pdf); this format is returned by tidycensus by default * Goal: return data "ready to go" for use with tidyverse tools --- class: middle, center, inverse ## Exploring ACS data with tidyverse tools --- ## Finding the largest values * dplyr's `arrange()` function sorts data based on values in one or more columns, and `filter()` helps you query data based on column values * Example: what counties in the US have the lowest and highest median household incomes? ```r library(tidyverse) arrange(median_income, estimate) ``` ``` ## # A tibble: 3,221 × 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 72055 Guánica Municipio, Puerto Rico B19013_001 12856 1386 ## 2 72045 Comerío Municipio, Puerto Rico B19013_001 14666 1925 ## 3 72015 Arroyo Municipio, Puerto Rico B19013_001 14933 1660 ## 4 72147 Vieques Municipio, Puerto Rico B19013_001 14942 2258 ## 5 72083 Las Marías Municipio, Puerto Rico B19013_001 15353 3090 ## 6 72001 Adjuntas Municipio, Puerto Rico B19013_001 15375 1242 ## 7 72097 Mayagüez Municipio, Puerto Rico B19013_001 15501 776 ## 8 72079 Lajas Municipio, Puerto Rico B19013_001 15593 1764 ## 9 72107 Orocovis Municipio, Puerto Rico B19013_001 15637 1796 ## 10 72141 Utuado Municipio, Puerto Rico B19013_001 15812 1182 ## # … with 3,211 more rows ``` --- ```r arrange(median_income, desc(estimate)) ``` ``` ## # A tibble: 3,221 × 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 51107 Loudoun County, Virginia B19013_001 156821 2662 ## 2 51610 Falls Church city, Virginia B19013_001 155071 17392 ## 3 06085 Santa Clara County, California B19013_001 140258 1577 ## 4 06081 San Mateo County, California B19013_001 136837 2151 ## 5 51059 Fairfax County, Virginia B19013_001 133974 1433 ## 6 06041 Marin County, California B19013_001 131008 4163 ## 7 24027 Howard County, Maryland B19013_001 129549 2980 ## 8 51013 Arlington County, Virginia B19013_001 128145 2836 ## 9 08035 Douglas County, Colorado B19013_001 127443 2115 ## 10 36059 Nassau County, New York B19013_001 126576 1363 ## # … with 3,211 more rows ``` --- ## Refining your analysis with `filter()` * The `filter()` function subsets data according to a specified condition, much like a SQL query ```r # Remove Puerto Rico from the median income dataset income_states_dc <- filter(median_income, !str_detect(NAME, "Puerto Rico")) arrange(income_states_dc, estimate) ``` ``` ## # A tibble: 3,143 × 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 28055 Issaquena County, Mississippi B19013_001 17109 10468 ## 2 21237 Wolfe County, Kentucky B19013_001 24349 4388 ## 3 28051 Holmes County, Mississippi B19013_001 24958 2750 ## 4 48127 Dimmit County, Texas B19013_001 25000 8169 ## 5 22035 East Carroll Parish, Louisiana B19013_001 25049 6945 ## 6 46071 Jackson County, South Dakota B19013_001 25357 7628 ## 7 28119 Quitman County, Mississippi B19013_001 25783 3336 ## 8 46121 Todd County, South Dakota B19013_001 26250 4499 ## 9 48377 Presidio County, Texas B19013_001 26395 4952 ## 10 13309 Wheeler County, Georgia B19013_001 26776 3605 ## # … with 3,133 more rows ``` --- class: middle, center, inverse ## Group-wise Census data analysis --- ## Group-wise Census data analysis * The `group_by()` and `summarize()` functions in dplyr are used to implement the split-apply-combine method of data analysis * The default "tidy" format returned by tidycensus is designed to work well with group-wise Census data analysis workflows --- ## What is the highest-income county in each state? ```r highest_incomes <- median_income %>% * separate(NAME, into = c("county", "state"), sep = ", ") %>% * group_by(state) %>% * filter(estimate == max(estimate, na.rm = TRUE)) ``` --- ```r highest_incomes ``` ``` ## # A tibble: 52 × 6 ## # Groups: state [52] ## GEOID county state variable estim…¹ moe ## <chr> <chr> <chr> <chr> <dbl> <dbl> ## 1 01117 Shelby County Alabama B19013_0… 82592 2846 ## 2 02016 Aleutians West Census Area Alaska B19013_0… 90708 2916 ## 3 04013 Maricopa County Arizona B19013_0… 72944 519 ## 4 05007 Benton County Arkansas B19013_0… 76887 2240 ## 5 06085 Santa Clara County California B19013_0… 140258 1577 ## 6 08035 Douglas County Colorado B19013_0… 127443 2115 ## 7 09001 Fairfield County Connecticut B19013_0… 101194 1352 ## 8 10003 New Castle County Delaware B19013_0… 78428 1622 ## 9 11001 District of Columbia District of Columbia B19013_0… 93547 1718 ## 10 12109 St. Johns County Florida B19013_0… 88794 2505 ## # … with 42 more rows, and abbreviated variable name ¹estimate ``` --- class: middle, center, inverse ## Visualizing ACS data --- ## Visualizing ACS estimates * As opposed to decennial US Census data, ACS estimates include information on uncertainty, represented by the _margin of error_ in the `moe` column * This means that in some cases, visualization of estimates without reference to the margin of error can be misleading * Walkthrough: building a margin of error visualization with __ggplot2__ --- ## Visualizing ACS estimates * Let's get some data on median gross rent by county in Maryland ```r md_rent <- get_acs( geography = "county", variables = "B25031_001", state = "MD", year = 2021 ) ``` --- ## A basic plot * To visualize a dataset with __ggplot2__, we define an _aesthetic_ and a _geom_ ```r ggplot(md_rent, aes(x = estimate, y = NAME)) + geom_point() ``` --- <img src="img/first-plot.png" width="800px" /> --- ## Problems with our basic plot * The data are not sorted by value, making comparisons difficult * The axis and tick labels are not intuitive * The Y-axis labels contain repetitive information (" County, Maryland") * We've made no attempt to customize the styling --- ## Sorting by value * We use `reorder()` to sort counties by the value of their ACS estimates, improving legibility ```r md_plot <- ggplot(md_rent, aes(x = estimate, * y = reorder(NAME, estimate))) + geom_point(color = "darkred", size = 2) ``` --- <img src="img/second-plot.png" width="800px" /> --- ## Cleaning up tick labels * Using a combination of functions in the __scales__ package and custom-defined functions, tick labels can be formatted any way you want ```r library(scales) md_plot <- md_plot + * scale_x_continuous(labels = label_dollar()) + * scale_y_discrete(labels = function(x) str_remove(x, " County, Maryland|, Maryland")) ``` --- <img src="img/third-plot.png" width="800px" /> --- ## Improving formatting and theming * Use `labs()` to label the plot and its axes, and change the theme to one of several built-in options ```r md_plot <- md_plot + labs(title = "Median gross rent, 2017-2021 ACS", subtitle = "Counties in Maryland", caption = "Data acquired with R and tidycensus", x = "ACS estimate", y = "") + theme_minimal(base_size = 12) ``` --- <img src="img/fourth-plot.png" width="800px" /> --- ## Problem: comparing ACS estimates * The chart suggests that Calvert County has lower rent than Prince George's County and higher rent than St. Mary's, but its margin of error is quite large ```r md_rent %>% arrange(desc(estimate)) %>% slice(5:9) ``` ``` ## # A tibble: 5 × 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 24033 Prince George's County, Maryland B25031_001 1593 15 ## 2 24035 Queen Anne's County, Maryland B25031_001 1550 73 ## 3 24009 Calvert County, Maryland B25031_001 1526 113 ## 4 24021 Frederick County, Maryland B25031_001 1503 38 ## 5 24037 St. Mary's County, Maryland B25031_001 1492 51 ``` * How to visualize uncertainty in an intuitive way? --- ## Visualizing margins of error ```r md_plot_errorbar <- ggplot(md_rent, aes(x = estimate, y = reorder(NAME, estimate))) + * geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe), * width = 0.5, linewidth = 0.5) + geom_point(color = "darkred", size = 2) + scale_x_continuous(labels = label_dollar()) + scale_y_discrete(labels = function(x) str_remove(x, " County, Maryland|, Maryland")) + labs(title = "Median gross rent, 2017-2021 ACS", subtitle = "Counties in Maryland", caption = "Data acquired with R and tidycensus. Error bars represent margin of error around estimates.", x = "ACS estimate", y = "") + theme_minimal(base_size = 12) ``` --- <img src="img/error-plot.png" width="800px" /> --- class: middle, center, inverse ## Making plots interactive --- ## Quick interactivity with `ggplotly()` * The __plotly__ R package is an interface to the Plotly JavaScript library for full-featured interactive plotting * Resource: [_Interactive web-based data visualization with R, plotly, and Shiny](https://plotly-r.com/) * `ggplotly()` automatically converts __ggplot2__ graphics to interactive charts ```r library(plotly) ggplotly(md_plot_errorbar, tooltip = "x") ``` --- <iframe src="img/md_plotly.html" width="800px" height="400px" data-external="1"></iframe> --- ## Interactivity with __ggiraph__ * __ggiraph__: Alternative approach for making __ggplot2__ graphics interactive * Includes `*_interactive()` versions of __ggplot2__ geoms that can bring chart elements to life * Next week: we'll use __ggiraph__ for interactive mapping! --- ## __ggiraph__ example ```r library(ggiraph) md_plot_ggiraph <- ggplot(md_rent, aes(x = estimate, y = reorder(NAME, estimate), * tooltip = estimate, * data_id = GEOID)) + geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe), width = 0.5, size = 0.5) + * geom_point_interactive(color = "darkred", size = 2) + scale_x_continuous(labels = label_dollar()) + scale_y_discrete(labels = function(x) str_remove(x, " County, Maryland|, Maryland")) + labs(title = "Median gross rent, 2017-2021 ACS", subtitle = "Counties in Maryland", caption = "Data acquired with R and tidycensus. Error bars represent margin of error around estimates.", x = "ACS estimate", y = "") + theme_minimal(base_size = 12) *girafe(ggobj = md_plot_ggiraph) %>% * girafe_options(opts_hover(css = "fill:cyan;")) ``` --- ## Saving interactive plots .pull-left[ * To save an interactive plot to a standalone HTML file for display on your website, use the `saveWidget()` function in the __htmlwidgets__ package ] .pull-right[ ```r library(htmlwidgets) plotly_plot <- ggplotly(md_plot_errorbar, tooltip = "x") saveWidget(plotly_plot, file = "md_plotly.html") ``` ] --- ## Part 2 exercises Swap in a variable from Part 1, `"DP02_0066P"` (percent with a graduate degree) for the analysis in this section. Find out answers to the following questions: * Which counties in the US have the largest percentages of graduate degree holders? * Which have the smallest percentages? * For a state of your choosing, which county in your state has the highest percentage? __Challenge, if time__: make a margin of error plot for the state you've chosen. Does this method work well for your state? --- class: middle, center, inverse ## Part 3: Working with ACS microdata --- ## What is "microdata?" * __Microdata__: individual-level survey responses made available to researchers * [The ACS Public Use Microdata Series (PUMS)](https://www.census.gov/programs-surveys/acs/microdata.html) allows for detailed cross-tabulations not available in aggregated data * The 1-year PUMS covers about 1 percent of the US population; the 5-year PUMS covers about 5 percent (so, not the full ACS) * Data downloads available [in bulk from the Census FTP server](https://www2.census.gov/programs-surveys/acs/data/pums/2019/5-Year/) or [from data.census.gov's MDAT tool](https://data.census.gov/mdat/#/search?ds=ACSPUMS5Y2019) * Other resource for cleaned, time-series microdata: IPUMS --- class: middle, center, inverse ## Using microdata in tidycensus --- ## Basic usage of `get_pums()` .pull-left[ * `get_pums()` requires specifying one or more variables and the state for which you'd like to request data. `state = 'all'` _can_ get data for the entire USA, but it takes a while! * The function defaults to the 5-year ACS with `survey = "acs5"`; 1-year ACS data is available with `survey = "acs1"`. * The default year is 2021 in the latest version of tidycensus; data are available back to 2005 (1-year ACS) and 2005-2009 (5-year ACS). 2020 1-year data are not available. ] .pull-right[ ```r library(tidycensus) hi_pums <- get_pums( variables = c("SEX", "AGEP", "HHT"), state = "HI", survey = "acs1", year = 2021 ) ``` ] --- ```r hi_pums ``` ``` ## # A tibble: 14,959 × 8 ## SERIALNO SPORDER WGTP PWGTP AGEP ST HHT SEX ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> ## 1 2021GQ0000142 1 0 65 22 15 b 2 ## 2 2021GQ0000336 1 0 15 42 15 b 1 ## 3 2021GQ0000571 1 0 46 21 15 b 1 ## 4 2021GQ0001532 1 0 15 9 15 b 2 ## 5 2021GQ0001569 1 0 90 21 15 b 1 ## 6 2021GQ0001907 1 0 27 20 15 b 1 ## 7 2021GQ0001922 1 0 98 22 15 b 1 ## 8 2021GQ0001975 1 0 114 22 15 b 1 ## 9 2021GQ0002367 1 0 144 22 15 b 1 ## 10 2021GQ0002698 1 0 92 19 15 b 1 ## # … with 14,949 more rows ``` --- ## Understanding default data from `get_pums()` `get_pums()` returns some technical variables by default without the user needing to request them specifically. These include: * `SERIALNO`: a serial number that uniquely identifies households in the sample; * `SPORDER`: the order of the person in the household; when combined with `SERIALNO`, uniquely identifies a person; * `WGTP`: the household weight; * `PWGTP`: the person weight --- ## Weights and ACS microdata * Given that PUMS data are a _sample_ of the US population, the weights columns must be used for analysis ```r library(tidyverse) hi_age_39 <- filter(hi_pums, AGEP == 39) print(sum(hi_pums$PWGTP)) ``` ``` ## [1] 1441553 ``` ```r print(sum(hi_age_39$PWGTP)) ``` ``` ## [1] 17381 ``` --- ## Are these estimates accurate? * PUMS weights are calibrated to population and household totals, so larger tabulations should align with published estimates ```r get_acs("state", "B01003_001", state = "HI", survey = "acs1", year = 2021) ``` ``` ## # A tibble: 1 × 5 ## GEOID NAME variable estimate moe ## <chr> <chr> <chr> <dbl> <dbl> ## 1 15 Hawaii B01003_001 1441553 NA ``` * Smaller tabulations will be characterized by more uncertainty, and may deviate from published estimates --- class: middle, center, inverse ## Working with PUMS variables --- ## Variables available in the ACS PUMS ```r View(pums_variables) ``` * The `pums_variables` dataset is your one-stop shop for browsing variables in the ACS PUMS * It is a long-form dataset that organizes specific _value codes_ by variable so you know what you can get. You'll use information in the `var_code` column to fetch variables, but pay attention to the `var_label`, `val_code`, `val_label`, and `data_type` columns --- ## Recoding PUMS variables .pull-left[ * The `recode = TRUE` argument in `get_pums()` appends recoded columns to your returned dataset based on information available in `pums_variables` ] .pull-right[ ```r hi_pums_recoded <- get_pums( variables = c("SEX", "AGEP", "HHT"), state = "HI", survey = "acs1", year = 2021, recode = TRUE ) ``` ] --- ```r hi_pums_recoded ``` ``` ## # A tibble: 14,959 × 11 ## SERIALNO SPORDER WGTP PWGTP AGEP ST HHT SEX ST_la…¹ HHT_l…² SEX_l…³ ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <ord> <ord> <ord> ## 1 2021GQ00… 1 0 65 22 15 b 2 Hawaii… N/A (G… Female ## 2 2021GQ00… 1 0 15 42 15 b 1 Hawaii… N/A (G… Male ## 3 2021GQ00… 1 0 46 21 15 b 1 Hawaii… N/A (G… Male ## 4 2021GQ00… 1 0 15 9 15 b 2 Hawaii… N/A (G… Female ## 5 2021GQ00… 1 0 90 21 15 b 1 Hawaii… N/A (G… Male ## 6 2021GQ00… 1 0 27 20 15 b 1 Hawaii… N/A (G… Male ## 7 2021GQ00… 1 0 98 22 15 b 1 Hawaii… N/A (G… Male ## 8 2021GQ00… 1 0 114 22 15 b 1 Hawaii… N/A (G… Male ## 9 2021GQ00… 1 0 144 22 15 b 1 Hawaii… N/A (G… Male ## 10 2021GQ00… 1 0 92 19 15 b 1 Hawaii… N/A (G… Male ## # … with 14,949 more rows, and abbreviated variable names ¹ST_label, ## # ²HHT_label, ³SEX_label ``` --- ## Using variables filters .pull-left[ * PUMS datasets - especially from the 5-year ACS - can get quite large. The `variables_filter` argument can return a subset of data from the API, reducing long download times ] .pull-right[ ```r hi_pums_filtered <- get_pums( variables = c("SEX", "AGEP", "HHT"), state = "HI", survey = "acs5", variables_filter = list( SEX = 2, AGEP = 30:49 ), year = 2021 ) ``` ] --- ```r hi_pums_filtered ``` ``` ## # A tibble: 8,532 × 8 ## SERIALNO SPORDER WGTP PWGTP AGEP ST HHT SEX ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> ## 1 2020HU0245490 3 4 4 34 15 1 2 ## 2 2020HU0245555 1 8 8 49 15 3 2 ## 3 2020HU0245753 2 25 27 38 15 1 2 ## 4 2020HU0245942 2 83 67 38 15 1 2 ## 5 2020HU0247148 4 23 49 47 15 1 2 ## 6 2020HU0247499 2 24 23 49 15 1 2 ## 7 2020HU0247555 3 1 3 49 15 1 2 ## 8 2020HU0247818 1 68 68 36 15 1 2 ## 9 2020HU0250570 2 19 18 38 15 1 2 ## 10 2020HU0250646 2 92 38 46 15 1 2 ## # … with 8,522 more rows ``` --- class: middle, center, inverse ## Public Use Microdata Areas (PUMAs) --- ## What is a PUMA? * Public Use Microdata Areas (PUMAs) are the smallest available geographies at which records are identifiable in the PUMS datasets * PUMAs are redrawn with each decennial US Census, and typically are home to 100,000-200,000 people. The 2021 ACS uses 2010 PUMAs; the 2022 ACS will align with the new 2020 PUMAs * In large cities, a PUMA will represent a collection of nearby neighborhoods; in rural areas, it might represent several counties across a large area of a state * Let's preview some of next week's spatial tools to understand PUMA geography in Hawaii --- ```r library(tigris) library(mapview) options(tigris_use_cache = TRUE) # Get the latest version of 2010 PUMAs hi_pumas <- pumas(state = "HI", cb = TRUE, year = 2019) hi_puma_map <- mapview(hi_pumas) ``` --- ```r hi_puma_map ```
--- ## Working with PUMAs in PUMS data * To get PUMA information in your output data, use the variable code `PUMA` ```r hi_age_by_puma <- get_pums( variables = c("PUMA", "AGEP"), state = "HI", survey = "acs5" ) ``` --- ```r hi_age_by_puma ``` ``` ## # A tibble: 72,790 × 7 ## SERIALNO SPORDER WGTP PWGTP AGEP PUMA ST ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 2018HU0841283 1 17 17 58 00302 15 ## 2 2018HU0841283 2 17 18 57 00302 15 ## 3 2018HU0841283 3 17 23 25 00302 15 ## 4 2018HU0841330 1 14 14 78 00306 15 ## 5 2018HU0841330 2 14 17 51 00306 15 ## 6 2018HU0841330 3 14 15 18 00306 15 ## 7 2018HU0841330 4 14 12 15 00306 15 ## 8 2018HU0841330 5 14 16 52 00306 15 ## 9 2018HU0841347 1 14 13 55 00307 15 ## 10 2018HU0841347 2 14 15 37 00307 15 ## # … with 72,780 more rows ``` --- class: middle, center, inverse ## Handling uncertainty in tabulated PUMS estimates --- ## Uncertainty in PUMS data * PUMS data represent a smaller sample than the regular ACS, so understanding error around tabulated estimates is critical * [The Census Bureau recommends using _successive difference replication_](https://www2.census.gov/programs-surveys/acs/tech_docs/pums/ACS2015_2019_PUMS_README.pdf) to calculate standard errors, and provides _replicate weights_ to do this * __tidycensus__ includes tools to help you get replicate weights and format your data for appropriate survey-weighted analysis --- ## Getting replicate weights * We can acquire either housing or person replicate weights with the `rep_weights` argument ```r hi_pums_replicate <- get_pums( variables = c("AGEP", "PUMA"), state = "HI", survey = "acs1", year = 2021, * rep_weights = "person" ) ``` --- ```r hi_pums_replicate ``` ``` ## # A tibble: 14,959 × 87 ## SERIALNO SPORDER AGEP PUMA ST WGTP PWGTP PWGTP1 PWGTP2 PWGTP3 PWGTP4 ## <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2021HU0920… 1 80 00100 15 22 22 33 37 5 36 ## 2 2021HU0920… 1 89 00304 15 82 83 88 153 138 82 ## 3 2021HU0920… 1 82 00302 15 20 20 20 22 22 7 ## 4 2021HU0920… 2 88 00302 15 20 23 21 25 26 7 ## 5 2021HU0921… 1 37 00308 15 227 227 239 220 278 383 ## 6 2021HU0921… 2 36 00308 15 227 244 283 237 264 390 ## 7 2021HU0921… 3 6 00308 15 227 218 221 207 231 344 ## 8 2021HU0921… 4 4 00308 15 227 265 289 252 267 394 ## 9 2021HU0921… 5 0 00308 15 227 263 290 253 267 393 ## 10 2021HU0921… 1 36 00308 15 84 84 153 94 113 26 ## # … with 14,949 more rows, and 76 more variables: PWGTP5 <dbl>, PWGTP6 <dbl>, ## # PWGTP7 <dbl>, PWGTP8 <dbl>, PWGTP9 <dbl>, PWGTP10 <dbl>, PWGTP11 <dbl>, ## # PWGTP12 <dbl>, PWGTP13 <dbl>, PWGTP14 <dbl>, PWGTP15 <dbl>, PWGTP16 <dbl>, ## # PWGTP17 <dbl>, PWGTP18 <dbl>, PWGTP19 <dbl>, PWGTP20 <dbl>, PWGTP21 <dbl>, ## # PWGTP22 <dbl>, PWGTP23 <dbl>, PWGTP24 <dbl>, PWGTP25 <dbl>, PWGTP26 <dbl>, ## # PWGTP27 <dbl>, PWGTP28 <dbl>, PWGTP29 <dbl>, PWGTP30 <dbl>, PWGTP31 <dbl>, ## # PWGTP32 <dbl>, PWGTP33 <dbl>, PWGTP34 <dbl>, PWGTP35 <dbl>, … ``` --- ## Handling complex survey samples * __tidycensus__ links to the __survey__ and __srvyr__ packages for managing PUMS data as complex survey samples * The `to_survey()` function will format your data with replicate weights for correct survey-weighted estimation ```r hi_survey <- to_survey( hi_pums_replicate, type = "person" ) class(hi_survey) ``` ``` ## [1] "tbl_svy" "svyrep.design" ``` --- ## Survey-weighted tabulations * __srvyr__ conveniently links R's survey infrastructure to familiar tidyverse-style workflows * Standard errors can be multiplied by 1.645 to get familiar 90% confidence level margins of error ```r library(srvyr) hi_survey %>% filter(AGEP == 39) %>% survey_count() %>% mutate(n_moe = n_se * 1.645) ``` ``` ## # A tibble: 1 × 3 ## n n_se n_moe ## <dbl> <dbl> <dbl> ## 1 17381 1688. 2776. ``` --- ## Group-wise survey data analysis * A familiar group-wise tidyverse workflow can be applied correctly by __srvyr__ for the calculation of medians and other summary statistics ```r hi_survey %>% group_by(PUMA) %>% summarize(median_age = survey_median(AGEP)) %>% mutate(median_age_moe = median_age_se * 1.645) ``` ``` ## # A tibble: 10 × 4 ## PUMA median_age median_age_se median_age_moe ## <chr> <dbl> <dbl> <dbl> ## 1 00100 42 0.251 0.413 ## 2 00200 44 0.502 0.826 ## 3 00301 29 1.00 1.65 ## 4 00302 41 1.51 2.48 ## 5 00303 49 1.26 2.07 ## 6 00304 45 1.26 2.07 ## 7 00305 44 1.26 2.07 ## 8 00306 36 1.00 1.65 ## 9 00307 37 1.26 2.07 ## 10 00308 34 1.00 1.65 ``` --- ## Checking our answers * Tabulated median ages are not identical to published estimates, but are very close * Use published estimates if available; use PUMS data to generate estimates that aren't available in the published tables ```r hi_age_puma <- get_acs( geography = "puma", variables = "B01002_001", state = "HI", year = 2021, survey = "acs1" ) ``` --- class: middle, center, inverse ## Hope to see you next week!