class: center, middle, inverse, title-slide # ACS Microdata and International Census Data ### Kyle Walker ### March 25, 2021 --- ## About me * Associate Professor of Geography at TCU * Spatial data science researcher and consultant * R package developer: tidycensus, tigris, mapboxapi * Book coming this year: _Analyzing the US Census with R_ - These workshops are a sneak preview of the book's content! --- ## SSDAN workshop series * Today: US Census microdata and international Census data in R * March 4: an introduction to analyzing US Census data with tidycensus ([code](https://github.com/walkerke/umich-workshop/blob/main/census-data-in-r/code/part-1-code.R) | [video recording](https://www.youtube.com/watch?v=PnFJfuJ83NI)) * March 11: spatial analysis of US Census data in R ([code](https://github.com/walkerke/umich-workshop/blob/main/spatial-analysis/code/part-2-code.R) | [video recording](https://www.youtube.com/watch?v=GqC1HjAKui4)) --- ## Today's agenda * Hour 1: Using ACS microdata with tidycensus * Hour 2: Analyzing and modeling microdata * Hour 3: Bonus: international Census data resources in R --- ## Optional prep To run _all_ the examples in Part 3, you'll need API keys * Canada: https://censusmapper.ca/users/sign_up * Mexico: http://www3.inegi.org.mx//sistemas/api/indicadores/v1/tokenVerify.aspx --- class: middle, center, inverse ## Part 1: Using ACS microdata with tidycensus --- ## tidycensus overview and review * tidycensus: an R package to retrieve demographic data, optionally with geographic information, from the US Census Bureau's decennial Census, American Community Survey, and Population Estimates APIs * 2020: new functionality to retrieve ACS _microdata_ from the API * Coming soon: migration flows data! --- ## What is "microdata?" * __Microdata__: individual-level survey responses made available to researchers * [The ACS Public Use Microdata Series (PUMS)](https://www.census.gov/programs-surveys/acs/microdata.html) allows for detailed cross-tabulations not available in aggregated data * The 1-year PUMS covers about 1 percent of the US population; the 5-year PUMS covers about 5 percent (so, not the full ACS) * Data downloads available [in bulk from the Census FTP server](https://www2.census.gov/programs-surveys/acs/data/pums/2019/5-Year/) or [from data.census.gov's MDAT tool](https://data.census.gov/mdat/#/search?ds=ACSPUMS5Y2019) --- ## Microdata resources: IPUMS & ipumsr <img src=img/ipums.png style="width: 800px"> --- ## Microdata and the Census API <img src=img/micro_api.png style="width: 800px"> --- class: middle, center, inverse ## Using microdata in tidycensus --- ## Setting up tidycensus * To use tidycensus, you will need a Census API key. Visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email. * Once activated, use the `census_api_key()` function to set your key as an environment variable ```r library(tidycensus) census_api_key("YOUR KEY GOES HERE", install = TRUE) ``` --- ## Basic usage of `get_pums()` .pull-left[ * `get_pums()` requires specifying one or more variables and the state for which you'd like to request data. `state = 'all'` _can_ get data for the entire USA, but it takes a while! * The function defaults to the 5-year ACS with `survey = "acs5"`; 1-year ACS data is available with `survey = "acs1"`. * The default year is 2019; data are available back to 2006 (1-year ACS) and 2005-2009 (5-year ACS) ] .pull-right[ ```r library(tidycensus) wy_pums <- get_pums( variables = c("SEX", "AGEP", "SCHL"), state = "WY", survey = "acs1", year = 2019 ) ``` ] --- ```r wy_pums ``` ``` ## # A tibble: 5,967 x 8 ## SERIALNO SPORDER WGTP PWGTP AGEP ST SCHL SEX ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> ## 1 2019GQ0000335 1 0 206 17 56 16 2 ## 2 2019GQ0000958 1 0 50 37 56 18 1 ## 3 2019GQ0009156 1 0 33 75 56 19 2 ## 4 2019GQ0012426 1 0 23 21 56 19 2 ## 5 2019GQ0015243 1 0 209 94 56 16 1 ## 6 2019GQ0018773 1 0 273 26 56 16 1 ## 7 2019GQ0019978 1 0 6 16 56 12 1 ## 8 2019GQ0022931 1 0 67 18 56 18 2 ## 9 2019GQ0032421 1 0 113 94 56 21 2 ## 10 2019GQ0033045 1 0 10 44 56 13 1 ## # … with 5,957 more rows ``` --- ## Understanding default data from `get_pums()` `get_pums()` returns some technical variables by default without the user needing to request them specifically. These include: * `SERIALNO`: a serial number that uniquely identifies households in the sample; * `SPORDER`: the order of the person in the household; when combined with `SERIALNO`, uniquely identifies a person; * `WGTP`: the household weight; * `PWGTP`: the person weight --- ## Weights and ACS microdata * Given that PUMS data are a _sample_ of the US population, the weights columns must be used for analysis ```r library(tidyverse) wy_age_50 <- filter(wy_pums, AGEP == 50) print(sum(wy_pums$PWGTP)) ``` ``` ## [1] 578759 ``` ```r print(sum(wy_age_50$PWGTP)) ``` ``` ## [1] 4756 ``` --- class: middle, center, inverse ## Working with PUMS variables --- ## Variables available in the ACS PUMS ```r View(pums_variables) ``` * The `pums_variables` dataset is your one-stop shop for browsing variables in the ACS PUMS * It is a long-form dataset that organizes specific _value codes_ by variable so you know what you can get. You'll use information in the `var_code` column to fetch variables, but pay attention to the `var_label`, `val_code`, `val_label`, and `data_type` columns --- ## Recoding PUMS variables .pull-left[ * The `recode = TRUE` argument in `get_pums()` appends recoded columns to your returned dataset based on information available in `pums_variables` ] .pull-right[ ```r wy_pums_recoded <- get_pums( variables = c("SEX", "AGEP", "SCHL"), state = "WY", survey = "acs1", year = 2019, recode = TRUE ) ``` ] --- ```r wy_pums_recoded ``` ``` ## # A tibble: 5,967 x 11 ## SERIALNO SPORDER WGTP PWGTP AGEP ST SCHL SEX ST_label SCHL_label ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <ord> <ord> ## 1 2019GQ00… 1 0 14 16 56 13 1 Wyoming… Grade 10 ## 2 2019GQ00… 1 0 58 36 56 15 1 Wyoming… 12th grade - … ## 3 2019GQ00… 1 0 72 82 56 16 2 Wyoming… Regular high … ## 4 2019GQ00… 1 0 68 18 56 16 2 Wyoming… Regular high … ## 5 2019GQ00… 1 0 68 35 56 16 1 Wyoming… Regular high … ## 6 2019GQ00… 1 0 11 54 56 16 2 Wyoming… Regular high … ## 7 2019GQ00… 1 0 49 26 56 16 1 Wyoming… Regular high … ## 8 2019GQ00… 1 0 61 18 56 18 1 Wyoming… Some college,… ## 9 2019GQ00… 1 0 140 35 56 19 1 Wyoming… 1 or more yea… ## 10 2019GQ00… 1 0 75 18 56 20 1 Wyoming… Associate's d… ## # … with 5,957 more rows, and 1 more variable: SEX_label <ord> ``` --- ## Using variables filters .pull-left[ * PUMS datasets - especially from the 5-year ACS - can get quite large. The `variables_filter` argument can return a subset of data from the API, reducing long download times * This feature is currently only available in the development (GitHub) version of tidycensus ] .pull-right[ ```r wy_pums_filtered <- get_pums( variables = c("SEX", "AGEP", "SCHL"), state = "WY", survey = "acs5", variables_filter = list( SEX = 2, SCHL = 16:20 ), year = 2019, recode = TRUE ) ``` ] --- ```r wy_pums_filtered ``` ``` ## # A tibble: 7,658 x 11 ## SERIALNO SPORDER WGTP PWGTP AGEP ST SCHL SEX ST_label SCHL_label ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <ord> <ord> ## 1 20150000… 1 27 28 74 56 16 2 Wyoming… Regular high … ## 2 20150000… 1 13 13 51 56 20 2 Wyoming… Associate's d… ## 3 20150000… 1 18 19 44 56 20 2 Wyoming… Associate's d… ## 4 20150000… 1 16 16 57 56 16 2 Wyoming… Regular high … ## 5 20150000… 1 14 14 93 56 18 2 Wyoming… Some college,… ## 6 20150000… 2 11 12 66 56 18 2 Wyoming… Some college,… ## 7 20150000… 2 17 13 19 56 19 2 Wyoming… 1 or more yea… ## 8 20150000… 2 25 24 51 56 18 2 Wyoming… Some college,… ## 9 20150000… 2 26 22 57 56 19 2 Wyoming… 1 or more yea… ## 10 20150000… 2 14 13 66 56 16 2 Wyoming… Regular high … ## # … with 7,648 more rows, and 1 more variable: SEX_label <ord> ``` --- class: middle, center, inverse ## Public Use Microdata Areas (PUMAs) --- ## What is a PUMA? .pull-left[ * Public Use Microdata Areas (PUMAs) are the smallest available geographies at which records are identifiable in the PUMS datasets * PUMAs are redrawn with each decennial US Census, and typically are home to 100,000-200,000 people * In large cities, a PUMA will represent a collection of nearby neighborhoods; in rural areas, it might represent several counties across a large area of a state ] .pull-right[ ```r library(tigris) options(tigris_use_cache = TRUE) wy_pumas <- pumas(state = "WY", cb = TRUE) plot(wy_pumas$geometry) ``` ![](index_files/figure-html/plot-pumas-1.png)<!-- --> ] --- ## Working with PUMAs in PUMS data * To get PUMA information in your output data, use the variable code `PUMA` ```r wy_age_by_puma <- get_pums( variables = c("PUMA", "AGEP"), state = "WY", survey = "acs5" ) ``` --- ```r wy_age_by_puma ``` ``` ## # A tibble: 29,238 x 7 ## SERIALNO SPORDER WGTP PWGTP AGEP PUMA ST ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 2015000001990 1 27 26 67 00300 56 ## 2 2015000006752 1 5 5 86 00100 56 ## 3 2015000006752 2 5 6 83 00100 56 ## 4 2015000006847 1 29 29 61 00300 56 ## 5 2015000007045 1 23 23 62 00300 56 ## 6 2015000007045 2 23 19 66 00300 56 ## 7 2015000007045 3 23 34 35 00300 56 ## 8 2015000007382 1 17 17 33 00500 56 ## 9 2015000010448 1 9 9 69 00100 56 ## 10 2015000010448 2 9 9 77 00100 56 ## # … with 29,228 more rows ``` --- ## Custom queries by PUMA * The `puma` argument in `get_pums()` can be used to obtain data for a specific PUMA or multiple PUMAs - again reducing long download times if specific geographies are required ```r wy_puma_subset <- get_pums( variables = "AGEP", state = "WY", survey = "acs5", puma = "00500" ) ``` --- ```r wy_puma_subset ``` ``` ## # A tibble: 7,176 x 7 ## SERIALNO SPORDER WGTP PWGTP AGEP PUMA ST ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 2015000010667 1 11 12 38 00500 56 ## 2 2015000010667 2 11 14 36 00500 56 ## 3 2015000010667 3 11 14 5 00500 56 ## 4 2015000018572 1 8 8 80 00500 56 ## 5 2015000040346 1 24 23 54 00500 56 ## 6 2015000040346 2 24 31 55 00500 56 ## 7 2015000042859 1 7 8 60 00500 56 ## 8 2015000042859 2 7 4 53 00500 56 ## 9 2015000047414 1 17 16 60 00500 56 ## 10 2015000047414 2 17 35 72 00500 56 ## # … with 7,166 more rows ``` --- ## Part 1 exercises * Try requesting PUMS data using `get_pums()` yourselves, but for a state other than Wyoming. * Use the `pums_variables` dataset to browse the available variables in the PUMS. Create a custom query with `get_pums()` to request data for variables other than those we've used in the above examples. --- class: middle, center, inverse ## Analyzing and modeling PUMS data in R --- ## Considerations when working with PUMS data * The individual-level microdata returned by `get_pums()` can be used to produce detailed cross-tabulations not available in the aggregate ACS, e.g. from `get_acs()` * tidyverse tools like dplyr are excellent for producing these tabulations and handling survey weights * Because data acquired with `get_pums()` are drawn from the ACS, any estimates produced in your analysis will be characterized by error - which can be substantial when cross-tabulations are highly specific --- ## PUMS data and the tidyverse ```r ms_pums <- get_pums( variables = c("PUMA", "SEX", "AGEP", "SCHL"), state = "MS", survey = "acs5", year = 2019, recode = TRUE ) ms_age_sex <- ms_pums %>% count(SEX_label, AGEP, wt = PWGTP) ``` --- ```r ms_age_sex ``` ``` ## # A tibble: 186 x 3 ## SEX_label AGEP n ## <ord> <dbl> <dbl> ## 1 Male 0 18232 ## 2 Male 1 19104 ## 3 Male 2 18399 ## 4 Male 3 18810 ## 5 Male 4 20597 ## 6 Male 5 18066 ## 7 Male 6 20756 ## 8 Male 7 20989 ## 9 Male 8 20548 ## 10 Male 9 21852 ## # … with 176 more rows ``` --- ## Group-wise data tabulation ```r ms_pums_summary <- ms_pums %>% mutate(ba_above = SCHL %in% c("21", "22", "23", "24")) %>% group_by(PUMA, SEX_label) %>% summarize( total_pop = sum(PWGTP), mean_age = weighted.mean(AGEP, PWGTP), ba_above = sum(PWGTP[ba_above == TRUE & AGEP >= 25]), ba_above_pct = 100 * (ba_above / sum(PWGTP[AGEP >= 25])) ) ``` --- ```r ms_pums_summary ``` ``` ## # A tibble: 42 x 6 ## # Groups: PUMA [21] ## PUMA SEX_label total_pop mean_age ba_above ba_above_pct ## <chr> <ord> <dbl> <dbl> <dbl> <dbl> ## 1 00100 Male 86297 35.6 12061 22.1 ## 2 00100 Female 92713 37.7 17297 27.8 ## 3 00200 Male 71887 38.4 5933 12.4 ## 4 00200 Female 75595 41.1 8315 15.8 ## 5 00300 Male 56603 36.4 4498 12.4 ## 6 00300 Female 60428 38.9 6641 16.3 ## 7 00400 Male 69837 37.0 10173 23.0 ## 8 00400 Female 74481 39.1 11907 24.4 ## 9 00500 Male 69954 36.6 8238 18.2 ## 10 00500 Female 75230 38.7 11162 22.0 ## # … with 32 more rows ``` --- ## Advanced example: mapping PUMS data * Use the tigris package to obtain PUMA geometries and join data based on a common identifier for mapping ```r library(tigris) library(tmap) options(tigris_use_cache = TRUE) ms_pumas <- pumas("MS", cb = TRUE) joined_pumas <- ms_pumas %>% left_join(ms_pums_summary, by = c("PUMACE10" = "PUMA")) %>% filter(SEX_label == "Female") ``` --- .pull-left[ ```r ms_map <- tm_shape(joined_pumas) + tm_polygons(col = "ba_above_pct", palette = "Greens", title = "% of women with\ncollege degree") + tm_layout(legend.outside = TRUE, legend.outside.position = "right") ``` ] .pull-right[ ```r ms_map ``` ![](index_files/figure-html/show-ms-map-1.png)<!-- --> ] --- class: middle, center, inverse ## Calculating errors around tabulated estimates --- ## Survey design and the ACS PUMS * As we covered in the first workshop, the American Community Survey is based on a _sample_ of the US population and in turn subject to sampling error * The complex sample design of the ACS further requires appropriate specification when fitting statistical models using PUMS data * tidycensus, with help from the survey/srvyr packages, includes tools to assist with these tasks --- ## Getting replicate weights .pull-left[ * The Census Bureau publishes 80 "replicate weights" that can be used to calculate standard errors around tabulated estimates from microdata * The `rep_weights` parameter in `get_pums()` makes it easier for users to retrieve all of these variables without having to request all 80 directly. Choose `rep_weights = "person"` for person-weights or `"household"` for household weights ] .pull-right[ ```r ms_pums_replicate <- get_pums( variables = c("PUMA", "SEX", "AGEP", "SCHL"), state = "MS", survey = "acs5", year = 2019, recode = TRUE, rep_weights = "person" ) ``` ] --- ```r ms_pums_replicate ``` ``` ## # A tibble: 146,023 x 92 ## SERIALNO SPORDER AGEP PUMA ST SCHL SEX ST_label SCHL_label SEX_label ## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <ord> <ord> <ord> ## 1 20150000… 1 53 01100 28 16 2 Mississ… Regular h… Female ## 2 20150000… 2 33 01100 28 13 1 Mississ… Grade 10 Male ## 3 20150000… 3 32 01100 28 17 1 Mississ… GED or al… Male ## 4 20150000… 1 42 00500 28 21 1 Mississ… Bachelor'… Male ## 5 20150000… 2 37 00500 28 21 2 Mississ… Bachelor'… Female ## 6 20150000… 3 8 00500 28 05 1 Mississ… Grade 2 Male ## 7 20150000… 4 7 00500 28 04 1 Mississ… Grade 1 Male ## 8 20150000… 1 33 01300 28 18 2 Mississ… Some coll… Female ## 9 20150000… 2 15 01300 28 11 1 Mississ… Grade 8 Male ## 10 20150000… 3 5 01300 28 02 2 Mississ… Nursery s… Female ## # … with 146,013 more rows, and 82 more variables: WGTP <dbl>, PWGTP <dbl>, ## # PWGTP1 <dbl>, PWGTP2 <dbl>, PWGTP3 <dbl>, PWGTP4 <dbl>, PWGTP5 <dbl>, ## # PWGTP6 <dbl>, PWGTP7 <dbl>, PWGTP8 <dbl>, PWGTP9 <dbl>, PWGTP10 <dbl>, ## # PWGTP11 <dbl>, PWGTP12 <dbl>, PWGTP13 <dbl>, PWGTP14 <dbl>, PWGTP15 <dbl>, ## # PWGTP16 <dbl>, PWGTP17 <dbl>, PWGTP18 <dbl>, PWGTP19 <dbl>, PWGTP20 <dbl>, ## # PWGTP21 <dbl>, PWGTP22 <dbl>, PWGTP23 <dbl>, PWGTP24 <dbl>, PWGTP25 <dbl>, ## # PWGTP26 <dbl>, PWGTP27 <dbl>, PWGTP28 <dbl>, PWGTP29 <dbl>, PWGTP30 <dbl>, ## # PWGTP31 <dbl>, PWGTP32 <dbl>, PWGTP33 <dbl>, PWGTP34 <dbl>, PWGTP35 <dbl>, ## # PWGTP36 <dbl>, PWGTP37 <dbl>, PWGTP38 <dbl>, PWGTP39 <dbl>, PWGTP40 <dbl>, ## # PWGTP41 <dbl>, PWGTP42 <dbl>, PWGTP43 <dbl>, PWGTP44 <dbl>, PWGTP45 <dbl>, ## # PWGTP46 <dbl>, PWGTP47 <dbl>, PWGTP48 <dbl>, PWGTP49 <dbl>, PWGTP50 <dbl>, ## # PWGTP51 <dbl>, PWGTP52 <dbl>, PWGTP53 <dbl>, PWGTP54 <dbl>, PWGTP55 <dbl>, ## # PWGTP56 <dbl>, PWGTP57 <dbl>, PWGTP58 <dbl>, PWGTP59 <dbl>, PWGTP60 <dbl>, ## # PWGTP61 <dbl>, PWGTP62 <dbl>, PWGTP63 <dbl>, PWGTP64 <dbl>, PWGTP65 <dbl>, ## # PWGTP66 <dbl>, PWGTP67 <dbl>, PWGTP68 <dbl>, PWGTP69 <dbl>, PWGTP70 <dbl>, ## # PWGTP71 <dbl>, PWGTP72 <dbl>, PWGTP73 <dbl>, PWGTP74 <dbl>, PWGTP75 <dbl>, ## # PWGTP76 <dbl>, PWGTP77 <dbl>, PWGTP78 <dbl>, PWGTP79 <dbl>, PWGTP80 <dbl> ``` --- ## Creating a `survey` object .pull-left[ * The __survey__ package is the standard for handling complex survey samples in R * The more recent __srvyr__ package wraps __survey__ to allow the use of tidyverse functions on survey objects * tidycensus includes a function, `to_survey()`, to convert ACS microdata to survey/srvyr objects ] .pull-right[ ```r library(survey) library(srvyr) ms_pums_svy <- ms_pums_replicate %>% to_survey(type = "person", design = "rep_weights") class(ms_pums_svy) ``` ``` ## [1] "tbl_svy" "svyrep.design" ``` ] --- ## Calculating estimates and errors with __srvyr__ * srvyr's `survey_*()` family of functions automatically calculates standard errors around tabulated estimates for you ```r ms_pums_svy %>% survey_count(PUMA, SEX_label) ``` ``` ## # A tibble: 42 x 4 ## PUMA SEX_label n n_se ## <chr> <ord> <dbl> <dbl> ## 1 00100 Male 86297 272. ## 2 00100 Female 92713 119. ## 3 00200 Male 71887 402. ## 4 00200 Female 75595 333. ## 5 00300 Male 56603 715. ## 6 00300 Female 60428 293. ## 7 00400 Male 69837 652. ## 8 00400 Female 74481 458. ## 9 00500 Male 69954 233. ## 10 00500 Female 75230 258. ## # … with 32 more rows ``` --- ## Complex tabulations with __srvyr__ ```r ms_svy_summary <- ms_pums_svy %>% mutate(ba_above = SCHL %in% c("21", "22", "23", "24")) %>% filter(AGEP >= 25) %>% group_by(PUMA, SEX_label) %>% summarize( age_25_up = survey_total(vartype = "se"), ba_above_n = survey_total(ba_above, vartype = "se"), ba_above_prop = survey_mean(ba_above, vartype = "se") ) glimpse(ms_svy_summary) ``` ``` ## Rows: 42 ## Columns: 8 ## Groups: PUMA [21] ## $ PUMA <chr> "00100", "00100", "00200", "00200", "00300", "00300",… ## $ SEX_label <ord> Male, Female, Male, Female, Male, Female, Male, Femal… ## $ age_25_up <dbl> 54604, 62248, 47679, 52649, 36400, 40802, 44281, 4889… ## $ age_25_up_se <dbl> 240.3236, 107.8895, 350.4173, 272.1924, 468.8363, 198… ## $ ba_above_n <dbl> 12061, 17297, 5933, 8315, 4498, 6641, 10173, 11907, 8… ## $ ba_above_n_se <dbl> 568.4099, 607.9556, 414.7287, 463.0930, 335.5505, 400… ## $ ba_above_prop <dbl> 0.2208813, 0.2778724, 0.1244363, 0.1579327, 0.1235714… ## $ ba_above_prop_se <dbl> 0.010308248, 0.009783897, 0.008779868, 0.008863327, 0… ``` --- ## Converting standard errors to MOEs * To convert standard errors to the familiar margins of error at a 90 percent confidence level, multiply them by 1.645 ```r ms_svy_summary_moe <- ms_svy_summary %>% mutate(ba_above_prop_moe = ba_above_prop_se * 1.645) %>% filter(SEX_label == "Female") ggplot(ms_svy_summary_moe, aes(x = ba_above_prop, y = reorder(PUMA, ba_above_prop))) + geom_errorbarh(aes(xmin = ba_above_prop - ba_above_prop_moe, xmax = ba_above_prop + ba_above_prop_moe)) + geom_point(size = 3, color = "navy") + labs(title = "Percent of women age 25+ with 4-year college degree", subtitle = "PUMAs in Missisippi. Error bars represent margin of error at 90 percent confidence level.", x = "2015-2019 ACS estimate (from PUMS data)", y = "") + scale_x_continuous(labels = scales::percent) + theme_grey(base_size = 14) ``` --- <img src=img/errorbars.png style="width: 800px"> --- class: middle, center, inverse ## Advanced example: modeling with PUMS data --- ## Data setup * Question (adapted from the tidycensus docs by Matt Herman): what is the relationship between wages/class of worker and commute times in Rhode Island? ```r ri_pums_to_model <- get_pums( variables = c("PUMA", "WAGP", "JWMNP", "JWTRNS", "COW", "ESR"), state = "RI", survey = "acs5", year = 2019, rep_weights = "person" ) ``` --- ## Data preparation .pull-left[ * Appropriate estimation of standard errors requires accounting for the complex sample design of the PUMS and using replicate weights * For analysis of subpopulations, `srvyr::filter()` works like `survey::subset()` for appropriate standard error estimation ] .pull-right[ ```r ri_model_sd <- ri_pums_to_model %>% mutate( emp_type = case_when( COW %in% c("1", "2") ~ "private", COW %in% c("3", "4", "5") ~ "public", TRUE ~ "self" ) ) %>% to_survey() %>% filter( ESR == 1, # civilian employed JWTRNS != 11, # does not work at home WAGP > 0, # earned wages last year JWMNP > 0 # commute more than zero min ) ``` ] --- ## Fitting a model * The family of modeling functions in the __survey__ package should be used for modeling data in survey design objects, e.g. `survey::svyglm()` ```r model <- survey::svyglm(log(JWMNP) ~ log(WAGP) + emp_type + PUMA, design = ri_model_sd) ``` --- ## Evaluating the model ```r summary(model) ``` ``` ## ## Call: ## survey::svyglm(formula = log(JWMNP) ~ log(WAGP) + emp_type + ## PUMA, design = ri_model_sd) ## ## Survey design: ## Called via srvyr ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.725934 0.065017 26.546 < 2e-16 *** ## log(WAGP) 0.128040 0.006113 20.945 < 2e-16 *** ## emp_typepublic -0.102069 0.017574 -5.808 1.70e-07 *** ## emp_typeself -0.206041 0.032482 -6.343 1.92e-08 *** ## PUMA00102 -0.055908 0.019301 -2.897 0.00503 ** ## PUMA00103 -0.171204 0.022522 -7.602 9.88e-11 *** ## PUMA00104 -0.139590 0.025732 -5.425 7.84e-07 *** ## PUMA00201 -0.037179 0.022801 -1.631 0.10747 ## PUMA00300 -0.159596 0.025680 -6.215 3.26e-08 *** ## PUMA00400 -0.051718 0.022139 -2.336 0.02236 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 13447.08) ## ## Number of Fisher Scoring iterations: 2 ``` --- ## Part 2 exercises * Using the dataset you acquired from the exercises in Part 1 (or the example Wyoming dataset), tabulate a group-wise summary using the PWGTP column and dplyr functions as you've learned in this section. * Advanced follow-up: using `get_acs()`, attempt to acquire the same aggregated data from the ACS. Compare your tabulated estimate with the ACS estimate. * Second advanced follow-up: request the same data as before, but this time with replicate weights. Calculate the margin of error as you've learned in this section - and if you have time, compare with the posted ACS margin of error! --- class: middle, center, inverse ## International Census data resources in R --- ## International Census data * In the workshops to this point, we've focused on United States Census data. Demographic data and Censuses are international, however, and so are R resources! * Options for non-US demographic data are widespread in R, and include: -- - Global data from the US Census Bureau's International Data Base, and -- - Country-specific Census data packages from around the world. We'll review a sampling in a moment! --- ## The US Census International Data Base * The US Census Bureau International Database includes both historical data on demographic components (population, fertility, mortality, migration) and population projections to 2100 * idbr: my R package to query data from the IDB API. Not as smooth as tidycensus, but it works! * Functions: `idb1()` for the 1-year API and `idb5()` for the 5-year API (a bit of a misnomer, as it includes a wide range of yearly demographic variables) --- ## International data with idbr * Example: how has life expectancy at birth changed in Southeast Asia? ```r library(idbr) library(tidyverse) # Not necessary if you have the development version of idbr idb_api_key(Sys.getenv("CENSUS_API_KEY")) se_asia_lex <- idb5( country = c("Laos", "Vietnam", "Cambodia", "Thailand"), year = 1995:2021, variable = "E0", country_name = TRUE ) ggplot(se_asia_lex, aes(x = time, y = E0, color = NAME)) + theme_minimal(base_size = 18) + geom_line(size = 2) + labs(title = "Change in life expectancy, 1995-2021", y = "Life expectancy at birth", x = "Year") ``` --- <img src=img/lex_plot.png style="width: 850px"> --- ## International data with idbr .pull-left[ * Example: what 10 countries have the highest total fertility rates in 2021? ```r top_tfr <- idb5( country = "all", year = 2021, variable = "TFR", country_name = TRUE ) %>% slice_max(TFR, n = 10) ``` ] .pull-right[ ```r top_tfr ``` ``` ## # A tibble: 10 x 4 ## TFR NAME FIPS time ## <dbl> <chr> <chr> <dbl> ## 1 6.91 Niger NG 2021 ## 2 5.90 Angola AO 2021 ## 3 5.7 Congo (Kinshasa) CG 2021 ## 4 5.63 Mali ML 2021 ## 5 5.57 Chad CD 2021 ## 6 5.47 Benin BN 2021 ## 7 5.45 Uganda UG 2021 ## 8 5.43 South Sudan OD 2021 ## 9 5.41 Somalia SO 2021 ## 10 5.10 Burundi BY 2021 ``` ] --- ## International population pyramids <img src=img/nigeria_pyramid-1.gif style="width: 500px"> .footnote[See more at [Ari Lamstein's blog](https://arilamstein.com/blog/2016/06/06/idbr-access-us-census-bureau-international-data-base-r/)] --- ## Canada: cancensus .pull-left[ <img src="https://raw.githubusercontent.com/mountainMath/cancensus/master/images/cancensus-sticker.png" style="width: 400px"> ] .pull-right[ * [The cancensus R package](https://mountainmath.github.io/cancensus/index.html) grants comprehensive access to Canadian Census data through the [CensusMapper APIs](https://censusmapper.ca/) ```r library(cancensus) library(tidyverse) library(sf) options(cancensus.api_key = Sys.getenv("CANCENSUS_API_KEY")) # View(list_census_vectors("CA16")) ``` ] --- ## Canada: cancensus * cancensus allows for different datasets, regions, and geographic levels to be queried * Simple feature geometry is available if requested ```r montreal_english <- get_census( dataset = "CA16", regions = list(CMA = "24462"), vectors = "v_CA16_1364", level = "CT", geo_format = "sf", labels = "short" ) ``` --- ## Mapping Canadian Census data ```r tmap_mode("view") montreal_pct <- montreal_english %>% mutate(pct_english = 100 * (v_CA16_1364 / Population)) tm_shape(montreal_pct) + tm_polygons(col = "pct_english", alpha = 0.5, palette = "viridis", style = "jenks", n = 7, title = "Percent speaking<br/>English at home") ``` --- <img src=img/montreal.png style="width: 800px"> --- ## Mexico: mxmaps/inegiR * Used together, the [mxmaps](https://www.diegovalle.net/mxmaps/) and [inegiR](https://github.com/Eflores89/inegiR) R packages allow for geographic analysis and visualization of Mexican Census data * Get an API token at http://www3.inegi.org.mx//sistemas/api/indicadores/v1/tokenVerify.aspx --- ## Mexico: mxmaps/inegiR ```r # remotes::install_github("diegovalle/mxmaps") library(mxmaps) token_inegi <- Sys.getenv("INEGI_API_KEY") state_regions <- df_mxstate_2020$region choropleth_inegi(token_inegi, state_regions, indicator = "3104003001", legend = "%", title = "Percentage born outside\nstate of residence") + theme_void(base_size = 18) ``` --- <img src=img/mx_map.png style="width: 850px"> --- .pull-left[ ```r library(inegiR) token_inegi <- Sys.getenv("INEGI_API_KEY") state_regions <- mxmaps::df_mxstate_2020$region pct_migrants <- map_df(state_regions, ~{ inegi_series(series_id = "3104003001", token = token_inegi, geography = .x, database = "BISE") %>% mutate(state_code = .x) }) ``` ] .pull-right[ ```r head(pct_migrants) ``` ``` ## date date_shortcut values notes state_code ## 1 <NA> Q1 6.182882 01 ## 2 <NA> Q1 7.725998 02 ## 3 <NA> Q1 11.723691 03 ## 4 <NA> Q1 4.443911 04 ## 5 <NA> Q1 4.581990 05 ## 6 <NA> Q1 8.135314 06 ``` ] --- ## Brazil: geobr .pull-left[ <img src="https://raw.githubusercontent.com/ipeaGIT/geobr/master/r-package/man/figures/geobr_logo_b.png" style="width: 400px"> ] .pull-right[ * Package from [IPEA](https://www.ipea.gov.br/portal/) that helps you load shapes for many Brazilian geographies, including Census boundaries ```r library(geobr) rj_tracts <- read_census_tract(code_tract = 3304557) ``` ] --- ```r mapview::mapview(rj_tracts) ``` <img src=img/rj_tracts.png style="width: 800px"> --- ## Kenya: RKenyaCensus .pull-left[ * The [rKenyaCensus package](https://github.com/Shelmith-Kariuki/rKenyaCensus) by [Shel Kariuki](https://shelkariuki.netlify.app/) makes indicators from the 2019 Kenya Population and Housing Census * Tables install directly with the package and can be accessed by name ] .pull-right[ ```r # remotes::install_github("Shelmith-Kariuki/rKenyaCensus") library(rKenyaCensus) library(tidyverse) religion <- V4_T2.30 %>% filter(County != "KENYA") %>% mutate(county_title = str_to_title(County), prop_islam = Islam / Total) ``` ] --- ## Visualizing Kenyan Census data ```r ggplot(religion, aes(x = prop_islam, y = reorder(county_title, prop_islam))) + geom_col(fill = "navy", alpha = 0.6) + theme_minimal(base_size = 12.5) + scale_x_continuous(labels = scales::percent) + labs(title = "Percent Muslim by County in Kenya", subtitle = "2019 Kenyan Census", x = "", y = "", caption = "Data source: rKenyaCensus R package") ``` --- <img src=img/kenya_plot.png style="width: 900px"> --- ## Mapping Kenyan Census data * The rKenyaCensus package includes county boundaries to facilitate mapping of the various indicators in the Census ```r kenya_islam_map <- KenyaCounties_SHP %>% sf::st_as_sf() %>% mutate(County = str_remove(County, " CITY")) %>% left_join(religion, by = "County") %>% mutate(pct_islam = 100 * prop_islam) tm_shape(kenya_islam_map) + tm_polygons(col = "pct_islam", palette = "viridis", style = "jenks", n = 7, title = "% Muslim", alpha = 0.6) ``` --- <img src=img/kenya_map.png style="width: 700px"> --- ## Other data resources * Many other international data resources in R exist, e.g. [nomisr](https://github.com/ropensci/nomisr) for UK data and [afrimapr](https://afrimapr.github.io/afrimapr.website/) for Africa-wide datasets * The [IPUMS International project](https://international.ipums.org/international/) has microdata for over 100 countries; use IPUMS data with the [ipumsr R package](https://github.com/mnpopcenter/ipumsr) --- ## Part 3 exercises * Choose a country from the examples above and explore further. Try making a map or a chart for a new variable for a country of your choice! --- class: middle, center, inverse ## Thank you!