Conclusion

As its title suggests, this book is an overview of methods, maps, and models that can be used to complete applied social research projects with R and Census data. As the breadth of approaches and packages covered in the book suggests, R’s ecosystems for working on these topics are large and constantly changing. While I have chosen to focus on packages that I have developed (tidycensus, tigris, mapboxapi, crsuggest, and idbr) and frameworks like tidyverse and sf that integrate with those packages, there are many other directions you could go to learn more.

Census data analysts may be interested in delving deeper into the packages introduced in Chapter 11, such as censusapi, ipumsr, and lehdr. Geospatial analysts will want to read Geocomputation with R (Lovelace, Nowosad, and Muenchow 2019) to gain a strong command of R’s capabilities for spatial data. Those interested in machine learning and prediction, which is not a major focus of this book, should read Ken Steif’s Public Policy Analytics (Steif 2021), which uses some of the skills learned in this book in applied machine learning workflows for public policy. Readers who want to learn more about modeling, especially within a tidyverse framework, should keep an eye on Tidy Modeling with R (Kuhn and Silge 2021), which is not yet complete at the time of this writing but when finished will offer a comprehensive overview of how to model data in a tidyverse-friendly way.

My own packages outlined in this book will also be updated as new data are made available. The top development priority for tidycensus is making access to 2020 Census data seamless as they are released to data.census.gov and the Census API, allowing users to work with 2020 Census data in the exact same way they’ve been working with other datasets in the package. Other feature suggestions are welcome on the tidycensus GitHub issues page, as are user contributions!

My future development work on these packages will certainly be done with recent threats to Census data quality in mind. While we are very fortunate in the R community to have such a breadth of data resources at our fingertips, it is important to be aware of how ephemeral that can be. The Census Bureau did not release its typical 2020 1-year ACS estimates due to data collection problems during the COVID-19 pandemic, breaking time series workflows that use 1-year ACS data. While the 5-year 2016-2020 ACS dataset was still released, larger margins of error due to COVID-19 data collection problems will be present in ACS samples for the next several years.

The Census data community has also been grappling with the tension between preserving respondent privacy and maintaining data quality. The 2020 decennial Census data are released using differential privacy, an approach that introduces errors into datasets in an attempt to preserve overall population characteristics while limiting the possibility of re-identifying individuals in the data (Abowd 2018). Advocates for differential privacy argue that disclosure avoidance techniques are necessary to prevent re-identification attacks which are important for the Census Bureau given modern database reconstruction technologies (Hawes 2020). Critics of this approach, however, have argued that differential privacy will have a disastrous impact on Census data quality (Ruggles et al. 2019), potentially making microdata and block-level aggregate data unusable. The loss in data quality also threatens to disproportionately impact rural populations and racial & ethnic minorities. This will make it difficult to evaluate racial health disparities (Santos-Lozada, Howard, and Verdery 2020), accurately estimate COVID-19 mortality rates (Hauer and Santos-Lozada 2021), and complete countless other analyses that require access to high-quality small-area Census data. More broadly, Ruggles and Van Riper (2021) argue that re-identification attacks with database reconstruction are little different than what is produced with a random number generator, and that the actual threats to the population from Census data re-identification are already publicly available on the internet.

Census data have also faced threats from higher levels of authority. A major initiative of the Trump Administration was to introduce a citizenship question on the decennial Census, which is currently only asked on the American Community Survey. This move, which was ultimately rejected by the United States Supreme Court, was widely interpreted as an effort to suppress responses in nonwhite and immigrant communities (Frey 2019). It also may have been a precursor to an effort to make Congressional apportionment contingent only on the citizen population rather than the entire population as is mandated in the US Constitution. Although this effort did not succeed, the Trump Administration engaged in other efforts to underfund and undermine the success of the Census, which was exacerbated by data collection difficulties during the COVID-19 pandemic (Bahrampour, Rabinowitz, and Mellnik 2021).

This discussion is not intended to conclude the book on a cynical note. Rather, it is to reinforce one of the book’s central take-aways: the critical importance of high-quality, open data and free and open source software to analyze that data. Census data are a democratizing force in many ways. They allow communities to analyze their own characteristics and use that information to solve problems that may otherwise be overlooked by higher levels of government. They allow analysts to independently evaluate re-districting scenarios and call them out if they are disenfranchising local residents. They help us “see” latent inequalities that manifest themselves within societies and propose solutions to rectify them.

These initiatives are facilitated not only by open data, but by open tools to analyze them. As discussed earlier in this book, even open government datasets have traditionally been hard to work with. They may have been stored in bulky datasets that require navigation of complex file systems, and required expensive commercial software platforms to process and analyze them. In contrast, resources like R and the Census API put Census data queries at the fingertips of the user, and integrate with data analysis tools in ways that help analysts get to insights faster. This is not to dismiss the learning curve of R and the methods and skills necessary to generate and understand those insights. Rather, it is to stress that open data and open source software reduce financial and logistical barriers to access, creating a larger and more diverse user community that can generate unique insights and help solve problems. Contributing to this effort is one of my main reasons for writing this book and publishing it as an open website, and I very much appreciate all of you who have taken the time to read through it.

12 Working with Census data outside the United States

Errata