DON’T use data for individual blocks. Instead, aggregate data into larger areas, or use statistical models that combine data from many blocks. Block data are published to permit the analysis of user-constructed geographic areas composed of multiple blocks, for example, new voting districts that consist of collections of blocks within a politically defined geography.
This isn’t likely to be satisfactory advice for analysts for a couple key reasons. First, analysts working with Census data in rural areas often need block-level data to understand demographic trends, as block groups (the next level up in the Census hierarchy) may be too large in sparsely-populated areas. Second, “aggregating data” is not as simple as it sounds in the quote. Creating data aggregations requires an understanding of techniques in GIS and data science that may be beyond the knowledge of the average Census data user.
In this post, I’ll illustrate a technique for creating custom regions from Census block data. We’ll be using the pygeoda package for this task, a Python wrapper of the C++ library that powers GeoDa, a GUI tool for exploratory spatial data analysis and spatial modeling. Working with GeoDa in this way is particularly fun for me. I was a qualitative geographer in graduate school before encountering GeoDa. GeoDa was the tool that sparked an interested in spatial data science for me and in many ways motivated my eventual career path.
Let’s grab some block data using pygris for Delta County, Texas, a rural county of about 5,000 residents northeast of the Dallas-Fort Worth metro area. If you haven’t previously cached the Texas block shapefile, this will take a few minutes to download.
import geopandas as gpimport pygeodafrom pygris import blocks, block_groupsfrom pygris.data import get_census# Get the block data for a county in Texasdelta_blocks = blocks(state ="TX", county ="Delta", year =2020, cache =True)
Using FIPS code '48' for input 'TX'
Using FIPS code '119' for input 'Delta'
Given that Delta County is fairly small, we can use .explore() to make a performant interactive map of the 571 Census blocks.
delta_blocks.explore(tooltip =False, popup =True)
Make this Notebook Trusted to load map: File -> Trust Notebook
There is one town of reasonable size in Delta County, Cooper. However, Census geography above the block level makes any sort of demographic analysis tricky. For example, we can briefly review block groups in Delta County:
delta_bgs = block_groups(state ="TX", county ="Delta", year =2020, cache =True)delta_bgs.explore(tooltip =False, popup =True)
Using FIPS code '48' for input 'TX'
Using FIPS code '119' for input 'Delta'
Make this Notebook Trusted to load map: File -> Trust Notebook
We see that the 571 blocks in Delta County are organized into only 4 block groups. Cooper is bisected by two block groups, both of which include area outside the built-up area of the town. This means that we can’t really use block groups to do more detailed analysis of Cooper’s in-town demographics, and any other settlements in the county (like Pecan Gap in the northwest) are subsumed by much larger block groups.
The solution we’ll use is regionalization. Regionalization is the process of building larger, aggregated areas from smaller geographies in ways that are spatially coherent and account for the characteristics of those small areas. To get started, let’s grab some demographic data from the new Demographic and Housing Characteristics file. While there are many ways to get Census data in Python, pygris has a lower-level function, get_census(), to help you grab data from the Census API to merge to your Census shapes.
We can then merge our block-level Census data to our block geometries and calculate some derived columns. As the block shapes acquired with blocks() have a column on land area, ALAND20, we can calculate population density; we’ll also calculate the percentage of the block population that is non-Hispanic white. These columns will be used in the regionalization algorithm to cluster together demographically similar blocks.
After preparing our Census data, we can now move to spatial analysis. pygeoda.open() convers a GeoPandas GeoDataFrame to an object suitable for use with pygeoda. Next, we’ll create spatial weights to represent spatial relationships between Census blocks. We’ll use rook weights, which means that blocks are considered to be neighbors if they share at least one line segment between them. This step is critical for regionalization as we want to ensure that our regions are spatially coherent.
Weights Meta-data:
number of observations: 571
is symmetric: True
sparsity: 0.008244361905404535
# min neighbors: 1
# max neighbors: 22
# mean neighbors: 4.7075306479859895
# median neighbors: 4.0
has isolates: False
We get some basic information about the weights object. The least-connected block has 1 neighbor, and the most-connected block has 22; the median number of neighbors is 4.
With this information in hand, we can run the regionalization algorithm. The algorithm we’ll choose is Max-p. Max-P regionalization attempts to find the maximum number of clusters that are spatially contiguous and exceed a given size threshold while maximizing within-cluster homogeneity. We’ll use population density and percent non-Hispanic white as our clustering variables, and total population as our bounding variable. Setting min_bound to 100 tells the algorithm that each derived region must have at least 100 people in it. A couple other technical details: max-p is highly sensitive to the algorithm’s starting point, so it is recommended to set a random-number seed (and potentially evaluate results among multiple seeds). For more stable performance, cpu_threads = 1 should also be used.
Thanks to the C++ back-end of pygeoda, the function is lightning-fast. The function returns a dict object with information about the regionalization solution; we’ll pluck the clusters out from it and assign it to our original GeoDataFrame as a column. We’ll then use a dissolve() operation to build new geographies from our regionalization solution, and calculate some derived statistics for those regions.
We see that the algorithm built 39 regions in Delta County. The minimum population of any region is 101, which indicates that our minimum population threshold specification was satisfied. Let’s take a look at the derived regions:
Make this Notebook Trusted to load map: File -> Trust Notebook
Zooming into Cooper shows that we have several new geographies built from Census blocks in the town, allowing us to do more detailed demographic analysis. Additionally, Pecan Gap gets its own aggregated geography in this solution. We can also use these regions in future data work, as block IDs are mapped to regions in the delta_geo object we created.
Regionalization is a powerful tool in spatial data science, and it can be used to solve problems in a wide range of fields. If you want to learn more, check out my workshops or send me a note to discuss further!