Geocoding and custom subsets¶
pygris offers an interface to the US Census Bureau's Geocoding and GeoLookup APIs. The Census geocoder is typically not as accurate as commercial alternatives as it is based on street centerlines geocoding referenced to the TIGER/Line database; see Prener and Fox (2021) for more details. For analysts who need to do geocoding in the United States, however, it is a useful tool for a few reasons:
- It is free to use without an API key;
- It offers both single-address and batch geocoding options;
- It can return the GEOID of the Census geographic unit in which a given address or coordinate pair falls.
Geocoding functionality in pygris is found in the geocode
module. The geocode()
function takes either a single-line address or an address specified as arguments and returns a Pandas dataframe of possible matches, subject to the value of the limit
argument.
from pygris.geocode import geocode
geocode(address = "1600 Pennsylvania Ave, Washington DC")
longitude | latitude | GEOID | address | |
---|---|---|---|---|
0 | -77.036541 | 38.898744 | 110019800001034 | 1600 Pennsylvania Ave, Washington DC |
The estimated longitude and latitude of the location, along with the estimated Census block GEOID, are returned. As a convenience, users can also specify the option as_gdf = True
to return a GeoDataFrame for mapping or analysis.
geocode(address = "1600 Pennsylvania Ave, Washington DC", as_gdf = True).explore(marker_type = "marker")
Given that the Census geocoder uses street centerlines geocoding, not rooftop geocoding, the point will be placed on the street near the location, not necessarily right on top of it.
The geolookup()
function can be used to find the Census GEOID of a location defined by X and Y coordinates.
from pygris.geocode import geolookup
geolookup(longitude = -98.90629, latitude= 32.75639)
GEOID | longitude | latitude | |
---|---|---|---|
0 | 484299502001041 | -98.90629 | 32.75639 |
Batch geocoding of addresses in a Pandas DataFrame can be accomplished with the batch_geocode
function. batch_geocode()
takes columns describing addresses and sends them to the Census Bureau's batch geocoding API. The object returned includes estimated longitude and latitude coordinates for each address along with contextual information about the Census geographies that the addresses fall within.
Let's take a look at some sample addresses from New York City.
import pandas as pd
from pygris.geocode import batch_geocode
my_addresses = pd.DataFrame(
{"building": ["Chrysler Building", "Empire State Building", "Flatiron Building"],
"address": ["405 Lexington Ave", "20 W 34th St", "175 5th Ave"],
"city": "New York",
"state": "New York",
"zip": ["10174", "10018", "10010"]}
)
my_addresses
building | address | city | state | zip | |
---|---|---|---|---|---|
0 | Chrysler Building | 405 Lexington Ave | New York | New York | 10174 |
1 | Empire State Building | 20 W 34th St | New York | New York | 10018 |
2 | Flatiron Building | 175 5th Ave | New York | New York | 10010 |
We can geocode the addresses by passing the dataframe to the batch_geocode()
function and mapping the columns as arguments to the function appropriately.
my_points = batch_geocode(my_addresses, id_column = "building",
address = "address", city = "city", state = "state",
zip = "zip", as_gdf = True)
my_points
id | address | status | match_quality | matched_address | tiger_line_id | tiger_side | state | county | tract | block | longitude | latitude | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Flatiron Building | 175 5th Ave, New York, New York, 10010 | Match | Exact | 175 5TH AVE, NEW YORK, NY, 10010 | 59653610 | R | 36 | 61 | 5600 | 2002 | -73.99002339099997 | 40.74093396000006 | POINT (-73.99002 40.74093) |
1 | Empire State Building | 20 W 34th St, New York, New York, 10018 | Match | Non_Exact | 20 W 34TH ST, NEW YORK, NY, 10118 | 59653429 | L | 36 | 61 | 7600 | 1001 | -73.98533698799997 | 40.748757279000074 | POINT (-73.98534 40.74876) |
2 | Chrysler Building | 405 Lexington Ave, New York, New York, 10174 | Match | Exact | 405 LEXINGTON AVE, NEW YORK, NY, 10174 | 59658691 | R | 36 | 61 | 9200 | 1003 | -73.97592741299997 | 40.75159803200006 | POINT (-73.97593 40.75160) |
An interactive map helps us audit the geocoding result:
my_points.explore(marker_type = "marker")
While the geocoded points for the Chrysler Building and the Empire State Building are right next to the actual buildings, the Flatiron Building is geocoded one block away. This underscores the point that users who need very high accuracy should probably look to other geocoding resources. However, if these results are tolerable, the Census geocoder offers a compelling option for users who require a free batch geocoder.
Custom subsets of Census shapefiles¶
Users of pygris will typically request data by state or county, which are the geographies at which Census shapefiles are usually available. However, there may be instances where more custom subsets are required. As pygris relies on geopandas.read_file()
to read in shapefiles, it benefits from the options available to read in specific rows from those shapefiles only.
The parameter subset_by
will return a custom subset of a Census shapefile depending on the type of object passed as an argument. Options are as follows:
- If a user supplies a tuple of format (minx, miny, maxx, maxy), it will be interpreted as a bounding box and rows will be returned that intersect that bounding box;
- If a user supplies a integer or a slice object, the first n rows (or the rows defined by the slice object) will be returned;
- If a user supplies an object of type geopandas.GeoDataFrame or of type geopandas.GeoSeries, rows that intersect the input object will be returned. CRS misalignment will be resolved internally;
- Users can even leverage the geocoding functionality in pygris to pass a dict object of format
{"address": "buffer_dist"}
, which will return geographies that intersect a given buffer distance (specified in meters) of a geocoded address.
Let's take a look at how this works. Say we want to retrieve Census tracts within 5km of the Texas State Capitol in Austin. This is straightforward with subset_by
:
from pygris import tracts
capitol_tracts = tracts(state = "TX", cb = True,
subset_by = {"1100 Congress Ave., Austin, TX 78701": 5000})
capitol_tracts.explore()
Using the default year of 2021 Using FIPS code '48' for input 'TX'
Getting data for multiple states is not quite as straightforward. Let's say you want all the tracts within 10km of Subaru Park, home of the Philadelphia Union soccer team in Chester, PA. This buffer will cross into New Jersey and into Delaware from Pennsylvania.
One method is to use a list comprehension to store state-wise tracts in a list, then use pandas.concat()
to assemble the result:
import pandas as pd
union_tracts_list = [tracts(cb = True, state = x, subset_by = {"2501 Seaport Dr, Chester, PA 19013": 10000}) for x in ['DE', 'PA', 'NJ']]
union_tracts = pd.concat(union_tracts_list)
union_tracts.explore()
Using the default year of 2021 Using FIPS code '10' for input 'DE' Using the default year of 2021 Using FIPS code '42' for input 'PA' Using the default year of 2021 Using FIPS code '34' for input 'NJ'
The other alternative is to pull down a national tracts shapefile by omitting the state and subsetting with subset_by
. Given that the file will first have to be downloaded, it is recommended that you use cache = True
to store this file in a local cache. Once you have it, however, you can subset for whatever locations you need, when you need!