12 Working with Census data outside the United States

Although the methods presented in this book are generalizable, the examples given in the book to this point have focused on the United States. Readers of this book are most certainly not limited to the United States in their projects, and will be interested in how to apply the methods presented to examples around the world. International Census & demographic data is the focus of this chapter. The first section focuses on global demographic indicators for between-country comparisons and gives an overview of the idbr R package for accessing these indicators through the US Census Bureau’s International Data Base. The sections that follow cover country-specific packages from around the world that deliver Census data to R users in similar ways to the US-focused packages covered earlier in this book. Examples from Canada, Mexico, Brazil, and Kenya illustrate how to apply the methods readers have learned in this book to Census data analyses in other parts of the world.

12.1 The International Data Base and the idbr R package

The US Census Bureau’s International Database (IDB) is a repository of dozens of demographic indicators for over 200 countries around the world. The IDB includes both historical information by year for most countries as well as population projections to 2100. In-place demographic characteristics are derived from a wide range of international Censuses and surveys, and future projections are estimated with the cohort-component method.

The Census Bureau makes these datasets available to researchers through an interactive data tool and also through its API, allowing for programmatic data access. The idbr R package (K. Walker 2021b), first released in 2016 and updated to version 1.0 in 2021, uses a simple R interface to help users gain access to and analyze data from the IDB.

To get started with idbr, users can install the package from CRAN and set their Census API keys with the idb_api_key() function. If a user’s Census API key is already installed using tidycensus, idbr will pick it up making this step unnecessary.

library(idbr)
# Unnecessary if API key is installed with tidycensus
idb_api_key("YOUR KEY GOES HERE")

The core function implemented in idbr is get_idb(), which can access all datasets available in the IDB. There are two main datasets available: the 1-year-of-age population dataset, which allows for population data by country to be downloaded broken down by age and sex; and the 5-year-of-age dataset, which gives access to population by 5-year age bands but also many other fertility, mortality, and migration indicators that are not organized by age bands.

A basic IDB query uses a country, specified as either an ISO-2 code or a country name in English, a vector of one or more variables, and a vector of one or more years. For example, we can fetch historical and projected population for Nigeria from 1990 to 2100:

nigeria_pop <- get_idb(
  country = "Nigeria",
  variables = "pop",
  year = 1990:2100
)
Table 12.1: Nigeria population data acquired with idbr
code year name pop
NG 1990 Nigeria 97041753
NG 1991 Nigeria 99406734
NG 1992 Nigeria 101837787
NG 1993 Nigeria 104387403
NG 1994 Nigeria 107053762
NG 1995 Nigeria 109838554
NG 1996 Nigeria 112744016
NG 1997 Nigeria 115773046
NG 1998 Nigeria 118927944
NG 1999 Nigeria 122200766

The result can then be visualized as a time series using ggplot2.

library(tidyverse)

ggplot(nigeria_pop, aes(x = year, y = pop)) + 
  geom_line(color = "darkgreen") + 
  theme_minimal() + 
  scale_y_continuous(labels = scales::label_number_si()) + 
  labs(title = "Population of Nigeria",
       subtitle = "1990 to 2100 (projected)",
       x = "Year",
       y = "Population at midyear")
Historical and projected population of Nigeria

Figure 12.1: Historical and projected population of Nigeria

idbr includes functionality to help users look up variable codes for use in their function calls. The function idb_variables() returns a data frame with all variables available in the IDB, along with an informative label; the idb_concepts() function prints out a list of concepts that can be supplied to the concept parameter and will return a group of variables that belong to the same concept (e.g. mortality rates, components of population growth).

For worldwide analyses, users can specify the argument country = "all" to return all available countries in the IDB. In concert with tidyverse tools, this makes global comparative analyses straightforward. In the example below, we get data on life expectancy at birth for all countries in 2021, then view the top 10 and bottom 10 countries.

world_lex <- get_idb(
  country = "all",
  variables = "e0",
  year = 2021
)

world_lex %>% 
  slice_max(e0, n = 10)
Table 12.2: Countries with the longest life expectancies at birth, 2021
code year name e0
MC 2021 Monaco 89.40
SG 2021 Singapore 86.19
MO 2021 Macau 84.81
JP 2021 Japan 84.65
SM 2021 San Marino 83.68
CA 2021 Canada 83.62
IS 2021 Iceland 83.45
HK 2021 Hong Kong 83.41
AD 2021 Andorra 83.23
IL 2021 Israel 83.15
world_lex %>% 
  slice_min(e0, n = 10)
Table 12.3: Countries with the shortest life expectancies at birth, 2021
code year name e0
AF 2021 Afghanistan 53.25
CF 2021 Central African Republic 55.07
SO 2021 Somalia 55.32
MZ 2021 Mozambique 56.49
SL 2021 Sierra Leone 58.45
SS 2021 South Sudan 58.60
TD 2021 Chad 58.73
LS 2021 Lesotho 58.90
SZ 2021 Eswatini 59.13
NE 2021 Niger 59.70

12.1.1 Visualizing IDB data

idbr aims to return data from the IDB in a structure suitable for creative data visualizations. This includes population pyramids (like those introduced in Chapter 4) and world maps.

As mentioned earlier in this chapter, get_idb() can retrieve population data by age and sex in one-year age bands if the age or sex arguments are supplied but variables is left blank. An illustration of this is found in the example below, where get_idb() is used to get 1-year-of-age population data for 2021 in Japan, broken down by sex.

japan_data <- get_idb(
  country = "Japan",
  year = 2021,
  age = 0:100,
  sex = c("male", "female")
) 
Table 12.4: 1-year age band population data for Japan in 2021
code year name pop sex age
JP 2021 Japan 450869 Male 0
JP 2021 Japan 458588 Male 1
JP 2021 Japan 478968 Male 2
JP 2021 Japan 503320 Male 3
JP 2021 Japan 519507 Male 4
JP 2021 Japan 535623 Male 5
JP 2021 Japan 544629 Male 6
JP 2021 Japan 553183 Male 7
JP 2021 Japan 563710 Male 8
JP 2021 Japan 567511 Male 9

Following the example illustrated in Section 4.5.2, a population pyramid can be created by changing data values for one sex to negative, then plotting the data as back-to-back horizontal bar charts.

japan_data %>%
  mutate(pop = ifelse(sex == "Male", pop * -1, pop)) %>%
  ggplot(aes(x = pop, y = as.factor(age), fill = sex)) + 
  geom_col(width = 1) + 
  theme_minimal(base_size = 12) + 
  scale_x_continuous(labels = function(x) paste0(abs(x / 1000000), "m")) + 
  scale_y_discrete(breaks = scales::pretty_breaks(n = 10)) + 
  scale_fill_manual(values = c("darkred", "navy")) + 
  labs(title = "Population structure of Japan in 2021",
       x = "Population",
       y = "Age",
       fill = "")
Population pyramid for Japan

Figure 12.2: Population pyramid for Japan

The aging structure of Japan’s population is notable, especially the number of female centenarians (women aged 100 and up) that stand out on the chart, a number that is likely to grow in the years ahead.

get_idb() also includes a geometry parameter to help users make world maps of demographic indicators. By setting geometry = TRUE, get_idb() returns a simple features object with a geometry column attached to the demographic data. Here, we fetch data on total fertility rate by country in 2021.

library(idbr)
library(tidyverse)

fertility_data <- get_idb(
  country = "all",
  year = 2021,
  variables = "tfr",
  geometry = TRUE,
)

There are a couple points of note with respect to the data returned. The dataset leaves out a number of small countries that would not appear on a zoomed-out world map. To ensure that they are retained, use the argument resolution = "high" for a higher-resolution dataset (which you may also want for regional mapping). There are also rows returned as NA in the dataset; this is to fill in areas that should appear on the map (like Antarctica) but do not have data in the IDB.

The data returned can be mapped with geom_sf() from ggplot2 or any other mapping tool introduced in this book. Specifying a suitable map projection with coord_sf() for the whole world - like the Robinson projection used below - is recommended.

ggplot(fertility_data, aes(fill = tfr)) + 
  theme_bw() + 
  geom_sf() + 
  coord_sf(crs = 'ESRI:54030') + 
  scale_fill_viridis_c() + 
  labs(fill = "Total fertility\nrate (2021)")
Map of total fertility rates by country in 2021

Figure 12.3: Map of total fertility rates by country in 2021

12.1.2 Interactive and animated visualization of global demographic data

In Chapter 6, readers learned how to create interactive maps of US Census data with the tmap and leaflet packages. While the same methods can be applied to the global data returned by get_idb(), the Web Mercator projection used by the Leaflet library is not ideal for world maps as it inflates the area of countries near the poles relative to countries near the Equator. This problem with Web Mercator was covered in Section 6.5.2 and is particularly important for global maps. In turn, it is preferable to turn to other interactive charting libraries such as ggiraph.

To get started, let’s return to our global fertility data and create a new column that will store information we want to include in our tooltip that appears when a viewer hovers over a country.

fertility_data$tooltip <- paste0(fertility_data$name,
                                 ": ",
                                 round(fertility_data$tfr, 2))

We now set up the visualization. Note that instead of using geom_sf(), we use ggiraph’s geom_sf_interactive(), which includes some new aesthetics. We map our new tooltip column to the tooltip aesthetic, which will generate a hover tooltip, and we map the code column to the data_id aesthetic, which will change the country’s color when hovered over. The girafe() function then renders the interactive graphic; we can specify that we want map zooming with girafe_options().

library(ggiraph)

fertility_map <- ggplot(fertility_data, aes(fill = tfr)) + 
  theme_bw() + 
  geom_sf_interactive(aes(tooltip = tooltip, data_id = code), size = 0.1) + 
  coord_sf(crs = 'ESRI:54030') + 
  scale_fill_viridis_c() + 
  labs(fill = "Total fertility\nrate (2021)")

girafe(ggobj = fertility_map) %>%
  girafe_options(opts_zoom(max = 10))

Figure 12.4: Interactive world map with ggiraph

The time-series data availability in the IDB also works well for creating animated time-series graphics with help from the gganimate package (Pedersen and Robinson 2020). gganimate extends ggplot2 to animate sequences of plots using intuitive syntax to ggplot2 users. Earlier in this chapter, we learned how to create a population pyramid from IDB data for a single year (2021). If data are requested instead for multiple years, gganimate can be used to transition from year to year in an animated population pyramid, showing how the population structure of a country has evolved (and is projected to evolve) over time.

library(idbr)
library(tidyverse)
library(gganimate)

mx_pyramid_data <- get_idb(
  country = "Mexico",
  sex = c("male", "female"),
  age = 0:100,
  year = 1990:2050
) %>%
  mutate(pop = ifelse(sex == "Male", pop * -1, pop))

mx_animation <- ggplot(mx_pyramid_data, 
                       aes(x = pop, 
                           y = as.factor(age), 
                           fill = sex)) + 
  geom_col(width = 1) + 
  theme_minimal(base_size = 16) + 
  scale_x_continuous(labels = function(x) paste0(abs(x / 1000000), 
                                                 "m")) + 
  scale_y_discrete(breaks = scales::pretty_breaks(n = 10)) + 
  scale_fill_manual(values = c("darkred", "darkgreen")) + 
  transition_states(year) + 
  labs(title = "Population structure of Mexico in {closest_state}",
       x = "Population",
       y = "Age",
       fill = "")

animate(mx_animation, height = 500, width = 800)
Animated time-series population pyramid of Mexico

Figure 12.5: Animated time-series population pyramid of Mexico

The animation illustrates the aging of Mexico’s population, both observed and projected, during the specified time period.

12.2 Country-specific Census data packages

Analysts around the world need access to high-quality tabular and spatial demographic data to make decisions much like the data that are available from the US Census Bureau. While Censuses are conducted by countries around the world, few countries expose their data by API in the way the US Census Bureau does. Nonetheless, there are a number of R packages that help analysts acquire and work with country-specific Census data. This section of the book covers a sampling of those packages, with examples given from Canada, Mexico, Kenya, and Brazil. These examples will illustrate how to apply many of the techniques introduced in this book to a variety of non-US examples.

12.2.1 Canada: cancensus

The cancensus R package (von Bergmann, Shkolnik, and Jacobs 2021) grants comprehensive access to Canadian Census data through the CensusMapper APIs. While Statistics Canada, the Canadian agency that distributes Canadian demographic data, does not maintain these APIs, package author Jens von Bergmann’s CensusMapper product makes Canadian data accessible via a web interface and data API. Working with cancensus will feel familiar to tidycensus users, as cancensus integrates well within the tidyverse and includes an option to return feature geometry along with demographic data for a wide range of Canadian geographies.

To get started with cancensus, obtain an API key from the CensusMapper website and store it in your .Renviron file as discussed in Section 11. Supplying this key to the cancensus.api_key option will allow the key to be picked up automatically by cancensus functions.

The list_census_vectors() function operates in a similar way to load_variables() in tidycensus. Specify a dataset to generate a browsable data frame of variable IDs which can be used to query data from the CensusMapper API.

library(cancensus)
library(tidyverse)
options(cancensus.api_key = Sys.getenv("CANCENSUS_API_KEY"))

var_list <- list_census_vectors("CA16")
Table 12.5: List of Census vectors in cancensus
vector type label units parent_vector aggregation details
v_CA16_401 Total Population, 2016 Number NA Additive CA 2016 Census; Population and Dwellings; Population, 2016
v_CA16_402 Total Population, 2011 Number NA Additive CA 2016 Census; Population and Dwellings; Population, 2011
v_CA16_403 Total Population percentage change, 2011 to 2016 Number NA Average of v_CA16_402 CA 2016 Census; Population and Dwellings; Population percentage change, 2011 to 2016
v_CA16_404 Total Total private dwellings Number NA Additive CA 2016 Census; Population and Dwellings; Total private dwellings
v_CA16_405 Total Private dwellings occupied by usual residents Number v_CA16_404 Additive CA 2016 Census; Population and Dwellings; Total private dwellings; Private dwellings occupied by usual residents

The data can be browsed with the View() function in RStudio much like it is recommended to do with tidycensus in Section 2.3. Two companion functions, list_census_regions() and list_census_datasets(), should also be used to identify region codes and dataset codes for which you’d like to request data.

Once determined, the appropriate codes can be passed as arguments to various parameters in the get_census() function, the main data access function used in cancensus. The example below fetches data on English speakers by Census tract in the Montreal area in this way.

montreal_english <- get_census(
  dataset = "CA16",
  regions = list(CMA = "24462"),
  vectors = "v_CA16_1364",
  level = "CT",
  geo_format = "sf",
  labels = "short"
) 

The example above constructs the query to the CensusMapper API using the following arguments:

  • dataset = "CA16" requests data from the 2016 Canadian Census;

  • regions = list(CMA = "24462") gets data for the Montreal metropolitan area by using a named list, matching the region type CMA with the specific ID for the Montreal area;

  • vectors = "v_CA16_1364" fetches data for the specific Census vector that represents English language speaking, as identified using list_census_vectors();

  • level = "CT" gets data at the Census tract level;

  • geo_format = "sf" instructs get_census() to return simple feature geometry for Census tracts along with the Census data;

  • labels = "short" returns simpler labels in the output dataset.

Once queried, our data appear as follows:

Table 12.6: Data on English speakers in Montreal by Census tract
Population Households GeoUID Type PR_UID Shape Area Dwellings Adjusted Population (previous Census) CMA_UID CSD_UID CD_UID Region Name Area (sq km) v_CA16_1364 geometry
2638 1328 4620001.00 CT 24 0.46188 1452 2608 24462 2466023 2466 Montréal 0.46188 60 MULTIPOLYGON (((-73.50938 4…
3516 1762 4620002.00 CT 24 0.38927 1902 3039 24462 2466023 2466 Montréal 0.38927 95 MULTIPOLYGON (((-73.51732 4…
6373 2937 4620003.00 CT 24 0.74010 3103 6238 24462 2466023 2466 Montréal 0.74010 370 MULTIPOLYGON (((-73.52152 4…
3176 1603 4620004.00 CT 24 0.44828 1704 3261 24462 2466023 2466 Montréal 0.44828 80 MULTIPOLYGON (((-73.51911 4…
3060 1593 4620005.00 CT 24 0.56464 1749 3163 24462 2466023 2466 Montréal 0.56464 60 MULTIPOLYGON (((-73.50938 4…
4467 1993 4620006.00 CT 24 0.64815 2149 4458 24462 2466023 2466 Montréal 0.64815 105 MULTIPOLYGON (((-73.50779 4…

By default, cancensus returns a number of contextual variables along with the requested Census vector; total population, total households, total dwellings, and area can be used for normalization automatically. This information can be used to normalize data on English speakers by calculating a new column with mutate() named pct_english; familiar mapping tools from Chapter 6 are then used to visualize the data.

library(tmap)
tmap_mode("view")

montreal_pct <- montreal_english %>%
  mutate(pct_english = 100 * (v_CA16_1364 / Population))

tm_shape(montreal_pct) + 
  tm_polygons(
    col = "pct_english", 
    alpha = 0.5, 
    palette = "viridis",
    style = "jenks", 
    n = 7, 
    title = "Percent speaking<br/>English at home"
  )

Figure 12.6: Interactive map of the percentage English language speakers at home in the Montreal area

Exploring the data interactively reveals distinctive patterns of English language speaking in Montreal. While English is the most common language spoken at home in some central Census tracts and the southwestern portion of the Island of Montreal, percentages are much lower in suburban areas and the outer portions of the metropolitan region.

12.2.2 Kenya: rKenyaCensus

The rKenyaCensus package (Kariuki 2020) makes indicators from the 2019 Kenya Population and Housing Census available to R users. This package is the result of a painstaking effort by the package author Shel Kariuki to scrape data from PDFs containing the Census results - which is the way the original data were distributed - and convert them to R data frames. Kenyan Census tables install directly with the package and can be accessed by name. Let’s take a look at table V4_T2.30, which contains information on religion by county in Kenya. We will assign this table to the variable religion for ease of reference.

# remotes::install_github("Shelmith-Kariuki/rKenyaCensus")
library(rKenyaCensus)
library(tidyverse)

religion <- V4_T2.30
Table 12.7: Religion data from the 2019 Kenyan Census
County Total Catholic Protestant Evangelical AfricanInstituted Orthodox OtherChristian Islam Hindu Traditionists OtherReligion NoReligion/Atheists Don’t Know NotStated
KENYA 47213282 9726169 15777473 9648690 3292573 201263 1732911 5152194 60287 318727 467083 755750 73253 6909
MOMBASA 1190987 170797 241554 151939 46094 3104 100568 450740 6136 1040 6637 11148 1018 212
KWALE 858748 43624 116453 82176 22193 720 47892 520160 332 7121 4703 12551 715 108
KILIFI 1440958 95148 388893 307345 54980 2858 128298 253966 518 32605 27502 146669 1989 187
TANA RIVER 314710 11306 22866 6791 4015 176 11148 256422 115 226 453 1044 80 68

The data are organized by county in Kenya with religious affiliations arranged by column. The first row represents data for the entirety of Kenya. Some additional data wrangling steps allow for the data to be readily plotted using familiar tooling. For example, the inclusion of the Total column makes calculation of proportions straightforward. In the example below, a new column prop_islam is computed and passed to a ggplot2 bar chart.

religion_prop <- religion %>%
  filter(County != "KENYA") %>%
  mutate(county_title = str_to_title(County), 
         prop_islam = Islam / Total)

ggplot(religion_prop, aes(x = prop_islam, 
                          y = reorder(county_title, prop_islam))) + 
  geom_col(fill = "navy", alpha = 0.6) + 
  theme_minimal(base_size = 12.5) + 
  scale_x_continuous(labels = scales::percent) + 
  labs(title = "Percent Muslim by County in Kenya",
       subtitle = "2019 Kenyan Census", 
       x = "",
       y = "",
       caption = "Data source: rKenyaCensus R package")
Bar chart of Islam prevalence by county in Kenya

Figure 12.7: Bar chart of Islam prevalence by county in Kenya

The visualization illustrates that most Kenyan counties have very small Muslim populations; however, a smaller number of counties have much larger Muslim populations, with percentages in some cases close to 100%. This raises related questions about the geography of these patterns, which can be mapped using additional functionality in rKenyaCensus.

The rKenyaCensus package includes a built-in county boundaries dataset to facilitate mapping of the various indicators in the Census, KenyaCounties_SHP. This object is of class SpatialPolygonsDataFrame, which will need to be converted to a simple features object with sf::st_as_sf()

library(sf)

kenya_counties_sf <- st_as_sf(KenyaCounties_SHP)

ggplot(kenya_counties_sf) + 
  geom_sf() + 
  theme_void()
ggplot2 map of Kenya county boundaries

Figure 12.8: ggplot2 map of Kenya county boundaries

With a little additional cleaning, the religion data can be joined to the county boundaries dataset, facilitating choropleth mapping with ggplot2 and geom_sf().

kenya_islam_map <- kenya_counties_sf %>%
  mutate(County = str_remove(County, " CITY")) %>%
  left_join(religion_prop, by = "County") 

ggplot(kenya_islam_map, aes(fill = prop_islam)) + 
  geom_sf() + 
  scale_fill_viridis_c(labels = scales::percent) + 
  theme_void() + 
  labs(fill = "% Muslim",
       title = "Percent Muslim by County in Kenya",
       subtitle = "2019 Kenyan Census",
       caption = "Data acquired with the rKenyaCensus R package")
Choropleth map of Islam in Kenya

Figure 12.9: Choropleth map of Islam in Kenya

The map shows how the distinct religious divides in Kenya play out geographically in the country, with proportionally few Muslim residents in the western part of Kenya and much higher percentages in the eastern part of the country.

12.2.3 Mexico: combining mxmaps and inegiR

Mexico’s national statistics office, the Instituto Nacional de Estadística y Geografía (INEGI), offers an API for access to many of its datasets. Used together, the mxmaps (Valle-Jones 2021) and inegiR (Flores 2019) R packages allow for geographic analysis and visualization of Mexican Census data obtained from INEGI.

Like other APIs introduced in this book, the INEGI API requires an API token. This token can be requested from the INEGI website. Once acquired, it can be used by both mxmaps and inegiR to request data; I’m saving my key as the environment variable INEGI_API_KEY.

The mxmaps package is based on the choroplethr R package (Lamstein 2020), which offers a convenient interface for visualizing data from the US ACS. mxmaps wraps inegiR’s data access functions in the choropleth_inegi() function, which can request data for indicators from INEGI’s data bank and visualize them. The map below shows the percentage of residents by state born in Mexico but outside their current state of residence using indicator code 304003001. For full functionality, the inegiR and mxmaps packages must both be installed from GitHub with remotes::install_github().

# remotes::install_github("Eflores89/inegiR")
# remotes::install_github("diegovalle/mxmaps")
library(mxmaps)
token_inegi <- Sys.getenv("INEGI_API_KEY")

state_regions <- df_mxstate_2020$region
choropleth_inegi(token_inegi, state_regions, 
                 indicator = "3104003001",
                 legend = "%",
                 title = "Percentage born outside\nstate of residence") + 
  theme_void(base_size = 14) + 
  labs(caption = "Data sources: INEGI, mxmaps R package")
## 
Downloading data from the INEGI API
Choropleth map of internal migration in Mexico by state

Figure 12.10: Choropleth map of internal migration in Mexico by state

The map illustrates that the states with the most internal migrants in Mexico include Baja California Sur, Quintana Roo, and Nuevo Leon. This information can also be accessed directly using the inegiR package. The inegi_series() function returns one state at a time, which can be combined into a country-wide dataset with map_dfr().

library(inegiR)
token_inegi <- Sys.getenv("INEGI_API_KEY")
state_regions <- mxmaps::df_mxstate_2020$region

pct_migrants <- map_dfr(state_regions, ~{
  inegi_series(series_id = "3104003001", 
               token = token_inegi, 
               geography = .x,
               database = "BISE") %>%
    mutate(state_code = .x)
})
Table 12.8: Illustration of data acquired with inegiR
date date_shortcut values notes state_code
NA Q1 4.378360 01
NA Q1 6.159140 02
NA Q1 9.993685 03
NA Q1 3.309146 04
NA Q1 2.936635 05
NA Q1 5.245685 06
NA Q1 1.152171 07
NA Q1 2.638413 08
NA Q1 3.574361 09
NA Q1 2.385331 10

12.2.4 Brazil: aligning the geobr R package with raw Census data files for spatial analysis

The three country-specific examples shown above - Canada, Kenya, and Mexico - illustrate workflows for obtaining data directly from an R package and mapping the results. In many cases, however, data do not come directly with an R package, requiring analysts to research the raw data and work with raw files. The example below illustrates how to accomplish this for Brazil, where spatial data are found in a well-documented and widely used R package geobr, but demographic data will be downloaded and processed directly from the Instituto Brasileiro de Geografia e Estatística website using R tooling.

12.2.4.1 Spatial data for Brazil with geobr

The geobr (Pereira and Goncalves 2021) package from IPEA helps you load spatial datasets for many Brazilian geographies, including Census boundaries. These datasets are generously hosted by IPEA and downloaded into a user’s R session when the corresponding function is called. The code_tract parameter can accept a 7-digit municipality code; below, the code for Rio de Janeiro is used. Municipality codes can be looked up with the corresponding read_municipality() function.

# Install first from GitHub:
# remotes::install_github("ipeaGIT/geobr", subdir = "r-package")
library(geobr)

rj_tracts <- read_census_tract(code_tract = 3304557)

By default, geobr retrieves a Census tract dataset with simplified geometries that is appropriate for small-scale demographic visualization but inappropriate for spatial analysis or large-scale visualization. A quick visualization shows the Census tract geometries for Rio de Janeiro:

ggplot(rj_tracts) + 
  geom_sf(lwd = 0.1) + 
  theme_void()
Basic ggplot2 map of Census tracts in Rio de Janeiro

Figure 12.11: Basic ggplot2 map of Census tracts in Rio de Janeiro

12.2.4.2 Identifying demographic data resources for Brazil

For many demographic applications, users will want to identify Brazilian data that can be joined to the Census tract shapes. This makes for a useful exercise to illustrate an applied workflow for working with the raw tables. The example below will get data from the 2010 Brazilian Census and merge it to the geometries obtained with geobr for analysis.

Census data tables can be downloaded from the Instituto Brasileiro de Geografia e Estatística (IBGE) website’s FTP page, which provides aggregated data at the Census tract level. These datasets are organized by state, so we’ll need to identify the appropriate zipped folder to download and bring in the data we need into R. R’s web access tools, discussed in Section 11, can be used to download this data. Our analysis will focus on Brasilia, the capital city of Brazil; in turn, we’ll need data for the Distrito Federal (Federal District) in which Brasilia is located.

library(httr)

dir.create("brazil")

data_url <- "https://ftp.ibge.gov.br/Censos/Censo_Demografico_2010/Resultados_do_Universo/Agregados_por_Setores_Censitarios/DF_20171016.zip"

zip_name <- basename(data_url)

out_file <- file.path("brazil", zip_name)

GET(data_url, write_disk(out_file, overwrite = TRUE))

unzip(out_file, exdir = "brazil/brasilia_data")

The workflow above operates as follows:

  • After loading in httr, we create a new subdirectory named brazil in our working directory. This will store the data we download from IBGE.

  • We then identify the URL from which we’ll be requesting data, and extract the basename from the URL with basename(); this will be the name of the file on disk.

  • file.path() constructs the output location for the downloaded file, then GET() from httr makes the HTTP request, using write_disk() to write the downloaded file to the output location.

  • Finally, unzip() unzips the folder. We can take a look at what we got back:

data_path <- "brazil/brasilia_data/DF/Base informa\x87oes setores2010 universo DF/CSV"

list.files(data_path)
##  [1] "Basico_DF.csv"           "Domicilio01_DF.csv"     
##  [3] "Domicilio02_DF.csv"      "DomicilioRenda_DF.csv"  
##  [5] "Entorno01_DF.csv"        "Entorno02_DF.csv"       
##  [7] "Entorno03_DF.csv"        "Entorno04_DF.csv"       
##  [9] "Entorno05_DF.csv"        "Pessoa01_DF.csv"        
## [11] "Pessoa02_DF.csv"         "Pessoa03_DF.csv"        
## [13] "Pessoa04_DF.csv"         "Pessoa05_DF.csv"        
## [15] "Pessoa06_DF.csv"         "Pessoa07_DF.csv"        
## [17] "Pessoa08_DF.csv"         "Pessoa09_DF.csv"        
## [19] "Pessoa10_DF.csv"         "Pessoa11_DF.csv"        
## [21] "Pessoa12_DF.csv"         "Pessoa13_DF.csv"        
## [23] "PessoaRenda_DF.csv"      "Responsavel01_DF.csv"   
## [25] "Responsavel02_DF.csv"    "ResponsavelRenda_DF.csv"

The actual path, given that a subdirectory includes an accented character, may differ based on your operating system and/or your locale. Finally, we’ll copy all the files from the nested subdirectory to the brasilia_data directory.

file.copy(data_path, "brazil/brasilia_data", recursive = TRUE)

12.2.4.3 Working with Brazilian demographic data

Our files include a series of CSV files that represent characteristics of Census tracts in the Federal District. Let’s read in the first file, Basico_DF.csv, and see what we get back. An important step here is to use read_csv2() rather than read_csv(); this is because the file is semicolon-delimited and uses commas for decimal places, meaning that read_csv() will interpret it incorrectly. The use of accented characters in the raw data file also requires setting the encoding to Latin-1.

basico <- read_csv2("brazil/brasilia_data/CSV/Basico_DF.csv", 
                     locale = locale(encoding = "latin1"))
Table 12.9: Data from the 2010 Brazilian Census
Cod_setor Cod_Grandes Regiões Nome_Grande_Regiao Cod_UF Nome_da_UF Cod_meso Nome_da_meso Cod_micro Nome_da_micro Cod_RM Nome_da_RM Cod_municipio Nome_do_municipio Cod_distrito Nome_do_distrito Cod_subdistrito Nome_do_subdistrito Cod_bairro Nome_do_bairro Situacao_setor Tipo_setor V001 V002 V003 V004 V005 V006 V007 V008 V009 V010 V011 V012 …34
5.300108e+14 5 Região Centro-Oeste 53 Distrito Federal 5301 Distrito Federal 53001 Brasília 40 RIDE - Região Integrada de Desenvolvimento do Distrito Federal e Entorno 5300108 BRASÍLIA 530010805 BRASÍLIA 53001080506 BRASÍLIA 5300108000 BRASÍLIA (todos os setores) 1 0 NA NA NA NA NA NA NA NA 9506.63 138395783 11709.39 144747623 NA
5.300108e+14 5 Região Centro-Oeste 53 Distrito Federal 5301 Distrito Federal 53001 Brasília 40 RIDE - Região Integrada de Desenvolvimento do Distrito Federal e Entorno 5300108 BRASÍLIA 530010805 BRASÍLIA 53001080506 BRASÍLIA 5300108000 BRASÍLIA (todos os setores) 1 0 NA NA NA NA NA NA NA NA 7806.67 22289238 7806.67 22289238 NA
5.300108e+14 5 Região Centro-Oeste 53 Distrito Federal 5301 Distrito Federal 53001 Brasília 40 RIDE - Região Integrada de Desenvolvimento do Distrito Federal e Entorno 5300108 BRASÍLIA 530010805 BRASÍLIA 53001080506 BRASÍLIA 5300108000 BRASÍLIA (todos os setores) 1 0 NA NA NA NA NA NA NA NA 10939.03 192793982 12006.26 198863824 NA
5.300108e+14 5 Região Centro-Oeste 53 Distrito Federal 5301 Distrito Federal 53001 Brasília 40 RIDE - Região Integrada de Desenvolvimento do Distrito Federal e Entorno 5300108 BRASÍLIA 530010805 BRASÍLIA 53001080506 BRASÍLIA 5300108000 BRASÍLIA (todos os setores) 1 0 22 49 2.23 1.71 2502.36 6401544 2502.36 6401544 1453.43 3813162 1975.18 4160350 NA
5.300108e+14 5 Região Centro-Oeste 53 Distrito Federal 5301 Distrito Federal 53001 Brasília 40 RIDE - Região Integrada de Desenvolvimento do Distrito Federal e Entorno 5300108 BRASÍLIA 530010805 BRASÍLIA 53001080506 BRASÍLIA 5300108000 BRASÍLIA (todos os setores) 1 0 186 676 3.63 1.15 6932.78 6092028 6970.25 5862510 2972.86 12913471 5053.02 11433950 NA

The file includes a series of ID codes then a series of variables associated with each Census tract. The meaning of the variables in the dataset can be looked up from a PDF data dictionary available for download from the IBGE website. We’ll focus here on variable V007, which represents the average monthly income in each Census tract for income-earning households.

brasilia_income <- basico %>%
  mutate(code_tract = as.character(Cod_setor)) %>%
  select(code_tract, monthly_income = V007)

ggplot(brasilia_income, 
       aes(x = monthly_income)) + 
  geom_histogram(bins = 100, 
                 alpha = 0.5, 
                 color = "navy", 
                 fill = "navy") + 
  theme_minimal()
ggplot2 histogram of median monthly income by Census tract in Brasilia

Figure 12.12: ggplot2 histogram of median monthly income by Census tract in Brasilia

Typical of neighborhood income distributions, the histogram is heavily right-skewed with a large cluster of Census tracts with low earnings and a long tail of wealthier areas.

12.2.4.4 Spatial analysis of Brazilian Census data

This information can be used in concert with the geobr package to analyze the geographic distribution of earnings using techniques covered elsewhere in this book. The first step of the analysis involves returning to geobr to get Census tracts for Brasilia/the Federal District using Brasilia’s municipality code. The argument simplified = FALSE is used given the need for higher-quality tract boundaries for spatial analysis. Next, the income data is merged to the tract shapes, then transformed to an appropriate coordinate reference system for Brasilia.

library(sf)
library(geobr)

brasilia_tracts <- read_census_tract(
  code_tract = 5300108,
  simplified = FALSE
) %>%
  select(code_tract)

brasilia_income_geo <- brasilia_tracts %>%
  left_join(brasilia_income, by = "code_tract") %>%
  st_transform(22523)

After joining the data, the income information can be visualized on a map using familiar mapping tools like geom_sf().

ggplot(brasilia_income_geo, aes(fill = monthly_income)) + 
  geom_sf(color = NA) + 
  theme_void() + 
  scale_fill_viridis_c()
Choropleth map of median monthly income in Brasilia

Figure 12.13: Choropleth map of median monthly income in Brasilia

The map highlights income disparities between central Brasilia and the more rural portions of the Federal District, and identifies the Lago Sul area as home to the highest-earning households, on average.

As the spatial data have been prepared appropriately, methods from Section 7.7 can be used to explore patterns of spatial autocorrelation by monthly income in the Brasilia area. The example below uses the local form of the Moran’s \(I\) statistic, described in more detail in Section 7.7.3. The steps used are described in brief below:

  • A queens-case contiguity-based spatial weights matrix is created with functions from the spdep package, using the argument zero.policy = TRUE as a small number of tracts in the dataset have no neighbors after omitting NA values;

  • The monthly_income column is scaled to a z-score;

  • The LISA statistic is computed with the localmoran_perm() function, using the permutation-based method to compute statistical significance;

  • The LISA results are attached to the spatial dataset and recoded into cluster groups for visualization.

library(spdep)
set.seed(1983)

# Omit NAs and scale the income variable
brasilia_input <- brasilia_income_geo %>%
  mutate(scaled_income = as.numeric(scale(monthly_income))) %>%
  na.omit() 

# Generate contiguity-based weights with zero.policy = TRUE
brasilia_weights <- brasilia_input %>%
  poly2nb() %>%
  nb2listw(zero.policy = TRUE)

# Compute the LISA and convert to a tibble
brasilia_lisa <- localmoran_perm(brasilia_input$scaled_income, 
                                 brasilia_weights,
                                 nsim = 999L, 
                                 alternative = "two.sided",
                                 na.action = na.pass) %>%
  as_tibble() %>%
  set_names(c("local_i", "exp_i", "var_i", "z_i", "p_i",
              "p_i_sim", "pi_sim_folded", "skewness", "kurtosis"))

# Add the LISA columns to the spatial dataset 
# and recode them into cluster values
brasilia_lisa_clusters <- brasilia_input %>%
  select(code_tract, scaled_income) %>%
  bind_cols(brasilia_lisa) %>%
  mutate(lisa_cluster = case_when(
    p_i >= 0.05 ~ "Not significant",
    scaled_income > 0 & local_i > 0 ~ "High-high",
    scaled_income > 0 & local_i < 0 ~ "High-low",
    scaled_income < 0 & local_i > 0 ~ "Low-low",
    scaled_income < 0 & local_i < 0 ~ "Low-high"
  ))

Once computed, the results can be visualized on a map.

color_values <- c(`High-high` = "red", 
                  `High-low` = "pink", 
                  `Low-low` = "blue", 
                  `Low-high` = "lightblue", 
                  `Not significant` = "white")

ggplot(brasilia_lisa_clusters, aes(fill = lisa_cluster)) + 
  geom_sf(size = 0.1) + 
  theme_void(base_family = "Verdana") + 
  scale_fill_manual(values = color_values) + 
  labs(title = "LISA clusters of median monthly income",
       subtitle = "Census tracts, Brasilia/Federal District",
       fill = "Cluster type",
       caption = "Data sources: 2010 Brazilian Census via IBGE; geobr R package")
LISA cluster map of median monthly income by Census tract in Brasilia

Figure 12.14: LISA cluster map of median monthly income by Census tract in Brasilia

The map illustrates strong clustering of high monthly incomes in central Brasilia with low-high spatial outliers scattered among the higher-earning areas. Low-low cluster census tracts tend to be found on the edges of the urbanized portions of the Federal District, representing more sparsely populated rural areas.

12.3 Other international data resources

Many other international data resources in R exist, and a robust R ecosystem is emerging to standardize these data source. In Europe, the nomisr R package (Odell 2018) provides an interface to official statistics for the United Kingdom, and the insee package (Leclerc 2021) does the same for France. Europe-wide data can be acquired using the eurostat package (Lahti et al. 2017), which interacts with European Union data sources. In Africa, the afrimapr project is an effort to disseminate Africa-wide datasets to R users. It includes a wide variety of spatial datasets for countries across Africa.

Microdata users will be interested in the IPUMS International project, which makes historical microdata for over 100 countries available to researchers. Like US IPUMS data, the ipumsr R package can help users work with international microdata in R. A companion project from IPUMS is the International Historical Geographic Information System (IHGIS) project, which aims to disseminate international spatial and demographic data in much the same way that the NHGIS project (covered in Section 11.1.1 does for historical Census data for the US.

12.4 Exercises

  • Use idbr to create a map of a global demographic indicator as shown above. Use ggiraph and ggplot2 to make both static and interactive versions of the map.
  • Choose one of the four countries highlighted in this chapter (Canada, Kenya, Mexico, or Brazil) and explore their corresponding R packages further. Explore the available demographic indicators for those countries, and try making a map or a chart for that indicator.