Skip to contents

ColOpenData can be used to access open demographic data from Colombia. This demographic data is retrieved from the National Administrative Department of Statistics (DANE). The demographic module allows you to consult demographic data from the National Population and Dwelling Census (CNPV) of 2018 and Population Projections.

The available CNPV information is divided in four categories: households, persons demographic, persons social and dwellings. The population projections information presents data from 1950 to 2070 for a national level, from 1985 to 2050 for a departmental level and from 1985 to 2035 for a municipal level. All data documentation can be accessed as explained at Documentation and Dictionaries.

In order to access its documentation we need to use the function list_datasets and indicate as a parameter the module we are interested in. It is important to take a good look at this to have a clearer understanding of what we count with, before just throwing ourselves to work with the data. Now, we should start by loading all necessary libraries.

In this vignette you will learn: 1. How to download demographic data using ColOpenData 2. How to filter, group, mutate and aggregate demographic data including census and population projections 3. How to visualize data using ggplot2

As the goal of this vignette is to show some examples on how to use the data, we will load some specific libraries but that does not mean they are required to use the data in all cases.

Disclaimer: all data is loaded to the environment in the user’s R session, but is not downloaded to user’s computer.

Initial Exploration: Basic Data Handling with ColOpenData

Documentation access

First, we have to access the demographic documentation, to check available datasets.

datasets_dem <- list_datasets("demographic")
head(datasets_dem)
#>                   name       group source year        level   category
#> 11 DANE_CNPVH_2018_1HD demographic   DANE 2018   department households
#> 12 DANE_CNPVH_2018_1HM demographic   DANE 2018 municipality households
#> 13 DANE_CNPVH_2018_2HD demographic   DANE 2018   department households
#> 14 DANE_CNPVH_2018_2HM demographic   DANE 2018 municipality households
#> 15 DANE_CNPVH_2018_3HD demographic   DANE 2018   department households
#> 16 DANE_CNPVH_2018_3HM demographic   DANE 2018 municipality households
#>                                                                                                                      description
#> 11   Number of households with persons under 15 years of age and number of persons under 15 years of age, by department and area
#> 12 Number of households with persons under 15 years of age and number of persons under 15 years of age, by municipality and area
#> 13                      Number of households with senior citizens and number of persons aged 60 and over, by department and area
#> 14                    Number of households with senior citizens and number of persons aged 60 and over, by municipality and area
#> 15                                                                                Households by headship, by department and area
#> 16                                                                              Households by headship, by municipality and area

After checking the documentation, we can load the data we want to work with. To do this, we will use the download_demographic function that takes by parameter the data name, presented in the documentation. For this first example we will focus on a CNPV dataset.

Data load

# Load a demographic dataset
public_services_d <- download_demographic("DANE_CNPVV_2018_8VD")
#> Original data is provided by the National Administrative Department of
#> Statistics (DANE).
#> Reformatted by the package authors.
#> Stored and redistributed by Universidad de Los Andes under the Epiverse
#> TRACE iniative.
head(public_services_d)
#>   codigo_departamento   departamento     area  servicio_publico      disponible
#> 1                  00 Total nacional    total energia_electrica              si
#> 2                  00 Total nacional    total energia_electrica              no
#> 3                  00 Total nacional    total energia_electrica sin_informacion
#> 4                  00 Total nacional cabecera energia_electrica              si
#> 5                  00 Total nacional cabecera energia_electrica              no
#> 6                  00 Total nacional cabecera energia_electrica sin_informacion
#>      total
#> 1 12984126
#> 2   496603
#> 3        0
#> 4 10485896
#> 5    81579
#> 6        0

As it can be seen above, public_services_d presents information regarding availability of public services in the country in 2018, at the department level. Now, with this data we could, for example, find the proportion of dwellings that have access to a water supply system (WSS) by department and plot it.

Data filter and plot

# Subset data so it presents the information regarding the WSS, by department
wss <- public_services_d %>%
  subset(area == "total_departamental" & servicio_publico == "acueducto") %>%
  select(departamento, disponible, total)

# Calculate the total counts for each department (for all situations)
total_counts <- wss %>%
  group_by(departamento) %>%
  dplyr::summarise(total_all = sum(total)) %>%
  ungroup()

# Calculate the proportion of counts of "si" by department
proportions_wss <- wss %>%
  filter(disponible == "si") %>%
  left_join(total_counts, by = "departamento") %>%
  mutate(proportion_si = total / total_all)

# Change the value of cell San Andres
proportions_wss[28, "departamento"] <- "SAPSC"

# Plot proportions_wss of dwellings with access to WSS by deparment
ggplot(proportions_wss, aes(
  x = reorder(departamento, -proportion_si),
  y = proportion_si
)) +
  geom_bar(stat = "identity", fill = "#10bed2", color = "black", width = 0.6) +
  labs(
    title = "Proportion of dwellings with access to WSS, by department",
    x = "Department",
    y = "Proportion"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  )

Data Integration: Combining Datasets for Enhanced Analysis

Also, with the CNPV data, we could merge data from the different categories and compare information. For example, we would download a dataset from the persons demographic category and another one from persons social. The first one presents information regarding population’s literacy and the second one regarding its marital status.

Data load

# Load data
message("Loading data, this can take a while")
#> Loading data, this can take a while
literacy <- download_demographic("DANE_CNPVPD_2018_16PM")
#> Original data is provided by the National Administrative Department of
#> Statistics (DANE).
#> Reformatted by the package authors.
#> Stored and redistributed by Universidad de Los Andes under the Epiverse
#> TRACE iniative.
marital_status <- download_demographic("DANE_CNPVPS_2018_7PM")
#> Original data is provided by the National Administrative Department of
#> Statistics (DANE).
#> Reformatted by the package authors.
#> Stored and redistributed by Universidad de Los Andes under the Epiverse
#> TRACE iniative.

For this example, we will focus on a specific municipality, examining the total area and specifically concentrating on the age group between 15 and 24 years old. Since we want to compare both datasets simultaneously, it’s important to first ensure they have the same population count under identical conditions. Most columns in the demographic datasets contain a total value, which aggregates data across all conditions. For instance, the ‘sex’ column includes ‘male’, ‘female’, and total, representing both genders combined. Similarly, the ‘marital status’ column’s total value encompasses the entire population, irrespective of their marital status.

Data check

# Literacy data filter to have a population's count under specific conditions
lit <- literacy %>%
  subset(municipio == "Villapinzón" & grupo_de_edad == "entre_15_y_24_anos" &
    area == "total" & sabe_leer_y_escribir == "total" & sexo == "total") %>%
  select(municipio, area, grupo_de_edad, sexo, total)

# Marital status data filter to find population's count under certain conditions
ms_15_19 <- marital_status %>%
  subset(municipio == "Villapinzón" & area == "total" &
    grupo_de_edad == "15_a_19" & estado_civil == "total" &
    sexo == "total") %>%
  select(grupo_de_edad, estado_civil, sexo, total)

ms_20_24 <- marital_status %>%
  subset(municipio == "Villapinzón" & area == "total" &
    grupo_de_edad == "20_a_24" & estado_civil == "total" &
    sexo == "total") %>%
  select(grupo_de_edad, estado_civil, sexo, total)

ms <- rbind(ms_15_19, ms_20_24)

# Sum of counts of both age groups
ms_count <- ms %>%
  group_by(estado_civil, sexo) %>%
  dplyr::summarise(total = sum(total)) %>%
  ungroup()
count_total <- toString(ms_count$total)

# New row with summed information
new_row <- data.frame(
  grupo_de_edad = "15_a_24", estado_civil = "total",
  sexo = "total", total = count_total,
  stringsAsFactors = FALSE
)
# Final dataframe with counts for each group independently and together, with
# rows names modification

ms_f <- rbind(ms, new_row)
rownames(ms_f) <- NULL
head(ms_f)

Now that we know that both data sets contain the same counts, we can start to look at the specific information we want. For the literacy data set, we would focus on the people that do not know how to read neither write.

Data filter and plot

# Filter of literacy data to get only the counts for specific conditions
lit_filt <- literacy %>%
  subset(municipio == "Villapinzón" & sabe_leer_y_escribir == "no" &
    grupo_de_edad == "entre_15_y_24_anos" & area == "total") %>%
  select(sabe_leer_y_escribir, grupo_de_edad, sexo, total) %>%
  mutate(estado = "no_sabe_leer_y_escribir") %>%
  select(-sabe_leer_y_escribir) %>%
  mutate(proportion = total / lit$total)

# Filter of marital status data to get only the counts for specific conditions
ms_filt15 <- marital_status %>%
  subset(municipio == "Villapinzón" & area == "total" &
    grupo_de_edad == "15_a_19") %>%
  select(grupo_de_edad, estado_civil, sexo, total)

# Filter of marital status data to get only the counts for specific conditions
ms_filt20 <- marital_status %>%
  subset(municipio == "Villapinzón" & area == "total" &
    grupo_de_edad == "20_a_24") %>%
  select(grupo_de_edad, estado_civil, sexo, total)

# Merge of both marital status dataframes
ms_filt <- rbind(ms_filt15, ms_filt20)

# Data frame with marital status info and proportions based on columns
ms_final <- ms_filt %>%
  group_by(estado_civil, sexo) %>%
  dplyr::summarise(total = sum(total)) %>%
  ungroup() %>%
  mutate(grupo_de_edad = "entre_15_y_24_anos") %>%
  rename(estado = estado_civil) %>%
  mutate(proportion = total / lit$total)
#> `summarise()` has grouped output by 'estado_civil'. You can override using the
#> `.groups` argument.

# Merged marital status and literacy final df, with subsets to a better graph
merged_df <- bind_rows(lit_filt, ms_final) %>%
  subset(sexo != "total") %>%
  subset(estado != "total") %>%
  subset(estado != "soltero_a")

# Plot
ggplot(merged_df, aes(x = estado, y = proportion, fill = sexo)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(
    values = c(mujer = "#f04a4c", hombre = "#aec800"),
    labels = c("Female", "Male")
  ) +
  labs(
    title = "Proportion of people in Villapinzón by marital status and
    iliteracy by sex",
    x = "Status",
    y = "Count",
    fill = "Sex"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  )

Forecasting the Future: Analyzing Population Projections

Now, as a third example, we will take a look at a population projections dataset presenting projections and back projections of national population by area, sex and age for the period from 1950 to 2070.

First, we will observe the variation on the female and male population within the years, covering the period from 2024 to 2034.

Data load, filter and plot

# Download data
asen <- download_demographic("DANE_PP_1950_2070_ASEN")
#> Original data is provided by the National Administrative Department of
#> Statistics (DANE).
#> Reformatted by the package authors.
#> Stored and redistributed by Universidad de Los Andes under the Epiverse
#> TRACE iniative.

# Subset data
male <- asen %>%
  subset(area == "total" & sexo == "total_hombres" & ano > 2023 & ano < 2035)

female <- asen %>%
  subset(area == "total" & sexo == "total_mujeres" & ano > 2023 & ano < 2035)

mandf <- bind_rows(male, female)

# Plot of population in a national level by year and sex
ggplot(mandf, aes(x = ano, y = total, fill = sexo)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.6) +
  scale_fill_manual(
    values = c(total_mujeres = "#f04a4c", total_hombres = "#aec800"),
    labels = c("Female", "Male")
  ) +
  labs(
    title = "Population in Colombia from 2024 to 2034 by sex",
    x = "Year",
    y = "Population",
    fill = "Sex"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  ) +
  scale_x_continuous(breaks = unique(mandf$ano))

In the graph presented above, we observe that the male population slightly exceeds the female population for the selected period. As a second step in our analysis of the dataset, we will focus on the male population by age in the year with the highest male population value from 2024 to 2034, which is the year 2034.

Data load, filter and plot

# Subset data
male_2034_99 <- asen %>%
  subset(area == "total" & ano == 2034 & sexo == "hombre" &
    edad != "100_y_mas") %>%
  mutate(edad = as.numeric(edad))

male_2034_100 <- asen %>%
  subset(area == "total" & ano == 2034 & sexo == "hombre" &
    edad == "100_y_mas") %>%
  rename(age_group = edad) %>%
  rename(total_sum = total) %>%
  select(age_group, total_sum)

# Define age groups
age_groups <- cut(male_2034_99$edad,
  breaks = c(-1, 2, 12, 19, 29, 39, 49, 59, 69, 79, 89, 99),
  labels = c(
    "0-2", "3-12", "13-19", "20-29", "30-39", "40-49",
    "50-59", "60-69", "70-79", "80-89", "90-99"
  )
)

# Add age groups to the data frame
male_agegroups <- mutate(male_2034_99, age_group = age_groups)

# Calculate new totals for age groups
summarized_data <- male_agegroups %>%
  group_by(age_group) %>%
  summarize(total_sum = sum(total))

# Bind dataframes
merged <- bind_rows(summarized_data, male_2034_100)

# Fix the order of ages appearance in the plot
merged$age_group <- factor(merged$age_group, levels = unique(merged$age_group))

# Plot of male population in a national level by age group in 2052
ggplot(merged, aes(
  x = age_group,
  y = total_sum
)) +
  geom_bar(stat = "identity", fill = "#f04a4c", color = "black", width = 0.6) +
  labs(
    title = "Male population counts in Colombia by age group in 2034",
    x = "Age group",
    y = "Male population"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  )