Skip to contents

ColOpenData can be used to access open demographic data from Colombia. This demographic data is retrieved from the National Administrative Department of Statistics (DANE). The demographic module allows you to consult demographic data from the National Population and Dwelling Census (CNPV) of 2018 and Population Projections.

The available CNPV information is divided in four categories: households, persons demographic, persons social and dwellings. The population projections information presents data from 1950 to 2070 for a national level, from 1985 to 2050 for a departmental level and from 1985 to 2035 for a municipal level. All data documentation can be accessed as explained at Documentation and Dictionaries.

In this vignette you will learn:

  1. How to download demographic data using ColOpenData
  2. How to filter, group, mutate and aggregate demographic data
  3. How to visualize data using ggplot2

As the goal of this vignette is to show some examples on how to use the data, we will load some specific libraries but that does not mean they are required to use the data in all cases.

In order to access its documentation we need to use the function list_datasets and indicate as a parameter the module we are interested in. It is important to take a good look at this to have a clearer understanding of what we count with, before just throwing ourselves to work with the data. Now, we should start by loading all necessary libraries.

Disclaimer: all data is loaded to the environment in the user’s R session, but is not downloaded to user’s computer.

Initial Exploration: Basic Data Handling with ColOpenData

Documentation access

First, we have to access the demographic documentation, to check available datasets.

datasets_dem <- list_datasets("demographic")

head(datasets_dem)
#>                   name       group source year        level   category
#> 11 DANE_CNPVH_2018_1HD demographic   DANE 2018   department households
#> 12 DANE_CNPVH_2018_1HM demographic   DANE 2018 municipality households
#> 13 DANE_CNPVH_2018_2HD demographic   DANE 2018   department households
#> 14 DANE_CNPVH_2018_2HM demographic   DANE 2018 municipality households
#> 15 DANE_CNPVH_2018_3HD demographic   DANE 2018   department households
#> 16 DANE_CNPVH_2018_3HM demographic   DANE 2018 municipality households
#>                                                                                                                      description
#> 11   Number of households with persons under 15 years of age and number of persons under 15 years of age, by department and area
#> 12 Number of households with persons under 15 years of age and number of persons under 15 years of age, by municipality and area
#> 13                      Number of households with senior citizens and number of persons aged 60 and over, by department and area
#> 14                    Number of households with senior citizens and number of persons aged 60 and over, by municipality and area
#> 15                                                                                Households by headship, by department and area
#> 16                                                                              Households by headship, by municipality and area

After checking the documentation, we can load the data we want to work with. To do this, we will use the download_demographic function that takes by parameter the dataset name, presented in the documentation. For this first example we will focus on a CNPV dataset.

Data load

public_services_d <- download_demographic("DANE_CNPVV_2018_8VD")
#> Original data is retrieved from the National Administrative Department
#> of Statistics (Departamento Administrativo Nacional de Estadística -
#> DANE).
#> Reformatted by package authors.
#> Stored by Universidad de Los Andes under the Epiverse TRACE iniative.

head(public_services_d)
#>   codigo_departamento   departamento     area  servicio_publico      disponible
#> 1                  00 Total nacional    total energia_electrica              si
#> 2                  00 Total nacional    total energia_electrica              no
#> 3                  00 Total nacional    total energia_electrica sin_informacion
#> 4                  00 Total nacional cabecera energia_electrica              si
#> 5                  00 Total nacional cabecera energia_electrica              no
#> 6                  00 Total nacional cabecera energia_electrica sin_informacion
#>      total
#> 1 12984126
#> 2   496603
#> 3        0
#> 4 10485896
#> 5    81579
#> 6        0

As it can be seen above, public_services_d presents information regarding availability of public services in the country at the department level. Now, with this data we could, for example, find the proportion of dwellings that have access to a water supply system (WSS) by department and plot it.

Data filter and plot

First we will subset the data so it presents the information regarding the WSS by department.

wss <- public_services_d %>%
  filter(
    area == "total_departamental",
    servicio_publico == "acueducto"
  ) %>%
  select(departamento, disponible, total)

With the subset, we can calculate the total counts by department.

total_counts <- wss %>%
  group_by(departamento) %>%
  summarise(total_all = sum(total)) %>%
  ungroup()

Then, we can calculate the proportions of “yes” (“si”) by department.

proportions_wss <- wss %>%
  filter(disponible == "si") %>%
  left_join(total_counts, by = "departamento") %>%
  mutate(proportion_si = total / total_all)

For plotting purposes, we will change the name of “San Andrés”, since the complete name is too long.

proportions_wss[28, "departamento"] <- "SAPSC"

Finally, we can plot the results

ggplot(proportions_wss, aes(
  x = reorder(departamento, -proportion_si),
  y = proportion_si
)) +
  geom_bar(stat = "identity", fill = "#10bed2", color = "black", width = 0.6) +
  labs(
    title = "Proportion of dwellings with access to WSS by department",
    x = "Department",
    y = "Proportion"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  )

Data Integration: Combining Datasets for Enhanced Analysis

Also, with the CNPV data, we could merge data from the different categories and compare information. For example, we would download a dataset from the persons demographic category and another one from persons social. The first one presents information regarding population’s literacy and the second one regarding its marital status. The following datasets might take a while to download.

Data load

literacy <- download_demographic("DANE_CNPVPD_2018_16PM")
#> Original data is retrieved from the National Administrative Department
#> of Statistics (Departamento Administrativo Nacional de Estadística -
#> DANE).
#> Reformatted by package authors.
#> Stored by Universidad de Los Andes under the Epiverse TRACE iniative.
marital_status <- download_demographic("DANE_CNPVPS_2018_7PM")
#> Original data is retrieved from the National Administrative Department
#> of Statistics (Departamento Administrativo Nacional de Estadística -
#> DANE).
#> Reformatted by package authors.
#> Stored by Universidad de Los Andes under the Epiverse TRACE iniative.

For this example, we will focus on a specific municipality, examining the total area and specifically concentrating on the age group between 15 and 24 years old. Since we want to compare both datasets simultaneously, it’s important to first ensure they have the same population count under identical conditions. Most columns in the demographic datasets contain a total value, which aggregates data across all conditions. For instance, the “sex” column includes “male”, “female”, and “total”, representing both genders combined. Similarly, the “marital status” column’s ‘total’ value encompasses the entire population, irrespective of their marital status.

Data check

Since the age group in literacy is complete, we can just filter using the desired age range and the total counts for other variables, including area, sex and ability to read.

lit <- literacy %>%
  filter(
    municipio == "Villapinzón",
    grupo_de_edad == "entre_15_y_24_anos",
    area == "total",
    sabe_leer_y_escribir == "total",
    sexo == "total"
  ) %>%
  select(sexo, total)

head(lit)
#>    sexo total
#> 1 total  2911

We will now filter marital status for people between 15 and 24 years. This dataset age groups are divided from 15 to 19 and from 20 to 24 years old, so we will need to merge the two groups. To do this, we can group by a common variable, like sex and add as follows:

ms_15_24 <- marital_status %>%
  filter(
    municipio == "Villapinzón",
    area == "total",
    grupo_de_edad %in% c("15_a_19", "20_a_24"),
    estado_civil == "total",
    sexo == "total"
  ) %>%
  group_by(sexo) %>%
  summarise(total = sum(total)) %>%
  select(sexo, total)

head(ms_15_24)
#> # A tibble: 1 × 2
#>   sexo  total
#>   <chr> <dbl>
#> 1 total  2911

Now that we know that both data sets contain the same counts, we can start to look at the specific information we want. For the literacy data set, we would focus on the people that do not know how to read neither write.

Data filter and plot

We can filter the literacy data. For this purpose we will filter the column “sabe_leer_y_escribir” for the value “no”. We will also create a column containing the proportion of people that cannot read compared to the total population within the age range.

lit_filt <- literacy %>%
  filter(
    municipio == "Villapinzón",
    sabe_leer_y_escribir == "no",
    grupo_de_edad == "entre_15_y_24_anos",
    area == "total"
  ) %>%
  mutate(
    estado = "no_sabe_leer_y_escribir",
    proporcion = total / lit$total
  ) %>%
  select(estado, sexo, proporcion)

To filter the marital status dataset we will filter by the two age groups again. We will group by marital status and sex, sum by the total counts and create a column containing the proportion of each marital status compared to the population within the age group.

ms_15_24_filt <- marital_status %>%
  filter(
    municipio == "Villapinzón",
    area == "total",
    grupo_de_edad %in% c("15_a_19", "20_a_24")
  ) %>%
  group_by(estado_civil, sexo) %>%
  summarise(total = sum(total)) %>%
  mutate(proporcion = total / ms_15_24$total) %>%
  select(estado_civil, sexo, proporcion) %>%
  rename(estado = estado_civil)
#> `summarise()` has grouped output by 'estado_civil'. You can override using the
#> `.groups` argument.

Finally, we can merge the two datasets and plot the outcome.

merged_df <- bind_rows(lit_filt, ms_15_24_filt) %>%
  filter(
    sexo != "total",
    estado != "total"
  )

ggplot(merged_df, aes(x = estado, y = proporcion, fill = sexo)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(
    values = c(mujer = "#f04a4c", hombre = "#aec800"),
    labels = c("Female", "Male")
  ) +
  labs(
    title = "Proportion of people in Villapinzón by marital status and
    iliteracy by sex",
    x = "Status",
    y = "Proportion",
    fill = "Sex"
  ) +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  )