Read case data
Last updated on 2025-03-24 | Edit this page
Overview
Questions
- Where do you usually store your outbreak data?
- How many different data formats can you use for analysis?
- Can you import data from databases and health APIs?
Objectives
- Explain how to import outbreak data from different sources into
R
environment.
Prerequisites
This episode requires you to be familiar with:
Data science : Basic tasks with R.
Introduction
The initial step in outbreak analysis typically involves importing
the target dataset into the R
environment from either a
local source (like a file on your computer) or external source (like a
database). Outbreak data can be stored in diverse formats, relational
database management systems (RDBMS), or health information systems
(HIS), such as REDCap and
DHIS2, which provide application
program interfaces (APIs) to the database systems so verified users can
easily add and access data entries. The latter option is particularly
well-suited for collecting and storing large-scal institutional health
data. This episode will elucidate the process of reading cases from
these sources.
Let’s start by loading the package rio to read data
and the package here to easily find a file path within
your RStudio project. We’ll use the pipe %>%
to easily
connect some of their functions, including functions from the data
formatting package dplyr. We’ll therefore call the
tidyverse package, which includes both the pipe and
dplyr:
R
# Load packages
library(tidyverse) # for {dplyr} functions and the pipe %>%
library(rio) # for importing data
library(here) # for easy file referencing
The double-colon
The double-colon ::
in R lets you call a specific
function from a package without loading the entire package into the
current environment.
For example, dplyr::filter(data, condition)
uses
filter()
from the dplyr package, without
having to use library(dplyr)
at the start of a script.
This help us remember package functions and avoid namespace conflicts (i.e. when two different packages include functions with the same name, so R does not know which to use).
Setup a project and folder
- Create an RStudio project. If needed, follow this how-to guide on “Hello RStudio Projects” to create one.
- Inside the RStudio project, create a
data/
folder. - Inside the
data/
folder, save the ebola_cases_2.csv and marburg.zip CSV files.
Reading from files
Several packages are available for importing outbreak data stored in
individual files into R
. These include rio, readr from the
tidyverse
, io, ImportExport,
and data.table.
Together, these packages offer methods to read single or multiple files
in a wide range of formats.
The below example shows how to import a csv
file into
R
environment using rio package. We use the
here package to tell R to look for the file in the
data/
folder of your project, and as_tibble()
to convert into a tidier format for subsequent analysis in R.
R
# read data
# e.g., the path to our file is data/raw-data/ebola_cases_2.csv then:
ebola_confirmed <- rio::import(
here::here("data", "ebola_cases_2.csv")
) %>%
dplyr::as_tibble() # for a simple data frame output
# preview data
ebola_confirmed
OUTPUT
# A tibble: 120 × 4
year month day confirm
<int> <int> <int> <int>
1 2014 5 18 1
2 2014 5 20 2
3 2014 5 21 4
4 2014 5 22 6
5 2014 5 23 1
6 2014 5 24 2
7 2014 5 26 10
8 2014 5 27 8
9 2014 5 28 2
10 2014 5 29 12
# ℹ 110 more rows
Similarly, you can import files of other formats such as
tsv
, xlsx
, … etc.
Why should we use the {here} package?
The here package is designed to simplify file referencing in R projects by providing a reliable way to construct file paths relative to the project root. The main reason to use it is Cross-Environment Compatibility.
It works across different operating systems (Windows, Mac, Linux) without needing to adjust file paths.
- On Windows, paths are written using backslashes (
\
) as the separator between folder names:"data\raw-data\file.csv"
- On Unix based operating system such as macOS or Linux the forward
slash (
/
) is used as the path separator:"data/raw-data/file.csv"
The here package is ideal for adding one more layer of reproducibility to your work. If you are interested in reproducibility, we invite you to read this tutorial to increase the openess, sustainability, and reproducibility of your epidemic analysis with R
Reading compressed data
Take 1 minute: Can you read data from a compressed file in
R
? Download this zip
file containing data for Marburg outbreak and then import it to your
working environment.
You can check the full list of supported file formats in the rio package on the package website. To expand {rio} to the full range of support for import and export formats run:
R
rio::install_formats()
You can use this template to read the file:
rio::import(here::here("some", "where", "downto", "path", "file_name.zip"))
R
rio::import(here::here("data", "Marburg.zip"))
Reading from databases
The DBI package serves as a versatile interface for interacting with database management systems (DBMS) across different back-ends or servers. It offers a uniform method for accessing and retrieving data from various database systems.
When to read directly from a database?
We can use database interface packages to optimize memory usage. If we process the database with “queries” (e.g., select, filter, summarise) before extraction, we can reduce the memory load in our RStudio session. Conversely, conducting all data manipulation outside the database management system by loading the full dataset into R can use up much more computer memory (i.e. RAM) than is feasible on a local machine, which can lead RStudio to slow down or even freeze.
External relational database management systems (RDBMS) also have the advantage that multiple users can access, store and analyse parts of the dataset simultaneously, without having to transfer individual files, which would make it very difficult to track which version is up-to-date.
The following code chunk demonstrates in four steps how to create a
temporary SQLite database in memory, store the
ebola_confirmed
as a table on it, and subsequently read
it:
1. Connect with a database
First, we establish a connection to an SQLite database created on our
machine and stored in its local memory with
DBI::dbConnect()
.
R
library(DBI)
library(RSQLite)
# Create a temporary SQLite database in memory
db_connection <- DBI::dbConnect(
drv = RSQLite::SQLite(),
dbname = ":memory:"
)
Callout
A real-life connection to an external SQLite database would look like this:
R
# in real-life
db_connection <- DBI::dbConnect(
RSQLite::SQLite(),
host = "database.epiversetrace.com",
user = "juanito",
password = epiversetrace::askForPassword("Database password")
)
2. Write a local data frame as a table in a database
Then, we can write the ebola_confirmed
into a table
named cases
within the database using the
DBI::dbWriteTable()
function.
R
# Store the 'ebola_confirmed' dataframe as a table named 'cases'
# in the SQLite database
DBI::dbWriteTable(
conn = db_connection,
name = "cases",
value = ebola_confirmed
)
In a database framework, you can have more than one table. Each table
can belong to a specific entity
(e.g., patients, care
units, jobs). All tables will be related by a common ID or
primary key
.
3. Read data from a table in a database
Subsequently, we reads the data from the cases
table
using dplyr::tbl()
.
R
# Read one table from the database
mytable_db <- dplyr::tbl(src = db_connection, "cases")
If we apply dplyr verbs to this database SQLite table, these verbs will be translated to SQL queries.
R
# Show the SQL queries translated
mytable_db %>%
dplyr::filter(confirm > 50) %>%
dplyr::arrange(desc(confirm)) %>%
dplyr::show_query()
OUTPUT
<SQL>
SELECT `cases`.*
FROM `cases`
WHERE (`confirm` > 50.0)
ORDER BY `confirm` DESC
4. Extract data from the database
Use dplyr::collect()
to force computation of a database
query and extract the output to your local computer.
R
# Pull all data down to a local tibble
extracted_data <- mytable_db %>%
dplyr::filter(confirm > 50) %>%
dplyr::arrange(desc(confirm)) %>%
dplyr::collect()
The extracted_data
object represents the extracted,
ideally after specifying queries that reduces its size.
R
# View the extracted_data
extracted_data
OUTPUT
# A tibble: 3 × 4
year month day confirm
<int> <int> <int> <int>
1 2014 9 16 84
2 2014 9 15 68
3 2014 9 17 56
Run SQL queries in R using dbplyr
Practice how to make relational database SQL queries using multiple
dplyr verbs like dplyr::left_join()
among
tables before pulling down data to your local session with
dplyr::collect()
!
You can also review the dbplyr R package. But for a step-by-step tutorial about SQL, we recommend you this tutorial about data management with SQL for Ecologist. You will find close to dplyr!
Reading from HIS APIs
Health related data are also increasingly stored in specialized HIS
APIs like Fingertips, GoData,
REDCap, and DHIS2. In such case one
can resort to readepi package,
which enables reading data from HIS-APIs.
-[TBC]
Key Points
- Use
{rio}, {io}, {readr}
and{ImportExport}
to read data from individual files. - Use
{readepi}
to read data form HIS APIs and RDBMS.