Read case data
Last updated on 2024-04-29 | Edit this page
Estimated time: 30 minutes
Overview
Questions
- Where do you usually store your outbreak data?
- How many different data formats can I read?
- Is it possible to import data from database and health APIs?
Objectives
- Explain how to import outbreak data from different sources into
R
environment for analysis.
Introduction
The initial step in outbreak analysis involves importing the target
dataset into the R
environment from various sources.
Outbreak data is typically stored in files of diverse formats,
relational database management systems (RDBMS), or health information
system (HIS) application program interfaces (APIs) such as REDCap, DHIS2, etc. The latter option is
particularly well-suited for storing institutional health data. This
episode will elucidate the process of reading cases from these
sources.
Reading from files
Several packages are available for importing outbreak data stored in
individual files into R
. These include rio, readr from the
tidyverse
, io, ImportExport,
data.table.
Together, these packages offer methods to read single or multiple files
in a wide range of formats.
The below example shows how to import a csv
file into
R
environment using rio package.
R
library("rio")
library("here")
# read data
# e.g.: if path to file is data/raw-data/ebola_cases.csv then:
ebola_confirmed <- read_csv(here::here("data", "raw-data", "ebola_cases.csv"))
# preview data
head(ebola_confirmed, 5)
OUTPUT
date confirm
1 2014-05-18 1
2 2014-05-20 2
3 2014-05-21 4
4 2014-05-22 6
5 2014-05-23 1
Similarly, you can import files of other formats such as
tsv
, xlsx
, etc.
You can check the full list of supported file formats in the rio package on the package website. Here is a selection of some key ones:
R
rio::install_formats()
R
rio::import(here::here("some", "where", "downto", "path", "file_name.zip"))
Click here to download a zip file containing data for Marburg outbreak and then import it to your working environment.
Reading from databases
The DBI package serves as a versatile interface for interacting with database management systems (DBMS) across different back-ends or servers. It offers a uniform method for accessing and retrieving data from various database systems.
The following code chunk demonstrates how to create a temporary
SQLite database in memory, store the case_data
as a table
within it, and subsequently read from it:
R
library("DBI")
library("RSQLite")
# Create a temporary SQLite database in memory
db_con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# Store the 'case_data' dataframe as a table named 'cases'
# in the SQLite database
DBI::dbWriteTable(db_con, "cases", case_data)
# Read data from the 'cases' table
result <- DBI::dbReadTable(db_con, "cases")
# Close the database connection
DBI::dbDisconnect(db_con)
# View the result
base::print(utils::head(result))
OUTPUT
date confirm
1 16208 1
2 16210 2
3 16211 4
4 16212 6
5 16213 1
6 16214 2
This code first establishes a connection to an SQLite database
created in memory using dbConnect()
. Then, it writes the
case_data
into a table named ‘cases’ within the database
using the dbWriteTable()
function. Subsequently, it reads
the data from the ‘cases’ table using dbReadTable()
.
Finally, it closes the database connection with
dbDisconnect()
. Read this tutorial
episode on SQL databases and R for more examples.
Run SQL queries in R using dbplyr
A database interface package optimize memory usage by processing the database before extraction, reducing memory load. Conversely, conducting all data manipulation outside the database (e.g., in our local Rstudio session) can lead to inefficient memory usage and strained system resources.
Read the Introduction to dbplyr vignette to learn how to generate your own queries!
Reading from HIS APIs
Health related data are also increasingly stored in specialized HIS
APIs like Fingertips, GoData,
REDCap, and DHIS2. In such case one
can resort to readepi package,
which enables reading data from HIS-APIs.
-[TBC]