intervalaverage-intro.Rmd
Note: This package and vignette makes extensive use of
data.table
. If you’re unfamiliar with the
data.table
syntax, a brief review of that package’s introductory
vignette may be useful.
intervalaverage
do?
The intervalaverage
package is intended to do a specific
job very efficiently. Namely, it averages data measured non-continuously
over arbitrary intervals. It is implemented in C++ and
data.table
in order to do this job as fast and in as
memory-efficient a way as possible.
The motivation for this package was to efficiently create time- and location-weighted averages of air pollution exposure for participants in epidemiological studies. In these studies, air pollution exposure is modeled on a weekly, monthly, or annual basis for study participants’ home addresses. In order to create a long-term average of each study participant’s air pollution exposure, it is necessary to calculate an average of the modeled exposures for each address for the period during which they lived there, which may not align nicely with the exposure periods produced by the models.
There are likely many other applications for averaging non-continuous measurements over arbitrary intervals, but this documentation will demonstrate the application of calculating long-term air pollution exposures for individuals over multiple locations.
We will use two data sets (included in this package) to demonstrate
intervalaverage
’s main functionality.
no2
Nitrogen Dioxide (NO2) modeled
annually at 62 home addresses (identified by
location_id
)
address_history
The time periods for which 25 people
(identified by person_id
) lived at each of these home
address.
library(intervalaverage)
data("no2")
data("address_history")
setDT(no2)
setDT(address_history)
head(no2[])
#> no2 start_date end_date location_id
#> 1: 24.05 1990-01-01 1990-12-31 1
#> 2: 22.99 1991-01-01 1991-12-31 1
#> 3: 24.90 1992-01-01 1992-12-31 1
#> 4: 25.88 1993-01-01 1993-12-31 1
#> 5: 25.99 1994-01-01 1994-12-31 1
#> 6: 25.56 1995-01-01 1995-12-31 1
head(address_history[])
#> start_date end_date location_id person_id
#> 1: 1989-06-01 2021-12-31 43 1
#> 2: 1989-06-01 2001-10-31 18 2
#> 3: 2001-11-01 2007-04-30 17 2
#> 4: 2007-05-01 2021-12-31 16 2
#> 5: 1989-06-01 1996-07-05 56 3
#> 6: 1989-06-01 1992-06-30 9 4
Note that each of these data sets is already a
data.table
and the date columns are IDate
s.
However, due to a quirk in data.table
, we need to run
setDT
on data loaded with the data
function.
Inspecting the no2
data set above, we see that each
location has a single NO2 measurement for each calendar year.
It would be simple to find the average measurement over, for example,
2002-01-01 through 2005-12-31, because that period aligns with our
measurements. However, what if we wanted to find the average
NO2 level at each of the periods in our
address_history
? Not only does each address have a
different date range to average over, but that range doesn’t align
nicely with the measurements! For each address, we need to take a mean
of the measurements, weighted by how much of the measurement was
overlapped by the address history.
This is straightforward with intervalaverage
’s
intervalaverage
function!
The y
argument of intervalaverage
tells it
what date range to average for every group in the values data set (in
this case, every location_id
in no2
). For now,
we don’t care about who lived at each address, only the unique addresses
and date ranges. Since some people in our address history shared homes,
we begin by creating a set of unique addresses and their associated
dates.
unique_addresses <- unique(address_history[ , .(location_id, start_date, end_date)])
Now we can use the intervalaverage
function:
averaged_exposures <- intervalaverage(
x = no2,
y = unique_addresses,
interval_vars = c("start_date", "end_date"),
value_vars = "no2",
group_vars = "location_id"
)
head(averaged_exposures[ , .(location_id, start_date, end_date, no2)])
#> location_id start_date end_date no2
#> 1: 1 1990-07-01 2005-06-30 24.83144
#> 2: 2 2005-07-01 2021-12-31 NA
#> 3: 6 1989-06-01 1992-09-30 NA
#> 4: 7 1989-06-01 1990-06-30 NA
#> 5: 8 1989-06-01 2021-12-31 NA
#> 6: 9 1989-06-01 1992-06-30 NA
You’ll notice that there are many missing values. By default,
intervalaverage
returns NA
unless the interval
specified in y
is completely covered by the values in
x
, and many of our addresses were occupied outside the time
period covered by the NO2 model. To relax this requirement
and return a value even when not all of the interval had NO2
values, use the required_percentage
argument. For example,
here we accept an average if at least 80% of the interval has
values.
averaged_exposures <- intervalaverage(
x = no2,
y = unique_addresses,
interval_vars = c("start_date", "end_date"),
value_vars = "no2",
group_vars = "location_id",
required_percentage = 80
)
head(averaged_exposures[ , .(location_id, start_date, end_date, no2)])
#> location_id start_date end_date no2
#> 1: 1 1990-07-01 2005-06-30 24.83144
#> 2: 2 2005-07-01 2021-12-31 NA
#> 3: 6 1989-06-01 1992-09-30 21.85671
#> 4: 7 1989-06-01 1990-06-30 NA
#> 5: 8 1989-06-01 2021-12-31 NA
#> 6: 9 1989-06-01 1992-06-30 17.59011
intervalaverage
also includes some diagnostic columns in
its output that can be useful in determining why NA
s are
being returned (this is why we selected only location_id
,
start_date
, end_date
, and no2
to
display above). See the “Value” section of
help(intervalaverage)
for an explanation of what these
columns mean.
If you look closely at our address history, you’ll see that some
locations and dates are shared among multiple person_id
s.
This makes sense, since more than one person often lives in the same
house! You’ll also notice that people move (so each
person_id
may be associated with multiple
location_id
s).
To create an average for each person (rather than location), we’ll
need to link the no2
measurement at each location to the
person who lived there at the time. Remember, multiple people might have
lived at the same location at the same or different times.
The intervalaverage
package does this with the
intervalintersect
function. It looks very similar to the
intervalaverage
function:
no2_by_person <- intervalintersect(
x = no2,
y = address_history,
interval_vars = c('start_date', 'end_date'),
group_vars = 'location_id'
)
head(no2_by_person)
#> location_id start end no2 person_id
#> 1: 1 1990-07-01 1990-12-31 24.05 6
#> 2: 1 1991-01-01 1991-12-31 22.99 6
#> 3: 1 1992-01-01 1992-12-31 24.90 6
#> 4: 1 1993-01-01 1993-12-31 25.88 6
#> 5: 1 1994-01-01 1994-12-31 25.99 6
#> 6: 1 1995-01-01 1995-12-31 25.56 6
To understand what’s happening here, let’s look at person 5.
This person moved from location 40 to location 41 in November of 1996.
address_history[person_id == 5]
#> start_date end_date location_id person_id
#> 1: 1989-06-01 1996-11-30 40 5
#> 2: 1996-12-01 2021-12-31 41 5
If you look at the result of intervalintersect
, you’ll
see that person 5 has two no2
values for 1996, and that
each one is associated with the part of the year when they lived where
that value applied.
no2_by_person[person_id == 5 & year(start) %in% 1995:1997]
#> location_id start end no2 person_id
#> 1: 40 1995-01-01 1995-12-31 44.08 5
#> 2: 40 1996-01-01 1996-11-30 37.17 5
#> 3: 41 1996-12-01 1996-12-31 28.02 5
#> 4: 41 1997-01-01 1997-12-31 25.00 5
Now that we have a set of NO2 values associated with the
person_id
rather than location_id
, we can use
intervalaverage
to find the average for any period of time.
That could be a period that is unique to each participant. Or it could
be consistent for all people in the data set, such as a yearly average
(where each year is a time-weighted average of the places the person
lived that year).
We will work through both, and introduce one more function along the way.
The package includes another data.table with each person’s enrollment date in our study. We will use it to calculate the average exposure for each participant in the year before they joined our study.
data(enrollment)
setDT(enrollment)
head(enrollment)
#> person_id enroll_date
#> 1: 1 2016-01-08
#> 2: 2 2016-08-18
#> 3: 3 2015-08-12
#> 4: 4 2016-07-18
#> 5: 5 2016-07-11
#> 6: 6 2015-07-26
Again, the y
argument of intervalaverage
tells it what date range to average for every group in the values data
set (x
). So we first need to create a
data.table
that contains the beginning and end of our
desired averaging period for each participant from the
enrollment
data set. We’ll define the “year prior” to
enrollment in our study as the 365 days preceding the enrollment date
(i.e. the period from 365 days before enrollment to 1 day before
enrollment).
pre_enrollment <- enrollment[ , .(person_id, start = enroll_date - 365L, end = enroll_date - 1L)]
head(pre_enrollment)
#> person_id start end
#> 1: 1 2015-01-08 2016-01-07
#> 2: 2 2015-08-19 2016-08-17
#> 3: 3 2014-08-12 2015-08-11
#> 4: 4 2015-07-19 2016-07-17
#> 5: 5 2015-07-12 2016-07-10
#> 6: 6 2014-07-26 2015-07-25
You may notice I’ve used start
and end
here rather than start_date
and end_date
as I
did previously. This is because of a quirk of
intervalintersect
where it returns start
and
end
as the interval column names regardless of the input
names, and in the next step, the interval column names need to match.
This will be fixed in future versions of the package.
Now we simply apply intervalaverage
, again allowing for
20% missingness in no2
values:
no2_pre_enrollment <- intervalaverage(
x = no2_by_person,
y = pre_enrollment,
interval_vars = c('start', 'end'),
group_vars = 'person_id',
value_vars = 'no2',
required_percentage = 80
)
head(no2_pre_enrollment[, .(person_id, start, end, no2)])
#> person_id start end no2
#> 1: 1 2015-01-08 2016-01-07 NA
#> 2: 2 2015-08-19 2016-08-17 NA
#> 3: 3 2014-08-12 2015-08-11 NA
#> 4: 4 2015-07-19 2016-07-17 NA
#> 5: 5 2015-07-12 2016-07-10 NA
#> 6: 6 2014-07-26 2015-07-25 NA
Finally, let’s calculate annual (calendar-year) averages for
2000-2004. We’ll need to build a y
argument for the
intervalaverage
function that has a start date, end date
for each year. However, even though the periods we are averaging over
are the same for each participant, we also need to repeat each
year once for every participant. This is because because y
must contain the same grouping variable(s) as x
.
This gives us an opportunity to try a convenient helper function in
intervalaverage
: CJ.dt
. This is a more
convenient version of data.table::CJ
(CJ
as in
“Cross Join”) that can take data.tables
as arguments
instead of just vectors. By cross joining our unique
person_id
s with the years 2000-2004, we can construct a
y
argument that will work in the
intervalaverage
function.
# Build a data.table with start and end dates for each year
years <- data.table(year = 2000:2004)
years[ , start := as.IDate(paste(year, "01", "01", sep = "-"))]
years[ , end := as.IDate(paste(year, "12", "31", sep = "-"))]
years[]
#> year start end
#> 1: 2000 2000-01-01 2000-12-31
#> 2: 2001 2001-01-01 2001-12-31
#> 3: 2002 2002-01-01 2002-12-31
#> 4: 2003 2003-01-01 2003-12-31
#> 5: 2004 2004-01-01 2004-12-31
# Cross join with the unique person_ids in our data
years <- CJ.dt(years, unique(address_history[ , .(person_id)]))
head(years[order(year, person_id)])
#> year start end person_id
#> 1: 2000 2000-01-01 2000-12-31 1
#> 2: 2000 2000-01-01 2000-12-31 2
#> 3: 2000 2000-01-01 2000-12-31 3
#> 4: 2000 2000-01-01 2000-12-31 4
#> 5: 2000 2000-01-01 2000-12-31 5
#> 6: 2000 2000-01-01 2000-12-31 6
Now we can apply the intervalaverage
function again to
get a yearly average for each person. Recall that the NO2
data is also annual, so the only rows that will actually be averaged
will be those for years where a person moved (since a person will have
had an NO2 value for each address they lived in that
year).
no2_annual <- intervalaverage(
x = no2_by_person,
y = years,
interval_vars = c('start', 'end'),
group_vars = 'person_id',
value_vars = 'no2',
required_percentage = 80
)
head(no2_annual[, .(person_id, start, end, no2)])
#> person_id start end no2
#> 1: 1 2000-01-01 2000-12-31 35.14
#> 2: 1 2001-01-01 2001-12-31 40.35
#> 3: 1 2002-01-01 2002-12-31 32.97
#> 4: 1 2003-01-01 2003-12-31 31.82
#> 5: 1 2004-01-01 2004-12-31 30.72
#> 6: 2 2000-01-01 2000-12-31 17.13
For a more detailed and advanced tutorial of how to use the
intervalaverage
package, see
vignette("intervalaverage-advanced")
.
For information about the inner workings of the package (intended for
maintainers), see
vignette("intervalaverage-technicaloverview")
.