intervalaverage-intro.RmdNote: This package and vignette makes extensive use of
data.table. If you’re unfamiliar with the
data.table syntax, a brief review of that package’s introductory
vignette may be useful.
intervalaverage do?
The intervalaverage package is intended to do a specific
job very efficiently. Namely, it averages data measured non-continuously
over arbitrary intervals. It is implemented in C++ and
data.table in order to do this job as fast and in as
memory-efficient a way as possible.
The motivation for this package was to efficiently create time- and location-weighted averages of air pollution exposure for participants in epidemiological studies. In these studies, air pollution exposure is modeled on a weekly, monthly, or annual basis for study participants’ home addresses. In order to create a long-term average of each study participant’s air pollution exposure, it is necessary to calculate an average of the modeled exposures for each address for the period during which they lived there, which may not align nicely with the exposure periods produced by the models.
There are likely many other applications for averaging non-continuous measurements over arbitrary intervals, but this documentation will demonstrate the application of calculating long-term air pollution exposures for individuals over multiple locations.
We will use two data sets (included in this package) to demonstrate
intervalaverage’s main functionality.
no2 Nitrogen Dioxide (NO2) modeled
annually at 62 home addresses (identified by
location_id)
address_history The time periods for which 25 people
(identified by person_id) lived at each of these home
address.
library(intervalaverage)
data("no2")
data("address_history")
setDT(no2)
setDT(address_history)
head(no2[])
#> no2 start_date end_date location_id
#> 1: 24.05 1990-01-01 1990-12-31 1
#> 2: 22.99 1991-01-01 1991-12-31 1
#> 3: 24.90 1992-01-01 1992-12-31 1
#> 4: 25.88 1993-01-01 1993-12-31 1
#> 5: 25.99 1994-01-01 1994-12-31 1
#> 6: 25.56 1995-01-01 1995-12-31 1
head(address_history[])
#> start_date end_date location_id person_id
#> 1: 1989-06-01 2021-12-31 43 1
#> 2: 1989-06-01 2001-10-31 18 2
#> 3: 2001-11-01 2007-04-30 17 2
#> 4: 2007-05-01 2021-12-31 16 2
#> 5: 1989-06-01 1996-07-05 56 3
#> 6: 1989-06-01 1992-06-30 9 4Note that each of these data sets is already a
data.table and the date columns are IDates.
However, due to a quirk in data.table, we need to run
setDT on data loaded with the data
function.
Inspecting the no2 data set above, we see that each
location has a single NO2 measurement for each calendar year.
It would be simple to find the average measurement over, for example,
2002-01-01 through 2005-12-31, because that period aligns with our
measurements. However, what if we wanted to find the average
NO2 level at each of the periods in our
address_history? Not only does each address have a
different date range to average over, but that range doesn’t align
nicely with the measurements! For each address, we need to take a mean
of the measurements, weighted by how much of the measurement was
overlapped by the address history.
This is straightforward with intervalaverage’s
intervalaverage function!
The y argument of intervalaverage tells it
what date range to average for every group in the values data set (in
this case, every location_id in no2). For now,
we don’t care about who lived at each address, only the unique addresses
and date ranges. Since some people in our address history shared homes,
we begin by creating a set of unique addresses and their associated
dates.
unique_addresses <- unique(address_history[ , .(location_id, start_date, end_date)])Now we can use the intervalaverage function:
averaged_exposures <- intervalaverage(
x = no2,
y = unique_addresses,
interval_vars = c("start_date", "end_date"),
value_vars = "no2",
group_vars = "location_id"
)
head(averaged_exposures[ , .(location_id, start_date, end_date, no2)])
#> location_id start_date end_date no2
#> 1: 1 1990-07-01 2005-06-30 24.83144
#> 2: 2 2005-07-01 2021-12-31 NA
#> 3: 6 1989-06-01 1992-09-30 NA
#> 4: 7 1989-06-01 1990-06-30 NA
#> 5: 8 1989-06-01 2021-12-31 NA
#> 6: 9 1989-06-01 1992-06-30 NAYou’ll notice that there are many missing values. By default,
intervalaverage returns NA unless the interval
specified in y is completely covered by the values in
x, and many of our addresses were occupied outside the time
period covered by the NO2 model. To relax this requirement
and return a value even when not all of the interval had NO2
values, use the required_percentage argument. For example,
here we accept an average if at least 80% of the interval has
values.
averaged_exposures <- intervalaverage(
x = no2,
y = unique_addresses,
interval_vars = c("start_date", "end_date"),
value_vars = "no2",
group_vars = "location_id",
required_percentage = 80
)
head(averaged_exposures[ , .(location_id, start_date, end_date, no2)])
#> location_id start_date end_date no2
#> 1: 1 1990-07-01 2005-06-30 24.83144
#> 2: 2 2005-07-01 2021-12-31 NA
#> 3: 6 1989-06-01 1992-09-30 21.85671
#> 4: 7 1989-06-01 1990-06-30 NA
#> 5: 8 1989-06-01 2021-12-31 NA
#> 6: 9 1989-06-01 1992-06-30 17.59011intervalaverage also includes some diagnostic columns in
its output that can be useful in determining why NAs are
being returned (this is why we selected only location_id,
start_date, end_date, and no2 to
display above). See the “Value” section of
help(intervalaverage) for an explanation of what these
columns mean.
If you look closely at our address history, you’ll see that some
locations and dates are shared among multiple person_ids.
This makes sense, since more than one person often lives in the same
house! You’ll also notice that people move (so each
person_id may be associated with multiple
location_ids).
To create an average for each person (rather than location), we’ll
need to link the no2 measurement at each location to the
person who lived there at the time. Remember, multiple people might have
lived at the same location at the same or different times.
The intervalaverage package does this with the
intervalintersect function. It looks very similar to the
intervalaverage function:
no2_by_person <- intervalintersect(
x = no2,
y = address_history,
interval_vars = c('start_date', 'end_date'),
group_vars = 'location_id'
)
head(no2_by_person)
#> location_id start end no2 person_id
#> 1: 1 1990-07-01 1990-12-31 24.05 6
#> 2: 1 1991-01-01 1991-12-31 22.99 6
#> 3: 1 1992-01-01 1992-12-31 24.90 6
#> 4: 1 1993-01-01 1993-12-31 25.88 6
#> 5: 1 1994-01-01 1994-12-31 25.99 6
#> 6: 1 1995-01-01 1995-12-31 25.56 6To understand what’s happening here, let’s look at person 5.
This person moved from location 40 to location 41 in November of 1996.
address_history[person_id == 5]
#> start_date end_date location_id person_id
#> 1: 1989-06-01 1996-11-30 40 5
#> 2: 1996-12-01 2021-12-31 41 5If you look at the result of intervalintersect, you’ll
see that person 5 has two no2 values for 1996, and that
each one is associated with the part of the year when they lived where
that value applied.
no2_by_person[person_id == 5 & year(start) %in% 1995:1997]
#> location_id start end no2 person_id
#> 1: 40 1995-01-01 1995-12-31 44.08 5
#> 2: 40 1996-01-01 1996-11-30 37.17 5
#> 3: 41 1996-12-01 1996-12-31 28.02 5
#> 4: 41 1997-01-01 1997-12-31 25.00 5Now that we have a set of NO2 values associated with the
person_id rather than location_id, we can use
intervalaverage to find the average for any period of time.
That could be a period that is unique to each participant. Or it could
be consistent for all people in the data set, such as a yearly average
(where each year is a time-weighted average of the places the person
lived that year).
We will work through both, and introduce one more function along the way.
The package includes another data.table with each person’s enrollment date in our study. We will use it to calculate the average exposure for each participant in the year before they joined our study.
data(enrollment)
setDT(enrollment)
head(enrollment)
#> person_id enroll_date
#> 1: 1 2016-01-08
#> 2: 2 2016-08-18
#> 3: 3 2015-08-12
#> 4: 4 2016-07-18
#> 5: 5 2016-07-11
#> 6: 6 2015-07-26Again, the y argument of intervalaverage
tells it what date range to average for every group in the values data
set (x). So we first need to create a
data.table that contains the beginning and end of our
desired averaging period for each participant from the
enrollment data set. We’ll define the “year prior” to
enrollment in our study as the 365 days preceding the enrollment date
(i.e. the period from 365 days before enrollment to 1 day before
enrollment).
pre_enrollment <- enrollment[ , .(person_id, start = enroll_date - 365L, end = enroll_date - 1L)]
head(pre_enrollment)
#> person_id start end
#> 1: 1 2015-01-08 2016-01-07
#> 2: 2 2015-08-19 2016-08-17
#> 3: 3 2014-08-12 2015-08-11
#> 4: 4 2015-07-19 2016-07-17
#> 5: 5 2015-07-12 2016-07-10
#> 6: 6 2014-07-26 2015-07-25You may notice I’ve used start and end
here rather than start_date and end_date as I
did previously. This is because of a quirk of
intervalintersect where it returns start and
end as the interval column names regardless of the input
names, and in the next step, the interval column names need to match.
This will be fixed in future versions of the package.
Now we simply apply intervalaverage, again allowing for
20% missingness in no2 values:
no2_pre_enrollment <- intervalaverage(
x = no2_by_person,
y = pre_enrollment,
interval_vars = c('start', 'end'),
group_vars = 'person_id',
value_vars = 'no2',
required_percentage = 80
)
head(no2_pre_enrollment[, .(person_id, start, end, no2)])
#> person_id start end no2
#> 1: 1 2015-01-08 2016-01-07 NA
#> 2: 2 2015-08-19 2016-08-17 NA
#> 3: 3 2014-08-12 2015-08-11 NA
#> 4: 4 2015-07-19 2016-07-17 NA
#> 5: 5 2015-07-12 2016-07-10 NA
#> 6: 6 2014-07-26 2015-07-25 NAFinally, let’s calculate annual (calendar-year) averages for
2000-2004. We’ll need to build a y argument for the
intervalaverage function that has a start date, end date
for each year. However, even though the periods we are averaging over
are the same for each participant, we also need to repeat each
year once for every participant. This is because because y
must contain the same grouping variable(s) as x.
This gives us an opportunity to try a convenient helper function in
intervalaverage: CJ.dt. This is a more
convenient version of data.table::CJ (CJ as in
“Cross Join”) that can take data.tables as arguments
instead of just vectors. By cross joining our unique
person_ids with the years 2000-2004, we can construct a
y argument that will work in the
intervalaverage function.
# Build a data.table with start and end dates for each year
years <- data.table(year = 2000:2004)
years[ , start := as.IDate(paste(year, "01", "01", sep = "-"))]
years[ , end := as.IDate(paste(year, "12", "31", sep = "-"))]
years[]
#> year start end
#> 1: 2000 2000-01-01 2000-12-31
#> 2: 2001 2001-01-01 2001-12-31
#> 3: 2002 2002-01-01 2002-12-31
#> 4: 2003 2003-01-01 2003-12-31
#> 5: 2004 2004-01-01 2004-12-31
# Cross join with the unique person_ids in our data
years <- CJ.dt(years, unique(address_history[ , .(person_id)]))
head(years[order(year, person_id)])
#> year start end person_id
#> 1: 2000 2000-01-01 2000-12-31 1
#> 2: 2000 2000-01-01 2000-12-31 2
#> 3: 2000 2000-01-01 2000-12-31 3
#> 4: 2000 2000-01-01 2000-12-31 4
#> 5: 2000 2000-01-01 2000-12-31 5
#> 6: 2000 2000-01-01 2000-12-31 6Now we can apply the intervalaverage function again to
get a yearly average for each person. Recall that the NO2
data is also annual, so the only rows that will actually be averaged
will be those for years where a person moved (since a person will have
had an NO2 value for each address they lived in that
year).
no2_annual <- intervalaverage(
x = no2_by_person,
y = years,
interval_vars = c('start', 'end'),
group_vars = 'person_id',
value_vars = 'no2',
required_percentage = 80
)
head(no2_annual[, .(person_id, start, end, no2)])
#> person_id start end no2
#> 1: 1 2000-01-01 2000-12-31 35.14
#> 2: 1 2001-01-01 2001-12-31 40.35
#> 3: 1 2002-01-01 2002-12-31 32.97
#> 4: 1 2003-01-01 2003-12-31 31.82
#> 5: 1 2004-01-01 2004-12-31 30.72
#> 6: 2 2000-01-01 2000-12-31 17.13For a more detailed and advanced tutorial of how to use the
intervalaverage package, see
vignette("intervalaverage-advanced").
For information about the inner workings of the package (intended for
maintainers), see
vignette("intervalaverage-technicaloverview").