Note: This package and vignette makes extensive use of data.table. If you’re unfamiliar with the data.table syntax, a brief review of that package’s introductory vignette may be useful.

What does intervalaverage do?

The intervalaverage package is intended to do a specific job very efficiently. Namely, it averages data measured non-continuously over arbitrary intervals. It is implemented in C++ and data.table in order to do this job as fast and in as memory-efficient a way as possible.

The motivation for this package was to efficiently create time- and location-weighted averages of air pollution exposure for participants in epidemiological studies. In these studies, air pollution exposure is modeled on a weekly, monthly, or annual basis for study participants’ home addresses. In order to create a long-term average of each study participant’s air pollution exposure, it is necessary to calculate an average of the modeled exposures for each address for the period during which they lived there, which may not align nicely with the exposure periods produced by the models.

There are likely many other applications for averaging non-continuous measurements over arbitrary intervals, but this documentation will demonstrate the application of calculating long-term air pollution exposures for individuals over multiple locations.

Data

We will use two data sets (included in this package) to demonstrate intervalaverage’s main functionality.

  • no2 Nitrogen Dioxide (NO2) modeled annually at 62 home addresses (identified by location_id)

  • address_history The time periods for which 25 people (identified by person_id) lived at each of these home address.

library(intervalaverage)
data("no2")
data("address_history")
setDT(no2) 
setDT(address_history)
head(no2[])
#>      no2 start_date   end_date location_id
#> 1: 24.05 1990-01-01 1990-12-31           1
#> 2: 22.99 1991-01-01 1991-12-31           1
#> 3: 24.90 1992-01-01 1992-12-31           1
#> 4: 25.88 1993-01-01 1993-12-31           1
#> 5: 25.99 1994-01-01 1994-12-31           1
#> 6: 25.56 1995-01-01 1995-12-31           1
head(address_history[])
#>    start_date   end_date location_id person_id
#> 1: 1989-06-01 2021-12-31          43         1
#> 2: 1989-06-01 2001-10-31          18         2
#> 3: 2001-11-01 2007-04-30          17         2
#> 4: 2007-05-01 2021-12-31          16         2
#> 5: 1989-06-01 1996-07-05          56         3
#> 6: 1989-06-01 1992-06-30           9         4

Note that each of these data sets is already a data.table and the date columns are IDates. However, due to a quirk in data.table, we need to run setDT on data loaded with the data function.

Averaging Measurements over Arbitrary Intervals

Inspecting the no2 data set above, we see that each location has a single NO2 measurement for each calendar year. It would be simple to find the average measurement over, for example, 2002-01-01 through 2005-12-31, because that period aligns with our measurements. However, what if we wanted to find the average NO2 level at each of the periods in our address_history? Not only does each address have a different date range to average over, but that range doesn’t align nicely with the measurements! For each address, we need to take a mean of the measurements, weighted by how much of the measurement was overlapped by the address history.

This is straightforward with intervalaverage’s intervalaverage function!

The y argument of intervalaverage tells it what date range to average for every group in the values data set (in this case, every location_id in no2). For now, we don’t care about who lived at each address, only the unique addresses and date ranges. Since some people in our address history shared homes, we begin by creating a set of unique addresses and their associated dates.

unique_addresses <- unique(address_history[ , .(location_id, start_date, end_date)])

Now we can use the intervalaverage function:

averaged_exposures <- intervalaverage(
  x = no2,
  y = unique_addresses,
  interval_vars = c("start_date", "end_date"),
  value_vars = "no2", 
  group_vars = "location_id"
)
head(averaged_exposures[ , .(location_id, start_date, end_date, no2)])
#>    location_id start_date   end_date      no2
#> 1:           1 1990-07-01 2005-06-30 24.83144
#> 2:           2 2005-07-01 2021-12-31       NA
#> 3:           6 1989-06-01 1992-09-30       NA
#> 4:           7 1989-06-01 1990-06-30       NA
#> 5:           8 1989-06-01 2021-12-31       NA
#> 6:           9 1989-06-01 1992-06-30       NA

You’ll notice that there are many missing values. By default, intervalaverage returns NA unless the interval specified in y is completely covered by the values in x, and many of our addresses were occupied outside the time period covered by the NO2 model. To relax this requirement and return a value even when not all of the interval had NO2 values, use the required_percentage argument. For example, here we accept an average if at least 80% of the interval has values.

averaged_exposures <- intervalaverage(
  x = no2,
  y = unique_addresses,
  interval_vars = c("start_date", "end_date"),
  value_vars = "no2", 
  group_vars = "location_id", 
  required_percentage = 80
)
head(averaged_exposures[ , .(location_id, start_date, end_date, no2)])
#>    location_id start_date   end_date      no2
#> 1:           1 1990-07-01 2005-06-30 24.83144
#> 2:           2 2005-07-01 2021-12-31       NA
#> 3:           6 1989-06-01 1992-09-30 21.85671
#> 4:           7 1989-06-01 1990-06-30       NA
#> 5:           8 1989-06-01 2021-12-31       NA
#> 6:           9 1989-06-01 1992-06-30 17.59011

intervalaverage also includes some diagnostic columns in its output that can be useful in determining why NAs are being returned (this is why we selected only location_id, start_date, end_date, and no2 to display above). See the “Value” section of help(intervalaverage) for an explanation of what these columns mean.

Averaging over an Address History

If you look closely at our address history, you’ll see that some locations and dates are shared among multiple person_ids. This makes sense, since more than one person often lives in the same house! You’ll also notice that people move (so each person_id may be associated with multiple location_ids).

To create an average for each person (rather than location), we’ll need to link the no2 measurement at each location to the person who lived there at the time. Remember, multiple people might have lived at the same location at the same or different times.

The intervalaverage package does this with the intervalintersect function. It looks very similar to the intervalaverage function:

no2_by_person <- intervalintersect(
  x = no2, 
  y = address_history, 
  interval_vars = c('start_date', 'end_date'), 
  group_vars = 'location_id'
)  
head(no2_by_person)
#>    location_id      start        end   no2 person_id
#> 1:           1 1990-07-01 1990-12-31 24.05         6
#> 2:           1 1991-01-01 1991-12-31 22.99         6
#> 3:           1 1992-01-01 1992-12-31 24.90         6
#> 4:           1 1993-01-01 1993-12-31 25.88         6
#> 5:           1 1994-01-01 1994-12-31 25.99         6
#> 6:           1 1995-01-01 1995-12-31 25.56         6

To understand what’s happening here, let’s look at person 5.

This person moved from location 40 to location 41 in November of 1996.

address_history[person_id == 5]
#>    start_date   end_date location_id person_id
#> 1: 1989-06-01 1996-11-30          40         5
#> 2: 1996-12-01 2021-12-31          41         5

If you look at the result of intervalintersect, you’ll see that person 5 has two no2 values for 1996, and that each one is associated with the part of the year when they lived where that value applied.

no2_by_person[person_id == 5 & year(start) %in% 1995:1997]
#>    location_id      start        end   no2 person_id
#> 1:          40 1995-01-01 1995-12-31 44.08         5
#> 2:          40 1996-01-01 1996-11-30 37.17         5
#> 3:          41 1996-12-01 1996-12-31 28.02         5
#> 4:          41 1997-01-01 1997-12-31 25.00         5

Now that we have a set of NO2 values associated with the person_id rather than location_id, we can use intervalaverage to find the average for any period of time. That could be a period that is unique to each participant. Or it could be consistent for all people in the data set, such as a yearly average (where each year is a time-weighted average of the places the person lived that year).

We will work through both, and introduce one more function along the way.

Average Over the Year Prior to Study Enrollment

The package includes another data.table with each person’s enrollment date in our study. We will use it to calculate the average exposure for each participant in the year before they joined our study.

data(enrollment)
setDT(enrollment)
head(enrollment)
#>    person_id enroll_date
#> 1:         1  2016-01-08
#> 2:         2  2016-08-18
#> 3:         3  2015-08-12
#> 4:         4  2016-07-18
#> 5:         5  2016-07-11
#> 6:         6  2015-07-26

Again, the y argument of intervalaverage tells it what date range to average for every group in the values data set (x). So we first need to create a data.table that contains the beginning and end of our desired averaging period for each participant from the enrollment data set. We’ll define the “year prior” to enrollment in our study as the 365 days preceding the enrollment date (i.e. the period from 365 days before enrollment to 1 day before enrollment).

pre_enrollment <- enrollment[ , .(person_id, start = enroll_date - 365L, end = enroll_date - 1L)]
head(pre_enrollment)
#>    person_id      start        end
#> 1:         1 2015-01-08 2016-01-07
#> 2:         2 2015-08-19 2016-08-17
#> 3:         3 2014-08-12 2015-08-11
#> 4:         4 2015-07-19 2016-07-17
#> 5:         5 2015-07-12 2016-07-10
#> 6:         6 2014-07-26 2015-07-25

You may notice I’ve used start and end here rather than start_date and end_date as I did previously. This is because of a quirk of intervalintersect where it returns start and end as the interval column names regardless of the input names, and in the next step, the interval column names need to match. This will be fixed in future versions of the package.

Now we simply apply intervalaverage, again allowing for 20% missingness in no2 values:

no2_pre_enrollment <- intervalaverage(
  x = no2_by_person, 
  y = pre_enrollment, 
  interval_vars = c('start', 'end'), 
  group_vars = 'person_id',
  value_vars = 'no2',
  required_percentage = 80
)  
head(no2_pre_enrollment[, .(person_id, start, end, no2)])
#>    person_id      start        end no2
#> 1:         1 2015-01-08 2016-01-07  NA
#> 2:         2 2015-08-19 2016-08-17  NA
#> 3:         3 2014-08-12 2015-08-11  NA
#> 4:         4 2015-07-19 2016-07-17  NA
#> 5:         5 2015-07-12 2016-07-10  NA
#> 6:         6 2014-07-26 2015-07-25  NA

Annual Averages

Finally, let’s calculate annual (calendar-year) averages for 2000-2004. We’ll need to build a y argument for the intervalaverage function that has a start date, end date for each year. However, even though the periods we are averaging over are the same for each participant, we also need to repeat each year once for every participant. This is because because y must contain the same grouping variable(s) as x.

This gives us an opportunity to try a convenient helper function in intervalaverage: CJ.dt. This is a more convenient version of data.table::CJ (CJ as in “Cross Join”) that can take data.tables as arguments instead of just vectors. By cross joining our unique person_ids with the years 2000-2004, we can construct a y argument that will work in the intervalaverage function.

# Build a data.table with start and end dates for each year
years <- data.table(year = 2000:2004) 
years[ , start := as.IDate(paste(year, "01", "01", sep = "-"))]
years[ ,   end := as.IDate(paste(year, "12", "31", sep = "-"))]
years[]
#>    year      start        end
#> 1: 2000 2000-01-01 2000-12-31
#> 2: 2001 2001-01-01 2001-12-31
#> 3: 2002 2002-01-01 2002-12-31
#> 4: 2003 2003-01-01 2003-12-31
#> 5: 2004 2004-01-01 2004-12-31
# Cross join with the unique person_ids in our data
years <- CJ.dt(years, unique(address_history[ , .(person_id)]))
head(years[order(year, person_id)])
#>    year      start        end person_id
#> 1: 2000 2000-01-01 2000-12-31         1
#> 2: 2000 2000-01-01 2000-12-31         2
#> 3: 2000 2000-01-01 2000-12-31         3
#> 4: 2000 2000-01-01 2000-12-31         4
#> 5: 2000 2000-01-01 2000-12-31         5
#> 6: 2000 2000-01-01 2000-12-31         6

Now we can apply the intervalaverage function again to get a yearly average for each person. Recall that the NO2 data is also annual, so the only rows that will actually be averaged will be those for years where a person moved (since a person will have had an NO2 value for each address they lived in that year).

no2_annual <- intervalaverage(
  x = no2_by_person, 
  y = years, 
  interval_vars = c('start', 'end'), 
  group_vars = 'person_id',
  value_vars = 'no2',
  required_percentage = 80
)  
head(no2_annual[, .(person_id, start, end, no2)])
#>    person_id      start        end   no2
#> 1:         1 2000-01-01 2000-12-31 35.14
#> 2:         1 2001-01-01 2001-12-31 40.35
#> 3:         1 2002-01-01 2002-12-31 32.97
#> 4:         1 2003-01-01 2003-12-31 31.82
#> 5:         1 2004-01-01 2004-12-31 30.72
#> 6:         2 2000-01-01 2000-12-31 17.13

Further Reading

For a more detailed and advanced tutorial of how to use the intervalaverage package, see vignette("intervalaverage-advanced").

For information about the inner workings of the package (intended for maintainers), see vignette("intervalaverage-technicaloverview").