Chapter 1 1. Sources, Contents, and Quality Control of Data

1.1 1.1. Overview of Location Data

Location data are central to the design of the EAC mySQL database, as all exposure data compiled for the relevant studies are associated with a location and all exposure predictions will be made at locations. Locations of interest include air quality monitoring sites, meteorological monitoring sites, and participants’ homes. Other important locations include the intersection points of grids developed at various scales to facilitate preparation of visual aids like concentration maps. “Location data” refers to the data necessary to specify a position and its relevant characteristics. For example, a participant’s home is located at a certain latitude and longitude. The latitude and longitude can then be used to determine geographic variables, such as population density at that location according to a given Census year. The coordinates of each location were determined as specified in each relevant subsection.

1.1.1 1.1.1. Map Projections and Coordinates

Geographic coordinates and projected coordinates are stored in the EAC database for all locations. Latitude and longitude are stored as decimal degrees as referenced to the North American Datum geographic coordinate system. Two different sets of projected coordinates will be available. The projected coordinates (x,y) represent the number of meters that the location lies from the origin of the projection. One set of coordinates will be referenced to a specific State Plane Zone. Another set of coordinates will be referenced to the US Conical Lambert projection (lambert x, lambert y). Distance calculations using State Plane coordinates are expected to be more accurate, but these coordinates cannot be used to calculate distances across State Plane Zones. Distances calculated between two points within a state plane could vary up to almost 1% depending on the coordinate system used (up to approximately 2 kilometers absolute distance). Locations were assigned latitudes and longitudes (“geocoded”) in one of three ways: created in ArcGIS from street addresses, recorded directly from digital global positioning system (GPS) units, or acquired directly from a third-party data provider. The geographic coordinates were then projected into the State Plane Coordinate System (SPCS). The latitude and longitude and the projected coordinates of the SPCS for participants’ residences are available to users with appropriate permission. Coordinates that uniquely identify a participant’s home are considered identifying information by the University of Washington Human Subjects Review Board. Access to participant home locations may be granted by the Principal Investigator for data users that demonstrate a need for that information, complete any relevant human subjects training, and sign the Data and Materials Distribution Agreement (DMDA).

1.1.2 1.1.2. Political Identifiers – County, State

Political identifiers of location that are available include state, county, and block group. As a default, counties and block groups are assigned by overlaying the locations’ latitudes and longitudes onto the TeleAtlas Dynamap 2000 v. 16.1 county shapefile. In the future, if necessary, historical counties will be determined by overlaying the locations onto historical census shapefiles in later versions of the database. State boundaries of the lower 48 states have not changed during the lifetimes of any study participants, so state is assigned only by the TeleAtlas 2000 boundaries. The term ‘city’ is vague in the context of our health studies and is to be avoided, although we are able to provide the year 2000 Metropolitan Statistical Area (MSA) for locations. For convenience, a standard set of monitors were selected in each MESA Air study area for exposure modeling. These are indicated on the maps on the internal website.

1.2 1.2. Air Quality System (AQS) Data

1.2.1 1.2.1 Description of Data Source

Air quality data, i.e. pollutant concentrations, are collected nationally by networks of federally- and locally-funded agencies, called the Air Quality System (AQS). The AQS data from most agency monitoring locations were obtained from the Environmental Protection Agency’s (EPA) website at https://aqs.epa.gov/aqsweb/airdata/download_files.html. Monitoring data from national parks and some rural locations is collected by a specific sub-network of the AQS, called the Interagency Monitoring of Protected Visual Environments (IMPROVE) network. Data collected by the IMPROVE network were obtained from the Visibility Information Exchange Web System (VIEWS), at http://views.cira.colostate.edu/iwdw/QueryWizard/Default.aspx. Users should be aware that non-IMPROVE monitoring locations are generally selected to assess population exposures, and that IMPROVE monitoring locations are generally selected to assess pollution impacts in sparsely populated environments. Therefore, users may determine that IMPROVE monitors are not representative of pollutant levels to which “nearby” subjects are exposed if those subjects live in reasonably populous areas.

The AQS network is comprised of a heterogeneous mixture of state, county, and local agencies throughout the United States. As funding availability and monitoring aims evolve, monitors come online or are discontinued, and monitoring schedules are altered. The monitors near a given location that collected PM2.5 data from 2000 to 2002 may not have collected PM2.5 data from 2003 to 2005. Furthermore, an individual monitor that collected PM2.5 data every day in 2002 may have collected data every third day in 2003. Another monitor that collects PM2.5 data every day during the winter may only collect data every third day in the summer. These features of the data primarily affect analysts creating exposure models. In general, health analysts conducting analyses based on AQS monitoring data will receive data from monitors as described in Sections 2.6 and 2.7.

PM2.5 data is collected by air quality agencies via several methods, and is consistently reported in units of µg/m3. Federal Reference Methods (FRM) are used to determine compliance with National Ambient Air Quality Standards (NAAQS). These data are considered to be of the most reliable quality, so only PM2.5 data collected by FRM monitors will be available in the EAC database. FRM methods are similar to one another, and involve collecting PM2.5 on a filter for 24 hours at a time. Most commonly, agencies measure PM2.5 one out of three days, with occasional data loss. The IMPROVE network collects PM data using FRM-type methods. A comparison of IMPROVE data to AQS FRM data collected concurrently at the same locations indicated that these methods were comparable enough to be considered equivalent. Occasionally, EPA audits of AQS data have found that measurement conditions did not meet the specifications required to validate those measurements for regulatory purposes. Many of these measurements were re-classified as “raw data” by the EPA. Although there is some evidence that these data are noisier than validated measurements, they are collected using reference methods and EPA staff believe that they are “potentially of quality.” We have included these measurements with a flag where no validated measurement was available.

A number of users have asked if we will include data from Federal Equivalent Methods (FEMs). These include tapered element oscillating monitors (TEOMs) and beta attenuation monitors (BAMs), which are both one-hour integrated methods. When used for compliance purposes, these monitors are used alongside and corrected to filter methods. Uncorrected methods are not as well-correlated with the filter methods, show seasonal bias, and are relatively sparse in locations where we do not already have filter-based monitors. Our judgment has been that these monitors would add noise to our models rather than improve their spatial resolution.

We note that some agencies may operate two monitors concurrently at the same location for quality control purposes. The EAC database is intended to house no more than one concentration for a single point in time and space. Therefore, AQS monitoring methodologies were ranked in terms of reliability based on best professional judgment. A detailed list of the rankings can be found in Tables 1 through 5. Data for the more reliable monitor was retained for time periods during which two monitors’ data were available from exactly the same location. If two co-located monitors used the same method, the data from the monitor with a more complete time series were retained. The alternative monitor’s data were inserted for time periods during which only that monitor’s data were available. Thus, a given location could have a single time series of monitoring data sourced from more than one monitor.

Table 1. Ranking of PM25 collection methods

Rank	Parameter	Parameter Code	Method Code	Recording Mode	Collection Description	Analysis Description	Method Type	Reference Method ID	Equivalent Method	Federal MDL	Max Value	Digits	Round Truncate Indicator	Units	Table
1	PM2.5 - Local Conditions	88101	145	Intermittent	R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC	Gravimetric	FRM	RFPS-1006-145	R & P Model 2025 Sequential Air Sampler with BGI VSCC	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
2	PM2.5 - Local Conditions	88101	118	Intermittent	R & P Model 2025 PM2.5 Sequential w/WINS	GRAVIMETRIC	FRM	RFPS-0498-118	R & P CO PLUS MODEL 2025PM SEQ	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
3	PM2.5 - Local Conditions	88101	144	Intermittent	R & P Model 2000 PM-2.5 Audit Sampler w/VSCC	Gravimetric	FRM	RFPS-1006-144	R & P Model 2000 Audit Sampler with BGI VSCC	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
4	PM2.5 - Local Conditions	88101	143	Intermittent	R & P Model 2000 PM-2.5 Air Sampler w/VSCC	Gravimetric	FRM	RFPS-1006-143	R & P Model 2000 Air Sampler with BGI VSCC	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
5	PM2.5 - Local Conditions	88101	117	Intermittent	R & P Model 2000 PM2.5 Sampler w/WINS	GRAVIMETRIC	FRM	RFPS-0498-117	R & P CO Model 2000 PM-2.5	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
6	PM2.5 - Local Conditions	88101	116	Intermittent	BGI Model PQ200 PM2.5 Sampler w/WINS	GRAVIMETRIC	FRM	RFPS-0498-116	BGI INC Model PQ200 PM2.5	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
7	PM2.5 - Local Conditions	88101	155	Intermittent	Thermo Electron Model RAAS2.5-300 Sequential w/VSCC	Gravimetric	FRM	RFPS-1006-155	Thermo Electron RAAS2.5-300 Sequential with BGI VSCC	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
8	PM2.5 - Local Conditions	88101	142	Intermittent	BGI Models PQ200-VSCC or PQ200A-VSCC	Gravimetric	FRM	RFPS-1006-142	BGI Models PQ200-VSCC or PQ200A-VSCC	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
9	PM2.5 - Local Conditions	88101	129	Intermittent	R & P Model 2000 PM-2.5 Audit w/WINS	GRAVIMETRIC	FRM	RFPS-0499-129	R & P CO Model 2000 PM-2.5 Audit	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
10	PM2.5 - Local Conditions	88101	153	Intermittent	Thermo Electron Model RAAS2.5-100 w/VSCC	Gravimetric	FRM	RFPS-1006-153	Thermo Electron RAAS2.5-100 with BGI VSCC	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
11	PM2.5 - Local Conditions	88101	154	Intermittent	Thermo Electron Model RAAS2.5-200 Audit w/VSCC	Gravimetric	FRM	RFPS-10060154	Thermo Electron RAAS2.5-200 Audit with BGI VSCC	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
12	PM2.5 - Local Conditions	88101	135	Intermittent	URG-MASS100 Single PM2.5 Sampler	GRAVIMETRIC	FRM	RFPS-0400-135	URG MASS100 Single PM2.5 Sampler	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
13	PM2.5 - Local Conditions	88101	136	Intermittent	URG-MASS300 Sequential PM2.5 Sampler	GRAVIMETRIC	FRM	RFPS-0400-136	URG MASS300 Sequential PM2.5 Sampler	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
14	PM2.5 - Local Conditions	88101	128	Intermittent	Andersen RAAS2.5-2000PM2.5 Aud w/WINS	Gravimetric	FRM	RFPS-0299-128	Andersen RAAS2.5-200 PM2.5 Audit	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
15	PM2.5 - Local Conditions	88101	119	Intermittent	Andersen RAAS2.5-100 PM2.5 SAM w/WINS	GRAVIMETRIC	FRM	RFPS-0598-119	Andersen RAAS2.5-100 PM2.5 SAM	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
16	PM2.5 - Local Conditions	88101	120	Intermittent	Andersen RAAS2.5-300 PM2.5 SEQ w/WINS	GRAVIMETRIC	FRM	RFPS-0598-120	Andersen RAAS2.5-300 PM2.5 SEQ	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
17	PM2.5 - Local Conditions	88101	123	Intermittent	Thermo Env Model 605 CAPS	Gravimetric	FRM	RFPS-1098-123	THERMO ENVIR MODEL 605 CAPS	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
18	PM2.5 - Local Conditions	88101	177	Intermittent	Thermo Scientific Partisol 2000-D Dichot.	Gravimetric	FEM	EQPS-0509-177	Thermo Scientific Partisol 2000-D Dichot.	2	5000	1	T	Micrograms/cubic meter (LC)	pm25
19	PM2.5 - Local Conditions	88101	179	Intermittent	Thermo Scientific Dichot. Partisol-Plus Model 2025-D Seq	Gravimetric	FEM	EQPS-0509-179	Thermo Scientific Dichot. Partisol-Plus Model 2025-D Seq	2	5000	1	T	Micrograms/cubic meter (LC)	pm25

Table 3. Ranking of O3 collection methods

Rank	Parameter	Parameter Code	Method Code	Recording Mode	Collection Description	Analysis Description	Method Type	Reference Method ID	Equivalent Method	Federal MDL	Min Value	Max Value	Digits	Round Truncate Indicator	Units	Table
1	Ozone	44201	047	Continuous	INSTRUMENTAL	ULTRA VIOLET	FEM	EQOA-0880-047	THERMO ELECTRON 49	0.005	-0.004	0.5	3	T	Parts per million	o3
2	Ozone	44201	103	Continuous	INSTRUMENTAL	OPEN PATH O3 ANALYZER	FEM	EQOA-0495-103	OPSI MODEL AR 500 O3 ANALYZER	0.005	-0.004	0.5	3	T	Parts per million	o3
3	Ozone	44201	056	Continuous	INSTRUMENTAL	ULTRA VIOLET	FEM	EQOA-0383-056	DASIBI 1008-AH	0.005	-0.004	0.5	3	T	Parts per million	o3
4	Ozone	44201	019	Continuous	INSTRUMENTAL	ULTRA VIOLET	FEM	EQOA-0577-019	DASIBI 1003-AH–PC–RS	0.005	-0.004	0.5	3	T	Parts per million	o3
5	Ozone	44201	053	Continuous	INSTRUMENTAL	ULTRA VIOLET	FEM	EQOA-0881-053	MONITOR LABS 8810	0.005	-0.004	0.5	3	T	Parts per million	o3
6	Ozone	44201	078	Continuous	INSTRUMENTAL	ULTRA VIOLET	FEM	EQOA-0990-078	ENVIRONICS SERIES 300	0.005	-0.004	0.5	3	T	Parts per million	o3
7	Ozone	44201	087	Continuous	INSTRUMENTAL	ULTRA VIOLET ABSORPTION	FEM	EQOA-0992-087	MODEL 400 OZONE ANALYZER	0.005	-0.004	0.5	3	T	Parts per million	o3
8	Ozone	44201	112	Continuous	INSTRUMENTAL	ULTRAVIOLET ABSORPTION	FEM	EQOA-0196-112	HORIBA APOA-360	0.005	-0.004	0.5	3	T	Parts per million	o3
9	Ozone	44201	105	Continuous	INSTRUMENTAL	UV PHOTOMETRIC	FEM	EQOA-0895-105	ENVIRONMENT SA MODEL Q341M	0.005	-0.004	0.5	3	T	Parts per million	o3
10	Ozone	44201	020	Continuous	INSTRUMENTAL	CHEMILUMINESCENCE	FRM	RFOA-0577-020	BECKMAN 950A	0.005	-0.004	0.5	3	T	Parts per million	o3

Table 4. Ranking of SO2 collection methods

Rank	Parameter	Parameter Code	Method Code	Recording Mode	Collection Description	Analysis Description	Method Type	Reference Method ID	Equivalent Method	Federal MDL	Min Value	Max Value	Digits	Round Truncate Indicator	Units	Table
1	Sulfur dioxide	42401	560	Continuous	INSTRUMENTAL	Pulsed Fluorescent 43C-TLE/43i-TLE	FEM	EQSA-0486-060	Thermo Electron 43c-TLE/43i-TLE	0.2	-4	1500	1	T	Parts per billion	so2
2	Sulfur dioxide	42401	060	Continuous	INSTRUMENTAL	PULSED FLUORESCENT	FEM	EQSA-0486-060	THERMO ELECTRON 43A, 43B, 43C	2.0	-4	10000	1	T	Parts per billion	so2
3	Sulfur dioxide	42401	009	Continuous	INSTRUMENTAL	PULSED FLUORESCENT	FEM	EQSA-0276-009	THERMO ELECTRON 43	2.0	-4	1500	1	T	Parts per billion	so2

Table 5. Ranking of EC/OC collection methods

Chemical Speciation Network (CSN) Temperature Protocol, Total Optical Transmittance

Rank	Parameter	Parameter Code	Method Code	Recording Mode	Collection Description	Analysis Description	Method Type	Reference Method ID	Equivalent Method	Federal MDL	Min Value	Max Value	Digits	Round Truncate Indicator	Units	Table
1	EC CSN PM2.5 LC TOT	88307	847	Intermittent	R & P Model 2025 PM2.5 Sequential Quartz VSCC	STN TOT	NA	NA	NA	0.098	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.1
1	OC CSN Unadjusted PM2.5 LC TOT	88305	847	Intermittent	R & P Model 2025 PM2.5 Sequential Quartz VSCC	STN TOT	NA	NA	NA	0.098	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.1
2	EC CSN PM2.5 LC TOT	88307	843	Intermittent	R & P Model 2025 PM2.5 Sequential Quartz	STN TOT	NA	NA	NA	0.098	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.1
2	OC CSN Unadjusted PM2.5 LC TOT	88305	843	Intermittent	R & P Model 2025 PM2.5 Sequential Quartz	STN TOT	NA	NA	NA	0.098	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.1
3	EC CSN PM2.5 LC TOT	88307	853	Intermittent	R&P MDL2300 PM2.5 Seq Spec Quartz	STN TOT	NA	NA	NA	0.163	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.1
3	OC CSN Unadjusted PM2.5 LC TOT	88305	853	Intermittent	R&P MDL2300 PM2.5 Seq Spec Quartz	STN TOT	NA	NA	NA	0.163	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.1
4	EC CSN PM2.5 LC TOT	88307	873	Intermittent	URG MASS450 Quartz VSCC	STN TOT	NA	NA	NA	0.098	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.1
4	OC CSN Unadjusted PM2.5 LC TOT	88305	873	Intermittent	URG MASS450 Quartz VSCC	STN TOT	NA	NA	NA	0.098	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.1
5	EC CSN PM2.5 LC TOT	88307	833	Intermittent	URG MASS450 Quartz WINS	STN TOT	NA	NA	NA	0.098	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.1
5	OC CSN Unadjusted PM2.5 LC TOT	88305	833	Intermittent	URG MASS450 Quartz WINS	STN TOT	NA	NA	NA	0.098	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.1
6	EC CSN PM2.5 LC TOT	88307	823	Intermittent	Andersen RAAS Quartz	STN TOT	NA	NA	NA	0.224	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.1
6	OC CSN Unadjusted PM2.5 LC TOT	88305	823	Intermittent	Andersen RAAS Quartz	STN TOT	NA	NA	NA	0.224	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.1
7	EC CSN PM2.5 LC TOT	88307	813	Intermittent	Met One SASS Quartz	STN TOT	NA	NA	NA	0.245	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.1
7	OC CSN Unadjusted PM2.5 LC TOT	88305	813	Intermittent	Met One SASS Quartz	STN TOT	NA	NA	NA	0.245	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.1

IMPROVE Temperature Protocol, Total Optical Reflectance

Rank	Parameter	Parameter Code	Method Code	Recording Mode	Collection Description	Analysis Description	Method Type	Reference Method ID	Equivalent Method	Federal MDL	Min Value	Max Value	Digits	Round Truncate Indicator	Units	Table
1	EC PM2.5 LC TOR	88321	859	Intermittent	R & P Model 2025 PM2.5 Sequential Quartz VSCC	EC1+EC2+EC3-OP	NA	NA	NA	0.093	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2
1	OC PM2.5 LC TOR	88320	859	Intermittent	R & P Model 2025 PM2.5 Sequential Quartz VSCC	OC1+OC2+OC3+OC4+OP	NA	NA	NA	0.402	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2
2	EC PM2.5 LC TOR	88321	857	Intermittent	R & P Model 2025 PM2.5 Sequential Quartz	EC1+EC2+EC3-OP	NA	NA	NA	0.093	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2
2	OC PM2.5 LC TOR	88320	857	Intermittent	R & P Model 2025 PM2.5 Sequential Quartz	OC1+OC2+OC3+OC4+OP	NA	NA	NA	0.402	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2
3	EC PM2.5 LC TOR	88321	855	Intermittent	R&P MDL2300 PM2.5 Seq Spec Quartz	EC1+EC2+EC3-OP	NA	NA	NA	0.155	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2
3	OC PM2.5 LC TOR	88320	855	Intermittent	R&P MDL2300 PM2.5 Seq Spec Quartz	OC1+OC2+OC3+OC4+OP	NA	NA	NA	0.670	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2
4	EC PM2.5 LC TOR	88321	877	Intermittent	URG MASS450 Quartz VSCC	EC1+EC2+EC3-OP	NA	NA	NA	0.025	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2
4	OC PM2.5 LC TOR	88320	877	Intermittent	URG MASS450 Quartz VSCC	OC1+OC2+OC3+OC4+OP	NA	NA	NA	0.025	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2
5	EC PM2.5 LC TOR	88321	829	Intermittent	URG 3000N w/Pall Quartz filter and Cyclone Inlet	EC1+EC2+EC3-(OP(TOR))=(88329+88330+88331-88328)	NA	NA	NA	0.002	NA	NA	3	R	Micrograms/cubic meter (LC)	ecoc.2
6	EC PM2.5 LC TOR	88321	838	Intermittent	URG 3000N w/Pall Quartz filter and Cyclone Inlet	EC1+EC2+EC3-OP (88329+88330+88331-88336)	NA	NA	NA	0.002	NA	NA	3	R	Micrograms/cubic meter (LC)	ecoc.2
6	OC PM2.5 LC TOR	88320	838	Intermittent	URG 3000N w/Pall Quartz filter and Cyclone Inlet	OC1+OC2+OC3+OC4+OP (88324+88325+88326+88327+88328)	NA	NA	NA	0.002	NA	NA	3	R	Micrograms/cubic meter (LC)	ecoc.2
7	EC PM2.5 LC TOR	88321	837	Intermittent	URG MASS450 Quartz WINS	EC1+EC2+EC3-OP	NA	NA	NA	0.025	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2
7	OC PM2.5 LC TOR	88320	837	Intermittent	URG MASS450 Quartz WINS	OC1+OC2+OC3+OC4+OP	NA	NA	NA	0.025	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2
8	EC PM2.5 LC TOR	88321	815	Intermittent	Met One SASS Quartz	EC1+EC2+EC3-OP	NA	NA	NA	0.233	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2
8	OC PM2.5 LC TOR	88320	815	Intermittent	Met One SASS Quartz	OC1+OC2+OC3+OC4+OP	NA	NA	NA	1.005	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2
9	EC PM2.5 LC TOR	88321	825	Intermittent	Andersen RAAS Quartz	EC1+EC2+EC3-OP	NA	NA	NA	0.213	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2
9	OC PM2.5 LC TOR	88320	825	Intermittent	Andersen RAAS Quartz	OC1+OC2+OC3+OC4+OP	NA	NA	NA	0.918	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2
10	EC PM2.5 LC TOR	88321	809	Intermittent	IMPROVE Module C with Cyclone Inlet-Quartz Filter	EC1+EC2+EC3-OP (88329+88330+88331-88336)	NA	NA	NA	0.002	NA	NA	3	R	Micrograms/cubic meter (LC)	ecoc.2
10	OC PM2.5 LC TOR	88320	809	Intermittent	IMPROVE Module C with Cyclone Inlet-Quartz Filter	OC1+OC2+OC3+OC4+OP (88324+88325+88326+88327+88328)	NA	NA	NA	0.002	NA	NA	3	R	Micrograms/cubic meter (LC)	ecoc.2
11	EC PM2.5 LC TOR	88321	805	Intermittent	IMPROVE	EC1+EC2+EC3-OP	NA	NA	NA	0.020	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2
11	OC PM2.5 LC TOR	88320	805	Intermittent	IMPROVE	OC1+OC2+OC3+OC4+OP	NA	NA	NA	0.087	NA	NA	2	R	Micrograms/cubic meter (LC)	ecoc.2

IMPROVE Temperature Protocol, Total Optical Transmittance

method_tables$ecoc.3  %>% kbl()

Rank	Parameter	Parameter Code	Method Code	Recording Mode	Collection Description	Analysis Description	Method Type	Reference Method ID	Equivalent Method	Federal MDL	Min Value	Max Value	Digits	Round Truncate Indicator	Units	Table
NA	EC CSN_Rev Unadjusted PM2.5 LC TOT	88357	838	Intermittent	URG 3000N w/Pall Quartz filter and Cyclone Inlet	EC1+EC2+EC3-OP (88329+88330+88331-88388)	NA	NA	NA	0.002	NA	NA	3	R	Micrograms/cubic meter (LC)	ecoc.3
NA	EC CSN_Rev Unadjusted PM2.5 LC TOT	88357	840	Intermittent	URG 3000N w/Pall Quartz filter and Cyclone Inlet	EC1+EC2+EC3-(OP(TOT))=(88383+88384+88385-88388))	NA	NA	NA	0.002	NA	NA	3	R	Micrograms/cubic meter (LC)	ecoc.3
NA	OC CSN_Rev Unadjusted PM2.5 LC TOT	88355	838	Intermittent	URG 3000N w/Pall Quartz filter and Cyclone Inlet	OC1+OC2+OC3+OC4+OP (88374+88375+88376+88377+88388)	NA	NA	NA	0.002	NA	NA	3	R	Micrograms/cubic meter (LC)	ecoc.3

Revised CSN Temperature Protocol, Total Optical Reflectance, Unadjusted

Rank	Parameter	Parameter Code	Method Code	Recording Mode	Collection Description	Analysis Description	Method Type	Reference Method ID	Equivalent Method	Federal MDL	Min Value	Max Value	Digits	Round Truncate Indicator	Units	Table
NA	EC CSN_Rev Unadjusted PM2.5 LC TOR	88380	831	Intermittent	URG 3000N w/Pall Quartz filter and Cyclone Inlet	EC1+EC2+EC3-(OP(TOR))=(88383+88384+88385-88378))	NA	NA	NA	0.002	NA	NA	3	R	Micrograms/cubic meter (LC)	ecoc.4

Elemental and organic carbon (EC/OC) are not listed as criteria air pollutants, and as such have no federal reference methods. Similar equipment is used for the collection of samples for EC/OC analysis as is used for the collection of PM2.5 for gravimetric analysis, so similar rankings were used to prioritize EC/OC methods. One issue that is important to consider when requesting EC/OC data is that these pollutants are operationally defined. Through 2007, EC/OC was primarily measured using the Chemical Speciation Network (CSN) temperature protocol. In 2007, many locations switched to using the IMPROVE temperature protocol. Comparisons between HEI/NPACT EC/OC measurements and monitoring network measurements by different protocols are presented in the XRF and EC/OC supplement to the QA report.

AQS monitors collect hourly NOX and NO2 data by chemiluminescence, generally reported in units of ppm (and stored in the EAC database in units of ppb). All chemiluminescent methods are considered equivalent, so data for the monitor with the more complete time series were retained for days with two monitors’ data available. If a duplicate monitor existed, data from that monitor were inserted into the time series for time periods during which only that monitor’s data were available.

Speciated data for the same species are rarely assessed using more than one method, so data for the monitor with the more complete time series were retained for days with two monitors’ data available. If a duplicate monitor existed, data from that monitor were inserted into the time series for time periods during which only that monitor’s data was available.

1.2.2 1.2.2 AQS Locations

AQS locations were downloaded from AQS Data Mart as latitude and longitude. Flat files were converted to shapefiles in ArcGIS and re-projected into the appropriate State Plane Coordinate System before geographic variables were calculated.

AQS monitoring locations are identifiable by unique identifiers assigned by the EPA. These identifiers have three parts: a two digit state code, a three digit county code, and a four digit site code. Individual monitors at monitoring sites also have a five digit pollutant code, and a two-digit ‘parameter occurrence code’ (POC). The two digit state code and three digit county code are defined under the Information Technology Management Reform Act (Public Law 104-106) and the National Institute of Standards and Technology (NIST) for Federal computer systems. The four digit site code site code defines unique AQS monitor locations within certain counties, and the five digit parameter code defines the specific parameter being collected by a designated monitor. For example, a PM2.5 monitor in Los Angeles County, California at site 0001 could be numbered 0603700018810101. As described above, multiple monitors for the same pollutant are occasionally operated concurrently at the same site. The POC identifies unique monitors in this case, with lower numbers indicating more established monitors. Maps of the AQS monitor locations for which we retain data are available online.

1.2.3 1.2.3 AQS Monitor Deployment Height

AQS monitors are generally deployed at ground level or on the roof of a low building, but this is not universally the case. This may be of particular importance in New York, so the floor at which monitors were deployed was obtained from the air quality agency. Monitors deployed above the ground floor in the New York City metropolitan area are flagged with this information.

1.2.4 1.2.4 Data completeness rules for averaging

Data completeness rules are based on a minimum number of data points and a maximum gap between consecutive measurements, which are derived from the length of the averaging period and the typical monitoring frequency of the pollutant in question.

Table 6. Data completeness rules for averaging

Note that ozone is measured seasonally at approximately half of locations. For many applications, it may be more appropriate to request the April – October average rather than the annual average.

These standards were established for convenience and are not expected to cover the needs of every data request. All available AQS data are housed in the EAC database, not just those meeting these criteria. Users are welcome to design their own inclusion criteria; space is provided in the electronic data request form to outline criteria specific to each user’s needs. The monitor identification numbers are indicated on the study area maps at http://www.uwchscc.org/MESAAP/Data.aspx, so that users can list individual monitor numbers for which they would like data. These maps indicate the monitors that are included in the standard set, as well as additional monitors in the region. If no inclusion criteria are described, the standards above will be used as the default.

1.2.5 1.2.5 Quality Control

Air pollution monitoring is primarily conducted to comply with federal regulations, though individual air quality agencies may have other secondary objectives. Siting priorities, funding, local culture, and chance all have the potential to impact the quality of data from individual monitors. Therefore, certain criteria have been set for the inclusion of AQS data. After data are downloaded, the “Basic Level” of QC consists of the following steps:

Limits of detection (LOD) are determined from the documentation available on the AQS website; some may depend on the collection or analysis method. A range is provided for these species.

Measurements below the limit of detection (LOD) are replaced with a value equal to half the LOD. These values, and any daily averages associated with them, are assigned an ‘LOD’ flag.
Very high outliers (>= 1000 ppb for gases or >=1000 µg/m3 for PM2.5) are investigated. Occasionally, the wrong units appear to have been assigned to the raw data (i.e. ppm instead of ppb) by the local agency. A more reasonable unit may be assigned by the data manager, but since the air quality agency typically is not asked to confirm this error, the measurement is associated with a ‘unit’ flag.
Daily averages are calculated for those pollutants for which AQS captures hourly data, provided that the hourly measurements (where they exist) meet the following criteria. The same applies to daily meteorology data that is computed from hourly data:
At least 18 hours for the day are available
At least 4 hours between 4:00 am and 9:00 am are available
At least 4 hours between 1:00 pm and 6:00 pm are available
For ozone, the 10 am – 6 pm average will also be available. The inclusion criterion for these measurements is that 6 of the intended 8 hours must be available.
For ozone, the maximum 8-hour average will also be available. This measurement can be calculated only if that day’s data meets the daily average criteria and if at least 6 hours’ data are available for each 8-hour averaging window.
A small amount of additional data are excluded based on personal communication with local air monitoring agencies, which is documented in the Monitor Issue Log (see Appendix C). If extremely suspicious data were brought to the attention of a local agency, but no response was received or the agency confirmed its validity, then those data are retained in the database but may be assigned an ‘Agency’ flag to alert the analyst.

1.2.6 1.2.6 Data Archiving

Older versions of the AQS data already used for publications and manuscripts have sometimes been archived on DVD, including a pre-database version of AQS data for the years 1980-1999. Historical versions of data within the EAC database are archived in compressed files that can be accessed to fill old data requests or retrieve older versions of covariates and predictions.

1.3 1.3. Monitoring Data from DEOHS Air Monitoring Studies

Below are overviews of the types of air monitoring that our working group has conducted. There are study-specific QA/QC documents and other materials that are available and can be accessed as needed. Some of these documents are listed in Appendix C. The EAC database contains the dates that sampling started and ended for each individual measurement, as well as the middle day of the intended two-week sampling period (typically a Wednesday). Samples with concentrations below the LOD are flagged, and the LODs associated with these samples are provided. Additional flags for minor circumstances were associated with a small number of measurements as described in the final Quality Assurance / Quality Control Report (see Appendix C).

In addition to the geographic covariates that are available for all locations, DEOHS monitoring locations are also associated with a building floor at which the data were collected. This is primarily of interest in cities such as New York where many living units are located well above street level. For the Coarse monitoring sites and where the living floor is not indicated, data were collected at the first (ground) or second floor.

1.3.1 1.3.1 Fixed Site Monitoring

Fixed sites are similar to AQS sites in that they are designed to collect a continuous series of measurements over the monitoring period. Fixed sites were operated continuously in all MESA Air regions (2-5 per region) from approximately July 2005 until July 2009. The exact start and end dates vary from site to site. The methods used for these monitoring sites collected two-week integrated measurements, using either Teflon filters for PM or Ogawas for gases. Each PM sampler was intended to run on a 50% duty cycle for consecutive two-week intervals. These sites were chosen to represent MESA Air participant exposures, near road exposures, or were co-located with an AQS monitor in the area. PM2.5 (for mass, LAC, and elements), NO2, NOX, SO2, and sometimes O3 were measured at these sites. The HEI/NPACT study collected PM2.5 for EC/OC analysis at these sites during some time periods.

Remote Air Data (RAD) /“Low-Cost” Sensor Monitoring: “Low-cost” sensors with 5-min resolution for PM2.5, NO2, NO, CO and O3 were deployed in the seven regions of the ACT AP and MESA Air studies (Seattle, WA; Los Angeles, CA; St. Paul, MN; New York City, NY; Chicago, IL; Winston-Salem, NC; Baltimore, MD). In Seattle, monitors were deployed at participant and volunteer homes, community sites and co-located at AQS sites, from Spring 2017 through Winter 2019. Other regions had slightly shorter deployment periods and less monitors, which were primarily sent to AQS sites (except Baltimore which also had 2 periods of home site monitoring). Sensors were calibrated using regression models developed with co-located government monitoring data. Quality Control screens were done to remove clearly broken sensor data. Calibrated data is available for PM2.5 in all regions, CO in Seattle and Baltimore, and NO2/NO/O3 in Seattle only. However, data quality is a concern for some of the data, especially the gas sensors which had several issues making calibration difficult (NO2/NO/O3 in particular).

1.3.2 1.3.2 Home Outdoor Monitoring

Home outdoor monitoring refers to air monitoring conducted outside residential locations, primarily of participants involved in one of the health studies. In theory, repeated monitoring periods (or rounds) at the same home were intended to occur in distinct seasons: summer, winter, or “transitional” (spring or fall). Home outdoor monitoring was conducted in MESA Air, SPIROMICS Air, ACT-AP, CCAR Project 5 (Ogawas and VOCs), and the PM Center Panel Study.

1.3.3 1.3.3 Home Indoor Monitoring

Home indoor monitoring occurred at a subset of outdoor home monitoring locations. Similar sampling set-ups were operated concurrently inside and outside the participant’s home. PM2.5 (for mass, LAC, and elements), NO2, NOX, SO2, and O3 were measured at these sites in MESA Air. SPIROMICS Air additionally included nicotine. CCAR Project 5 included VOCs but not PM2.5.

1.3.4 1.3.4 Personal Monitoring

Personal monitoring occurred at a subset of indoor home monitoring locations. Participants carried samplers for NOX, NO2, SO2, and PM2.5 (for mass, LAC, and elements) for two weeks in MESA Air. In SPIROMICS Air, participants carried NOX, NO2, SO2, and O3 samplers for two weeks. CCAR Project 5 participants carried these Ogawas as well as VOC samplers.

1.3.5 1.3.5 In-Vehicle Monitoring

In-vehicle monitoring occurred in the personal vehicles of subjects that participated in CCAR Project 5. These samplers were sealed in canisters when then vehicle was not in use.

1.3.6 1.3.6 Roadway Gradient Snapshot Monitoring

Snapshot monitoring included ~100-150 individual locations monitored simultaneously for a two-week period using Ogawas on telephone poles. The aim of this type of campaign is to learn more about pollutant gradient near roads. Gradient-type snapshot monitoring was conducted by MESA Air, SPIROMICS Air, and ACT-AP.

1.3.7 1.3.7 Other Kinds of Snapshot Monitoring

MESA Air Coarse PM: The Coarse PM Study collected snapshot data in the winter and summer at MESA Air participant homes in Chicago, Winston-Salem, and St. Paul. This study measured PM2.5, PM10, and endotoxin, and was conducted primarily at participant homes. NOX/NO2/SO2 Ogawa data from these home locations is available for most sampling rounds.
CCAR “Fuzzy Points”: The Center for Clean Air Research Project 1 conducted mobile monitoring campaigns in Atlanta, Winston-Salem, St. Paul, LA, and Baltimore. “Fuzzy points” were intersections that the monitoring vehicle passed through several times from different directions. Ogawa and VOC badges were hung near these intersections over the 2-week monitoring periods. Each city was sampled in a heating (winter) and non-heating (summer) season.
LAX “Flightpath”: Two airplanes land at LAX every minute between 6:30 am until midnight, with all airplanes following the same flightpath over Inglewood. During Fall 2014, 24 NOX/NO2 Ogawas were distributed in roughly a grid pattern in this community.
Yesler Terrace: The Seattle Housing Authority was planning a re-development of the Yesler Terrace neighborhood. NOX/NO2 Ogawas were deployed at 28 locations in February 2010 and 86 locations in March 2010.
Diesel Exhaust Exposure in the Duwamish Study (DEEDS): DEEDS measured PM2.5, light absorbing carbon (indicated by filters’ light absorption coefficients), NOX, NO2, and SO2 at snapshot sites in Seattle, WA in Summer and Winter 2012. These data are described in Jill Schulte’s DEOHS master’s thesis. All monitoring results are based on a two-week integrated sampling design.

Table 7. Pollutant availability by study. Rounds per home refers to home outdoor monitoring. Rounds per pole refers to snapshot monitoring. The air monitoring methods were either two-week integrated (TWI) using HPEMs and passive badges or remote air data (RAD).

1 MESA Air Cities: Winston-Salem, NC; New York City, NY; Baltimore, MD; Chicago, IL; St. Paul, MN; Los Angeles, CA 2 SPIROMICS Air Cities: Winston-Salem, NC; New York City, NY; Baltimore, MD; Los Angeles, CA; Ann Arbor, MI; Salt Lake City, UT; San Francisco, CA

1.3.8 1.3.8 Data Description

Sampling designs prior to 2017 were generally intended to yield consecutive two-week integrated air pollutant measurements. The start and end dates of each sampling event are available, and the ‘intended middle day’ is available for MESA Air measurements. Pump failure, building closures, and participant availability often resulted in measurements that were shorter than two weeks or that were shifted from the standard Wednesday-to-Wednesday schedule by one or two days. A few Coarse PM Study snapshots were scheduled for a two week period that began and ended on a day other than Wednesday. For all MESA Air samples, the ‘intended middle day’ is the Wednesday that best aligns that sample with the standard Wednesday-to-Wednesday schedule in that study area. For the Coarse PM Study snapshots, the intended middle day aligns all of the Coarse PM samples that were collected concurrently. SPIROMICS Air homes were deployed over the course of 1-2 weeks and do not have an intended middle day.

Measurements are reported as ppb for gases. Gas samples collected via Ogawas are analyzed using ion chromatography (IC) analysis, which is quite sensitive and can detect very small concentrations of ions. However, a small number of measurements were made that were below the detection limit of the instrument, also known as the lab LOD. These measurements will be reported as half the noise in the instrument’s baseline.

Samples typically accumulate a small amount of contamination during handling and shipping. This level will often exceed the lab limit of detection. “Field blanks” are used to determine this level of contamination, and are used to determine a correction for contamination and the “field” limit of detection. The limit of detection is provided along with the measured value (after correction), as some users may choose to replace the measurements made below the LOD with other values. More information about correction and limits of detection can be found in the QA/QC report (see Appendix C).

Measurements are reported as µg/m3 for particulate matter. Because particulate matter mass is determined gravimetrically for the two-week integrated method, and because the microbalance can always provide a filter’s weight, lab LODs are not available. Other measures of uncertainty are addressed in the QA/QC report. The field LOD is determined for mass for PM, EC/OC, and elemental species. Data users will receive this LOD as a concentration, calculated by the volume of air sampled. Because the volume of air varies by sample based on sampling duration and flow rate, the concentration LOD will not be the same for all samples. As for gases, data users will receive both the corrected measurement and the LOD. RAD monitors evaluate particulate matter using a laser (Plantower). Data processing for this method has not been finalized.

1.3.9 1.3.9 Quality Control of Data

A full treatment of the QA/QC process for all of our monitoring data is out of scope for this document. Please refer to a project-specific Quality Assurance Project Plan (QAPP) and the final Quality Assurance/ Quality Control Report (see Appendix C) as needed.

Flags are associated with a small number of measurements and are provided to data users. Measurements taken near a source of pollution or with concentrations that might be impacted by slight deviations from the sampling protocol are indicated with a Source or Concentration flag. A full description of the flags exists in the QAPP and the final QA/QC report (see Appendix C). There are also a number of measurements that were noted as unbelievable by analysts that have worked extensively with these data. These flags are provided with a brief description of the issue that the analyst identified.

1.3.10 1.3.10 Raw Monitoring Data

Raw data include field log data, questionnaire data, and measurements. The collection, measurement, and transmittance of samples and data are covered extensively in the relevant field and lab Standard Operating Procedures documents (see Appendix C). Locations of the raw monitoring data files are listed in this Appendix.

1.4 1.4 Monitoring Data from Other Sources

1.4.1 1.4.1 Inhalable Particulate Network (IPN)

The Inhalable Particulate Network (IPN) was an EPA monitoring campaign from 1979 to 1984. This network included fine particulate measurements (PM2.5) and coarse particulate measurements (PM2.5-PM15) taken from dichot samplers, “inhalable particulates” (PM15) taken from size-selective input HIVOL samplers, and “total suspended particulate matter” (PM30) taken from HIVOL samplers, as well as a variety of other particulate components.

The EPA no longer has any data associated with this network. We digitized a printout of the data that was transferred to us from colleagues at NYU. Note that while this dataset was printed in 1984, measurements only extend through 1982 (indicating the printed version we have was not the final dataset). This printout contained site codes but not site locations. We manually linked site codes to tables with geographic locations; these tables were found in two separate EPA documents (Analysis of Inhalable and Fine Particulate Matter Measurements 1981 and Directory of Air Quality Monitoring Sites Active in 1977). Out of the 132 sites with fine PM (PM2.5) measurements, 102 of these sites were listed in either of the documents containing geolocations, and therefore 30 sites are missing lat/long data and geocovariates. Data from these 30 sites are therefore unusable.

Table 8. Description of particle pollution data available from IPN, including particle size and collection methods.

1.4.2 1.4.2 New York City Community Air Survey (NYCCAS)

NYCCAS collected 2-week measurements of air pollutants (PM2.5, LAC, NOX, O3) at utility pole locations throughout the 5 boroughs beginning in December 2008. Our database currently contains measurements collected through November 2017. More information regarding site selection can be found on NYCCAS’ website: http://www.nyc.gov/html/doh/html/environmental/community-air-survey.shtml

1.5 1.5 Participant Address History

1.5.1 1.5.1 MESA Air and MESA Neighborhood Participants

The main MESA study collected current addresses on all MESA Classic participants from the baseline exam and updated these addresses at all subsequent follow up calls and clinic visits. At the beginning of MESA, addresses were only used for mailings, etc. and out-of-date addresses were overwritten by current addresses. In Exams 2 and 3, the MESA Neighborhood study, a study ancillary to MESA and directed by Dr. Ana Diez Roux at the University of Michigan, administered a Residential History Questionnaire to all MESA Classic participants who attended these exams and consented to MESA Neighborhood. This questionnaire acquired historical addresses from 1980 until the administration of the survey. MESA Air funded identical Residential History Questionnaires on all MESA Air New Recruits and MESA Air participants recruited from the MESA Family study during Exams 3/4. In 2006, MESA Air requested that all addresses collected by the main MESA study during clinic visits and follow up phone calls be retained by the Coordinating Center. Residential histories are available participants in the MESA Neighborhood Study for the period between 1980 and Exams 2/3 (2002-2004), and from the main MESA study from 2006 onward. Living floor and building type are available for locations for which an Air Questionnaire was administered.

The MESA Neighborhood study cleaned all addresses they collected, and contracted with Mapping Analytics to geocode the cleaned addresses. In geocoding these addresses, Mapping Analytics employed a 50 foot offset from the centerline of the road. The EAC database includes the geocodes generated by Mapping Analytics for the addresses at which participants lived between 1980 and 1999, with a few exceptions as described below. The MESA Air EAC geocoded all addresses where participants reported residing from January 1, 1999 forward with ArcGIS using a 30 foot offset perpendicular to the street. During data cleaning, MESA Air recovered some addresses that Mapping Analytics was not able to geocode. These were geocoded with a 30 foot offset.

Efforts were made to establish an address history that was sensible and complete, such that each participant had a single residence for every time point between 1980 and 2010. In the residential history, not all addresses had move-in or move-out dates. Addresses were ordered by available move dates and track date (contact date or date of questionnaire administration). A missing move-in date was assigned as the day after the previous addresses’ move-out date. A missing move-out date was assigned the day before the track date of the next address, unless the last-known date at the previous address was also a track date. If only consecutive track dates were available, then the midpoint of the dates was used as the move-out date and the next day was used as the move-in date. Some track dates were misclassified as move-in dates; these were identified by comparison to available exam dates and track dates. PO Box addresses were dropped whenever possible, though included (and flagged) if no other address appeared to be valid for the concurrent time period. If only month and year (not day) were available for move dates, the move-out date was assigned to the end of the month, and the move-in date was assigned to the beginning of the month. If the year was available for the move date, but not the month or day, the middle date of the year was used. If two addresses had the same track dates, the midpoint of the difference was used for the move dates. The majority of these rules were based on those employed by the MESA Neighborhood study at the University of Michigan.

Decease dates were provided by the Coordinating Center. For these participants, an end date for the last known address was set as the decease date. Current addresses for living participants have a null end date.

The majority of addresses included in the database are the participants’ primary addresses, but some participants also provide secondary addresses, which are indicated as appropriate.

1.5.2 1.5.2 Technical Note: Address geocoding and geocode flags

Participant addresses that were added in Rev 12 were geocoded using ArcGIS 10.5.1, Business Analyst, and the USA Local Composite 2016 parcel and street information.

Previously, participants’ home addresses (street address, city, state, and zip code) for MESA were geocoded in ArcGIS 9.2 or 9.3 (ESRI, Redlands, CA) using data provided by the MESA Coordinating Center and the TeleAtlas Dynamap 2000 v.16.1 road network (Boston, MA).

MESA Air participant locations are geocoded at the EAC using automated geocoding procedures in ArcGIS for all addresses that match up to a selected sensitivity (80% for this study). Spelling sensitivity and minimum match score are both set to 80 for automated geocoding. The default minimum candidate score, 10, is used. Originally, the “side offset”2 used was 30 feet, with a 0 foot end offset (the default in ArcGIS 9.2 and 9.3). Addresses that were geocoded onto A1, A2, or A3 roads due to the 0 offset setting were re-geocoded with a 3 percent end offset (the default in ArcGIS 10). In the event that ArcGIS is unable to match the address with 80% accuracy, it will prompt the user to match the addresses interactively. The user must exercise good judgment for this process, and common fixes include removing apartment or unit numbers, fixing spelling errors, and checking the address with Google Earth, Google Maps, or Bing Maps.

In addition, a subset of addresses was geocoded with the parcel-based geocoding available in ArcGIS10. This was done to increase the spatial accuracy of geocodes for residences located very close to roadways, since very small differences in locations near roads can have a large influence on exposures. Ideally, all locations would be geocoded using parcel-based geocoding, but it would be time-consuming to create new geocodes (and, more importantly, new geocovariates) for all locations, and the differences in locations between street- and parcel-based geocodes are generally small and likely only important for locations near roads. In addition, the data underlying parcel-based geocoding is not available for all locations, so at best, parcel-based geocoding could only happen in a subset of homes. To select the locations to re-geocode and relocate using the parcel data, we selected all locations identified as being with 150 m of a major road based on the street geocodes and, when possible, calculated the parcel geocodes. If the parcel geocode and street geocodes were more than 100m apart or more than 50% different, and the parcel geocode was more than 10 m from any major road, the parcel geocode replaced the original geocode. A street geocode was retained if the parcel geocode fell within 10 m of a road.

All addresses geocoded to the exact location by street geocoding will be noted as “Exact” under the geocode type; those geocoded to an exact location by parcel geocoding will be noted as “Parcel”. Occasionally a road may have more than one name, so any addresses that were geocoded with different street names to exactly the same location were considered exact matches. Some addresses were identifiable through Google Earth but could not be geocoded to the exact street number in ArcGIS. These addresses were geocoded to the nearest intersection on the same street. For invalid street addresses with a valid zip code, participant locations were geocoded to the centroid of the zip code. Intersection and zip code geocodes were always created via street (not parcel) geocoding methods. Invalid addresses, such as addresses with no valid zip code, PO Boxes, and addresses outside the continental US, were noted as fatal. Each address will be associated with one of the geocode types that are listed below.

Table 9. Geocode types

1.5.3 1.5.3 Cardiovascular Health Study (CHS) Participants

For Rev14, CHS address locations and address histories were replaced with a new dataset providided by the CHS Coordinating Center. Geocoding for Rev 14 was conducted using Business Analyst for ArcGIS 10.3.

1.5.4 1.5.4 Women’s Health Initiative – Observational Study (WHI - OS) and Clinical Trial (WHI – CT) Participants

Addresses for participants in the Women’s Health Initiative were collected and geocoded by the WHI Coordinating Center. These addresses were collected at the participants’ initial interview and at follow-up. Follow-up after 2009 was conducted primarily by mail. As of Rev14, address locations through 2016 are included in the database. The latitudes and longitudes of these locations were transmitted to the EAC without street addresses, participant names, or participant unique numbers. The exact method of geocoding for these addresses is not known, and no data cleaning was performed by the EAC.

1.5.5 1.5.5 SPIROMICS Air Participants

Addresses were provided by the Coordinating Center at UNC with a dummy participant ID and were geocoded at the EAC. Participants were asked to provide the current address at baseline as well as all addresses at which they had lived for 10 years prior to the study. The current address and any recent addresses were recorded at follow-up visits and during quarterly follow-up calls. Geocoding for Rev 11 was conducted using Business Analyst for ArcGIS 10.3, with no offset.

1.5.6 1.5.6 ACT – Air Pollution Study

Billing and study addresses were provided by Kaiser Permanente Research Institute (KPRI) with a dummy participant ID and were geocoded at the EAC. Geocoding for Rev 14 was conducted using Business Analyst for ArcGIS 10.3.

1.5.7 1.5.7 The Sister Study and the Two Sister Study Participants

Addresses for participants in The Sister Study and the Two Sister Study were geocoded at the EAC in two batches. The first batch included four kinds of addresses: “current” (at enrollment), secondary (at enrollment), longest-lived, and childhood (indicated in that order by a string of 4 binary digits at the end of the native_id). Participants were enrolled between 2003 and 2009. The same types of addresses were geocoded for the Two Sister Study. Participants in The Sister Study were women whose sisters had breast cancer; the Two Sister Study was an ancillary study to The Sister Study that enrolled the women with breast cancer themselves (the sisters of The Sister Study original participants.) Participants in the Two Sister Study were enrolled between 2008 and 2010. Geocode types are similar to the EAC types that are listed in Table 6: exact, intersection, zip, or fatal.

The second batch of addresses included addresses, contact dates, and move dates since enrollment for Sister Study participants only (i.e., not Two Sister Study). The EAC determined a unique address history for each Sisters participant and then conducted geocoding using Business Analyst for ArcGIS 10.3.

1.5.8 1.5.6 PSID Study Participants

IDs of census blocks where participants in the Panel Study on Income Dynamics (PSID) reside were provided by Kyle Crowder’s group. Locations were approximated at the EAC by calculating census block centroids using ArcGIS 10.

1.5.9 1.5.7 BRFSS Study Participants

Zip codes were provided for survey respondents in the Center for Disease Control’s (CDC) Behavioral Risk Factor Surveillance System (BRFSS) in Washington State. Depending on the area of the ZCTA, between one and four census block centroids within each zip code were randomly selected to represent the average pollution level. Census blocks were required to have a total year 2010 population > 0. Geocovariates and predictions were generated for each point selected.

1.5.10 1.5.8 REGARDS Study Participants

Te REGARDS Coordindating Center provided the EAC with geocodes for REGARDS participants from baseline up through Dec 19, 2017. All data cleaning and geocoding were conducted by the REGARDS Coordinating Center, which used SAS to geocode locations.

1.5.11 1.5.9 Small-Fee Projects

Collaborators sometimes send us a small set of locations for which they would like model predictions and sometimes geographic covariates. These cohorts include: 1) 1000 locations relevant to the OMEGA study, 2) a group of women that underwent IVF, 3) residence locations for the Washington Heights Inwood Aging Project (WHICAP), 4) a small number of locations relevant to the Religious Orders Study (ROS), 5) residence locations for Ginkgo Evaluation of Memory Study (GEMS) and 6) monitor and other locations for a study conducted in Oakland.

1.6 1.6. Participant Exam Dates

The dates on which each MESA or MESA Air participant came into the clinic for each of the five MESA exams are provided to the EAC by the Coordinating Center. In addition to the primary exam date, the Coordinating Center also provided the dates for each participant’s coronary artery CT scan, ultrasound, spirometry, and MRI as these tests sometimes occurred on a separate day.

Since a very limited number of participants have an exam at a given field center on any particular day, these dates are considered identifying information and, as such, cannot be distributed by the EAC. These dates are used by the EAC to provide “year prior to exam” averaging or time boundaries. Thus, in some cases, address histories and exam dates may be ‘masked’ by providing a move date or exam date as the number of days since an event of interest (such as a participant’s baseline exam).

In addition, baseline exam dates for SPIROMICS Air participants are available.

1.7 1.7. Participant Time Location

As part of the MESA Air Questionnaire, participants reported location patterns by season (summer or winter) and day of the week. This included questions specific to the number of hours spent in transit, at home indoors and out, at work indoors and out, and at “other activity locations” (such as volunteering) indoors and out. The primary purpose of these data is to provide time-weighted, infiltration-adjusted, aggregated estimates of PM2.5 exposure, or to provide the percent of time spent indoors and outdoors. For these calculations, we sum the total time spent in indoor locations (reported as home indoor, work indoor, and other indoor) and outdoor locations (reported as home outdoor, work outdoor, and other outdoor) for each season-day. We then average time indoors/outdoors across all days of the week separately for the summer and winter. When calculating individual-level exposures integrating indoor and outdoor concentrations with time-location information, the “summer” answers will be used when the two-week average temperature exceeds 18 degrees Celsius, and the “winter” answers will be used for periods with average temperatures equal to or below 18 degrees Celsius. A number of participants did not complete the Air Questionnaire, and we imputed missing responses for these participants. In instances where a participant completed some, but not all, of the time-location section of the questionnaire, we assumed that a missing day was the same as a weekday or weekend day in the same season. Otherwise, it was assumed to be the average of non-missing days. If an entire season was missing, we used the responses from the other, non-missing season. For more specific analyses involving time-location, the raw, unaggregated data are available from the Coordinating Center.

1.8 1.8. AQS Monitors for ‘Acute’ Analysis (Averaging Period < 1 Year)

Studies of the effects of short-term variations in air pollution exposure (“acute analyses”) generally leverage the variability in day-to-day pollution concentrations, rather than the variability in concentrations over space. The most common exposure period for this type of analysis is 1, 2, or 3 days prior to the exam or event. As MESA Air and SPIROMICS models are generally on the 2-week resolution, AQS data must be relied upon for these acute analyses.

The MESA Air EAC developed a daily time series of PM2.5 data collected by AQS monitors in each of the MESA Air study regions (Baltimore, Chicago, New York/Rockland County, LA/Coastal, Riverside (defined as those locations in Riverside or San Bernadino County), St. Paul, and Winston-Salem), as well as for Ann Arbor, San Francisco and Salt Lake regions. To ensure that analyses are not biased by the effects of inconsistent or erratic monitor schedules, we only included data from monitors within these regions that recorded daily PM2.5 levels using a federal reference method and were in operation from July 1999 through the end of 2011.

In areas other than LA/Coastal and NYC, these time series consists of monitoring data from a single monitor in each region. In LA, there were several qualifying monitors, and a single time series was constructed by taking each monitor’s mean concentration, subtracting that from each individual measurement for that monitor, averaging the residuals across monitors for each day, and then adding the average of the monitors’ means back to each of the daily averaged residuals. In NYC, the only long-running daily monitor, 360050110, is missing 73 days of data in 2001 and was offline from mid-2010 through mid-2012. To create a time series with better coverage, data from the ten nearest monitors that operated while 360050110 was offline were used in addition to this daily monitor. The ten monitors were combined according to the procedure used in LA to approximate a regional trend, which was then used to model the output at site 360050110. Measurements that were modelled in this way are flagged in the column ‘estimate_flag’.

Additionally, the time series in St. Paul is based on monitor 271230871. This monitor collected data on a daily basis during the Exam 1 period but switched to a 1-in-3 day schedule in 2003. No other federal reference method (FRM) daily monitoring data was available for this area. In order to improve coverage, the FRM measurement was modelled from the measurements reported from two continuous monitors co-located with the FRM monitor, but using non-reference methods. While this supplemented time series has been made available for analysts who require daily data, it should be noted that the agreement between non-reference and reference methods is not great (mean RPD = 19%) and analysts should consider using a three-day average in Saint Paul for acute analysis. For more information, see the acute monitoring appendix. As with NYC, these estimates are flagged in the column ‘estimate_flag’.

Analysts should note that studies that include a spatial component and an averaging time greater than 14 days may find an average of modeled exposures (see Section 2.12) more appropriate than these AQS data. Requests that include an exposure time period between two weeks and one year will be provided both an average of the acute time series and appropriate modeled exposure averages, and should determine the most appropriate exposure metric for their specific analysis.

1.8.1 1.8.1 Pre-Adjusted PM2.5 Exposure

Some users may prefer to conduct an analysis using PM2.5 that has been pre-adjusted for seasonal variability, to ensure that observed effects are PM2.5-related rather than (say) temperature-related. This would primarily be of concern when studying acute outcomes that vary seasonally, and is primarily directed at outcomes which may be ‘triggered’ by unusually high deviations of PM2.5 from recent and typical levels. Pre-adjusted PM2.5 exposures are the residuals from a prediction model of PM2.5 that includes 6 degree of freedom per year b-splines on temperature and humidity, and 12 degrees of freedom per year on calendar time, and with categorical adjustment for day of week. The R code for pre-adjustment can be made available for analysts desiring a different model specification.

1.9 1.9. AQS Monitors Eligible for ‘Nearest Monitor’ Analysis (Averaging Period >= 1 Year)

Simplified spatial analyses can be conducted using the ‘nearest monitor’ average PM2.5 as the exposure estimate for a particular location. Generally we recommend using a modeled exposure for the primary hypothesis, but there are some cases in which analysts may want to compare results from nearest monitors. Nearest monitor analyses are not recommended for pollutants other than PM2.5 which vary significantly on the local scale, rather than the regional scale. Due to seasonality in air pollution trends, this analysis is usually recommended for exposure periods of one year or longer.

Monitors are typically considered eligible for inclusion in a nearest monitor analysis if they provide good coverage for the entire study period. Specific monitors have been identified for MESA Air regions from 1999-2012. Other regions and time periods will require monitor selection.

Distances to AQS monitors within 100 kilometers of each participant home were calculated via the spatial-statistics ‘sp’ package of the R language. The standard distance function of that package uses location latitudes and longitudes and the Great Circle Distance formula to calculate distances on a spherical surface without the need for planar projections. See Appendix C for the location of the code.

1.10 1.10. Grid Locations

In order to produce maps that reflect the spatial heterogeneity inherent in our exposure models, geographic covariates were calculated for grid points at several different scales. For city-specific models and maps, grid points were generated at three densities in each MESA Air area. Grids were designed to show a high level of detail in the areas with the greatest density of participant addresses (a dense grid at 500 m resolution) and to show less detail in outlying and rural areas (1 km resolution and 2 km resolution at the edges of the grid area). In addition, fine-scale grids at a 50 m resolution were generated in order to produce insets that highlight very small-scale variation in model predictions. For the national models, a grid with a 25 km resolution was created. This grid covers the continental US.

1.11 1.11. Geographic Covariates

The following table summarizes the geographic variables that are available and the sources of the data. Subsections follow that address each data source in more depth. Geographic covariates are typically not available outside the continental US.

Table 10. Available geographic information

a Distances calculated to spatial features are truncated at 25 km c Distances calculated to spatial features are truncated- see covariate-specific section d Available only in MESA Air areas e Available only for New York City f Not buffered in the same way that other variables were j Available for NYC and Chicago g Available for USGS historical source only h Available only for LA/Riverside i See the “Documentation of MESA Air Implementation of CALINE3QHR Model” for details on the inputs to the CALINE dispersion model**

Figure 1. Major roads may be represented by 2-4 parallel lines, such as the A1 (I-5). Here, the black lines that represent the A1 and the blue dotted line that represents an A3 are inside a 300m buffer. These lengths are summed for line length-type variables. Distances are calculated from the point (often the center of a building) to the nearest point on a line or polygon feature (such as the commercial district shaded in yellow). Example calculations: ll_a1_r00300 = 650; ll_a3_r00300 = 550; m_to_a1 = 250; m_to_a3 = 40; m_to_comm = 1360.

1.11.1 1.11.1 Sources of GIS data

Aside from MESA Air monitoring and home locations, geographic data are obtained from various external sources, such as TeleAtlas, the US Census Bureau, and US geological survey. All of these data are free of charge and can be acquired at any time, with the exception of the TeleAtlas data. The TeleAtlas database was obtained from the USEPA under their usage license via DVD. This DVD is currently stored at the EAC; contact the data manager if necessary. A table of other sources and their websites where data can be downloaded are listed below by data category. For detailed publication sources, publication dates and accuracy information of obtained data, please contact the EAC for the metadata document.

Table 11. Data sources

1.11.2 1.11.2 Creation or Projection of Shapefiles from Raw Data Sources

All geographic variables must be calculated from shapefiles. Emissions data are downloaded as flat files and shapefiles are created from latitude and longitude at the EAC. USGS land use, TeleAtlas road data, census data for the year 2000, and NDVI images are downloaded as shapefiles from the source website. All shapefiles were re-projected into State Plane Zones and clipped with a 25 kilometer ‘buffer’ that extends beyond the boundaries of the State Plane Zone. This enables geographic calculations for points near the border of a particular zone.

1.11.3 1.11.3 Land use data

Variables for land use as percentage of a buffer were calculated from two sources. USGS polygon layers, generated by manually-intensive methods using aerial photography from the 1970s and 1980s were used to calculate the variables with the “lu” prefix. Rasters based on satellite data from the year 2006 were obtained from the Multi-Resolution Land Characteristics (MRLC) Consortium, and these were used to calculate the variables with the “rlu” prefix. In general, the EAC is recommending that the Raster Land Use data be used for exposure models developed to reflect “current” exposures (e.g., from 1999 to the present) and the USGS data be used to calculate “historical” exposure (prior to 1999). Please note that is not advisable to create an exposure model that includes both sets of land use variables. Please contact the EAC with further questions on which set of land use variables to use.

Tables 9 and 10 contain the full lists of the possible land use designations, with rough equivalents between the two sources. More information about the USGS land use classifications can be found at http://landcover.usgs.gov/pdf/anderson.pdf. The stated positional accuracy for USGS land cover is approximately 200 meters. This affects all areas, and may produce unexpected results, especially within small buffers. Analysts are encouraged to scrutinize results related to these variables. The positional accuracy for the satellite-based rasters is 30 meters and these variables may be more reliable than those calculated from the USGS polygon files. More information about the satellite- based land use classifications can be found here: http://www.mrlc.gov/nlcd06_leg.php. However, analysts are cautioned that the ice and snow designations in this data source may potentially be inaccurate, as pavement is occasionally misclassified as this land use type.

As an additional note on the processing of the polygon files, the USGS organizes land cover data by grids. Multiple grids might be required for a single State Plane Zone. Land use grids are merged into a single shapefile and projected accordingly. Commercial land use is selected and exported as a separate data category for the distance to commercial land use calculation.

Table 12. Land use variable names (lu_<type>_p<radius>) for variables based on 1970s and 1980s aerial photography

a Provided as a “distance” variable, not as land use in a buffer b Mixture of developed, agricultural, and natural areas c Mixture of categories

Table 13. Land use variable names (rlu_<type>_p<radius>) for variables calculated from satellite-imagery rasters

1.11.4 1.11.4 TeleAtlas Road Data

TeleAtlas road data were obtained from the EPA for the year 2000. These road networks had duplicate road segments in some areas. That is, two road segments with the same name and other identifying information were occasionally found to lie one right on top of the other. These were considered digitization errors and were removed with ArcGIS via python code.

Feature class codes are used to categorize roadways. Limited access highways are designated as A1 roads. Other major roads, such as state and county highways without limited access, are designated as A2 or A3 roads. More detailed information about the road classification system used in our database can be found at http://www.maris.state.ms.us/pdf/CFCCcodes.pdf.

1.11.5 1.11.5 Distance to Road and Near Road Determination

The perpendicular distance from locations to A1, A2, and A3 roads was calculated, based on the TeleAtlas road network and the geocoding of addresses as specified in section 2.4. Locations are considered ‘near road’ if the distance to and A1 or A2 is less than 100 meters, or if the distance to an A3 is less than 50 meters.

1.11.6 1.11.6 Sum of line lengths in buffers

The total length of A1, A2, and A3 roads was calculated in various buffer sizes, based on the TeleAtlas road network and the geocoding of addresses as specified in section 2.4. For example, if multiple A3 roads were present within the buffer, the total length of all segments that were contained within the buffer were summed. See Figure 1 for an illustration. These values are provided in units of meters and variables have the format ll_<road type>_s<buffer radius in m>.

1.11.7 1.11.7 Airports and Major Airports

Airport shapefiles and airport emission sums are obtained from NEI database. Runways of airports are subsetted, and centroids of runways were calculated. Emissions in tons and freight numbers were then merged to centroids of runways and projected into appropriate State Plane projections. Airports were classified as “major” according to the number of passengers per year served by the airport and by the freight tonnage. A value of approximately 8,000 passengers per year was determined to be the approximate center of the distribution of passenger volumes among all airports in the study area. Approximately 160,000 pounds per year was determined to be the approximate center of the distribution for freight. Airports reporting passenger volumes or freight tonnage above these values, and all international airports, were considered major airports. Therefore, distance to major (large) airport (m_to_l_airp) was calculated separately from the distance to any airport (m_to_airp).

1.11.8 1.11.8 Coastlines, Railroads, and Rail Yards

Coastlines, railroads, and rail yard locations were obtained from the TeleAtlas geodatabase.

1.11.9 1.11.9 Ports

Port locations were obtained from the World Port Index from the National Imagery and Mapping Agency (now the National Geospatial Intelligence Agency). The designations of ‘small’, ‘medium’, and ‘large’ are made by that agency. The documentation says only that “the classification of harbor size is based on several applicable factors, including area, facilities, and wharf space. It is not based on area alone or on any other single factor.”

1.11.10 1.11.10 Distance to Nearest Truck Route and Length of Truck Routes in Buffers

Truck route data was obtained from the National Transportation Atlas Database 2009 for data collected in 2008. The distance to the nearest truck route in meters is provided (m_to_truck). Truck route lengths in buffers were also summed (tl_s<radius>), in the same manner as road lengths.

1.11.11 1.11.11 Population

Population buffers are provided as the estimated total number of people living within the specified area as of the year 2000 census, calculated by multiplying a blockgroup population density by the area of the blockgroup lying within the buffer area and then taking the sum. Population data at the block group level is obtained from US Census Bureau for the year 2000 and block group boundaries are extracted from TeleAtlas database. Data from these two sources are then merged together by blockgroup key to create a new shapefile. This shapefile is then split into a series of shapefiles based on its designated State Plane projection. New columns are added to calculate area of each block group polygon in km2 and then to calculate the population density in number of persons per km2. Points locations are buffered, and the population densities and areas of the block groups within the buffer are used to calculate the total number of individuals within certain radii (measured in meters). These variables appear in the database with names such as pop_s01000.

1.11.12 1.11.12 Emissions Data

EPA’s Emission Inventory and Analysis Group prepares a national database, the National Emission Inventory (NEI), of air emissions information with input from numerous State and local air agencies, from tribes, and from industry. The NEI database includes estimates of facility–specific Criteria Air Pollutants (CAPs) and Hazardous Air Pollutants (HAPs) emissions, along with their source-specific parameters necessary for modeling, such as location and facility characteristics (stack height, exit velocity, temperature, etc.). The latest and most updated NEI data available is the third and final version of the 2002 NEI data posted in January 2008. Using these data, the EAC sums the total tonnage of NOX, SO2, PM2.5, PM10, and CO emitted from short stacks within 3 km and from tall stacks between 3km and 15 km or 30 km of locations of interest. Facilities having stack heights of at least 30 meters (98.425 feet) are considered “tall” stack facilities. The rest are considered “short” stack facilities.

Table 14. Emissions variable names

1.11.13 1.11.13 Normalized Difference Vegetation Index

NDVI was obtained as a series of 16-day composite satellite images from the year 2006. The index was converted by the University of Maryland from the -1 to 1 scale to the 0-255 (pixel brightness) scale. On this scale, water has a value of approximately 50 and areas with dense vegetation have values around 200. For each location of interest, for each image, all pixels with a centroid within a certain distance of the location were averaged (radii included 250m, 500m, 1km, and 5km). For each buffer size, five summary numbers were calculated from the series of 23 averages for each location: the 25th, median, and 75th percentile of the entire year’s series, the median of the expected ‘high vegetation’ season, defined as April 1 - September 30, and the median of the expected “low vegetation” season, defined as the rest of the year.

Table 15. NDVI variable names

1.11.14 1.11.14 Impervious Surface

Impervious surface was obtained from the Multi-Resolution Land Characteristics (MRLC) Consortium’s National Landcover Dataset. Imperviousness was calculated by the MRLC from Landsat 7 Enhanced Thematic Mapper Plus (ETM+) satellite imagery. Briefly, the MRLC uses regression tree methods and the reflectance of different wavelengths of energy that are measured by satellite to characterize land cover. Image data are from 2006. Imperviousness refers to the percentage of area in a pixel that is covered with an impervious surface, such as pavement or concrete. Covariates provided by the EAC are averages of pixel values within various radii.

Figure 2. Each pixel in the impervious surface raster is assigned a value according to the permeability of the surfaces in that pixel. We provide the average pixel value within a buffer. In the illustration, both A and B have the same average value (50%).

1.11.15 1.11.15 CALINE Long-Term Average

Dispersion of air pollutants was modeled using the CALINE3QHCR line-source model. This model integrates road position, traffic volume, diurnal traffic patterns, and meteorology to produce an hourly estimate of the concentration of a generic, non-reactive pollutant at specific locations known as “receptors”. These concentrations have arbitrary units, and are not comparable between different MESA Air regions. Hourly estimates were averaged over time to produce a long-term (2004-2006) average concentration. This value can be used as a spatial covariate representing the average pollution exposure due to major-road traffic, relative to other locations in the same region.

In the EAC database Rev 4, slight modifications of CALINE3QHCR were implemented, based on prior experience and operational considerations. Datasets generated prior to Rev 4 contain “CALINE”-type covariates, whereas datasets generated at Rev 4 or later contain ‘CALINEMOD’-type covariates. The modifications included:

The minimum distance between receptors and any road link was set to 20m, to reduce outliers due to positional-accuracy limitations.
The minimum mixing height was set to 50m (CalPuff default)
At any specific hour, only road links within +/-30 degrees of the upwind direction from the receptor were considered.
The Python code preparing road link input files from shapefiles was standardized to include only links within 9km and +/-30 degrees, and to reduce run time and storage overhead.

In the greater Los Angeles and Riverside study areas, due to the geographical extent of MESA AIR cohort sub-regions (120km east to west) and the meteorological heterogeneity, separate CALINE values were calculated using meteorological stations situated at the area’s eastern and western edges. These are available for all monitoring locations, and for many other locations (see below).4 For tractability and reproducibility reasons, in the LA area the CALINE variable is designated as primary or “alternate” according to standard geographical boundaries. The designation is as follows:

In LA, Ventura and Orange counties, the primary (“calinemod_lt_a0xxxx”) variable is calculated using the western, LAX airport meteorology. The alternate variable (“calinemod_alternate_a0xxxx”) is available for all monitoring locations, and for cohort and grid locations >15km from the coast or in State Plane Zone 6.
In Riverside and San Bernardino counties, the primary variable is calculated using the eastern, March Air Force Base meteorology (10km SE of Riverside). The alternate variable is available for all monitoring locations, and for cohort and grid locations <25km from the coast, in State Plane Zone 5, or west of -117.7 degrees longitude.
In other counties (if relevant), the primary variable is designated according to state plane – LAX for locations in Zone 5 and March AFB in Zone 6.

CALINE, auxiliary code, and further documentation are available in the CALINE directory of MESA-AIR’s Subversion repository.

1.11.16 1.11.16 Motor Vehicle Emissions Trends

Modeling was performed using the Motor Vehicle Emission Simulator 2014 (MOVES2014) software package5 for 10 parameters in 21 counties that cover the MESA Air study areas. These parameters were: CO, NO2, NOX, Total PM2.5, PM10 and PM2.5 from brake wear, PM10 and PM2.5 from tire wear, EC, Non-EC PM, sulfate, aerosol H2O, and total energy expended by vehicles.

Vehicle types were classified into two distinct categories. These categories were broadly labeled “Heavy Duty Vehicles” and “Light Duty Vehicles.” Only the contribution of vehicles travelling on urban restricted access and urban unrestricted access road types were considered in this analysis. Each possible combination of road type and vehicle type were paired and run separately.

Table 16. Classification of vehicle types

Monthly values are available for the parameters by county for the years 1990 and 1999-2014, by vehicle/access category (light duty restricted access, light duty unrestricted access, heavy duty restricted access, heavy duty unrestricted access).

1.11.17 1.11.17 Residual Oil in New York City

Residual oil boilers are associated with high emissions of soot and of some elements (e.g., sulfur and nickel). Such boilers are common in the New York City metropolitan area. They are used in medium-to-large apartment buildings and in institutions such as hospitals and colleges. Most of the emission is usually through roof chimneys, rather than at street level. Residual oils graded 4 and 6 are the heaviest fractions of petroleum distillation, with 6 being heavier and dirtier.

Data were received from Environmental Defense Fund (EDF), who transferred the information to the public domain via a FOIA from New York City government. Data were cleaned of a few gross location and BTU-capacity errors. Boiler locations were geocoded by the MESA-Air EAC. Only boilers installed prior to 2007 were included (>95% of the original EDF dataset).

Table 17. Residual oil variable names

1.11.18 1.11.18 Distance to City Hall

This variable is available only for MESA regions, out to a distance of about 100-150km from each MESA city. Additionally, some challenges were encountered in converting the calculation to the psql process and in Rev 11 it is missing for many locations. It can be used as a proxy for the excess emissions due to congestion and increased activity around the center of a metropolitan area, above and beyond the emissions predicted by other variables such as population, land use or roads.

For the 3 smaller MESA cities (Baltimore, St. Paul, Winston-Salem), the variable provides the distance in meters to the nearest city hall of a city as large as the MESA city or larger. The distance is censored between 1 and 25 km. Therefore, the urban center is effectively approximated as a circle 2km in diameter, with all points inside the circle assumed equally “central”. The reference city halls for these study areas are:

Baltimore - Baltimore, DC (White House), Philadelphia.
St. Paul - Minneapolis, St. Paul.
Winston-Salem - Winston-Salem, Greensboro, Charlotte, Durham,Raleigh, Roanoke (VA).

In this MESA region the minimum distance is 500m rather than 1km, since most of the reference cities are smaller, including Winston-Salem itself.

For the 3 larger MESA cities, distances to two locations were calculated. The “m_to_main_cityhall” variable measures the distance to the center of the main MESA city only. It is censored between 3km (2km in Chicago) and 50km. It is used as a proxy for the overall metropolitan-center plume. In addition, for these 3 cities the “m_to_local_cityhall” variable also includes city halls of edge cities (but, generally, not of continuous suburbs to the main city). Distances are censored between 1 and 25 km. The added cities are: Chicago - Milwaukee, Madison, South Bend (IN). Los Angeles - Riverside, San Bernardino, Ontario, Pomona, Long Beach, Santa Ana, San Diego, Bakersfield. New York - Newark, Trenton, Bridgeport (CT), New Haven (CT).

The list of added cities was assembled to accommodate the location of agency and MESA monitors in the outlying areas that have been used in MESA Air exposure models. Values are missing for locations outside the main study areas and for locations geocoded to zip code centroids.

City halls were located via Google Maps, with geocodes rounded to the 3rd digit (~100m). Distances were calculated using R “sp” package function “spDistsN1”, which calculates distances directly from the geocodes, assuming a spherical Earth.

1.11.19 1.11.19 Elevation

Each MESA location point had an elevation value in meters extracted from a National Elevation Dataset (NED)-based raster grid, provided by the USGS. In the continental US and Hawai’i, the resolution of the raster was 1 arc second. The resolution was 2 arc seconds for Alaska. A concentric set of twenty four elevation points were then generated around each location point for a statistical sample at two radii (1000 m. and 5000 m). Standard deviation and counts of “above” or “below” a threshold elevation (+/- 20 or 50 meters respectively) were calculated for each point utilizing the twenty four point sampling.

Table 18. Elevation variable names

A small number of other statistics were calculated regarding the points surrounding the locations. These statistics are not anticipated to be useful for general modeling applications and will not be distributed as part of the set of standard covariates. Users interested in creating a more sophisticated model for the effect of elevation should contact the EAC for more details.

1.11.20 1.11.20 Urban Topography

Building footprints, heights and parcel information were obtained for Chicago and New York City, from the respective city governments (see Table 11 for data source details). Statistics were calculated in an effort to characterize the degree to which each location in these two cities is situated in a “Street Canyon” (i.e., surrounded by buildings in a manner that significantly constrains air flow). If these data are missing for any specific location in NYC or Chicago, this indicates that no building was found within 60m of that location.

Generally speaking, individual urban topography variables do not directly quantify the pollution exposure exacerbation due to street canyons. They serve as building blocks for street canyon models currently under development and can be provided to analysts with a specific use for them. However, they are not intended to be distributed with the ‘standard set’ of modeling variables.

Table 19. Urban topography variable names

1.11.21 1.11.21 Census Data

Census data for the year 2000 were obtained from the US Census via the University of Washington Library. These demographic variables are included in the Census’ Summary File 3 (SF 3). Data were compiled at the block (bk), block group (bg), and tract (tr) aggregating levels.

Table 21. 2000 Census variable names

Census data is also available for 2010. After the 2000 Census, the Census Bureau began administering the long-form questionnaire on a rolling (yearly) basis through the American Community Survey (ACS). The Census only releases ACS data for small areas in in five-year aggregations, and so we provide ACS data from the period 2006 through 2010 for tracts and block groups. Block-level data are more restricted, and therefore short-form (SF1) 2010 data are provided. 2010 data was retrieved through the National Historical Geographic Information System6 (http://www.nhghis.org/) and from Social Explorer (http://www.socialexplorer.com/). All dollar figures were reported in 2010 inflation-adjusted dollars. More variables, years, and geographic areas are publicly available through these websites and through http://www.census.gov/.

Table 22. Year 2010 census variable names

1.11.22 1.11.22 Distance to Bus Route

Bus route data was obtained through contact with the New York Department of Transportation and included bus route information for the New York City area including routes in New York, New Jersey, and Connecticut. The distance to the nearest bus route in meters is provided (m_to_bus). Bus route lengths in buffers were also summed (bus_s<radius>), in the same manner as road lengths.

Bus routes for other areas of the US will be added in the future. These data were downloaded from GTFS Feeds (http://transitfeeds.com/) in October 2016. GIS software was used to reproject into state plane zones, dedup links, and remove unwanted transportation types (ferry, light rail).

1.11.23 1.11.23 Satellite Data: PM2.5, NO2, SO2, HCHO, CO

Satellite-based estimates of air pollution for PM2.5, NO2, SO2, CO, and formaldehyde (HCHO) were obtained.

Annual satellite-based estimates of ground-level PM2.5 (1998-2014) at 0.1° were obtained from a previously published, publicly available data set (van Donkelaar et al., 2016 ). Briefly, five aerosol optical depth (AOD) satellite retrievals were combined with (1) satellite-based measurements of vertical aerosol profiles, (2) modeled AOD and ground-level PM2.5 from a global chemical transport model (GEOS-Chem), and (3) ground-based AOD measurements from the aerosol robotic network (AERONET) to estimate annual ground-level PM2.5 on a 0.1° grid (van Donkelaar et al., 2016).

We obtained daily L2 surface-level CO from the Measurements of Pollution in The Troposphere (MOPITT) sensor on NASA’s Terra satellite for years 2001-2016 (Deeter et al., 2017). Tropospheric NO2, SO2, and HCHO were derived from daily measurements obtained by the Ozone Monitoring Instrument (OMI) onboard the NASA Earth Observing System (EOS)-Aura satellite. Measurements were screened for quality based on cloud cover, illumination, and data flags and averaged temporally following a standard method for satellite data that considers pixel values within a buffer. The resolution of the final, processed rasters is provided in the table below. More details are available from Matt Bechle.

The value of the covariate in each case is the value of the pixel/grid cell in which the location lies.

Table 23. Satellite data methods

1.11.24 1.11.24 Method of Covariate (Variable) Calculation

Calculations were performed using the PostGIS spatial extension to PostgreSQL. Relevant scripts are located in Q:.

All distance calculations are truncated at 25 kilometers, except for distance to A2 or A3 which were truncated at 20 kilometers. For example, the distance to the nearest coastline will be 25,000 meters for all locations in the Winston-Salem study area, as well as those locations in the Los Angeles study area that are greater than 25 kilometers from the coast. Data users should be aware that this will be true whether the location is 25.1 kilometers from the coast or 100 kilometers from the coast.

1.11.25 1.11.25 Data Quality

The EAC does not perform QC on the source shapefiles that were downloaded from third-party data providers. Data quality as reported by the data generating agency is disclosed in the metadata document and discussed in the MESA Air Quality Assurance Project Plan (see Appendix C). 1.12. Exposure Model Outputs Modeled pollutant estimates may lag behind location data corrections by one database revision. Therefore, locations may not be assigned the estimate produced for an old location that lies more than 10m from the updated location.

For pollutants modeled at the two-week time resolution, an annual average will not be distributed for a participant that has fewer than 25 two-week predictions available for the residences at which they lived during the time period of interest. This could happen if, for example, a participant lived outside the MESA Air modeling areas during half of the year prior to their exam.

Table 24. Availability of predictions in MESA Air and SPIROMICS modeling regions (and WHICAP). For participant locations, predictions are generated for all addresses geocoded to intersections or exact locations. Models are maximum-likelihood based, unless otherwise indicated. Cells indicate the database version providing the underlying covariate data.

1.12 1.12 Exposure Model Outputs

1.12.1 1.12.1 NOX, NO2, and PM2.5 Likelihood Model Predictions

The NOX, NO2, and PM2.5 predictions use a spatio-temporal modeling methodology originally developed by Paul D. Sampson. It was first described in Fuentes et al.7, and adapted to MESA Air’s data systems by Sampson et al.8. The model is optimized via maximum likelihood, as developed and described by Szpiro et al.9 Subsequently, Keller et al. (citation) implemented the full unified hierarchical spatiotemporal model for MESA Air data. Both monitoring data and geographical variables were extracted from MESA Air Exposure Assessment Center’s database. All models are based on Version 14 (DR0328, September 2020) data for predictions from the beginning of 1999 through at least June 2018 for NO2, July 2018 for PM2.5, and January 2019 for NOx. Beginning from Rev 14, an “omnibus” version of city-specific modeling is implemented. This unifies the prediction process for all cohort locations in cities for which supplemental monitoring is available, namely Baltimore, New York City, Los Angeles, Winston-Salem, Chicago, St. Paul, Ann Arbor, Salt Lake City, and San Francisco. Therefore, a single, more convenient and efficient modeling process produces predictions for MESA Air, SPIROMICS Air and WHICAP participants.

It is worth noting that, in addition to the data described in this document, and provided as part of the MESA Air Data Request System (V7 and above), models for pollutants NYC and Rockland include NYCCAS monitoring data. NYCCAS collected NO, NO2, and PM2.5 in New York City over a period of two years between December 2008 and December 2010. Samples were collected at 150 sites for 7 to 8 two-week periods (one per season per year) over the two year period. Five reference locations, one in each NYC borough, collected two-week samples for the entire two-year period10.

The open-source R statistical analysis language was used, with the core model-fitting functions taken from the “SpatioTemporal” R package11, authored by Lindstrom et. al.12 and maintained by Lindstrom. The package is available on CRAN. Predictions were generated at all MESA Air participant addresses that were valid from 1999-2019, were within a modeling area, and were geocoded to an ‘exact’, ‘block’, or ‘intersection’ location. See appendix for modeling area and model performance statistics.

1.12.2 1.12.2 O3 Likelihood Model Predictions

The O3 models were originally developed by Meng Wang using equivalent methods to those used in the previous section. The current predictions are based on Version 14 data and run through at least March 2018.

Again, the open-source R statistical analysis language was used, with the core model-fitting functions taken from the “SpatioTemporal” R package13, authored by Lindstrom et. al.14 and maintained by Lindstrom. The package is available on CRAN. Predictions were generated at all MESA Air participant addresses that were valid from 1999-2018, were within a modeling area, and were geocoded to an ‘exact’, ‘block’, or ‘intersection’ location. See appendix for modeling area and model performance statistics.

1.12.3 1.12.3 SPIROMICS-Specific Likelihood Model Predictions

Predictions were obtained using methods similar to those utilized in MESA-Air (described briefly in 1.12.1) following the operational details of the hierarchical spatiotemporal model in Keller at al. Monitoring data and geographical covariates were extracted from MESA Air Exposure Assessment Center’s database for seven SPIROMICS cities (Baltimore, MD, New York, NY, Winston-Salem, NC, Los Angeles, CA, San Francisco, CA, Salt Lake City, UT, and Ann Arbor, MI, the first four of which were also MESA cities). After covariate pre-processing based on the recommendations in Keller et al., core model fitting was performed using SpatioTemporal, an R package available on CRAN. As indicated above, beginning from Rev 14, an “omnibus” version of city-specific modeling is implemented. This unifies the prediction process for all cohort locations in cities for which supplemental monitoring is available, namely Baltimore, New York City, Los Angeles, Winston-Salem, Chicago, St. Paul, Ann Arbor, Salt Lake City, and San Francisco. Therefore, a single, more convenient and efficient modeling process produces predictions for MESA Air, SPIROMICS Air and WHICAP participants. Predictions are based on a draft of Rev 14 of the data (DR0328, September 2020). Coverage periods are as indicated above for MESA Air. See the appendix for a summary of source monitors, modeling areas and model performance.

1.12.4 1.12.4 National Spatiotemporal Model Predictions - PM2.5

The national spatiotemporal model is fit separately on 9 regions. Notably, the model is fit on Wednesday to Wednesday two-week periods, but due to inconsistent monitoring schedules between MESA monitoring campaigns, those two-week schedules are not the same in all regions. Therefore, these two-week predictions are split into into one-week predictions (by repeating the predictions for a given two week period into the two respective one week periods). This allows a consistent (one-week) prediction intervals even in region-overlap areas. Note that while one-week predictions exist in the database, the model is not reliable down to this one-week temporal scale and therefore the minimimal recommended averaging period is two weeks. Time-weigthed two-week predictions will incorporate monitoring data from adjacent (ie preceding or subsequent) two-week monitoring periods. In general, caution should be used when using short exposure averaging periods since there will be some amount of temporal measurement error.

Within each region, a model is fit similar to MESA-Air, except that smoothing parameters are fit on only a subset of the long-term monitors to reduce computational time. To reduce obvious discontinuities at region borders, these regions were designed to be slightly overlapping (approximately 100km in either direction of region borders). Predictions in overlaps are made separately for both (or all) models in overlapping regions, then these predictions are distance-weighted to produce a single prediction for a given period. See appendix for summary of model performance.

1.12.5 1.12.5 Light Absorption Coefficient (LAC) Predictions

LAC predictions are available at MESA participants’ residential locations. A spatio-temporal model for LAC was developed based on MESA Air monitoring data collected between July 2005 and July 2009, NYCCAS data from 2008 – 2010 for NYC and Rockland, and geocovariates from Rev 7 of the database. The model was fit using the SpatioTemporal R package, and predictions were averaged over the timeframe of MESA Air monitoring (2006-2008). The spatial averages from the model are intended to represent long term exposures to black carbon. See appendix for model performance statistics.

1.12.6 1.12.6 Pragmatic PM2.5 Predictions

The pragmatic PM2.5 predictions were calculated using R and the methods described in “Pragmatic Estimation of a Spatio-Temporal Air Quality Model with Irregular Monitoring Data”15. These are available as two-week integrated predictions for ambient PM2.5 concentrations. “Initial” pragmatic predictions were generated for the “at exam” addresses for MESA Classic participants at Exam 1 and MESA Air Family and New Recruits for their baseline exam (Exam 3 or 4). “Final” pragmatic predictions were generated at all MESA Air participant addresses that were valid from 1999-2009, were within a modeling area, and were geocoded to an ‘exact’ or ‘intersection’ location. Besides encompassing more addresses and a longer timeframe, the road network used was changed from the Census source to the TeleAtlas source. Additional covariates, such as NDVI, and larger buffers of existing covariates, were available for the partial least squares regression. Site-varying neighborhood radii were used in the local random effects modeling strategy to compute trend functions. Finally, the Hawkins and Cressie variogram estimator was used for space-time residuals. See appendix or citation for model performance statistics.

1.12.7 1.12.7 National Model PM2.5, PM10, and NO2 Predictions

The national models for PM2.5, PM10 and NO2 are universal kriging models with partial least squares used to select relevant components for the mean regression. In Rev 9, we additionally include an alternative national NO2 national model that includes as an independent predictor variable columnar satellite NO2 measurements. This model showed improved performance compared to the model without satellite data and is the recommended exposure. The nation was divided into three regions based on topography. Each model was fit as a whole using maximum likelihood, with each region having its own set of estimated parameters for each pollutant and year. That is, for each year of annual average monitoring data, models were separately estimated (meaning that for each year PLS components, PLS coefficients, and variograms were estimated separately). Predictions were generated for all participant residence locations. The method was developed by Paul Sampson et. al. and implemented by Michael Young and Josh Keller16. PM2.5 (1999-2011) is based on Version 5 (August 2012) data, PM10 (1990-2010) is based on Version 6 (February 2013) data, NO2 (1990-2012) is based on Version 7 (August 2013), and satellite-enhanced NO2 (1990-2012) is based on Version 8 data. Models through 2014 (NO2) or 2015 (PM2.5) were added in Rev 11. Models for 2014-2016 were added for PM10 in Rev 14. See appendix for model performance statistics.

1.12.8 1.12.8 National Model Historical PM2.5 Predictions

The model for historical PM2.5 Predictions relies on the MESA Air spatio-temporal modeling framework applied to annual average concentrations. See section 2.12.1 for a summary of these methods. Models were built using Version 6 covariates (DR0110). Temporal trends were back-casted using line extrapolation of trends estimated from 1999-2012 data. Historical prediction models were developed by Sun-Young Kim.17 See appendix for model performance statistics.

1.12.9 1.12.9 National Model for As, Cr, Cu, EC, Ni, NO3-, OC, S, Si, SO2, SO42- and V, Predictions

Methods used for the national PM2.5 model (excluding the regional interactions) were applied to As, EC, OC, Si, S, Cr, Cu, Ni, V, SO2, nitrate, and sulfate data covering the one year period between July 2009 and June 2010. This period was chosen to maximize the number of whole-year averages, which was necessary to accommodate analysis method changes that were implemented at STN monitors. Covariates and AQS data came from Rev 5 (DR0096) for all pollutants except Cr (year 2009, covariates from Rev 4, DR0076). Predictions were generated for WHI-OS locations by Silas Bergen. See appendix for model performance statistics.

1.12.10 1.12.10 Coarse PM Land Use Regression Predictions

Model predictions for the MESA Coarse PM Study were developed for PM10-2.5, Cu, Si, P, and Zn using monitoring data collected at Coarse Snapshot locations (see section 2.3.1) and covariates from Rev 2 (DR0056). Land use regression models were selected using a separate exhaustive search for each study area and pollutant for the model with the lowest RMSE. Variable selection for the candidate models is described in Zhang et. al. (submitted). Briefly, models were designed to accommodate monitoring data from two rounds of sampling with interaction terms for season. The variable list was pared down from the available set described in section 2.10 to include those covered adequately by both the monitored locations and participant locations, then further reduced to 15 using LASSO. Kriging was not supported by the monitoring data and was not used. Predictions for locations lying more than 25 km from a monitoring location or with covariate values lying outside the range observed in the monitoring locations (with a 10% tolerance) were flagged. By default, these predictions are excluded from datasets provided to analysts. Note that this means that ‘missing’ predictions will vary by pollutant. See appendix for model performance statisics.

1.12.11 1.12.11 Alternative National NO2

NO2 predictions are available nationally at census block centroid for the national NO2 model developed by Novotny et. al.18 Briefly, this is a land use regression model that depends primarily on satellite data for column NO2 and impervious surface.

1.12.12 1.12.12 Individual-level Exposures to Ambient PM2.5

Ambient source exposures incorporate the likelihood-based PM2.5 predictions, predicted infiltration fractions, and time-location questionnaire data. The fraction of PM2.5 that infiltrates indoors (Finf) is impacted by temperature (as temperature affects behavior such as window opening and heat and air conditioning use). The temperature used when calculating Finf is based on the two-week average temperature observed at a central monitor in each study area. A ‘warm season’ model is applied when the average temperature exceed 18 degrees Celsius, and the “summer” time-location patterns are used to calculate the percent of time spent indoors during a typical week. The infiltration model was developed by Ryan Allen at Simon Fraser University19. This is a regression model based on MESA Air monitoring data and Air Questionnaire information. Each two-week PM2.5 prediction is multiplied by the infiltration fraction and the percent of time spent indoors and this is added to the PM2.5 prediction multiplied by the percent of time spent outdoors. These two-week individual-level exposures can be aggregated up to any time scale desired.

1.12.13 1.12.13 K-means

In order to assign exposure to multiple pollutants simultaneously, a clustering method was developed on the national scale. Twenty-two components of PM2.5 were used to define 7 component profiles, and predicted cluster membership was assigned based on geographic covariates and proximity to monitoring locations with cluster membership assigned based on measured component profile.20

1.12.14 1.12.14 ACT-Specific PM2.5 Likelihood Model Predictions

Predictions were obtained using methods similar to those utilized in MESA-Air (described briefly in Section 1.12.1 above) following the operational details of the hierarchical spatiotemporal model in Keller at al. The model was developed using geographical covariates and PM2.5 monitoring data from government monitors and supplemental monitoring campaigns in the Puget Sound, WA region. Government monitoring data include federal reference method (FRM) data from the EPA’s Air Quality System (AQS) database, tapered element oscillating microbalance (TEOM) monitors, a federal equivalency method (FEM), and nephelometer monitoring data from the Puget Sound Clean Air Agency (PSCAA). The historical nephelometer data was calibrated with co-located FRM data, available at certain non-industrial nephelometer sites starting around 1999 when FRM monitors became available. Additional monitoring campaigns include the PANEL study, DEEDS study, and remote air data (RAD) monitoring campaign. After covariate pre-processing based on the recommendations in Keller et al., core model fitting was performed using SpatioTemporal, an R package available on CRAN. Predictions cover the period 02/08/1978 – 03/27/2019. However, due to limited monitoring data in the 1980’s and 1990’s, the quality of model predictions may be a function of time, and less reliable for the 1980’s/1990’s. See the appendix for a summary of source monitors, modeling areas and model performance.

1.12.15 1.12.15 SPIROMICS Indoor Exposure Modeling Predictions: Nicotine, NO2, NOX, PM2.5

Predictions from indoor exposure models are available for 1,579 SPIROMICS participants in seven cities. The predictions are available for two-week periods generated at one-week intervals. The availability of predictions depends on the availability of participants’ answers to the Home Information Questionnaire (HIQ) for the address(es) at which they reside, their smoking habits as reported in the Respiratory Disease and Smoke Exposure Questionnaire (RDS), and availability of outdoor pollutant estimates at their location.

All of the models contain census block group-level predictors from the 2010 census, namely median family and household income, median housing value, rates of high school attainment, and rates of owner-occupied housing units. However, this data was not available for all block groups. In cases of missing data for any of these from 2010 census data, values compiled from ACS 5-year estimates (2009-2013) were used instead where possible. Dollar amounts from these were adjusted to 2010 levels using consumer price index data from the Bureau of Labor Statistics.

The model for PM2.5 includes building age (an item on the HIQ) as a predictor. At the time the HIQ was administered, 139 of the 1,579 participants for whom predictions are available had not reported this information. In order to generate predictions for these cases, building ages were obtained from various commercial and municipal data sources, primarily Zillow and PropertyShark, for 141 locations. More information, including references, is provided in Appendix L. 1.13. Meteorological Data Meteorological (“met”) data (temperature, humidity, wind speed, wind direction, ceiling height, dew point, sea level pressure, station level pressure, and visibility) are downloaded from the National Oceanic and Atmospheric Administration’s (NOAA) National Climatic Data Center (NCDC) at http://www7.ncdc.noaa.gov/CDO/dataproduct, as are the locations of the weather stations. Meteorological data are available nationally for the years 1979 through September 2019. Weather stations are identified by their Weather Bureau Army Navy (WBAN) and Cooperative Station (COOP) identification numbers. Meteorological data can be provided from the nearest met monitor or from a single station in each MESA Air community (including Riverside, CA; Rockland County, NY; and the metropolitan areas of the six cities). Because the data outside the MESA Air study areas has not been thoroughly reviewed for completeness and data quality issues, we recommend that “nearest met monitor” data be used with caution. For health analysis requests, we will provide the “community met monitor” data as the default.

These data are available at daily time resolution. For a subset of timepoints, data are reported at multiple altitudes at the same location. These data are all considered “surface” level data by NOAA. Whenever multiple recordings from the same location are available during the same hour, recordings are ranked by altitude at which they were recorded, and a single datapoint is selected and maintained in the EAC database. The “ground level” recording is retained if available. The next choice is a “sea level” recording, which is often a conversion of the ground level recording to sea level conditions. The last choice would be a recording taken at an unknown elevation, reported for aviation, but still considered “surface level”. Even with these rules, users should be aware that coverage in time is not perfect. No further cleaning has been performed, and no QA/QC rules have been set for meteorological data. Data are formatted and uploaded to the EAC database ‘as-is’.

Temperature and dewpoint are reported in degrees Celsius. Relative humidity is reported as a percentage. Wind speed is reported in m/s. Wind direction is reported as an angle, with zero degrees representing a wind blowing from due north. The daily prevailing wind was calculated using vector averaging, with calm wind hours considered to be the 0 vector (no direction reported, speed reported as 0-1.5 m/s). Vectors with no direction and wind speeds above 1.5 m/s were considered missing data and dropped from the averaging. Visibility and ceiling height are reported in meters. Pressure is reported in hectopascals. Daily values for all meteorological components met daily completeness rules outlined in section 2.2.4.

Visibility was truncated at 10 miles due to a method change at most sites by the year 2001. Noon visibility is reported alongside noon relative humidity and noon weather conditions. Weather conditions are reported as a string of binary indicators for the presence of airborne dust, fog, and precipitation.

1.13 1.13.1 Meteorological Data

1.13.1 1.13.1 B-Spline Variables for Temperature, Humidity, and Calendar Time

For studies of the chronic effects of air pollution on outcomes which may show seasonal variability, an alternative to adjusting directly for temperature and humidity is to adjust for b-spline (basis spline) variables. Users should be aware that each spline generates a large number of variables. However, this allows for very fine and flexible control of seasonal variability. We typically distribute these as variables from b-splines with 6 degrees of freedom per year (i.e. 6 variables per year).