Section 2 Writing New Code
Summary
- Use R whenever possible.
- Use or modify existing tools whenever possible.
- Create an R package with a Github repository if you write new code that will be reused.
- Create a data request ticket for one-off code.
- Create a data update ticket when adding data to the database or modifying data in the database.
- Document, Document, Document!
2.1 General Guidelines
2.1.1 Programming Languages
EAC code should be written in R whenever feasible, except for simple shell scripts or SQL queries. Choosing a single language helps us to write, review, and maintain code as a team and simplifies hiring new staff. Even in cases where another language might provide better performance or have more features related a particular project, having unfamiliar languages in our code base complicates review and maintenance. However, occasionally the benefits of using another language for a project may outweigh the costs. If you believe there is a compelling reason to use a different language for your project, consult the team.
2.1.2 Be Abstract
Just like we write code in a common language to make it maintainable by others, we should write reusable tools that can be implemented by others across different projects. While some code may seem inherently one-time use (e.g. filling a data request), it is likely made up of steps that will be re-used over and over. Be suspicious of every line of code in your one-off scripts, and ask if it could be wrapped into a reusable tool (or whether that tool already exists). If you find yourself with a script hundreds of lines long that you don’t think you’ll use more than once, you’re doing something wrong!
Coding processes this way has the benefits of saving time later when the code is reused for another project, improving accuracy by reducing the number of times a process is replicated (and therefore the number of opportunities for error), and making it easier to implement fixes that take effect everywhere the process is used.
In order of preference, you should use code from:
- An appropriate existing R package
- An internal R package you modified (and submitted a PR to the maintainer)
- A new R package you develop (potentially open source)
- Custom written, one-off code
2.1.3 Open Sourcing
One consequence of writing code abstractly is that the tools we develop may be useful in all sorts of circumstances, even outside of our group. Code that could be useful outside of our lab should generally be developed as an open source R package in a public Github repository. This allows us to give back to the open source community from which we benefit, but it also benefits us because outside collaborators can help us maintain our code. However, sharing a tool publicly does not commit us to implementing features beyond our own needs or to solving other people’s problems.
Public R packages should be developed according to the standards of ROpenSci, and their authors should consider submitting in-scope packages to ROpenSci review. Packages that may be in-scope deal with managing the data life cycle, geospatial analysis, or statistical analysis. See here and here.
2.2 Organization and Types of Code
- R Packages: Tools that will be reused in different projects should generally be implemented as R Packages. R Packages are stored on the kaufman-lab Github organization (open-source repositories may be stored in another organization’s Github or on the primary maintainer’s Github).
- Tickets: One-off code that uses the packages above to distribute, update, or modify our stored data.
2.3 R Packages
See https://r-pkgs.org/ for modern documentation on how to develop an R package.
2.3.1 Documentation
Internal R packages should include (at a minimum) documentation for all exported functions (using Roxygen), a README, and for all but the simplest packages, a vignette explaining basic usage. For open source packages, follow ROpenSci’s standards for documentation.
2.3.2 Review
Internal packages should be reviewed by another team member according to the standards in this manual. Open source packages may be reviewed according to those same standards or else submitted for review by ROpenSci (or both).
2.3.3 Version Control
Packages should be tracked in Github repositories owned by the “kaufman-lab” organization (rather than your personal Github account). Repositories can be public or private (depending on whether the project is intended to be open-source). You should make the “data team” team and/or “modeling team” team maintainers of the repository so that others can contribute to your code.
2.4 Requests & Updates
Once you have diligently created a reusable R package for performing every task that is not unique to your project, you may need to write some one-off code to actually call those tools. In almost all cases, one-off code should be stored in a request or update. Even minor internal analyses and data pulls can and should be implemented as requests – just use a “test” data request (TR####).
Any work that changes or distributes the EAC’s code or database should have a ticket that records what is done, when, and by whom.
2.4.1 Types of Tickets
2.4.1.1 Data Requests
A task that requires extracting and distributing data is a data request. Data requests are named according to the type of data being requested:
Data Requests (DR####) extract existing data and format/transform it according to the needs of internal or external analysts.
Clients request data via an online REDCap form, which alerts the data team via email that a new data request needs to be created. Using theeactickets
R package’smake_data_request
function, we then create a directory for the request onmesa3
in/var/local/QUTE/eac_database/requests
. This function also copies the client’s form submission to this directory, populates the request directory with template files for use in filling the request, and logs the request in the MESA database.Test Requests (TR####) are for internal analyses and data pulls. These serve as work spaces for data team members’ projects, but are still subject to requirements for documentation and tracking. These are also created using
eactickets::make_data_request
, but do not need to have a REDCap form submission.[DEPRECATED] Health Requests (HR####) distribute exposure estimates and other variables averaged over dates relevant to cohort studies managed by the EAC. NOTE: Health Requests are no longer differentiated from Data Requests when creating a new request. They are included here only to explain why old requests may be labeled “HR”.
[DEPRECATED] Model Requests (MR####) are used to create and distribute new air pollution models. NOTE: Map Requests are no longer differentiated from Data Requests when creating a new request. They are included here only to explain why old requests may be labeled “MR”.
2.4.1.2 Database Updates (DU####)
A task that creates or changes data in our database or that changes the structure of the database is a database update. This includes simple updates such as generating new air pollution predictions as well as more substantial changes such as restructuring a table in the database.
A database update may be required to fulfill a data request. In this case, it is still tracked as its own ticket, and the relationship between the data request ticket should be noted in the ticket’s documentation file.
Using the eactickets
R package, we create a directory for the database update on mesa3
in /var/local/QUTE/eac_database/staging/tickets
. The eactickets
package populates this directory with template files including template documentation, and logs the update description in the MESA database.
2.4.2 Request Structure
Data requests are generated with boilerplate files directory structure. If you need to create extra directories or files, make sure to document what they are.
These files and directories are described below for an example request “TR001”:
tr001/
tr001.Rmd # analyst documentation, rendered to HTML when the request is run
tr001.pdf # A copy of the data request form submission (if one exists)
tr001_qa.Rmd # R markdown file for use by QA analyst
tr001.Rproj
renv/
code/
tr001.R # main R script to generate deliverable data
[01_do_stuff.R] # optional additional code files (should be sourced by tr001.R)
[02_do_more_stuff.R]
[03_do_even_more_stuff.R]
data/ # raw data to be used in the request (if not sourced from database), or intermediate data sets that don't need to be distributed to the client
deliverables/ # data for client
delivered_YYYYMMDD.zip # archive of actual delivered data
In the above example, the (unarchived) data in deliverables should be reproduced by running code/tr001.R
.
2.4.3 Documentation
All requests need to be documented in the Rmd file included in the request template. This file contains a template for you to explain the context for the code, who developed it, when, and to describe input and output data. In-code comments (within the R code itself) should be restricted to explaining how the code works and organizing it into sections. Quality checks should be run and documented in the QA Rmd file included in the request template.
2.4.4 Version Control
Proposed: For requests and updates, use git for version control. In general, however, these should not be hosted on Github. For more complicated requests and updates, commit as you go. Git will also automatically be initiated upon creation and commited when the request is closed.