Project workflows – from data to manuscript

CMOR Lunch’n’Learn

11 August 2023

Ross Wilson

Data workflow

Ideally, we would…

  • Have a complete record of all analysis steps, from start to finish
  • Integrate analysis into reporting of results
  • Be able to share the entire project workflow with others
  • Easily update the pipeline when data/analysis changes
  • Maintain a common structure across projects

Solution – An integrated analysis workflow with R

Solution – An integrated analysis workflow with R

  • Have a complete record of all analysis steps, from start to finish
  • Integrate analysis into reporting of results
  • Be able to share the entire project workflow with others
  • Easily update the pipeline when data/analysis changes
  • Maintain a common structure across projects

Solution – An integrated analysis workflow with R

  • Have a complete record of all analysis steps, from start to finish
    • _plan.R includes (in code) all of the steps in the analysis workflow
  • Integrate analysis into reporting of results
  • Be able to share the entire project workflow with others
  • Easily update the pipeline when data/analysis changes
  • Maintain a common structure across projects

Solution – An integrated analysis workflow with R

  • Have a complete record of all analysis steps, from start to finish
    • _plan.R includes (in code) all of the steps in the analysis workflow
  • Integrate analysis into reporting of results
    • Quarto allows us to refer to the results of our analysis directly in the report/manuscript
  • Be able to share the entire project workflow with others
  • Easily update the pipeline when data/analysis changes
  • Maintain a common structure across projects

Solution – An integrated analysis workflow with R

  • Have a complete record of all analysis steps, from start to finish
    • _plan.R includes (in code) all of the steps in the analysis workflow
  • Integrate analysis into reporting of results
    • Quarto allows us to refer to the results of our analysis directly in the report/manuscript
  • Be able to share the entire project workflow with others
    • Git & GitHub for collaborative version control
  • Easily update the pipeline when data/analysis changes
  • Maintain a common structure across projects

Solution – An integrated analysis workflow with R

  • Have a complete record of all analysis steps, from start to finish
    • _plan.R includes (in code) all of the steps in the analysis workflow
  • Integrate analysis into reporting of results
    • Quarto allows us to refer to the results of our analysis directly in the report/manuscript
  • Be able to share the entire project workflow with others
    • Git & GitHub for collaborative version control
  • Easily update the pipeline when data/analysis changes
    • targets tracks dependencies between analysis stages and re-runs steps as needed
  • Maintain a common structure across projects

Solution – An integrated analysis workflow with R

  • Have a complete record of all analysis steps, from start to finish
    • _plan.R includes (in code) all of the steps in the analysis workflow
  • Integrate analysis into reporting of results
    • Quarto allows us to refer to the results of our analysis directly in the report/manuscript
  • Be able to share the entire project workflow with others
    • Git & GitHub for collaborative version control
  • Easily update the pipeline when data/analysis changes
    • targets tracks dependencies between analysis stages and re-runs steps as needed
  • Maintain a common structure across projects
    • cmor.tools brings all of this together under a common structure

The targets pipeline tool for R

  • We have briefly looked at the targets package in an earlier LnL session
  • The key concept is an analytical pipeline: a computational workflow consisting of
    • targets – the individual tasks involved in the workflow (data import, cleaning, analysis, etc.)
    • methods – the code used to complete each task
    • dependencies – which targets depend on the results of which other targets
  • targets analyses the pipeline, runs the code, and stores the results in /_targets/

The pipeline is described in _plan.R

_plan.R
targets <- list(
  tar_target(file, "data.csv", format = "file"),
  tar_target(data, get_data(file)),
  tar_target(model, fit_model(data)),
  tar_target(plot, plot_model(model, data))
)
  • This is a list (an R list object) of targets (specified with tar_target())
  • Each target specifies one step in the pipeline
    • identify the raw data file
    • import the data from the file into R
    • fit a statistical model to the data
    • create a plot showing the fitted model
  • The methods for each target are defined in the functions
    get_data(), fit_model(), and plot_model()
  • targets works out the dependencies automatically

Quarto

  • Quarto is a scientific and technical authoring and publishing system that allows us to mix text and executable R code
  • Quarto documents are plain text, but can be rendered to multiple output formats (HTML, PDF, Word, PowerPoint)
    • Even whole books or websites – the CMOR website is written with Quarto, as is this presentation
  • When the Quarto document is rendered, any R code will be run and the result (numeric values, a table, a figure, etc.) included in the resulting output document

Quarto in a targets pipeline

  • The tarchetypes package provides tar_quarto(), which allows Quarto documents to be used within a targets pipeline
  • tar_quarto(report, path = "report.qmd") defines a step that renders the source document "report.qmd" to a target named report
  • The source document should use tar_load() in an R code chunk to load dependency targets
    • targets will scan the source for these calls to know what the target dependencies are

cmor.tools

  • Our cmor.tools package (github.com/uo-cmor/cmor.tools) provides various tools to bring these ideas together and provide a common structure for managing data analysis projects
  • Key ideas implemented in cmor.tools:
    • A common project folder structure, to keep data separate from code separate from output
    • A targets pipeline to specify all of the steps in the workflow
    • Output (reports/manuscripts) etc. written in Quarto, and in the pipeline
    • Version control with Git, and repository hosted on GitHub
    • A few other tools for consistent formatting of outputs, etc.
    • (Some analysis tools, not yet fully implemented)

cmor.tools

  • You can install cmor.tools from GitHub:
#install.packages("remotes")
remotes::install_github("uo-cmor/cmor.tools")