Coding strategies for future us
Coding strategies can blend with workflow strategies, and the idea is working in a way that is not just for you in this moment. Here we will discuss good coding practices for beginning and seasoned coders alike that make it easier to work with other people, times, and computers.
WTF: What they forgot to teach you about R
Most of this advice comes directly from Jenny Bryan & Jim Hester’s awesome course What they Forgot to Teach You About R. We highly recommend reading Chapters 1-4 that go into much better detail than we cover here. The advice here is solid coding practices for any language, with examples from R.
Workflow versus product
Distinction between things you do because of personal taste & habits (“workflow”) versus the logic and output that is the essence of your project (“product”).
Workflow:
- Editor you use to write code.
- Name of your home directory.
- R code you ran before lunch.
Clearly product:
- Raw data.
- R code someone needs to run on your raw data to get your results, including the explicit library() calls to load necessary packages. (script, notebook).
Ideally, you don’t hardwire anything about your workflow into your product.
Source files
What are they and why?
Code that creates objects is “source code”. Source code is essentially text files you edit in a text editor that is then executed in the console.
Examples:
- .R, .Rmd
- .py
- .m
Save the source, not the workspace
Save the source code; do not save the R object itself.
Save your commands as “scripts” (.R
, .py
) or “notebooks” (.Rmd
, ipynb
). It doesn’t have to be polished. Just save it!
Everything that really matters should be achieved through code that you save – including objects and figures The contrast is storing them implicitly or explicitly, as part of an entire workspace, or clicking via the mouse.
Load libraries/packages at the top. Just like a recipe: tell us the ingredients need before we get going!
Always start R with a blank slate
Saving code is an absolute requirement for reproducibility.
When you quit, do not save the workspace to an .Rdata file. When you launch, do not reload the workspace from an .Rdata file.
In RStudio, set this via Tools > Global Options.
Restart R often during development
“Have you tried turning it off and then on again?” – timeless troubleshooting wisdom, applies to everything
If you use RStudio, use the menu item Session > Restart R
Additional ways to restart development where you left off, i.e. “re-run all the code up to HERE”
Avoid rm(list = ls())
It’s common to see scripts begin with this object-nuking command: rm(list = ls())
This is highly suggestive of a non-reproducible workflow.
The problem with rm(list = ls())
is that, given the intent, it does not go far enough.
It only deletes user-created objects from the global workspace.
Instead, Restart R with a clean slate OFTEN (e.g. many times/day), and write every script assuming it will be run in a fresh R process
Filepaths
Every saved thing gets a unique path.
Your code needs to run from somewhere specific. And when it interacts with other things (data or other code), you need to tell your code where things are.
The more deliberate you are about where things live,
- The easier it will be for you and future you
- The easier it will be for other people
- The easier it will be on another computer
setwd(“path/that/only/works/on/my/machine”)
The chance of setwd()
having the desired effect – making the file paths work – for anyone besides its author is 0%.
It’s also unlikely to work for the author one or two years or computers from now.
Hard-wired, absolute paths, especially when sprinkled throughout the code, make a project brittle. Such code does not travel well across time or space.
setwd()
BUT, if you still decide to use setwd()
in your scripts, you should at least be very disciplined about it:
Only use setwd()
at the very start of a file, i.e. in an obvious and predictable place.
Always set working directory to the same thing, namely to the top-level of the project. Always build subsequent paths relative to that.
R users: use the here
package
here()
identifies your project’s files, based on the current working directory at the time when the package is loaded. https://here.r-lib.org/.
library(here)
here()
Project oriented workflows
Dilemma and Solution
Problem statement:
We want to work on project A with the working directory set to path/to/projectA (my data analysis) and on project B with the working directory set to path/to/projectB (my teaching material).
But we also want to keep code like setwd(“path/to/projectA”) out of our scripts.
Solution:
Solution: use an IDE that supports a project-based workflow.
An integrated development environment (IDE) offers:
- a powerful, R-aware code editor
- many ways to send your code to a running R process
- other modern conveniences
And it eliminates:
- temptation to develop code directly in the Console. (instead:.R!)
- tension between development convenience and portability of the code.
Organize your work into projects
Here’s what I mean by “work in a project”:
- File system discipline: put all files related to a project in a designated folder.
- This applies to data, code, figures, notes, etc.
- Depending on project complexity, you might enforce further organization into subfolders.
- Working directory intentionality: when working on project A, make sure working directory is set to project A’s folder.
- Ideally, this is achieved via the development workflow and tooling, not by baking absolute paths into the code.
- File path discipline: all paths are relative — relative to the project’s folder.
Synergistic habits: you’ll get the biggest payoff if you practice all of them together.
Portability: the project can be moved around on your computer or onto other computers and will still “just work”. is the only practical convention that creates reliable, polite behavior across different computers/users/time. This convention is neither new, nor unique to R.
It’s like agreeing that we will all drive on the left or the right. A hallmark of civilization is following conventions that constrain your behavior a little, in the name of public safety.
RStudio Projects
The RStudio IDE has a notion of a (capital “P”) Project, which is a very effective implementation of (small “p”) projects.
Project have an.Rproj file in the folder, which is used to store settings specific to that project. Use File > New Project … to get started.
Allows for multiple projects
no danger of crosstalk: each has own R process, global workspace & working directory
Same “unit” as a GitHub repo!
Tips for RStudio Projects
One suggestion for organizing:
Have a dedicated folder for your Projects. - If you have One Main Place for Projects, then go there in Finder/File Explorer to launch any specific project with .Rproj. - Mine is called “~/github/”.
Switching Projects: RStudio knows about recent Projects.
Name files deliberately
Jenny Bryan’s 3 rules for Naming Things:
- machine readable
- human readable
- plays well with default ordering
Slides available from Speakerdeck. See also Jenny Bryan’s “Naming things” video (5 mins) from NormConf · Dec 4, 2022
Software considerations for coding
The following advice is from Tiffany Timbers, UBC Data Science, Intro to Reticulate:
You will need these software tools:
- Programming language (R, python)
- Code editor (RStudio IDE, Jupyter)
- Version control software (git, GitHub/bitbucket)
How to choose the “best” tool for the job:
- Reproducible and auditable
- Accurate
- Collaborative (and portable)
If you’re choosing between R, Python, and other modern languages, they will aready be reproducible, auditable, and accurate. That leaves collaboration – what do your collaborators use? What do folks in your lab, research team, or field of study use? What is mentioned in the papers you read? There is increasing interoperability between languages (e.g. see reticulate to run python code from R) so when you have some idea it’s best to get started!
See also: Opinionated analysis development (Parker 2017). Tools like RStudio are already doing this to help you. Reserve your mental energy for the fun part of the analysis!
Further reading
- Tidyverse Skills for Data Science - Wright, Ellis, Hicks, & Peng
- Principles for data analysis workflows - Stoudt, Vasquez, and Martinez 2021, PLOS Computational Biology
- R Cookbook — JD Long
- Big Book of R - Oscar Baruffa
- R for Excel Users - Lowndes & Horst
- R/RStudio workflows with tidyverse, RMarkdown, and GitHub, using ecological data from LTER (update from OHI’s intro to data science)
- An introduction to Earth and Environmental Data Science - Abernathy
- Intro to Python, JupyterLab, Unix, Git, some packages & workflows
- Data analysis and visualization in Python for ecologists - Carpentries
- Setup recommends using Anaconda and Jupyter Notebooks
- Python for Data Analysis - Thompson