Coding strategies for future us

Coding strategies can blend with workflow strategies, and the idea is working in a way that is not just for you in this moment. Here we will discuss good coding practices for beginning and seasoned coders alike that make it easier to work with other people, times, and computers.

Previous iterations of Coding strategies for future us: Filepaths and Project-oriented workflows

Expanding ouR community!, contributed by Dr. Chanté Davis

State of the Ecosystem Product Development Workflow, contributed by Kim Bastille


Software considerations for coding

The following advice is from Tiffany Timbers, UBC Data Science, Intro to Reticulate:

You will need these software tools:

  • Programming language (R, python)
  • Code editor (RStudio IDE, Jupyter)
  • Version control software (git, GitHub/bitbucket)

How to choose the “best” tool for the job:

  • Reproducible and auditable
  • Accurate
  • Collaborative (and portable)

If you’re choosing between R, Python, and other modern languages, they will aready be reproducible, auditable, and accurate. That leaves collaboration – what do your collaborators use? What do folks in your lab, or field use? What is mentioned in the papers you read? There is increasing interoperability between languages (e.g. see reticulate to run python code from R) so when you have some idea it’s best to get started!

See also: Opinionated analysis development (Parker 2017). Tools like RStudio are already doing this to help you. Reserve your mental energy for the fun part of the analysis!

WTF: What They Forgot to teach you

Most of this advice comes directly from Jenny Bryan & Jim Hester’s awesome course What they Forgot to Teach You About R. I highly recommend reading Chapters 1-4 that go into much better detail than the notes here. The advice here is solid coding practices for any language, with examples from R.

Workflow versus product

Distinction between things you do because of personal taste & habits (“workflow”) versus the logic and output that is the essence of your project (“product”).

Workflow:

  • Editor you use to write code.
  • Name of your home directory.
  • R code you ran before lunch.

Clearly product: - Raw data. - R code someone needs to run on your raw data to get your results, including the explicit library() calls to load necessary packages. (script, notebook)

Ideally, you don’t hardwire anything about your workflow into your product.

Source files

What are they and why?

Code that creates objects is “source code”. Source code is essentially text files you edit in a text editor that is then executed in the console.

Examples:

  • .R, .Rmd
  • .py
  • .m

Save the source, not the workspace

Save the source code; do not save the R object itself.

Save your commands as “scripts” (.R, .py) or “notebooks” (.Rmd, ipynb). It doesn’t have to be polished. Just save it!

Everything that really matters should be achieved through code that you save – including objects and figures The contrast is storing them implicitly or explicitly, as part of an entire workspace, or clicking via the mouse.

Load libraries/packages at the top. Just like a recipe: tell us the ingredients need before we get going!

Always start R with a blank slate

Saving code is an absolute requirement for reproducibility.

When you quit, do not save the workspace to an .Rdata file. When you launch, do not reload the workspace from an .Rdata file.

In RStudio, set this via Tools > Global Options.

Restart R often during development

“Have you tried turning it off and then on again?” – timeless troubleshooting wisdom, applies to everything

If you use RStudio, use the menu item Session > Restart R

Additional ways to restart development where you left off, i.e. “re-run all the code up to HERE”

Avoid rm(list = ls())

It’s common to see scripts begin with this object-nuking command: rm(list = ls())

This is highly suggestive of a non-reproducible workflow.

The problem with rm(list = ls()) is that, given the intent, it does not go far enough.

It only deletes user-created objects from the global workspace.

Instead, Restart R with a clean slate OFTEN (e.g. many times/day), and write every script assuming it will be run in a fresh R process

Filepaths

Every saved thing gets a unique path.

Your code needs to run from somewhere specific. And when it interacts with other things (data or other code), you need to tell your code where things are.

The more deliberate you are about where things live,

  • The easier it will be for you and future you
  • The easier it will be for other people
  • The easier it will be on another computer

setwd(“path/that/only/works/on/my/machine”)

The chance of setwd() having the desired effect – making the file paths work – for anyone besides its author is 0%.

It’s also unlikely to work for the author one or two years or computers from now.

Hard-wired, absolute paths, especially when sprinkled throughout the code, make a project brittle. Such code does not travel well across time or space.

setwd()

BUT, if you still decide to use setwd() in your scripts, you should at least be very disciplined about it:

Only use setwd() at the very start of a file, i.e. in an obvious and predictable place.

Always set working directory to the same thing, namely to the top-level of the project. Always build subsequent paths relative to that.

R users: use the here package

here() identifies your project’s files, based on the current working directory at the time when the package is loaded.

library(here)
here()

Project oriented workflows

Dilemma and Solution

Problem statement:

We want to work on project A with the working directory set to path/to/projectA (my data analysis) and on project B with the working directory set to path/to/projectB (my teaching material).

But we also want to keep code like setwd(“path/to/projectA”) out of our scripts.

Solution:

Solution: use an IDE that supports a project-based workflow.

An integrated development environment (IDE) offers:

  • a powerful, R-aware code editor
  • many ways to send your code to a running R process
  • other modern conveniences

And it eliminates:

  • temptation to develop code directly in the Console. (instead:.R!)
  • tension between development convenience and portability of the code.

Organize your work into projects

Here’s what I mean by “work in a project”:

  • File system discipline: put all files related to a project in a designated folder.
    • This applies to data, code, figures, notes, etc.
    • Depending on project complexity, you might enforce further organization into subfolders.
  • Working directory intentionality: when working on project A, make sure working directory is set to project A’s folder.
    • Ideally, this is achieved via the development workflow and tooling, not by baking absolute paths into the code.
  • File path discipline: all paths are relative — relative to the project’s folder.

Synergistic habits: you’ll get the biggest payoff if you practice all of them together.

Portability: the project can be moved around on your computer or onto other computers and will still “just work”. is the only practical convention that creates reliable, polite behavior across different computers/users/time. This convention is neither new, nor unique to R.

It’s like agreeing that we will all drive on the left or the right. A hallmark of civilization is following conventions that constrain your behavior a little, in the name of public safety.

RStudio Projects

The RStudio IDE has a notion of a (capital “P”) Project, which is a very effective implementation of (small “p”) projects.

Project have an.Rproj file in the folder, which is used to store settings specific to that project. Use File > New Project … to get started.

Allows for multiple projects

no danger of crosstalk: each has own R process, global workspace & working directory

Same “unit” as a GitHub repo!

Tips for RStudio Projects

One suggestion for organizing:

Have a dedicated folder for your Projects. - If you have One Main Place for Projects, then go there in Finder/File Explorer to launch any specific project with .Rproj. - Mine is called “~/github/”.

Switching Projects: RStudio knows about recent Projects.

Name files deliberately

Jenny Bryan’s 3 rules for Naming Things:

  • machine readable
  • human readable
  • plays well with default ordering

Available from Speakerdeck

Further reading