Data strategies for future us

Data strategies are part of a shared workflow strategy: How do we structure data, where do we store and back up data, how do we create metadata, and the mantra of “keep the raw data raw” and distinct from your analysis. Here we will discuss personal and team habits for data management and sharing: data strategies for future us.

Slides

Data strategies for Future Us

Data strategies for Future Us for Cloud, contributed by Andy Barrett, NASA National Snow and Ice Data Center (NSIDC)

Data strategies for Future Us for Metadata, contributed by Jessica Couture

These slides were contributed by researchers from previous Champions cohorts who shared examples and stories from their work.

Data strategies; examples from NOAA Fisheries Alaska (Marine Mammal Lab Stock Assessment Report), contributed by Rod Towell, Nancy Young, Tony Orr, Brian Fadely, Erin Richmond

Data to Product Workflows, contributed by Emily Markowitz (video)


Data organization in spreadsheets

This publication by Broman & Woo, 2018 appears in the “Practical Data Science for Stats” collection in PeerJ & American Statistician.

It is a delightful read, from the first opening sentences:

“Spreadsheets, for all of their mundane rectangularness, have been the subject of angst and controversy for decades…. Amid this debate, spreadsheets have continued to play a significant role in researchers’ workflows. The dangers of spreadsheets are real, however – so much so that the European Spreadsheet Risks Interest Group keeps a public archive of spreadsheet ‘horror stories’…”

Broman & Woo share practical tips to make spreadsheets less error-prone, easier for computers to process, and easier to share. And something incredibly cool, it’s the 3rd most downloaded stats paper in the American Statistician, behind 2 papers about p-values (twitter thread).

Read their paper for strategies behind their basic principles:

  1. Be consistent
  2. Write dates like YYYY-MM-DD
  3. Don’t leave any cells empty
  4. Put just one thing in a cell
  5. Organize data as a rectangle (“Tidy data”)
  6. Create a data dictionary
  7. Don’t include calculations in the raw data files
  8. Don’t use font color or highlighting as data
  9. Choose good names for things
  10. Make backups
  11. Use data validation to avoid data entry errors
  12. Save the data in plain text files

Good enough practices in scientific computing.

This publication by Wilson et al. 2017 in PLoS Computational Biology follows a previous publication by Wilson et al. 2014: Best practices for scientific computing.

In terms of data management recommendataion, they have 2 main themes:

  1. work towards ready-to-analyze data incrementally, documenting both the intermediate data and the process
  2. embrace the idea of “tidy data”, which can be a powerful accelerator for analysis

Read their paper for strategies behind their basic principles (Box 1):

  1. Save the raw data.
  2. Ensure that raw data are backed up in more than one location.
  3. Create the data you wish to see in the world.
  4. Create analysis-friendly data.
  5. Record all the steps used to process data.
  6. Anticipate the need to use multiple tables, & use a unique identifier for every record.
  7. Submit data to a reputable DOI-issuing repository so that others can access & cite.

The publication also covers:

  • Software: write, organize, and share scripts and programs used in an analysis.
  • Collaboration: make it easy for existing and new collaborators to understand & contribute to a project.
  • Project organization: organize the digital artifacts of a project to ease discovery & understanding.
  • Tracking changes: record how various components of your project change over time.
  • Manuscripts: write manuscripts in a way that leaves an audit trail & minimizes manual merging of conflicts.

Tidy data for efficiency, reproducibility, & collaboration

By Lowndes & Horst 2020, posted on the Openscapes blog: an illustrated series to tell a story about tidy data.

Tidy data has been mentioned in each of the above, as a way to organize data in spreadsheets, to prepare ready-to-analyze data, and for sharing with the FAIR principles.

So what is tidy data?

First, a preamble:

  • Raw data may not be stored in a tidy way
  • “Wrangling” data into tidy structure should be done programmatically as part of the analytical process - keep the raw data raw
  • Tidy data is a philosophy
  • There are existing tools to help

Remember that “tidying data (”data wrangling”) – up to 50–80% of a data scientist’s time” Lohr 2014, New York Times, so it’s important to leverage these existing philosophies and tools.

When we talk about organizing data to help us work in an efficient, reproducible, and collaborative way, we are talking about TIDY DATA. We mean deliberately thinking about the shape and structure of data – something that might not seem super exciting but is truly game-changing.

So let’s talk about what tidy data is and why it is so empowering for your analytical life.

What is tidy data?

Tidy data is a way to describe data that’s organized with a particular structure – a rectangular structure, where each variable has its own column, and each observation has its own row (Wickham 2014).


This standard structure of tidy data led Hadley Wickham to describe it the way Leo Tolstoy describes families. Leo says “Happy families are all alike; every unhappy family is unhappy in its own way”. Similarly, Hadley says “tidy datasets are all alike, but every messy dataset is messy in its own way”.


Tidy data for more efficient data science

Tidy data allows you to be more efficient by using existing tools deliberately built to do the things you need to do, from subsetting portions of your data to plotting maps of your study area. Using existing tools saves you from building from scratch each time you work with a new dataset (which can be time-consuming and demoralizing). And luckily, there are a lot of tools specifically built to wrangle untidy data into tidy data (for example, in the tidyr package). By being more equipped to wrangle your data into a tidy format, you can get to your analyses faster to start answering the questions you’re asking.


Tidy data for easier collaboration

Tidy data makes it easier to collaborate because our friends can use the same tools in a familiar way. Whether thinking about collaborators as current teammates, your future self, or future teammates, organizing and sharing data in a consistent and predictable way means less adjustment, time, and effort for all.


Tidy data for reproducibility and reuse

Tidy data also makes it easier to reproduce analyses because they are easier to understand, update, and reuse. By using tools together that all expect tidy data as inputs, you can build and iterate really powerful workflows. And, when you have additional data entries, it’s no problem to re-run your code!


Tidy data for the win!

Once you are empowered with tools to work with tidy data generally, it opens up a whole new world of datasets that feel more approachable because you can work using familiar tools. This transferrable confidence and ability to collaborate might be the best thing about tidy data.


So for more efficient, reproducible, and collaborative analyses, make friends with tidy data!



Learning Resources