Many aspects of doing a PhD feel like being thrown into the ocean without any help or support. This is especially the case when it comes to doing data analysis and coding. Unsurprisingly, as a PhD student you end up being inefficient with time and effort when it comes to doing your work. Sadly research culture currently doesn’t appreciate, fund or support these aspects of science as much as would be required to solve these problems. One of the first steps to changing this culture is through training and education of PhD students and early career researchers. Taking a course on being reproducible and open can lead you to being more productive and less stressed and, over time, teaching courses like these can help spread the awareness of these issues and slowly improve research culture.
Becoming more reproducible and open in your research can save you so much time and help to make your data analysis for your PhD less stressful and quicker. Teaching it to others helps you to keep learning and to become better at it for your own work.
From random data wrangling to reproducible research
If you work with any type of quantitative data that needs to be programmatically processed, you might have asked yourself questions like “Where is that bit of code I wrote four hours ago?” or “What does this code I wrote 6 months ago mean and what was it supposed to do?” We definitely have! We both started using R early in our PhDs and mostly taught ourselves, often through online courses such as those offered from edX or Coursera. These are great resources for learning how to code in general, but they don’t really teach you how to actually incorporate coding in your overall daily work and in your research projects.
For instance, when Bettina Lengger started doing her PhD work, she quickly found herself immersed in hundreds of files that worked for the moment, but when opening them after only a few weeks, she had no idea what she had originally been doing in them. The messy code and file structure took up a lot of time to understand again and led to wasting many hours having to re-do analyses and updating or fixing the scripts that had been written beforehand. Bettina often felt like the person in Figure 1, overwhelmed and frustrated. This messiness and disorder not only is wasteful from a time efficiency point of view, but also makes it less environmentally sustainable due to the larger amounts of energy used to re-do things. She eventually got so annoyed and frustrated that she started looking into how to be more structured and efficient with her project files, analyses and data. If not for colleagues and supervisors, then at least for her own sanity and for the planet.
Not only was the problem with being time and energy inefficient, but also that collaborating was difficult. Not agreeing beforehand on a standardized file structure, in addition to other things, among collaborators and labmates was negatively impacting her PhD research projects. For instance, sharing analysis scripts among labmates was tedious, reading or understanding how others’ code worked or how to use it was difficult, and trying to run the code on another computer was not always an easy task. But, PhD life is busy and these types of tasks and challenges are not given high priority in science. Mostly it felt like organizing analyses in a reproducible way was a ‘bonus’, but not a requirement.
This isn’t limited to only Bettina’s personal experiences either, as it is similar to Luke Johnston’s and many others experiences. Throughout most fields of science, there are huge increases in how much data are produced as data collection technologies rapidly advance. While this is great overall, it puts enormous strain on researchers because the skills and tools we need to work with this volume of data have not kept pace. To make things even worse, not only is there more demand for openness and transparency in how science is done, but funding agencies have not yet fully recognized the importance and value of these skills enough to spend money on hiring qualified personnel such as research software engineers or data scientists.
This has led to the current situation where it is largely not possible to determine whether a scientific study is reproducible or not. Reproducibility is when findings are repeated when the data and analysis code are the same, but the researcher is different, and it is a key component of verifying scientific findings. Replicability, often confused with reproducibility, is when findings are repeated when the study design and analysis code are the same, but the data and researcher are different. As shown in Figure 2, the reason scientific reproducibility is difficult to determine is because most studies do not provide enough detail to reproduce nor to replicate them, as almost no study provides the code or the data necessary to attempt to reproduce their findings. This lack of sufficient detail and transparency is generally due to researchers being unaware of, having no training in or lacking incentives or funds to conduct reproducible research.
Training and education are critical in preparing us for the modern demands of science, and it is for these reasons that Bettina decided to attend (and eventually help develop and teach) a course specifically aimed at teaching reproducible research in R along with collaborative practices. The r-cubed (Reproducible Research in R) course is a three-day course for PhD students and postdocs working in any biomedical field, who want to improve their literacy in data and coding (in R), to gain skills and knowledge on how to do modern and reproducible data analysis and to get insights into the barriers to open and reproducible research. Specifically the course teaches researchers the skills in using Git, GitHub, R Markdown and data wrangling and visualization in R. The course is designed as an open educational resource, found at r-cubed.rostools.org, that instructors can use directly or modify for their own lessons and that learners can use independently or as a reference after participating in the workshop. As you can see in Figure 3, many participants liked the course and felt like they learned a lot from it.
After Bettina attended the r-cubed course, the next challenge was to translate her newly acquired skills to her daily work. While it felt intimidating and time consuming at first, emphasizing and focusing on reproducibility within a research project does not have to be difficult – we like to compare the time and energy investment in incorporating reproducible practices to that of a chemical reaction, one that can be catalysed by an enzyme (Figure 4). Adopting reproducibility-friendly practices requires an initial time and effort commitment (like the uncatalysed curve) that can be greatly reduced by taking the course (like an enzyme) but in both cases later effort costs for doing research actually are decreased.
The first step for Bettina after taking the course was to apply a standardized workflow to her flow analysis, which typically consists of several hundred samples. Each research project is its own self-contained R project, with its own file structure and standard R files that can be easily modified for the specific flow analysis. She would load in her raw data, merge them with metadata from a csv file and save them as a clean working data set that would then be used for all analyses and plots. It now takes her about 30 minutes to go from processing the data to making the first graph of the data, compared to before when her data analysis could take easily half of a workday.
Other steps that she’s taken include getting her research group’s undergraduate or Masters’ students to start using the collaboration workflow taught in the course. First, they all start projects by setting up the standard file structure for the groups’ experiments and use GitHub to share their projects and collaborate together. This common file structure makes it much easier for Bettina to understand and inspect the data that the students generate and how they processed it, making it easier to help them to solve their problems and prevents delays to their research. This is just a single example of how planning to be reproducible can save time and effort, ultimately making a scientist’s life easier.
From learning to teaching: spreading reproducibility to the world
Since the start of the first course, self-described ‘beginner coders’ have been included as instructors and helpers for the course. At the end of each run of the course, lead instructor Luke asks if anyone wants to be involved in teaching or helping out next time. After Bettina had taken the course and gained so much from it, she wanted to get involved too so she volunteered and ended up teaching the course several times.
While the idea of having instructors who are also relatively inexperienced in the course material may seem counter-intuitive, there are many advantages to mixing experienced and novice coders and teachers. For instance, novice coders will be better able to empathize with the course participants on the types of challenges and struggles that they will likely encounter compared to experienced coders, since they will have likely also recently gone through similar struggles. From the participants’ perspective, seeing non-experts using what they are being taught serves as a way to dispel any potential feelings of ‘I’m not smart/knowledgeable/skilled enough’ but also gives them potential role models as motivation and inspiration for continuing to use these skills. They see that ‘normal’ people can learn and use these things too, something that they might not feel if instead only experts were teaching.
This specific aspect of teaching also shows up in the feedback we get as a major positive component of the course. Participants frequently write in the feedback surveys that being able to see the instructors make mistakes, mistype things or encounter errors and then proceed to fix these mistakes makes the content less intimidating and more grounded. It normalizes that all people who code aren’t perfect and also make mistakes, and that these challenges are simply part of learning, writing code, and doing data analysis.
So, why would a beginner want to be an instructor? Not only do you get some experience teaching and get to teach potential future collaborators, it also pushes you to stay updated and actually use the skills you teach in your own work too. After all, teaching is one of the best ways to learn; it is a highly efficient way of getting into the mindset of coding, of learning and of practicing what you preach.
As an instructor, it’s also great to see how excited participants are to learn more efficient and productive ways of working, but also how they have applied these skills to their work and workflows. We’ve delivered this course multiple times and often, many months afterwards, we get feedback from past participants on how they have improved in their data analysis and overall work, and how much they love using the tools and skills that we taught. Some have even convinced their entire research group to implement some of the practices and to put their code on GitHub. So while our primary learning outcome for the course is to have the participants use what they learned, hearing how they’ve taken those skills and spread their learning of better practices to their colleagues is a big bonus.
Changing research culture takes time and effort, driven largely by education and (re-)training. We are slowly changing the culture around reproducibility and open science, one course at a time, and not only do we teach it, we also live it. Our course material can be found online (r-cubed.rostools.org) under an open Creative Commons Attribution license, meaning anyone can take and modify the material for their own purposes as long as the original source is referenced. It is available for anyone to study and use for learning or for teaching. With this material and the courses we teach, we hope that science will become more open, reproducible and collaborative.
Paper describing the r-cubed course: Johnston et al., (2021). r-cubed: Guiding the overwhelmed scientist from random wrangling to Reproducible Research in R. Journal of Open Source Education, 4(44), 122, https://doi.org/10.21105/jose.00122. See the course material on the r-cubed website: r-cubed.rostools.org.
Open Science Training Handbook chapter on Reproducible Research and Data Analysis.
The Turing Way’s Guide for Reproducible Research.
We thank all the instructors and helpers of the r-cubed course for helping make it as best as it can be, as well as the participants who gave their feedback that helped substantially improve the course over the years.
Bettina Lengger is a third-year PhD student working at the Novo Nordisk Center for Biosustainability in Denmark. Her research interests lie within synthetic biology and metabolic engineering in yeast. In her PhD project, she is exploring biosensing with GPCR receptors in yeast. She teaches data wrangling in the r-cubed course for the Danish Diabetes Academy.
Luke W. Johnston is a diabetes epidemiologist working at Steno Diabetes Center Aarhus in Denmark. He teaches researchers how to use R to do reproducible data analysis. His other projects include creating R packages for statistical methods and for making it easier for researchers to implement reproducible and open science practices in their work.