Dean Marchiori | A retrospective on a year of commercial data science projects in R | RStudio
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
G'day, I'm Dean Marchiori. I'm a statistician from Sydney, Australia, and I'm interested in reproducible analysis and workflow choices in R. But in my day-to-day practice in industry as a commercial data scientist, is my work actually reproducible? To answer this, I conducted a reproducibility retrospective, or ReproRetro. I went back through the last year of my commercial data science projects, and rated each project across a number of dimensions that are valued in my practice.
You can pick aspects that suit your work, but the dimensions I picked cover if an analysis were replicable, modular, auditable, automated, and collaborative. So what were the results? Out of 55 projects, I scored an average of 3.8 out of 5. This is good, but not great. So if I care so much about this, why wasn't my work rated higher? Here are the lessons I took from my ReproRetro.
Lesson one: not everything needs to be reproducible
Controversial opinion number one, not everything needs to be reproducible. While all projects benefited from some aspect of reproducibility, the most important consideration for me was, how do you tell what needs to be reproducible?
Enter my Fiji test. So if you have a task that's sufficiently important, you may want to consider the Fiji test. Aussies like Fiji because it's only a few hours away, but if you're lying on a beach in holidays in Fiji, just spare a thought for your colleagues back in the office. If they need to rerun some of your analysis, will they be able to find it, understand your work, run it without too much fuss, and make reasonable enhancements or small changes? The last thing you want while you're sipping your pina colada is to be disturbed by a weird phone call about some model that you trained.
The last thing you want while you're sipping your pina colada is to be disturbed by a weird phone call about some model that you trained.
When I looked at the type of analytics task I completed, not everything needed to pass the Fiji test. A lot of work in industry is ad hoc and quite simple. Commercial or a high pressure environment, I think it's perfectly reasonable to have different standards on workflow choices depending on the type of work and the return on investment that you need. So long as a reasonable standard is maintained and as long as your team is consistent with these choices. However, this may mean getting better at anticipating the full scope of the project before rushing into solving a problem, which can be a challenge. In addition, it gives you the freedom to pick a more lightweight and flexible workflow choice if desired.
Picking the right tools for the job
My next lesson was picking the right tools for the job. You know, I've been searching for the biggest and best workflow choice, but I found it paid to have options. When I examined my tooling and workflow options, the results of my repro retro varied. Using a basic directory structure template to organize my projects was okay all round, but quickly got cumbersome. Notebooks like R Markdown were nicely replicable, but tended to be a bit of a dumping ground for code. Drake, a more formal workflow system, carried a bit more overhead and more admin to get used to, but overall it enforced much better standards, particularly around writing functional and modular code.
While any system can be made arbitrarily sophisticated and do a good job, I found it more effective to have a few different workflow choices available rather than try a one size fits all approach. All workflow choices are naturally optimized for certain applications, and I found it better to only take on additional complexity when the job really needed it.
Replication vs. reproduction
Next, I found in my work I had a tendency to mistake replication for reproduction, and the same goes for automation. Getting your code to rerun is only part of the solution, so consider the factors that matter in your practice and be deliberate about what you optimize for, and use that to guide your workflow choice. And how do you do that? Well, do a repro-retro. The process wasn't and isn't intended to be scientific or unbiased. The true value of a repro-retro is in the self-reflection and improvement. So go and dig through your old code and think about how you might do things better.
The true value of a repro-retro is in the self-reflection and improvement.
If you want to learn more, you can reach out to me or visit my AnalysisFlow GitHub repo. I'd also like to acknowledge and thank my collaborator Myles McBain for many interesting discussions in this space. Thank you and good luck.
