Resources

Hannah Podzorski | Advocating for Automation: Adapting Current Tools in Environmental Science with R

video
Oct 24, 2022
14:49

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi everybody, my name is Hannah Podzorski, and today I want to talk to you about how I adapted current tools in environmental science through R by advocating for automation. And to start, I want to tell you a little bit about my colleague, let's call him John. I work with a lot of Johns.

John might be a biologist, a chemist, maybe a geologist. He's been in his field for quite some time, and he has a lot of very important technical knowledge within his field, but little to no knowledge on modern day programming languages. In addition to this, John keeps getting handed data sets that are much larger and more complex than what his current tools can handle. And on top of this, just like everybody else, he's got too many tasks and not enough time.

So this is the type of scenario where I would advocate for automation. My goal is to streamline the processing of data so my team has more time to apply their technical knowledge to the problem at hand.

But John might be a little bit hesitant. And I understand that. Because as a programmer, I would much rather spend five days automating something than just five minutes to do a simple task. Because of that, not only today am I going to talk to you about how I used R to automate workflows, but also how I reduced the activation energy it took to get my team on board.

Reducing activation energy

So in chemistry, activation energy is the energy needed to get from one side of a chemical reaction to another. And you can reduce the activation energy by applying a catalyst. So when I want to move my team from their current workflow to an automated workflow, I can reduce the activation energy by rephrasing how I pitch automation, starting small, and being a little bit creative.

So let's start with the pitch, and most importantly, that word automation. It can be a little vague and field dependent, but what I mean when I say automation is the streamlining of a workflow by reducing unnecessary human interactions. So my team has more time to spend on greater value tasks. Most times when I pitch automation, I don't even use the word automation. I say let's streamline this workflow.

Most times when I pitch automation, I don't even use the word automation. I say let's streamline this workflow.

So let's talk about the difference between workflow that I traditionally see, which I'm going to call for the purposes of this talk reactionary workflow and automated workflow. So let's say we have four work products, A, B, C, and D. A might be our database, B might be our tables full of summary stats, C is our figures, and D is the text that goes into our final report. Say we have to update our database. Maybe we know that we need to update our database, like we need to go out and collect more data, or maybe we don't know and maybe someone found an error in unit conversions at the last minute.

Well with this change, now we have to manually update all of the workflow downstream. In comparison, an automated workflow, if you have to make that one change to your database, you can use functions or scripts to automate the processing of that one edit and apply it to all of your work products downstream. And this is, I know, a very simple definition of what automation is and streamlining a workflow, but I like to lay it out like this because it makes the pros or benefits of using automation very obvious.

So number one, reproducibility. Because of those scripts or those recipes that are being completed by a computer, you know you're going to get the same thing every time. And this is important not just when you can predict you need to update something, like we're going to do annual sampling events, but also when you can't predict, you're going to need to update something like when you catch an error in a simple mathematic equation. Also it can be extremely simple, and though it pains me to say this, this type of workflow can be accomplished in Excel workbooks, just maybe not at the current scale we need. It saves time because you're reducing that human interaction and also reducing the amount of errors, which saves my team time in quality control and quality assurance.

Starting small with familiar tools

So I found it helpful to start small with tasks that can be automated in the same amount of time as the original tasks. And so for me and the projects I work on, this is most likely going to be moving tables of summary statistics and figures out of Excel and into R. But most importantly, I like to try and meet my team members where they are, and that's going to be by creating work products they're familiar with, and at least in my case, that's going to most likely be Microsoft Office products.

So I'm going to talk about two packages I love for doing this, the first one being openxlsx. It's a package designed to edit, read, and create Excel files in R. So traditionally when I go to save a data frame in R, I'm going to write it as a CSV. However, more recently I've tried to force myself to save more of these data frames, especially the ones I know that are going to be of interest to my team, to Excel files. That's because I want to help them with this transition between the reactionary workflow and the automated workflow.

Because when they're working on so many projects at the same time, and they're used to all their files being dynamic and constantly changing, and also all written in Excel, and you throw one CSV in there, more than once have they opened that CSV in Excel, added a few tabs, highlighted a few cells, saved it, and closed out. And since those edits are not supported in CSV, they just lost all that work. So by me making the simple change of simply saving data frames as .XLS files, I'm helping to reduce the activation energy and get my team moving forward towards automation. Also I love openxlsx because it allows me to make formatted Excel tables in R, such as the one you see here, and it's great for when I have to put together hundreds of tables that I'll go into a back of a report, which I'm sure eventually one day someone might look at, maybe.

Also I love to use officer. Officer allows you to manipulate Word documents and PowerPoint documents in R. I use it mostly for exporting ggplots into PowerPoint. And so here's just example code doing just that, but what I want to do is I want to point out the RVG package and the function DML. So running your ggplot object through this, in our case our ggplot object is plot, through this function, before you export that figure to PowerPoint, allows you to actually edit that figure in PowerPoint.

So here's an example of a plot I created and exported to PowerPoint using the code from the previous slide. It's the concentration of chloride over time, and since I used the DML function, I can move around my annotations, change them, I can change the access titles and the legend, and this is great for if my team wants to just have a little bit more control of the final product or if they just don't like where the annotation is currently at or if they want to make last-minute nomenclature adjustments, this makes it great so we can still interact.

In addition with officer, you can use PowerPoint templates to create placeholders, and then you can add things like figure captions, figure numbers, notes, logos, so I can automate the process of creating these figures and export them from R into PowerPoint so they're ready for a report.

Automating Pro-UCL for environmental statistics

So now that I've got my team a little bit on board, I've started with these smaller tasks, now I can take on bigger tasks that might require a little bit more of initial investment and delayed gratification, so I'm going to talk just about that, a specific example where my team decided to automate software developed by a regulatory agency. So that software is Pro-UCL, Pro-UCL was developed by the U.S. Environmental Protection Agency or EPA, and it specializes in performing statistical analysis on left-sensor data.

In environmental sciences, we have a lot of left-sensor data, whenever we're measuring the concentration of some compound in water or soil or air, depending on our methods and our equipment, we have something called a detection limit, and below that detection limit, we don't know where that concentration will sit, it can be anywhere between zero and our detection limit, so there's a lot of uncertainty around that. So Pro-UCL takes that uncertainty into account when doing simple statistics like means and standard deviations, as well as more complex statistics like distributions and upper confidence limits.

So Pro-UCL is definitely, it's solving a problem for us, but it has some less than great design features, one of which is it will crash if you put too much data into it, and by too much I mean about 10,000 records, which for the type of projects we're working on, which require us to analyze hundreds of compounds in a variety of environmental media, we were having to run Pro-UCL multiple times. And because of that, we decided we wanted to try and automate it, and we decided to try and automate it by using a mouse macro, so in our case it's called mini-mouse macro, and basically automate the clicking of the software itself.

So we wrote a function in R that allows us to subset that data into more manageable chunks, and then you run the function, and it creates an input file for our mouse macro, and then it runs that file. So the mouse macro input is just a simple text file. I used some command line code to open up the mouse macro and load in the input file, and then it just clicks through the software for you. So here it's using the text string we gave it to input the location of the input file. It's clicking through all of the dropdowns we'll need, and then it's loading in the compounds we want to look at, and finally, it'll save that file to the text string I gave it for the output file.

And then once it senses that there's information in that file, it'll close out of Pro-UCL, and if necessary, run it again. So we'll loop this over and over and over again. And that might have seemed like a very simple process. Why would you automate it? Well, that was a very small data set, and there's actually a lot of computation time that has to go into it when you're doing larger data sets. So for small projects, running Pro-UCL through the automation like this can take 15 to 30 minutes, but for bigger projects, we're talking just automating like this, six to eight hours. So it's definitely saving us time that way.

But writing the code to automate the clicking of my mouse through the software was actually the easy part. The hard part was deciphering the output file. So here's what Pro-UCL gives you as an output file. It's a formatted Excel sheet. This is just a small subsection of the statistics it gives you and only one compound. So once again, for our sites, we're looking at hundreds of compounds, and we might even have to group our data spatially, so we'll need additional information.

And so before we automated, what we did was just have someone go through and copy and paste the values we were really interested in. And that took quite some time. So instead, we automated it, and now we create tables both unformatted and formatted in Excel that my team can use for their analysis and their final report.

Lessons learned and final thoughts

So overall, my team thinks this went great. It does exactly what we want it to do. When we first decided we wanted to automate Pro-UCL, we saw two paths we could go down, one using our mouse macro, the other just recreating all the statistical tests in R, and that definitely would have been possible, but at the time, we just didn't have the resources or the time to devote to such a project. In addition, Pro-UCL is an industry standard almost at this point. It's been used for quite some time, and there's a lot of important technical knowledge built into it, so using it is a benefit, especially when we have to deal with regulators.

In addition, Pro-UCL is used on projects such as risk assessments, where we look at how compounds in the environment can affect different receptors, like humans and wildlife, and sometimes we have to go back and forth with regulatory agencies to define our conceptual model for a site, which means we might be adding in receptors, we might be adding in compounds, we might be changing how we spatially divide up our data, and that means we very rarely have to just run Pro-UCL once, usually two or three times. So automating it definitely saved time, it saved human error, because less copying and pasting, but it's not perfect.

Once I'm sure, one day hopefully, EPA will update Pro-UCL, and then my code will become obsolete, but at this point, it's already paid for itself, it works, it does what we need it to do, and we're happy for that. In addition, there are other things I wanted to do to help improve this function, to make it easier to use for anybody, but we just didn't have the time, so it does require some special setup, especially with the coordinates for the clicking of the mouse, you have to make sure your screen resolution is set appropriately.

So final thoughts, I gave you a bunch of different hybrid approaches to automation that helped me save time and reduce human error, which ultimately is what I believe the goal of automation is and should be. So I showed some easier examples and some more advanced, so I hope you recognize that no matter what your skill set, you can benefit your team now, and maybe with just a little bit of rephrasing of your pitch, starting small and being creative, you too can help, you can reduce the activation energy needed to get your team on board, and reduce the road bumps along the way.

I gave you a bunch of different hybrid approaches to automation that helped me save time and reduce human error, which ultimately is what I believe the goal of automation is and should be.

And with that, if we have, I can go ahead and answer questions. First I wanted to let everybody know at this link are my slides, as well as example code for the smaller examples, so the openxlsx and the officer, as well as links to more information.