Resources

Mike Stackhouse | Dive Deep into Metadata with Tplyr | RStudio (2022)

video
Oct 24, 2022
14:59

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Thank you, everyone, for coming. We're really excited to be speaking here today. My name's Mike Stackhouse, I'm the Chief Innovation Officer of Atorus Research, and with me I have Jessica Higgins, Director of Analytics Engineering. But really, I'm a nerd who likes to code, and Jessica is a nerd who likes science and coding. Today we're going to be talking about our package, our R package, Tplyr.

And where I want to start with this is a little bit about why we made it, because typically I'm speaking at industry conferences that are pharma-centric, and specifically our pocket of pharma that we work in, which is late-phase clinical trials. So the first thing to understand about programming for late-phase is that we're not doing as much exploratory work, which is what a lot of the tools in the R ecosystem is built for, allowing you to explore your data, model, and do all sorts of things. But when you're in late-phase, a lot of what you need to do is already laid out for you. It's highly standardized, you already have basically all of the outputs that you need to create, so you're really trying to operate more like a factory.

So the first kind of principle of why we made Tplyr is that summary tables are highly repetitive, so your code should be highly reusable. You want to operate like a factory, you're putting out maybe 300 tables for a single trial. So you need to get through that volume and you need to get through it quickly, so you really want to be able to reuse that code. The second principle is that actually for the outputs that we create, summarizing the data isn't what takes a lot of time. It's easy enough to pop a variable into counts or summarize and do some means and standard deviations and get those numbers. That doesn't take much time. The cosmetics of getting it to look the way that your statistician asks you to is what takes a lot of time.

So the other thing that we tried to build into the package is that text formatting of numbers can be very tedious, so we wanted to make it intuitive and easy. But this package has been out for two years at this point, and I've talked about this enough that my team is definitely sick about hearing it. So the new thing that is the last principle of what we've finally gotten there with is that we want to make Tplyr more friendly for interactive tables.

So back in 2020, we work with a lot of different SAS programming groups who are looking to adopt R, and the question that I get constantly is why R? And the point that was made to me early on that resonated that I've never really forgotten is that someone told me that the cost of doing an additional, say, adverse event table for a trial within their company is free, because you just pop it into a SAS macro and out comes the table and you're done. And when you're trying to talk about cost savings and what R can do for you in that area, that's a hard thing to compete against when the output is free.

So what does R really have to offer? And that gets into the world of R Markdown and Shiny and interactivity, and that's where we really want to make things more accessible.

About the Tplyr package

So just to give a little bit of background on the package itself, Tplyr, it's an R package that we released back in 2020, and I shamelessly took the name from dplyr and the principle of dplyr that it's a package, a grammar of clinical summary tables.

So the reason that we try to make it a grammar of summary tables is that we wanted to take your focus off of doing the actual summary and thinking about the algorithms that you need to write to do the summarization and how you get the numbers that you want and really design it more based off of describing the output that you want to create. So shifting that focus of just saying, this is what I want to build. A statistician, typically you get your CSR and you get the mocks of the tables that you need to produce, and that's going to have different characteristics that are going to need to be fulfilled. So we wanted Tplyr syntax to focus on describing the output and building up the characteristics and then saying go and build this for me.

The last thing was that we also wanted to really draw our line in the sand. I did not want to anchor into any specific reporting package itself. I favored Huxtable early on because it had good RTF support. GT has been building up over the last few years and it's had an amazing place. Jessica loves FlexTable. And then in the Shiny world, you have things like Reactable. So I wanted to make the things that come out of our package really easily accessible to any styling package that you want to create. And we just wanted to really stop at the character formatting and the numbers so that that annoying part was done for you and that you can pop it into some other library that will actually make it pretty.

To describe Tplyr itself, I want to thank Christina Fillmore for giving me the best metaphor for this that I've had, which is think about cake. So cake can take a lot of forms. It can have a lot of flavors and it can have a lot of different decorations. You can have multiple layers into your cake and those things can look differently depending on how you build it up. And that's really what a Tplyr table ends up being.

So for a demographics table, which I have a concise example of up here, we have three different variables from our dataset that are being summarized. So we have the age group of subjects, which is a categorical variable. We have age, which is a continuous variable. And then we have race, which is another categorical variable. So for the categorical variables, you're going to count. You're going to represent percents. And then for the continuous variables, you're going to do different descriptive statistics for your end mean, standard deviation, so on and so forth. And then those things are all going to be combined together and they're going to have their different characteristics of how you want that styled. And we're going to bake our cake and pop that out and you get the data frame all collected together and assembled and ready formatted so that you can go and put that into some sort of presentation.

What metadata brings to the table

But our focus here today and what we're talking about for Tplyr today is what metadata can really bring to the table. So the question that we're trying to answer and what we're trying to make accessible here is that when you look at a result on the table, asking the question, what data produced that result? So if I look at this table, I know that I have six subjects in the low-dose subgroup who are black or African-American, but who are the six subjects? What are the data that were summarized to actually produce that result? If I'm looking at the median of the high-dose group, again, what data produced that result? What subjects were fed in? What was the data that went into my dplyr summarize and went through the median function to actually produce that result? Because these are questions that are often asked when you want to drill down into your data and understand more about it.

So the question that we're trying to answer and what we're trying to make accessible here is that when you look at a result on the table, asking the question, what data produced that result?

So specifically, when we talk about metadata within Tplyr, we're asking two different questions. What variables were necessary to derive that result? And what filters were necessary to derive that result? So really, what is the slice of data picking out the different columns and the different rows that were actually necessary to produce the result that you see on the table? And here, I will hand it over to Jess.

Okay. So thanks, Mike, for giving us a great intro on why you created Tplyr, and now we've talked about what do we mean by metadata. So how are we going to harness this metadata?

So in this process, you've created your Tplyr table, which produces your summary data frame. These are the results you see. You would see them on paper if you printed it out. And additionally, what we're doing here now is including this table metadata. That metadata, remember, those are the names and the variables used to create that result. And we can use that metadata to create the relevant subset, and we can wrap all of this in a really easy, shiny application to create an interactive table that, upon a click action, will give us both our summary table and the actual data that produced that result.

Creating and extracting metadata

So what does this look like? And building it. So how do you get that metadata? It's very simple. So creating the metadata, all you have to do is change the argument, metadata equals true, and what will happen is the metadata for this, for your table, will just be created automatically. For example, we have our result output data frame, and now you can see what's actually added to the table is the Tplyr metadata object. So each cell contains this object, and this object is the metadata, again, those variables, those names that created that specific result cell.

As we were talking about earlier, this is indicated by a row ID and a column ID, and we can go from there to find who are these people. They're not just numbers in the summary table. These are subjects in a clinical trial. So we've got this metadata. How do we extract it? What do we see? How do we get to this? As I mentioned, the input is really simple. You supply it a Tplyr table, a row ID, and a column name, and the output, what I really like about this, is that you can see, okay, these are the names that are required to create this result. I already know what columns I'm looking at. These are the filters, and you can see exactly what it looks like. What are the filters that are going into that result?

You're saying, great, I see this. Where the heck is that subset? What is this data? This is a really simple function, too. Again, it's a new function, getMetaSubset. You provide the same inputs, and it produces that subset, that subset of data that produced your results. You can see exactly who is in that result.

Additionally, this getMetaSubset function includes an additional argument to add relevant columns that might give your subset a little additional context. It defaults to uSubjId, which is, for those of you in the clinical world, it's a CDISC-specific variable that all these datasets have. However, you can change it to whatever you want. This is the only place where any clinical or CDISC-specific defaults or language is used in this package. We really want Tplyr to be extensible beyond just using finished CDISC data, but this is the one place where you see it.

Demo: interactive Shiny app

To summarize, we've created our Tplyr table, which includes our summary results. We've got our relevant metadata, our filters and names. We can create the subset to see that. What does this actually look like in an interactive table?

I'm going to give you a quick little demo. What you're going to see is the most bare bones basic Shiny app ever created. I'm pretty proud of this fact. I didn't create it, but I could, and I'm not a Shiny developer. Here it is, all its glory. In fact, it's so simple, it's 85 lines long, which I had to look up. Again, you can see the code. We'll provide that.

What do we have here? You can drive around. You want to see, as we mentioned earlier, we were looking at who are those eight subjects in the placebo group that are black or African American. You click on the result, and your metadata subset pops up. You can see it. You can check. This is incredibly useful. As someone who spent a lot of time looking at summary tables, then looking at a really long listing, trying to find those people individually, or programming it on the side while you're comparing at a summary table, to be able to see all of these things at once, to take your review from a very manual process to something that's interactive, that's quick, that's fast. That's what we're about, trying to make things quicker, faster, and easier, and to bring pharma along in that process slowly.

This is incredibly useful. That's what we're about, trying to make things quicker, faster, and easier, and to bring pharma along in that process slowly.

Extending metadata beyond Tplyr tables

Okay, that's great. Regular summary table. These are pretty common. What about if you are in an example where you have a summary table, and you've included some additional statistical analyses or models, and you have those results that you want to share, and you also want to include the metadata for that test. You want to know exactly what subjects went into that analysis. Tplyr, you can extend and append the metadata. You can create a new metadata object. You can add those filters and names that will help create the results.

In addition, and my favorite part, is you don't actually need a Tplyr table to do this. This will work with a data frame. I was just recently thinking about the context of this in my way-former life. I was an evolutionary biologist. I worked with a lot of butterflies and used to put them in models, and I was like, man, if I could have seen the subset that would go into that at that time, it would be really useful. I can see how I personally would have taken this and used it in my own work. Again, extending that metadata, adding that metadata directly to your results so you can pull that up as needed.

We'll show it quick. This Shiny app, again, really simple. Now we're looking at, we have the results of a dose-response model between high and low, I think. I can't see that well. Again, you're going to see who's in this model. What are the subjects here? What is the data that's included in this model? You can click on this really simple interactive table, and the results are produced right there for you. It's a really quick and snazzy way to see your results and the metadata that created them all at once.

With that, like I said, I'm really excited at how we're getting some of the idea of taking these formerly paper clinical tables and moving them into more of a screen-friendly format, into a quick review format, and a way that you can keep moving forward. Thank you.