Daniel Sjoberg - gtsummary: Streamlining Summary Tables for Research and Regulatory Submissions
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
I'm a data scientist at Genentech, and today I want to talk to you about the gtsummary package, which has like adorably been mentioned in previous two talks, so I love that.
So this is a bit of a different talk that you might hear about a package at PositConf. Oftentimes you hear about these new cool tools that you are learning about, and while I think this is a cool tool, it's not exactly new. It just turned five years old on CRAN, so like happy birthday gtsummary.
I want to talk to you a bit today about this journey from honestly being an absolute R noob to developing a package that was meant really for my team or my department at Sloan Kettering, where I was working at the time, and then to maintaining a project that's actually pretty widely used in the community now.
Origins at Memorial Sloan Kettering
So picture it, it's New York City, the year is 2018, and I'm working as a statistician, a biostatistician at Memorial Sloan Kettering Cancer Center, and my team and I, we took major pride in the quality of our code and the reproducibility of our results, so we're constantly patting ourselves on the back about it. I thought, should I share how we were reproducible, because it's a little embarrassing in retrospect, but what we would do is in Stata, cue the Imperial March music, we would, for example, if we needed a table one or a demographics table, we would say, hey, well, I need age and grade and stage, and I need that in table one format, so it would calculate the median, the IQR, what have you, and round them, and it would print it to the console with amber sands between where you would see the columns, then we would copy, put it in a Word document, paste, highlight it, convert it to a table, and we're like, wow, reproducibility.
So while this was good, you know, we weren't transcribing manually, we weren't retyping numbers, it leaves something to be desired, for sure. It was around this time that I started hearing some rumblings about something called R Markdown, and it was infinitely superior to what we were doing, and it was very clear that the team, we needed to make a change.
So we were just delighted to find the solution. Almost as delighted as I was to find a color match to this crushed velvet ensemble that Dorothy's wearing to that sticker. Flawless, right?
So we needed to make the switch. There were just a couple issues. We didn't know R. I kind of used R in passing in the past, but I didn't know what the tidyverse was, I didn't know anything really, but we thought that shouldn't really be an issue. It's just writing things in a script, how hard could it be?
So before we made this transition, like, let's take a survey, let's see what's out there, does it really meet our needs as a team? So ggplot was great, but for summary tables, we didn't find the exact solution that was going to be great for us. So we thought, we'll just build one, you know? How hard could it be?
So with the confidence that you can have only when you have absolutely no idea what you're doing, we were like, we are going to build this package, it's going to be so great, it's going to make this table and this table. These are the two common tables that I make in my work as a statistician here.
So I should say that while I was totally ignorant to how to program in R at the time, thinking about statistical reporting was something I had been doing for many years. I was on the editorial board for European Urology, the journal, for a couple years, and I had co-authored the reporting guidelines for all of the studies there, and those guidelines have since been adopted by seven other journals. So thinking about how to report statistics is something I thought a lot about. The mechanism of doing that with R, I was way out of my league, but I didn't even know it.
Anyway, fast forward, we cobbled this together. It's a great way to learn how to use a programming language is to pick up a project, right? So that's what we did. So the first release came in May 2019, and I remember being so nervous about putting this work out to the public to be scrutinized and see if people could see what I did. It's really funny to think back about those feelings because I now will put pretty much anything out and be like, hey, I made some garbage. Do you like it?
But the community's reaction was incredibly kind and engaging, and it was through that and community engagement through the package added additional functionality, and the product just got better and better and better. So the package grew in both users and functionality, and it was just really exciting to see these contributions coming in, both code contributions via pull requests and ideas for improvements, some really fantastic ideas. And this was also like my beginnings of engaging with the R community, and it was such a happy time.
But the community's reaction was incredibly kind and engaging, and it was through that and community engagement through the package added additional functionality, and the product just got better and better and better.
And funny enough, everyone speaking in this session, Rich, Shannon, Becca, hello. We all know each other before this from the community. So Rich, for example, I was in Toronto two weeks ago. He gives a great walking tour of the city, FYI. Shannon and I collaborated on a couple of packages together as well, and we have found ourselves in no fewer than two service elevators trying to get to a rooftop bar that we may or may not have had a reservation at. And Becca and I have a standing weekly call because we're actively collaborating right now. So it's just I love the community. It's really wonderful.
How it's going
So that's kind of how it got started, but how's it going? Well, much better than I anticipated, honestly. So we got over a million downloads from CRAN, a thousand GitHub stars, a hundred GitHub forks, and in 2021, the American Statistical Association gave the package the Innovation Programming Award, and last month, Augustin created an entire trial readout using gtsummary and won the 2024 Posit Table Contest. So that's pretty awesome. It's really exciting stuff.
The package in practice
So you've heard about the package, but let's take a look and see what we can see it actually in practice. So really the ethos, the ideas behind this package was we wanted to have dead simple code and also super customizable, which is sometimes two conflicting ideas.
So in this example, I think it's pretty dead simple. We have trial, which is a data frame, a data set, and we pass that to tblSummary from the gtsummary package, and we say I would like to have some summary statistics split by treatment, and I want to see age, grade, and response. So internally, we're looking at this. We're like, okay, well, age looks continuous. I'll default to the median IQR. Grade looks pretty categorical. I'll do that. Response, that looks dichotomous. I'm just going to give you a single line. Two of these, age and tumor response, have missing values. Let me make sure I put some information about that in the table so that you don't think that there's no missing data.
And it looks again, let's start at the top, at age, and it says, okay, well, your age is like from here around to here, probably rounding to one decimal, zero decimal places to the nearest integer. It is reasonable for this variable. So there's a lot of background kind of stuff going on so that in a couple lines of code, you can get a table that is pretty much ready to go to a journal. So that was kind of the idea behind it. Dead simple code, quick table. That's the takeaway right here.
So, and with those defaults, of course, they're all modifiable. So in this example, I'm adding the statistic argument and I'm saying, take all those continuous variables, which is in this case, just age, and rather than doing that default median IQR, I would like to see the mean and the standard deviation in parentheses next to it. So you may recognize some of this syntax. It looks just like glue, which you see in dplyr, you see it in stringr throughout the tidyverse. And that's exactly what we're doing, but we're making it do double duty. We're not just saying, put that in here. What we're doing is saying the mean is in curly brackets. So we go and look for the mean function. We run that function on the age vector, and then we format it and then do the same for standard deviation, and then do the proper gluing where we pop those numbers into the table. So you can end up with pretty complicated, very cute looking tables using this glue syntax. It's pretty simple. So I love it.
Composable tables and regression summaries
So in addition to doing these kinds of basic tables, gtsummary tables are very composable. So while this drug A, drug B is a very important part of the table, there's oftentimes much more that we want to report about these items, right? So a very common thing is that you'll want to see perhaps a p-value comparing the values, for example, of age, or in this case, marker level and tumor response across those treatments of drug A and drug B. And so we have a function add p that will do that. And that would just add a single p-value column. There's also an add difference function. When you have two groups, we will add the mean difference or the rate difference along with that confidence interval and the p-value. And from these tables, you can merge these things into a single cell, or you can hide the p-value if you don't like p-values.
Quite customizable here. And while we have wonderful, lovely defaults, they're, of course, all modifiable. I realize while practicing this talk, I'm going to say this maybe three more times.
My next favorite feature is regression model summaries. So here, you may recognize your typical logistic regression model from the GLM function from the stats package or base R, essentially. And we have tumor response as our endpoint. And we have treatment and biomarker as covariates in this model. And you've all seen this output. It's not that easy to read. It's not that easy to work with oftentimes. So we have a function called TBL regression or table regression. And you pass it the modeling object. And it's going to give you something quite reasonable back. So in this case, because we do a logistic regression, I want to exponentiate. So we have odds ratios. The function can recognize, hell, this looks like an odds ratio. This looks like a logistic regression. Let me put a good header up there, OR, with a footnote saying that OR is an odds ratio. It also does fantastic stuff like identifying the reference level. This is big. This is not ambiguous. So that's what I love. It's just the clarity that you're adding to the context you're adding to these tables is really fantastic. Now, we default to an em dash, because I love em dashes. But you can put a one. A lot of people use one or ref, depending on the journal you're submitting to, for example.
But I can't move on from this table without a small anecdote, again, about the community. I had written the original version to work with linear models, GLMs, and Cox proportional hazard regression. Because those are like, that was kind of like my bread and butter at the time. Joseph Lamarage, who wrote the labelled package, which also you heard about earlier today, he's like, hey, this is fantastic. Can I take this little garbage you wrote and make it better, and we can support everything? And I was like, yeah, that sounds amazing. So there's another package called broom helpers. And he didn't say it like that. That's me saying that. Broom helpers, which is kind of like a broom plus plus. And it will take your modeling object and do all sorts of additional formatting for it, like finding those reference rows. And it even does really complex things if you're using complex contrasts, what have you. So now TBL regression and broom helpers supports 40 plus packages and regression modeling functions, which is just really, really great. And if you are writing a regression modeling function, then just do the basic stuff that R wants you to do. Write a model frame method, write a model matrix, and then it'll just be supported out of the box.
Table cobbling
The next favorite feature. Table cobbling. So while we export many functions for creating rather simple tables, you need to make a complex table every now and then, I presume. I did as well, and my team. So we implemented this infrastructure for being able to merge and stack and stratify tables very, very easily. So on the far right, where it says multivariable in the header, that's the exact table that you saw previously, the multivariable regression model. And to the left, there's another function that creates univariable regression model summaries, and that was called to create that under the univariable heading. And then I pass this to table merge, and I specify the headings I want. And now I have a rather complex function with my univariable results on one side and my multivariable results on the other, and I just, with one line of code, just merge it all together.
Similarly, you can stack them. And I love this story that I'm about to tell you about this table in particular. So if you took epidemiology in school, you probably recognize this table. It's a very common table you'll see of odds ratios and kind of combining odds ratios across stratum using the Cochrane-Mantel-Hansel test or method, that's the CMH you see over there.
So very common table. So some epidemiologists in Sweden said, hey, this is a super common table, and we love your package. Can we make this with gtsummary? And my initial reaction is always, yes. Let me help you. Let's write the function. Let's do it together. And then I started writing, and I was like, you know what? Maintaining work is a thing. So if I put this in my package, I have to maintain it forever. And it's not something that I needed. So I had a better idea. I said, how about I teach you how to cobble this table together, and we put it in a package that you maintain? And they were really excited, and it worked out really well.
So just quickly, the basic components of this table is just a cross tabulation, TBL cross or table cross, to get that exposed and not exposed tabulation by T stage. So once you have that one table, you just do it again for cases and controls and merge them. Then do it for stage and for grade. Then stack them. And then an add stat function is a very general way to pretty much add anything you want to a gtsummary table, and you can slap on your odds ratios there. So very, very easily, you have a pretty complex table, which I think is so cute.
Broad community and language support
So like I said, I've interacted with these epidemiologists. I wasn't working as an epidemiologist. But over the years, I've interacted with people in many, many, many fields. And hearing how they do work has been super interesting, and I've learned a lot from them, and vice versa, perhaps. So economists, financial analysts, research scientists, data scientists, across the gamut of people doing analytic work. And the package is better for that. And in addition to being for everyone, gtsummary tables can be translated into, I believe, 14 languages at the moment. So you can just set at the top of your script, hey, I'm going to Spanish, or I'm going to Icelandic, I just saw. And your results will all be translated for you. So if you're working in a country that's a non-primary English speaker, the package is still really wonderful, and you can show your results in your native language. I will say that if you do find something that's not translated, because we do keep on adding new functionality, just shoot me a message and what the translation should be, and I'll add it.
So we've covered some really basic summaries here, but I wanted to just touch on that there are cross-tabulations and other continuous subgroup type analyses that you can do. Wide summaries where you have your n and percent, for example, in different columns. Survey data, survival, or time to event data is easy to summarize. And any number that you see in a gtsummary table can be reported in line in an R Markdown document or a Quarto document using some other functionality that we have. So I just love that for the entire reproducibility, like end-to-end part of it. So it's fantastic.
Output engines and GT integration
Next favorite feature is that, well, some of this is becoming moot, I think, the more I learn about GT today going from PowerPoint and Excel, so I'm looking forward to that. But GT is obviously the focus. It's in the name of the package. So you can export to all of these various engines as well for drawing your table. So FlexTable has some pretty strong connections within the Microsoft Office universe, so like Word and PowerPoint. HuxTable, I think, is really nice for going to PDF or even actually to Excel. KableExtra is great at PDF as well. Kable, I love Kable for when you're kind of like giving examples on Stack Overflow or doing reproducibility of examples in GitHub because they're just simple markdown tables, so they just render properly there. They don't have all of the bells and whistles of a GT table, of course, but most of the time it does tell you what you need to know. So this, I think, is super helpful because you're able to use the package in various contexts, whatever you need at that time in your analytic life.
So in addition to being able to just print a gtsummary table with GT, for example, once you do that conversion of GT, which is if you just print a gtsummary table, it kind of does it in the background, you don't know what's happening. But if you explicitly call as GT, then you have the, I don't know, 150 plus functions from the GT package to really style your table in the precise way that you just love. So that is incredibly powerful because that means I don't have to support all that because Rich is already doing it.
Getting started and ARDs for pharma
So I hope you can check out the package in the near future. And I would start at the package's website. The documentation is wonderful. There's a lot of vignettes and articles going through some basic and advanced use. There's a gallery full of pretty complex tables to kind of illustrate how you'd put all this together. There are videos on my YouTube channel for a one-hour presentation. On the R in Medicine channel, there's a three-hour workshop plus a 20-minute talk. Lots of ways, if you like to read or you like to listen and watch, to learn more about this.
But I wanted to add one more thing for my pharmaceutical friends in the audience. It's about analysis results datasets. So it's an emerging standard from CDISC, and ARDs are a super structured way to store results. So in the pharmaceutical industry, these are going to be living after your Adam datasets or datasets that are ready to do analysis on and before you create your TLGs, your tables, listings, and graphs. And gtsummary in the 2.0 release has been entirely refactored to run on ARDs. So that means that every gtsummary table, you can go in and get a highly structured object with all of the results unformatted in there. And that's going to be perfect for being compliant with this CDISC specification. You can also build your own ARDs and send them through the same tabling packages as well. So here's what an ARD looks like. I'm running low on time, so I'll skip it.
So I love the community. Thank you for being my friend.
And before we ask the first question, I just wanted to put one little out there for tomorrow. We're having a lunch in the Regency Ballroom for the Rainbow R community. Everyone's welcome. Please join. Thank you so much. It was a great talk. We have time for one quick question.
How does someone get involved with suggesting updates or making updates to gtsummary? I would get started on GitHub or if you see me in the hall, you can just stop me in the chat with me. Great. Thank you so much. Thank you, everyone.
