Ask Hadley Anything
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
So yeah, I'd like to thank everyone for attending this Ask Hadley Anything, which will be for the next half hour. Usar is coming up with just under a month to go in Salzburg this year, and Hadley is a presenter, so there's a chance at that event to hear more from him, but today he's here as representing one of our sponsors that made Usar possible.
I guess, Hadley, could you give us a little quicker intro to yourself and our sponsor, Posit?
Yeah, sure. Thanks, Jane. Thanks for having me. So as you probably know, I write a lot of R packages, and me and my team look after the tidyverse, which is a set of packages for doing data science, and kind of the DevTools ecosystem of packages, which help folks maintain their own packages. So for example, one of the things I've been working on recently is Packagedown, which is the R package that we use to make websites for our other R packages, and now many other people use that package for their package websites too.
So Posit, I work for Posit, which is a PBC, or a Public Benefit Corp. That's a little bit different to the traditional LLC. In an LLC, the kind of overarching goal of every LLC is to maximize shareholder revenue. That's kind of the mission of every LLC. The idea of a PBC is to relax that and to add some goals other than just making a ton of money. And so the goals of Posit are to really support the data science ecosystems, R and Python, and scientific publishing in general.
So we make a lot of open source software, but we also sell commercial products. Our main commercial products are Posit Workbench, Posit Connect, and Posit Package Manager, basically designed to solve problems that larger teams of R users tend to have. So Posit Workbench allows you to run RStudio or VS Code or Jupyter Notebooks on centralized compute, which allows you to have a few big servers with a lot of computing power, or perhaps you're in a regulated environment, and that allows you to run all your code in an environment where everything's set up correctly.
Connect allows you to get the results of your analysis basically out of your hand and the decision makers, whether that's by publishing Shiny apps or Dash apps or scheduled R markdowns or sending an email with a link to a dashboard at 9 a.m. on every Friday. And then last but not least is Posit Package Manager, which basically allows your organization to use a shared set of packages that have all the qualities that your organization might want.
So that was a very quick overview, but I work on the open source side of Posit, but tremendously grateful for all of my colleagues who do make products. They waste a lot of money because they pay for my salary.
Current projects and IDE preferences
I'll tell you, I've been playing around a little bit with VS Code, which is a pretty interesting IDE or almost not exactly an IDE. It's almost like a tool for building your own IDE, but just been playing around with that, learning a bit about how it works. And that's been pretty interesting. Just a very, very popular IDE used by all sorts of developers for other languages, which has been interesting to see what folks expect in kind of a modern IDE.
What is your preferred IDE and does it change depending on what you're doing?
Yeah, I will say like 90% of what I do is package development. I do like a little bit of other things, hardly any package development, but a little bit of Shiny development, a few websites, a little bit of data science from time to time. So I really like to kind of like stick with as much as possible with one thing so I can really get familiar with it. So currently that means it's sort of a mix of RStudio, which is my kind of day-to-day driver and then VS Code, which I'm just exploring and learning from.
Yeah, so packagedown, I'm currently getting very close to finishing off a packagedown release. I think that kind of the two big features in this release are going to be having like optional dark mode for your website. So the users of your package website can flip between light mode and dark mode. It's very, very popular feature on websites. And that's a little bit more significantly as the ability to use Quarto for your vignettes and your packages. So traditionally, people have written their vignettes using R Markdown. Fairly recently, CRAN has installed Quarto. And so now you can use Quarto to write your vignettes, which brings up like all of the nice features that Quarto have from better figure layout and easier cross-referencing and equations and all that kind of stuff. So bringing that to your package vignettes as well.
And that's kind of a dark art because you've got to merge the templating systems of packagedown and Quarto together. So it's certainly never going to be like 100% perfect, but I think we can get to like 95% pretty easily and get most of the features that people care about working. And then as people report them, we'll fix bugs and stuff. But that should be coming very, very, very soon.
GitHub Actions and R in production
Someone said that they recently saw a video where you showed an example of web scraping and GitHub automation. And they asked, is there a possible use to use GitHub to automatically render multiple Quarto documents to PDF and HTML and upload them to a server?
Yeah, absolutely. Now, CRAN just released, they're now making it more clear which packages are possibly scheduled for removal if they haven't fixed problems. So I've been working just to turn that into like a little dashboard that gets automatically updated every day by GitHub Action.
But this sort of thing, I think where you schedule a GitHub Action or something similar to do something and then publish it somewhere is like super, super useful, super powerful. And like basically you can do it today, it's just not terribly well documented. So one of the things I've been sort of thinking more generally about lately is this issue of like, how do you effectively use R in production? And to me, production, like the two, I think, big things that make putting R or putting anything in production challenging is that you've got to run your code is now being run like regularly and it's being run on another computer. So I'm trying to do more and more of these things in GitHub Actions, these regular things just to better learn, like how can you write code that is both more reliable and when it fails, gives you more actionable insights.
Another thing I did sort of along a similar lines, probably the video you saw was there was an artist I follow on TikTok and every time he sold his work, it was always sold out by the time I found out about it. So I wrote a little script to scrape his website every three hours and then send a push notification to my phone whenever it changed so that I could finally buy some of the art. So I think there's a lot of cool little things you can do with GitHub Actions in that way. And I want to try and make this accessible to more people.
Maintaining the tidyverse
How does Posit balance having packages that they continue to support, like all the ones in the tidyverse, without the effort to keep these packages supported, not leaving capacity to do anything new?
Yeah, I mean, that's a really tough challenge. And I think the other aspect of this that we're starting to learn about is that most people tend to get sick of maintaining the same package after somewhere between five and 10 years. So that's kind of happened. Let's sort of have what's happened with me and ggplot2, I love it, it's still close to my heart, but after so many years, you just want to do something else. And so the original strategy was to hire, well, the original fix was to hire Thomas Peterson to maintain it. And now he's getting sick of maintaining it. And so obviously hiring a new person every time is not a terribly scalable strategy.
So I think, I don't know, I don't have any great insights here. I think one of the things we're trying to be more clear about in the tidyverse is just our development cadence and how many packages most of the time are just kind of lying in this fallow state. We will keep them on CRAN, we will fix any kind of critical bugs that we come across, but they're mostly just kind of lying there, accumulating issues until we notice, oh, hey, there's a couple of weeks worth on this package, and then we'll come back to it. But I think in general, the thing that is tough is saying no to people. And I think we have gotten better at that a bit in terms of just being like, hey, this is a cool idea. We'd love to do it, but we just don't have the time and we're probably never going to have the time. And so closing issues down for that reason, which kind of sucks. But we have to carve out the time and the energy to work on new stuff as well as maintaining old stuff.
Updates on ggviz, Mastering Shiny, and dbplyr
Yeah, ggviz is definitely unlikely that we'll see another upgrade. But Thomas is starting to think about kind of what the future of graphics should look like. And it's not kind of super clear exactly what that should be and how that ties in with web graphics and reactivity. But he's starting to do a bunch of projects to kind of explore that. So nothing in the near future, but definitely something that he's going to think more about.
Mastering Shiny, I don't have any current plans to update it. But we've talked about doing like a translation of Mastering Shiny to Python. I think that's probably more likely to happen first, just as we try and help Python folks master Shiny as well.
And then dbplyr, I don't know, it's just like a steady... I think dbplyr seems to be like a pretty important package for a lot of people. And we'll just continue to see like a steady stream of fixes and improvements. I'm really grateful that dbplyr now has another maintainer, Maximillian, who has been contributing a lot of the fixes lately. So it's been really great to have someone else thinking about it and caring about it. So there's definitely a package, it's just a long tail of small issues and just like grinding down all these minor SQL translation problems.
I think the only thing that I'll be kind of like specifically kind of pitching in on that in the database ecosystem at the moment is just around integration with our partners. Both of us now partnering with Databricks and Snowflake and just doing anything I can do on the open source side to make sure if you are using Databricks or Snowflake that everything just like works seamlessly and painlessly from RStudio and from our packages.
R and Python interoperability
Yeah, I think a lot of what we discussed is how to make R and Python work better together. And I think that's kind of a story there is the same as it's been for quite a long time. It's just like a lot of stuff that you don't want to write in R or Python, you want to write in like C or C++ or Rust or some other high performance programming language. And then on top of that, you build a nice user interface in R or Python. I think like we've seen that with like DuckDB, we've seen that with like some interesting work in Polars. It's just, I don't know, like a reticulate package in R makes that easier to talk between the two. The Arrow project, the sharing data across them, that just feels, I don't know, like in some way, it feels to me like it's just getting easier and easier to collaborate across these boundaries.
One thing that Gabor Ciardi and my team did recently was write a little package for reading parquet files into R. You can certainly do that with Arrow, but Arrow is like a big heavy dependency. This is like, I think we call it a nano parquet. It is a very, very lightweight package, but it just makes it easier to use this pretty standard file format in R if you're working with smaller datasets as well. So it's kind of continuing to look for opportunities to collaborate across language boundaries.
Yeah. I mean, it's interesting because Python is legitimately the most popular programming language on the planet right now, incredibly powerful general purpose tool. And there's obviously a lot of advantages to having one tool that you turn to for every problem you could possibly imagine. And that's Python. But there's also advantages to working with tools that specialize for the problem at hand. Because Python, there's just no way that in Python, you're ever going to be able to, I think, interact with data as fluidly as you can with R because of the way that missing values are built into the language because of the way non-standard evaluation works and R allows us to write interfaces like ggplot2 and dplyr. Yeah, sure, the R community is smaller and there's lots of things that R can't do that Python can do. But I don't know. I think at the end of the day, we can have general purpose tools and we can have special purpose tools. And I love R and I have no intention of stopping developing.
Because Python, there's just no way that in Python, you're ever going to be able to, I think, interact with data as fluidly as you can with R because of the way that missing values are built into the language because of the way non-standard evaluation works and R allows us to write interfaces like ggplot2 and dplyr.
Hypothetical unlimited resources
In a hypothetical scenario where you had unlimited resources, is there a project that you would even want to start or revive?
I think one of the projects that's really interesting to me is these tools for gradual typing where you add types to a programming language. TypeScript's a good example of this or Pydantic's kind of in this space too, where the advantage of adding types being very strict for every function about what sorts of inputs it can take and what sort of outputs it produces, it leads to better documentation, leads to better error messages, and potentially leads to much better performance. But languages with strict type systems are pretty far away from R. But I think there's been interesting research lately and how you can kind of gradually add types into more dynamic languages. That would be a big project. It's kind of hard to tell what the payouts would be in the long run, but I think that would be something I would love to invest more on and just figure out how we can make our programs both safer and faster and more informative when they error.
LLMs and the future of data science
What's your opinion on the future of data science during the era of tools like ChatGPT, but I guess there's also GitHub Copilot and similar IDE-based tools.
Yeah, I find it very hard to imagine that you can replace a data scientist with anything less than kind of full-on generalized AI, because you have to bring so much domain knowledge and expertise to any problem apart from just writing code. But I do think that LLMs are going to have a pretty transformative impact on the way that we write code. And certainly, I think it seems like it fundamentally changes how we might teach programming, because now you can express yourself in English or whatever, express yourself in a human language and get pretty reasonable code a lot of the time generated. I use Copilot. I've found it really interesting and useful, particularly at doing tedious tasks or changing things between two different structures. It seems to be pretty good at figuring out, guessing the old structure and the new structure, and then prompting me.
So there's things that kind of routine bits of programming that I found it really helpful for. I think in general, it's really good at that kind of lower level, like pretty good at that lower level. Sometimes it just does a lot of things. But it doesn't, I don't know, I don't think it provides much sense of direction. So there still really needs to be that human in the loop, steering the direction and thinking about the bigger picture and what's the overall goal. And I think we're a very long way away from having LLMs to solve that gap. I think LLMs are really, really good at solving problems that have been solved a ton of times before, because it can just average out over all of those problems that are seen on the internet, the things that have never been done before. It's just clearly not as good. And I guess I'm very skeptical that we can kind of bootstrap genuine inferential capabilities just by throwing more and more data at it.
So there still really needs to be that human in the loop, steering the direction and thinking about the bigger picture and what's the overall goal. And I think we're a very long way away from having LLMs to solve that gap.
For people like yourself and those at Posit that are professional package maintainers and developers, do you find the code completions, I guess, if you have code completions enabled, does any of that get, you know, you tab it in? Or is it something where it's not using the correct style or it's not referencing?
I'm tabbing it in quite a lot. But again, I've noticed it's really good. If I'm going to write for a function, I've written the documentation and now I want to check that each argument is the correct type. And once I've done the first two arguments, it's pretty good at just stepping through all the other arguments and guessing what function I want to call to check its type. So it's stuff like that where it's like, I would have figured that out. It wouldn't have taken me that long to type it. But now I can just press tab and complete the whole thing instead of having to think about it. Sometimes that's a good thing. And sometimes it's a bad thing because you just, it's easy. It's also easy for subtle mistakes to creep in that way because you didn't really think about it. You didn't really read it. So there's definitely also a cost of just tabbing in larger blocks of code because sometimes the effort of understanding exactly what you just did is greater than the cost of writing it yourself. But I'm definitely noticing I'm kind of relying on it more. And even if it gets half the suggestion right, it's still useful to tab that out and then delete and change the rest of it.
Working at Posit and contributing to open source
If they want to become a data scientist at Posit, does that exist? And then just generally, if you did want to become an engineer at Posit, what makes a Positier or whatever Posit calls themselves?
Yeah. So we do have a data science team. It's pretty small. I think it's about four people currently and it's pretty small and it's pretty scrappy. And I will say my experience as a data science team at Posit has exactly the same problems that small data science teams have at every single company where they exist. It's challenging to get high quality data. It's challenging to create dashboards that yield actionable insights that don't have various execs trying to change things all the time. So I think because we have a small data science team, if you want to be a data scientist at Posit, you have to be strong and you've got to be able to act pretty independently and know a little bit about all the parts of the data science pipeline. You're likely to be doing a little bit of data engineering. You're going to be writing some R code. You're going to be writing Shiny apps. You'll be writing dashboards. You've got to have the full stack of experience there.
And engineering more broadly, like on the open source side, we're looking for people that are generally contributed to the R ecosystem already, present on GitHub. And it's certainly very easy. It's much easier to hire people that we have interacted with positively in the past because we know you're going to be a good fit with the team already. But we also have a bunch of just general engineers who are, I think if you read the R community, like Posit is a company that you know and hopefully respect. And the broader developer ecosystem, Posit is just like a tiny company amongst thousands of other companies. And it's a totally different ballgame hiring just general engineers because most of them don't know about Posit at all or what we're trying to do. And there I think it's just about the kind of general programming skills and having some interest in data science.
And maybe just to close us out, there's a question from Sierra here about if you know R and you want to take the step of getting involved in an open source project, do you have any tips there for someone who's never contributed to someone else's code in an open source setting before? Yeah, I think my advice is to like start small and get accustomed to kind of the technical process of like how do you make a pull request on GitHub. And the easiest way to do that is just like proofread documentation because I'm terrible at proofreading and like whatever I'm working on recently I'm sure you can find it's like ridiculous small errors. Like that's not, you know, it's a useful contribution. It's not, you know, it's not amazingly valuable but that allows you to kind of get over that first hurdle of like how do I actually do all this stuff? What are the mechanics of creating a patch and submitting it to GitHub? So I really recommend that approach and then as you get like comfortable with that you can start to maybe look at some of the issues and try and create reprexes and then maybe start looking at the code a little bit yourself.
Another great opportunity if you happen to be coming to PositConf in September and August or you live in Seattle, we have a Tidyverse Developer Day coming up which is just a day where, I'll put the link in the chat, where you can just come and try out that process in a very friendly supportive environment with tons of helpers around. So if you are interested and you can make it I'd say that's 100% the best way. Otherwise, you know, start small, figure out the mechanics of GitHub and then start trying to tackle some of the simpler issues.
Cool, well yeah, thank you for your time and as I mentioned those questions we have around things like Posit Academy that we'll follow up with in an email. But yeah, thank you for your time Hadley and I'm looking forward to seeing your workshop and hearing you talk next month in Salzburg. Yeah, awesome, thanks James, thanks Edwin, thanks for coming everyone.
