Ask Hadley Anything

Transcript#

This transcript was generated automatically and may contain errors.

So yeah, I'd like to thank everyone for attending this Ask Hadley Anything, which will be for the next half hour. Usar is coming up with just under a month to go in Salzburg this year, and Hadley is a presenter, so there's a chance at that event to hear more from him, but today he's here as representing one of our sponsors that made Usar possible.

I guess, Hadley, could you give us a little quicker intro to yourself and our sponsor, Posit?

Yeah, sure. Thanks, Jane. Thanks for having me. So as you probably know, I write a lot of R packages, and me and my team look after the tidyverse , which is a set of packages for doing data science, and kind of the DevTools ecosystem of packages, which help folks maintain their own packages. So for example, one of the things I've been working on recently is Packagedown, which is the R package that we use to make websites for our other R packages, and now many other people use that package for their package websites too.

So Posit, I work for Posit, which is a PBC, or a Public Benefit Corp. That's a little bit different to the traditional LLC. In an LLC, the kind of overarching goal of every LLC is to maximize shareholder revenue. That's kind of the mission of every LLC. The idea of a PBC is to relax that and to add some goals other than just making a ton of money. And so the goals of Posit are to really support the data science ecosystems, R and Python, and scientific publishing in general.

So we make a lot of open source software, but we also sell commercial products. Our main commercial products are Posit Workbench, Posit Connect, and Posit Package Manager, basically designed to solve problems that larger teams of R users tend to have. So Posit Workbench allows you to run RStudio or VS Code or Jupyter Notebooks on centralized compute, which allows you to have a few big servers with a lot of computing power, or perhaps you're in a regulated environment, and that allows you to run all your code in an environment where everything's set up correctly.

Connect allows you to get the results of your analysis basically out of your hand and the decision makers, whether that's by publishing Shiny apps or Dash apps or scheduled R markdowns or sending an email with a link to a dashboard at 9 a.m. on every Friday. And then last but not least is Posit Package Manager, which basically allows your organization to use a shared set of packages that have all the qualities that your organization might want.

So that was a very quick overview, but I work on the open source side of Posit, but tremendously grateful for all of my colleagues who do make products. They waste a lot of money because they pay for my salary.

Because Python, there's just no way that in Python, you're ever going to be able to, I think, interact with data as fluidly as you can with R because of the way that missing values are built into the language because of the way non-standard evaluation works and R allows us to write interfaces like ggplot2 and dplyr.

Hypothetical unlimited resources

In a hypothetical scenario where you had unlimited resources, is there a project that you would even want to start or revive?

I think one of the projects that's really interesting to me is these tools for gradual typing where you add types to a programming language. TypeScript's a good example of this or Pydantic's kind of in this space too, where the advantage of adding types being very strict for every function about what sorts of inputs it can take and what sort of outputs it produces, it leads to better documentation, leads to better error messages, and potentially leads to much better performance. But languages with strict type systems are pretty far away from R. But I think there's been interesting research lately and how you can kind of gradually add types into more dynamic languages. That would be a big project. It's kind of hard to tell what the payouts would be in the long run, but I think that would be something I would love to invest more on and just figure out how we can make our programs both safer and faster and more informative when they error.

LLMs and the future of data science

What's your opinion on the future of data science during the era of tools like ChatGPT, but I guess there's also GitHub Copilot and similar IDE-based tools.

Yeah, I find it very hard to imagine that you can replace a data scientist with anything less than kind of full-on generalized AI, because you have to bring so much domain knowledge and expertise to any problem apart from just writing code. But I do think that LLMs are going to have a pretty transformative impact on the way that we write code. And certainly, I think it seems like it fundamentally changes how we might teach programming, because now you can express yourself in English or whatever, express yourself in a human language and get pretty reasonable code a lot of the time generated. I use Copilot. I've found it really interesting and useful, particularly at doing tedious tasks or changing things between two different structures. It seems to be pretty good at figuring out, guessing the old structure and the new structure, and then prompting me.

So there's things that kind of routine bits of programming that I found it really helpful for. I think in general, it's really good at that kind of lower level, like pretty good at that lower level. Sometimes it just does a lot of things. But it doesn't, I don't know, I don't think it provides much sense of direction. So there still really needs to be that human in the loop, steering the direction and thinking about the bigger picture and what's the overall goal. And I think we're a very long way away from having LLMs to solve that gap. I think LLMs are really, really good at solving problems that have been solved a ton of times before, because it can just average out over all of those problems that are seen on the internet, the things that have never been done before. It's just clearly not as good. And I guess I'm very skeptical that we can kind of bootstrap genuine inferential capabilities just by throwing more and more data at it.

So there still really needs to be that human in the loop, steering the direction and thinking about the bigger picture and what's the overall goal. And I think we're a very long way away from having LLMs to solve that gap.

For people like yourself and those at Posit that are professional package maintainers and developers, do you find the code completions, I guess, if you have code completions enabled, does any of that get, you know, you tab it in? Or is it something where it's not using the correct style or it's not referencing?

I'm tabbing it in quite a lot. But again, I've noticed it's really good. If I'm going to write for a function, I've written the documentation and now I want to check that each argument is the correct type. And once I've done the first two arguments, it's pretty good at just stepping through all the other arguments and guessing what function I want to call to check its type. So it's stuff like that where it's like, I would have figured that out. It wouldn't have taken me that long to type it. But now I can just press tab and complete the whole thing instead of having to think about it. Sometimes that's a good thing. And sometimes it's a bad thing because you just, it's easy. It's also easy for subtle mistakes to creep in that way because you didn't really think about it. You didn't really read it. So there's definitely also a cost of just tabbing in larger blocks of code because sometimes the effort of understanding exactly what you just did is greater than the cost of writing it yourself. But I'm definitely noticing I'm kind of relying on it more. And even if it gets half the suggestion right, it's still useful to tab that out and then delete and change the rest of it.