Resources

Aaron Jacobs - Auth is the product, making data access simple with Posit Workbench

video
Oct 31, 2024
21:33

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

My name is Aaron Jacobs. I work at Posit on our sort of professional commercial products. I'm going to talk today about something sort of thematic that we've changed about our pro products.

When I submitted this talk, when I was working on this in sort of December, I thought that maybe no one had heard of any of these features and that few people outside of the company or even the team that I was working with would know about this stuff at all. This week, it has been very surprising because not only do people keep asking me about this, but we now have a ton of demos in the lounge and James mentioned this stuff in the keynote, like all that was news to me. I think that this work has accelerated incredibly quickly.

And because of that, I think today I'm going to actually talk more about like the strategy, I think, and the ideas behind it and sort of what influence like our thinking about these problems, as opposed to sort of walking through specific features. And hopefully that is interesting.

So I think this is a pretty mixed audience is my guess. Can I see a show of hands for people who have actually used Workbench? That's so many more than I was expecting. Awesome. Okay, so if you are not in that crowd, have you used RStudio? Like most people. Okay, so I think that a lot of people who are familiar with RStudio think about Workbench as like the pro version of RStudio. Especially because actually Workbench was called RStudio Server Pro for like the first couple of years that existed. But I don't think this is a terribly accurate view anymore. Not the least because Workbench now supports four data science IDEs and as you learned this morning, pretty soon five.

But in our own words, we would describe Workbench as, this is what our website says, a secure and scalable developer platform for data science and Python. The way that I generally think about Workbench is like as a centralized hosting for data science IDEs. And it includes a ton of features that are geared towards folks that are working in enterprise. So that includes things like SSO support, admin managed R and Python environments, resource controls, auditing, monitoring, a form of project sharing, and more.

What it means to be successful inside organizations

So more broadly than Workbench, I think I would describe our professional products as devoted to making code first data science successful inside of organizations. One of the things I talked about in the keynote this morning is that we do tend to draw the line between our open source and professional work in that way. The focus of our professional products is very much on organizations and that tends to be the dividing line. But one thing that they didn't talk about very much today is the problems that you tend to encounter in enterprises are often fundamentally uninteresting or surprising if you're a pure open source user.

So what does it mean to be successful inside of organizations? I mean the truth is that we are also constantly trying to figure that out. That's the process of making a commercial product for customers. You have to try and figure out what they actually want. And if we know one thing is that the meaning of that has not stayed constant over time. So it's readily apparent to us that the capabilities that drove customers to purchase RStudio Server Pro in 2014 when we released it are not the same things that drive them to purchase Posit Workbench in 2024. What our customers and users need out of our professional products is actually changing over time.

And what I want to focus on today, one way of thinking about it, is that it's one of the ways that one of the important ways that we have observed that data science inside of organizations is changing and the ways that our professional products are kind of adapting.

I'm sure plenty of pedagogical material gave you this idea that all data science would begin and end with tidy self-contained CSV or maybe parquet files that were just waiting to be analyzed. So I like to describe that as like the download the training data and get started, like that's the world view there. And I think as most of you probably know the reality is much more complicated.

So a key part of data science work is just confronting the fact that data is messy and inconclusive and incomplete, its provenance is unclear, its value is often unclear. And plenty of folks I think at this conference will have something to say about this problem. But I want to focus on a bit on a second very unglamorous reality data science work inside of organizations which is that getting access to data at all can be a struggle.

The data access problem in enterprises

So I think for much of Posit's early history we mostly saw enterprise customers manage access to data if they were doing so at all via Windows file shares and on-prem database management systems like Oracle SQL Server. And over time actually our enterprise products ended up working reasonably well in this context. But I think over the last five years or so, because everything is slow in enterprise, we've seen a pretty dramatic shift among our customers to the cloud, surprise. And a lot of the fairly simple access control mechanisms that we were accustomed to and the credential paradigms that we used worked in this on-prem context have not served folks well in this brave new cloud world.

So to give a hopefully humorous example of this, suppose you are your team inside of your organization is tasked with analyzing customer churn, so classic problem right. So when you get started on this project you know that your company stores basic customer data in Salesforce so you know you'll eventually probably want to access it by its Salesforce API. It turns out that this means asking your IT department to register a Salesforce external app for you which takes a few weeks and when it comes back that comes back with a warning not to embed the OAuth client credentials in scripts or reports, whatever that means.

And next you want to understand how your customer's experience in your web app might have influenced their desire to churn. To get that data you know you might actually need to ask a friend on the development team because you want them to want to run a query against the production database which is in say AWS Aurora. Because the activity data that you actually need hasn't yet been persisted to the internal data lake. The internal data lake uses Azure SQL Server as it turns out because in the year that that project started Microsoft gave me more cloud credits than Amazon. And then meanwhile the CIO is saying everything's going to move to Snowflake starting in the fall but they've been saying that for three years and you don't really know that's still a real thing at all.

And if you're counting this as three different SQL dialects you need to know at this point, you know, meanwhile you've heard that one of the other departments has been playing with Databricks or Spark and Databricks and it seems like maybe you could get access to cool streaming data there to make a nice dashboard. But they're hesitant to give you access because they want to make sure that the compute costs for your project don't get attributed to their department. Then you discover through Salesforce that some important aspects of customer information are actually only available in the original invoices which are PDFs stored in an AWS S3 bucket. And they're in an S3 bucket because that was what was used as an internal tool that is currently unmaintained but is critical for the success of the sales organization.

And then it turns out that it's not the same AWS account that you normally use but it's actually a different one and that one by policy does not allow you to use IAM user credentials and now you have to figure out how to get IAM role assumption working. And then to top all of it off, customer success only tracks their meetings in a Google spreadsheet which has cross references to Google docs with their notes from those calls which are the only place where you could find a textual source that might actually contain hints as to why those customers churn. And for which you need a GCP service account to access, which is disparate, and so on and so on and so on.

So hopefully some of you find this amusing, some of you it may trigger PTSD. Certainly does for me. And I think that this is an exaggeration but having talked to a reasonable number of customers I think it is only a slight exaggeration.

Certainly does for me. And I think that this is an exaggeration but having talked to a reasonable number of customers I think it is only a slight exaggeration.

Common themes in enterprise data access challenges

So I think that like the common themes that we hear today when we're talking about the challenges of getting access to data inside of organizations would be complexity. So although there's a limited number of cloud or data platforms — so this is essentially AWS, Azure, GCP, Snowflake, and Databricks — those are kind of from our vantage point the biggest ones that we see. Many customers are using more than one, many customers are actually using all of them together. And everyone seems to be in the middle of a years-long migration to an internal data platform that is yet to materialize. And so you are even more likely to be pulling data from multiple places in the same projects.

The second thing that we see is new forms of authentication and authorization. The native language of these cloud platforms is IAM and OAuth, not usernames and passwords. And the third major theme we see is more scrutiny I would say about the use of credentials. So when I worked as a data scientist the entire department, actually the company, shared a single database account and password. This would be considered wildly, actually wildly, unsecure even at the time. But it would be wildly inappropriate today. IT folks are increasingly unhappy with data scientists managing long-lived credentials themselves. They really do not want you to embed your personal Snowflake username and password inside of a Quarto report, which is absolutely the formal guidance that Snowflake sent me via email and PDF a year ago.

Not only have the technologies changed, the expectations have changed as well. And actually something a co-worker reminded me of this morning, there was a big data breach at Snowflake about six weeks ago. It was caused because someone was using a service account that they did not use single sign-on with, unsurprisingly because it's a service account so you can't do single sign-on directly anyway.

I think that something that Posit has always valued both on the open source side and in its professional products is meeting people where they are, helping data scientists be heroes inside of their organizations. And right now we think that an increasing amount of heroism is coming from the ability of our users to navigate these complex data platforms.

Workbench managed credentials

So with that in mind I think beginning about two years ago we started building deep integrations into Workbench to manage data and cloud platform credentials for you. And today, as of the Workbench release this week, I hope it's this week, that includes managed credentials for AWS, for Azure, for Databricks, and Snowflake.

So what this means in practice, because you're all nerds here, is that we've taught Workbench to speak these new languages like IAM and OAuth. The idea is that when you arrive in an IDE like RStudio, all of the credentials you need to access your organization's data are already present quietly in the background. You don't need to manage an AWS access key, you don't need to manage a Databricks personal access token, you don't need a Snowflake service account. You don't as a data scientist need to do this work yourself. This makes not only you happy but it also makes your admin and hopefully IT folks very happy as well.

Now on the subject of being nerdy you may be interested to know that the way this works under the hood is that Workbench will manage configurations for files and environment variables for you. We will do things like run through this OAuth stuff, we will refresh tokens for you, we will generate these configuration files. And we've been very very careful to make this as compatible with the existing tools as possible. So you'll find things like the AWS CLI just like works in RStudio, you'll find that the Databricks SDKs just work inside of Visual Studio Code in Python, the Snowflake CLI works. Like an amazing percentage of this work was actually just very carefully reading the public source code of these SDKs and figuring out how they worked, figuring out how they manage credentials and teaching Workbench to operate in that paradigm. And something actually Edgar talked about a little bit today is that we then took a lot of what we had learned by looking particularly at the Python ecosystem and then like ported some of those credential paradigms to R, largely through the odbc package but also through other mechanisms.

What's next for managed credentials

So I would say, you know, what's next generally for us for managed credentials. So I said we currently support four major providers, some of those have sort of paper cuts that we would like to improve. I think that'll certainly happen for AWS and Azure. But one of the things that happened as a result of this project is that introducing some of these capabilities into Workbench changed how we thought about what Workbench could do as a product. And so a good illustration of this is Workbench has supported this idea of like ad hoc jobs for a while but it was not clear really what the advantage of using that feature over just running R at a terminal would be. But with managed credentials we actually have a really nice answer which is that this allows jobs to use credentials. So you can have something like you can be working on a project and you need to run a simulation that takes six hours so you run it overnight and it uses some Databricks data. And right now we don't really have, we wouldn't really have had an answer to that, but with managed credentials you can do things like you know this job is going to run for six hours and five hours in it needs to pull some new data from Databricks but Workbench will have refreshed the credential for you so it'll work.

It also allows us to tackle you know problems we've had for a long time we didn't know how to solve, like how do we answer this you know paper cut issues around how people use git inside of Workbench. It's very painful to manage SSH keys, we don't have a great OAuth story — hey wait, no we have a great OAuth story now, can we make that work. And all of this to say I think that I'm very excited about the direction this sort of means for the project and the sort of future of it.

The last thing I'll say is we have not forgot about publishing. So it's not enough just to have a great authoring experience, we know that these cloud credentials are now vitally important for the kind of data products people are producing — reports, emails, dashboards, wonderful Shiny and Shiny Python applications, Streamlit apps, Voilà apps, the diversity is wild or wide. And I think that definitely when I was preparing this talk there was nothing to say about this but as of last week Connect has publicly released its first round of support for these integrations as well. And now I'm even more hopeful about this future because we can probably tell a great story where you can move from working on data science projects that use all of these new cloud credential systems to having a seamless publishing workflow that picks up on these credentials. James mentioned this morning in his keynote that also there's some very cool aspects of this Connect integration for viewer based permissions.

So I would say okay so that's it, I can answer a question. I will say definitely it's the case that you should drop by if you're interested in this at all, drop by the lounge. I think like the eight people in the world who I could pick to talk about that are basically all here including most of the engineers who work with me on these projects across both products and also solution engineers who give better demos than I could ever aspire to in my life. So thank you.

Q&A

Thank you Aaron and as a solutions engineer I'll say some of our demos aren't quite as good as maybe you'd like to think they are. But you know I gotta say I've had the experience of watching a solution engineer give a demo for a feature that I wrote and being like wow you did a much better job so you guys are great.

Okay yeah so a couple of questions. Well this isn't a question this is just an exclamation — GitHub credentials, yes please. Yeah actually yeah yes, we'll see, we'll see some, you know I have to argue with Tom Hawk who's the product manager about feature spots for Workbench. But yes okay and I guess I mean it's not a net question at all so you know there's GitLab and Bitbucket and yeah, oh yeah it all works. Okay sorry no, I won't say it but yeah, it's on our mind, is this solution okay.

Is this solution also available in the free RStudio download? So no, because it's arguable that it doesn't really make sense, because if you're operating on desktop a lot of this OAuth stuff like works through a browser flow and typically uses like localhost redirect, like all of those workflows already kind of work. The place that they're really painful is on server environments. And in particular the main difference is if you're working on desktop you want to be managing stuff yourself because like you're operating as an individual. But one of the main benefits for Workbench, which is a trade-off right, is that you're moving to a more admin controlled environment and that is a trade-off right. Like there are definitely uncomfortable aspects of not working on your desktop. So I think about this as being like yeah anyway. So I think that I could say more about this, there's various technical and also social reasons why it doesn't really make sense for those features to be available in the open source version. And this is actually one of the an illustration of that point I tried to clumsily make at the beginning which is I think we sometimes think about like could we not just put enterprise features in our open source products. But I think a lot of the time they're not as portable as you think they are and often they rely on assumptions that are not possible in an open source context or are fundamentally uninteresting in many cases to open source. So I would say no things are not coming to our open source products, though we did like the odbc work for example, like that that's all open source, that'll also back, that's open source.

Are these integrations only for the RStudio IDE? Ha ha yeah no they are not, they already work in VS Code. Of course we made them work in VS Code because we knew that Positron — so they will work on launch when Positron comes to Workbench. And also we intend to port them to JupyterLab, don't get me started on how difficult that is, but we'll do that too.

Can SAML or OIDC connections be latently asserted by Workbench? I don't actually know what that means. I would say there's already machinery in Workbench to do clever things with OIDC, that's how the AWS integration works. Don't get me started on SAML and how it lacks a refresh protocol and how that makes everything terrible. If you're really interested in that come find me after and I maybe can answer your question.

Are these integrations only for individual Salesforce Databricks user accounts or can you use a service account? So they are focused on the IDE user. I think the there's a larger theme which is that these platforms are increasingly moving away from things like service accounts. So I don't know like if you go to the Databricks page they're like please do not use personal access tokens anymore, please use OAuth. You go to the Snowflake page it says please do not use you know service accounts. You go to the IAM user page on AWS's documentation it's like please do not use this, please use short-lived credentials. So I think that we're going in this direction of like short-lived, individually kind of like attributed credentials. That said these systems basically can all be sort of like individually turned off when needed so you can make all existing patterns work, like we don't close the door on using other methods.

Yeah thank you Aaron and thank you to the rest. Yeah I would also say I would say like don't get the impression that I did all of this work myself. I think that one of the things about a lot of at least conference talks is like there you're talking about open source work that you've done and it's like work that you have individually done. So many people internally at Posit and also externally contributed to these features and they were done over the course of two years so as you imagine.