Databricks x Posit | Improved Productivity for your Data Teams

Transcript#

This transcript was generated automatically and may contain errors.

All right, well, welcome, everyone. We're excited to see you all here. My name is James Blair, and I work as a product manager at Posit for cloud integrations. And I'm excited to be here with you all today to talk about some of the recent work that we've done with Databricks and the partnership that we have with them. And and what we're excited about. I'm joined today by Rafi Kurlantic, who is a product specialist at Databricks. And we'll be walking through some of the Databricks perspective on the partnership, as well as the work that we're doing together. So I'm actually going to turn the time over to him to get things started today. But we're excited to be here with you all.

So I think this is a really, really nice way to kind of combine Databricks model serving with a challenging data set. And leverage some of the superpowers that LLMs actually have.

One last thing I want to call out here before I move on is notice how I didn't have to specify any credentials or anything like that when making these API calls. So all of the Databricks SDKs in the last year that have come out and they've been designed to reuse credentials that are created when you sign in with OAuth or when you use the Databricks CLI. So this is a much more secure and easy way to use these SDKs.

Now the second sort of example that I want to show you is using the Python SDK. So Databricks recently released a pure Python SDK for interacting with the Databricks REST API. And I want to show you how simple it is to call a language model on Databricks using the Python SDK. So we'll import the workspace client from the Databricks Python SDK. We'll instantiate the workspace client. We're going to say what is the serving endpoint, model serving endpoint that we're going to call. And then we'll say our prompt. In this case, we're not looking at the UFO data anymore. We're going to be calling a model that actually has been built on top of a RAG architecture that indexes all the latest Databricks documentation, stores it in a vector database, and then uses those embeddings to actually return a more accurate response to us when we query.

Querying it is also super simple. So we use the workspace client, the serving endpoint service, and then we pass in the query, which is go hit this model. And then here's our prompt that we want to pass. And for this one, I'm going to use the run Python file functionality here in VS Code. And we'll just wait a few seconds. We should be able to see the results here. Okay, so what does our language model have to say about this? Databricks model serving deploys your MLflow machine learning models and exposes them as REST API endpoints. Serverless compute resources, easy configuration, high availability and scalability, dashboards, monitoring, all sounds pretty great.

Okay, now I'm just going to show quickly and I'm going to hand it back over to James. So this is in Databricks itself. This is the model that we were calling. And you can see some of the dashboards that exist here, right? Beyond the scope of the presentation to go into detail here, but I just wanted to show how this is the model that we were calling there.

Okay, so to recap, we showed how you can launch VS Code in Posit Workbench, how you can access Databricks by installing the VS Code extension. You can use Databricks Connect to have interactive sessions with the data that's being managed in Databricks and Unity Catalog. And then we also showed how you can access Databricks model serving both through the Python SDK and then also through that Foundations Models API. All right, and now I will pass it back over to you, James.

Demo of signing into Posit Workbench via OAuth

Excellent. Thanks, Rafi, for walking us through that demo. I think, again, it's clear that Databricks has a lot to offer in regards to the ways in which we interact with modern tools in the data stack. And it's really great to see some of those become more and more available to users from whatever environment they want to be in, whether it's VS Code or something else. And we're going to continue that theme here as we talk about some new changes that we're introducing into Posit Workbench. So a lot of what we're going to walk through here in a moment is going to be made available in the upcoming December release of Posit Workbench. If you happen to be a Posit Workbench customer. And there's also some open source components here that we'll talk through as well.

The first thing I want to highlight is when I'm in Posit Workbench, if I go to start a new session, I'm presented with this option to choose, like Rafi demonstrated earlier, which editor, which environment I want to run my session in, whether it's Jupyter Notebook or Lab or VS Code, whatever the case might be. But there's this new portion down below the session name where I can choose a Databricks workspace that I'm going to want to authenticate into. These workspaces are configured by an admin. So an admin would go through and define what workspaces users would have access to. And then I can select from whichever workspace I want to select from, whether it's none or a predefined one, and then choose to sign into that workspace. Once I click sign in, this will transfer me over to another browser window where I will sign into Databricks using whatever mechanism is used within my organization. On our side, we use SSO. And so I would go through that whole flow to get signed into Databricks. Once that's happened, Posit Workbench will manage a specific token credential for me that I won't need to worry about. And the nice thing about that and the benefit of this is, as a user, I don't need to worry about downloading or managing a personal access token to gain access to Databricks from within Posit Workbench. It's all managed behind the scenes for me and automatically kept up to date once I've gone through this sign in process here from the new session launch screen.

New Databricks pane in RStudio

Once we've started this session, we'll be dropped into the RStudio IDE. So if you've used the open source version, the version that's available on Posit Workbench feels essentially identical and familiar. I have the ability to browse files and packages and other things that I might find interesting in here. I'm going to go ahead and open a project that contains some content that we're going to walk through as part of this demonstration. And then I want to highlight one of the things that's going to be new in this upcoming release of Posit Workbench. If I look up here in the top right hand corner, I have a number of different panes that I can open up. There's a new pane, however, called Databricks. And if I click on this new pane, this will open up a view into my Databricks compute console. So essentially what I am able to do here is I'm able to see what compute clusters I have access to on Databricks. I can see what status they have, whether they're running clusters, whether they're stopped, whether they're in the process of starting up or shutting down.

And if I want to, I can click on any of these. So let's take a look at this old cluster here at the top of my list. I can click on any of these and find additional details about the cluster, including its ID, what its policy is, information about the resources assigned to that cluster. All of this is made available directly here inside of the RStudio IDE within Posit Workbench. And again, this is specific to Posit Workbench. So to be transparent, this is not something that will come to the open source version of the RStudio IDE, but rather is something that is specific to Posit Workbench.

The advantage of this is that it kind of twofold. One is as a Databricks user, if I want to run some workload on Databricks, I can start the cluster I want to run that workload against directly from within RStudio. I don't need to log into the Databricks UI. I don't need to navigate some other system. I can do everything I need to directly from within the development environment window. And then once I've started a cluster, so let's take this old cluster, for example. Once I've started a cluster, I can easily connect to it by clicking on this connection icon. This will open up a dialog box that pre-populates all the information that's necessary to make the connection to this cluster. And again, one of the advantages of the fact that Posit Workbench is managing my Databricks credentials for me is that I don't need to input any sort of password. I don't need to input any sort of key or token here. All of that is automatically passed through from Workbench to the connection when this connection is made.

You'll see here that this cluster ID was automatically populated in this connection dialog. It tells me, okay, the Databricks runtime for this cluster is 13.3. And then it informs me here that the Python environment necessary for DBR 13.3 isn't available locally. So Databricks Connect is something that Rafi talked about a little bit. And that's the tool that we use to connect remotely into these Databricks sessions or these Databricks compute clusters to orchestrate workloads there. But there are some local dependencies necessary in order to have that functionality. I need to have a version of Databricks Connect locally. And that version of Databricks Connect needs to align with the version of Databricks runtime on the cluster that I'm connecting to. So all of these checks happen automatically.

And so in this case, this is a cluster that I just started this morning for this purpose with an older version of Databricks runtime, so 13.3. And it tells me I don't have the right Python dependencies in place. I'll need to install those. So we'll go through that process in just a moment. It tells me down here, look at your credentials are being managed by Workbench. We've already passed the credential. Everything's good to go for connecting. The only thing missing is the local Python dependencies that I need to make this connection.

OK, so if we run this code here inside of our console in our studio, we'll get a prompting to install the right Python dependencies for this particular connection. So here in this case, it says, look, we don't have the right version of dependencies necessary to support a connection to a cluster running DBR at 13.3. Do you want to install those dependencies? And we can hit yes here. And this will go through the process of creating a virtual environment on my specific to my user, in this case, specific to my user on Posit Workbench that contains all the correct dependencies necessary to create a connection to this specific cluster. Now, if I were to create another cluster or try to connect to another cluster that was also running a version of Databricks runtime, the same version of Databricks runtime. So in this case, 13.3, it would know that this environment already exists and use this preexisting virtual environment. So I don't need to recreate virtual environments every time I connect, but rather each time I connect to a new version of Databricks runtime, I'll be prompted to install an updated version of the Python dependencies. And those will be isolated in their own virtual environment so that I can reuse whichever one I need to for the clusters that I might want to access.

You'll notice up here on the top right hand corner now, once I've made this connection, it automatically opened to this connections pane. This connections pane now shows me a view into Unity catalog. So I can browse through the different catalogs, schemas and tables that I have available to me on the Databricks side and view specific details about those tables. So if I wanted to look at the UFO data that Rafi was talking about, I could come in and expand this out and see, okay, here's this New Fork reports data or this New Fork reports table. It's inside of this demos catalog and this New Fork schema. And then this will open up in just a moment and show me a preview of the different columns that that table has, the different values that that table has.

This is all everything that we've walked through here, with the exception of this Databricks pane over here on the right hand side. Everything that we've walked through here is a function of the sparkly R package, which is an open source package. So being able to connect to Databricks from a local R process using sparkly R is something that's available in that open source package. It's just this Databricks pane that's specific to the Posit Workbench release that's coming later this month.

New simplified ODBC access to Databricks

In addition to connecting through sparkly R, we're also excited to announce that we will begin offering a Databricks odbc connector to our professional customers here in the next short little while. And this Databricks odbc, or yeah, this Databricks odbc driver will be made available alongside the collection of existing odbc drivers that we offer to customers. What that means is if I come in and I say, let's go back to connections, let's create a new connection. Here you can see I have a long list of available connection types. And if I scroll, there we go. Down to the middle of this list, I can see that Databricks is listed in here as one of those drivers that I can connect to.

It's also worth noting that this Databricks connect DBR 13 plus is this is a way to connect directly from the connections pane with the sparkly R package. So if I wanted to connect through sparkly R, I have a number of different ways that I can facilitate that one of which is through the connections pane, which will bring up this dialogue similar to what we saw previously. If I go to the Databricks odbc driver here in this list, this will prompt me for the HTTP path that I need to connect to the cluster that I want to connect to or to the SQL warehouse on Databricks that I want to connect to. And I can retrieve that information from the Databricks comps or the Databricks UI.

Here's an example of how we can connect to that, how we can make an odbc connection into Databricks. And I want to highlight one additional thing here that's that we're excited about. If I if I connect traditionally, so a typical odbc connection from R would rely on the odbc package and then specifically would rely on the odbc function from that package to facilitate a connection. And then that allows me to pass in things like what driver do I want to use and what connection requirements or what connection string details need to be presented or provided to that driver to create the connection to this data source. In the case of Databricks, I actually need to provide a lot of detail. I need to provide the driver name, the host where the host where my Databricks, where Databricks is being run, the port, the HTTP path, the catalog I want to connect to. I need to define whether I'm using SSL. There's this thrift transport argument. I need to define what auth mechanism I'm using and provide a user ID and password, whether that's a username and password combination or a token and the personal access token that I'm supplying. But I have a lot of details that need to be provided in order to create a connection to Databricks.

Well, because we've done a lot of work inside of Workbench to simplify things like credential management and other components necessary to access Databricks, we're introducing a new function to the odbc package and upcoming release called Databricks. And this function exists to help facilitate connections to Databricks through odbc, whether that driver is the odbc driver that you install from us, or it's the odbc driver and connector that you install and download and install from Databricks themselves. With this new odbc Databricks function, we can get rid of essentially everything that we saw in the previous connection string and instead just supply the HTTP path that we need. And that's it.

Well, because we've done a lot of work inside of Workbench to simplify things like credential management and other components necessary to access Databricks, we're introducing a new function to the odbc package and upcoming release called Databricks. With this new odbc Databricks function, we can get rid of essentially everything that we saw in the previous connection string and instead just supply the HTTP path that we need. And that's it.

And so to highlight this, if we run this chunk of code right here, we'll see that we have an odbc connection into Databricks. And this odbc connection will give me a similar view into Unity Catalog as what we saw with Sparklyr. So once again, I can expand out and I can look at, okay, here's this new fork schema, here's this table inside of there, here's the different values that are included in that table.

And if we come back here, we can see that we now have two different connections. We have an odbc connection into Databricks, and then we have a Sparklyr connection into Databricks that are both being supported from this local environment. Once we've created these connections into the Databricks environment, we can create local references to data tables that might exist in the Databricks environment. So for example, if I run this new fork command here, if I run this line of code here that uses the table function or the TBL function to reference the data inside of Databricks, so notice that I'm referencing demos, new fork, and then the new fork reports, this will create a local reference that points to that data inside of Databricks. In fact, we can see this because if we run the head function, which would typically return the first six rows of a given dataset, and then we run show query, this will show me how this ends up being translated on the backend in the Spark SQL that gets executed on Databricks and the results come back into my environment. The key here and the benefit of this is that I can keep all of my data in Databricks as long as I need to and work with it there. And then when necessary, I can move whatever summarized version of the data I need back into my RStudio environment.

So if we wanted to, like, for example, right, we had this UFO sightings data. If we wanted to explore this data and gain some insight from it, we could do something like take the new fork data and filter by date time and then calculate by month how many UFO sightings that were from the dataset, and then we could explore what that might look like. We can keep, again, we can keep all the data inside of Databricks to perform the aggregation of this and then finally collect the resulting aggregation into our R session and then work with other tools within R like ggplot, for example, to create some sort of a visualization of whatever it is that we're investigating. In this case, we're saying how many UFO sightings were there by month, by year in this dataset. And so we can see if we look at this, there's not a lot up until about 2000. Then things started to increase in 2010, maybe 2009. Things really kind of ramped up, but it maybe died down a little bit in the past couple of years.