Resources

Databricks x Posit | Improved Productivity for your Data Teams

video
Dec 5, 2023
1:07:34

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

All right, well, welcome, everyone. We're excited to see you all here. My name is James Blair, and I work as a product manager at Posit for cloud integrations. And I'm excited to be here with you all today to talk about some of the recent work that we've done with Databricks and the partnership that we have with them. And and what we're excited about. I'm joined today by Rafi Kurlantic, who is a product specialist at Databricks. And we'll be walking through some of the Databricks perspective on the partnership, as well as the work that we're doing together. So I'm actually going to turn the time over to him to get things started today. But we're excited to be here with you all.

Introduction to Databricks

OK, awesome. Thank you so much, James. So at Databricks, I'm going to go into a little bit of who we are and what we're all about. So our mission is to democratize data and AI to the entire world. And we our strategy that we use to basically accomplish this mission is twofold. First, we have created three very successful open source projects, Apache Spark, which is scalable in-memory data processing, Delta Lake, which is scalable data storage and management, and MLflow, which is used to manage the entire machine learning lifecycle. In addition to this commitment to open source, we also have developed a managed service that is beloved by many of our customers. We have over 6,000 employees across the world. We're very successful, thankfully, in terms of our revenue and investment from top firms around the world. And we're the inventor of the data lake house paradigm, as well as a pioneer in generative AI.

Now, what I would like to begin to go into is sort of, you know, our view on data and AI is that it's universal. And everyone already knows that this has impacted every single industry, every single sector of the economy. And we believe that the companies that kind of win in their industry will become themselves data and AI companies. This is easier said than done, though. And if you look at what data and AI means, it's actually a collection of very different but related technologies. So, for example, you have everything from data warehousing and AI to streaming, real-time analytics, data science, machine learning. And then you have to govern all these different sort of use cases and technologies.

Not to mention that you have the emergence of generative AI and the promise that it kind of brings to the table. So it's very, very hard for organizations to kind of wrap their heads around all of this and make them, you know, successful with all of this. Our effort has really been to simplify all of these things. And we've done so with the data lake house.

So if you're not so familiar with the lake house, let me just kind of walk you through it quickly. So the concept of a lake house is essentially it's a portmanteau of a data warehouse and data lake. So you have the structured data and performance and consistency that you have with the data warehouse alongside the unstructured data and sort of the open nature of a data lake. So our data lake house is built on an open data lake. Basically, you use cloud object storage, S3, ADLS, GCS to store all of your data. And then you can unify your data under one single storage and management system, which is Delta Lake is the underlying technology for that. Once you have all of your data in storage and you're managing it with Delta Lake, then you can secure and govern it with Unity Catalog. And then once you've secured and governed your data, all of your data, one copy of the data, then you can kind of unlock these different pillars of data and AI and analytics.

So we kind of pioneered this concept in 2020, and we're really, really excited and happy to announce that this has been pretty successful. We've had 74% of global enterprises have adopted the lake house architecture. So this really speaks to kind of the power of it and sort of the simplicity and unification of it.

All right, now that you understand a little bit about what a data lake house is, it's this open, unified foundation for all of your data. What happens when you combine this type of foundation with generative AI? So how does that change things? We think that this actually introduces a new type of data platform, and that is the data intelligence platform. The concept of a data intelligence platform really does further the democratization of data and AI. It makes it more accessible to more people than ever before.

So practically, how does that work out? What does this really mean practically? So I want to touch on two elements that this really kind of means. The first is that obviously you want to be able to use generative AI to understand the semantics of your data so that you can ask questions of your data in natural language. That's a huge part of what a data intelligence platform is, and we are building Databricks to support that type of interaction with data. The other way that you can kind of use generative AI is to use open source models and combine it with your own personal data so that you can build applications on top of it.

So that's sort of what I'm going to focus on a little bit more today when we go through a demo. But rest assured that Databricks is working on these two different sort of components of a data intelligence platform, both for end users who want to ask questions of their data and for developers that want to build with generative AI.

Databricks products for Posit integration

Okay, so how do you access this data intelligence platform using Posit products? So James is going to focus a lot on this question, and I'm going to show a little bit as well. But I want to talk really quickly before I hand it over to James. I want to just talk quickly about the products that I'm going to be using from Databricks that enable this type of activity.

So the first is going to be Databricks Connect version 2. Databricks Connect is a lightweight client library that you can install on your laptop or any server that you want to be running a remote connection to Databricks. And it allows you to use Apache Spark in a very simple way. So it's very, very useful when you're working in your IDE and you want to kind of do exploratory work or development work in the comfort of your IDE. But it's also very useful when you want to develop a data app like a Shiny app or Streamlit app or something like that. And you want to be able to access Databricks and all the data that's in there, that's secure, that's governed, all that kind of stuff.

The second thing I'm going to be showing you is the Databricks extension for VS Code. I'm going to show you how easy it is to get set up, how you can find it in the VS Code marketplace. How it kind of lets you use all the features that you like about VS Code, but still be able to access and run code against Databricks and work with the data that you have there.

And then lastly, I'm going to be showing a little bit about Databricks Model Serving. This is our online real-time inference for machine learning models. It is fully managed, including support for GPUs. It's one-click deployment of models. It's serverless. It scales up and down seamlessly. We've done special optimizations specifically for serving large language models. And you can also set up monitoring of the inference. You can set up tables that basically collect the inferences that you're making and help you set up monitoring sort of just like out of the box with a few clicks.

Introduction to Posit

Hey, thanks, Rafi. I think it's clear that Databricks has a lot to offer and a lot to bring to the table with this data intelligence platform. And we're excited to talk about the ways in which Posit has worked closely with Databricks to make the experience for our joint customers even better. To highlight this, I want to talk a little bit about the history of Posit as a company, which may be familiar to some of you, it may not be familiar to some of you. So just to give some context of where Posit comes from and how we fit into this picture, I think will be good.

Okay, so Posit has its roots in open source software and primarily in the form of the RStudio IDE. So if you've ever used the R language, odds are at some point you've used the RStudio IDE, which is where Posit has come from, right? Historically, we continue to be the maintainers of this open source IDE today. And it is the preferred development environment for developers and data scientists that are working with the R language. We also contribute heavily to the open source community through a variety of open source packages. These include a number of popular R packages like the tidyverse and its collection of packages for data analysis and data science, Shiny for interactive applications, Quarto for dashboarding and for reproducible reporting, Reticulate allows R users to work with Python directly from the R language, and many, many more.

One of the things that's exciting as we look at where Posit has been and where we're going is our renewed and continued investment into not only R but also Python, so that we focus not just exclusively on a single language, but instead we focus on promoting open source technology for data scientists, regardless of whatever language that might be in.

Just a little bit of historical context, and this is not comprehensive at all, but just to give some context as to what Posit is as a company and who we are. In 2011 is when the RStudio IDE was first introduced, again, as an open source tool that's available to download, and it continues to exist today as an open source desktop installation that anybody can go and use. That's something that we firmly believe in is providing those open source tools to give anyone in the world access to the tools necessary to be data literate and to make informed decisions based on data they have access to. Shiny, the package for building interactive web applications in R was introduced in 2012. Our first enterprise software where we were charging customers to run software in an enterprise environment with additional features that were attractive to the enterprise were introduced in 2014. And then in 2020, we announced that we were becoming a public benefit corporation, and included in that effort is some specific promises that we make to our community as part of our company mission. And then in 2022, the company rebranded from RStudio, which had been the name of the company up until that point, to Posit.

So if you've ever worked with RStudio, the company, we are one in the same. We just now go by the name of Posit, which helps us, again, kind of push forward our mission of open source data science, regardless of the language of choice. One thing that's pretty unique about how Posit operates is that a lot of what we do today on the engineering side is focused on open source technology and open source packages. Again, we continue to support the open source RStudio IDE that continues to be very, very popular amongst data scientists and users. But then we also have this growing collection of packages that we work to support as well. That we firmly believe that this open source commitment is beneficial across the board. It helps users get early access to tools and technology that enable them to be successful in working with data at no financial cost. And then it also enables them to take those tools into a professional context, into a commercial environment, where we can support those tools with our enterprise products that customers purchase. And then that purchase, that income that we receive there, gets reinvested back into the open source side. And this is something that we refer to often as the virtuous cycle.

When we talk about the commercial side or the enterprise side of Posit, there's really three main components that we'll talk about. Each of these will be highlighted to some extent in our conversation today as we talk about Databricks. But just to give those who may be unfamiliar an idea, we offer a tool called Posit Workbench, which enables users to access common data science development environments through a browser in a centrally managed type of way. So through Posit Workbench, you have access to the RStudio development environment, VS Code, Jupyter Notebook, JupyterLab. And there's a number of different ways that Posit Workbench can be configured to run within an organization and can be designed architecturally to fit within whatever type of environment an organization is running in.

As a complement to Posit Workbench, we offer Posit Connect, which serves as a bridge between the developers and the data scientists and the end users. So Posit Connect allows developers to publish and share the work that they've done inside of a development environment, whether that work is something interactive, like a Shiny application that was built in R, or a Streamlit application that was built in Python, or even a web API that was built in R or Python. Those can be published to Posit Connect. And then other users of the organization, whether they're business users, decision makers, they may be downstream processes that are consuming APIs. Those can then utilize the work that's been done by the data scientists and the developers, because it's hosted and shared in a central location that's secured and governed according to the rules of the business and the rules of the organization.

As a kind of behind the scenes component to all this, we also offer Posit Package Manager, which enables organizations to provide a secure and essentially regulated way to have users access R and Python packages. Organizations can define which packages are part of their ecosystem. Posit Package Manager supports snapshotting by date. So if I had an analysis or a thing that I worked on two years ago, and I need to roll back to packages as they were at that point in time, Posit Package Manager makes all that very simple, very straightforward, and is a benefit to organizations where reproducibility and security are of most importance.

So what we'll talk about now as we move forward today is how all of these different pieces, the open source side, the commercial side, how does all this work together with Databricks? And what is it that we're doing today? And what is it that we plan to do in the future to make life easier for those organizations that happen to be using Posit and Databricks together? Collectively, we refer to the commercial side of the commercial products that we offer as Posit Team. And again, that idea of Posit Team is Posit Workbench, Posit Connect, Posit Package Manager. And what we look forward to and what we're going to talk through today is how Posit Team can sit alongside this data intelligence platform that Rafi's described and interact through a variety of different ways with that platform and leverage all the Databricks has to offer in terms of data governance. Governance, security, compute power, distributed computing, all the Databricks has to bring to the table, model serving, all these things that Rafi's touched on already. We can leverage those and make use of them from inside of these different products that Posit offers.

Demo of VS Code and Model Serving

Awesome. Thank you, James. All right. So I'm going to go through a quick demo here, and I want to start with Posit Workbench. So for people who may not be so familiar, you sign into Posit Workbench, and then you're able to basically launch different sessions for different development environments that you would like to use. So you can create a new Jupyter notebook session, JupyterLab, RStudio if you want to develop in R, and then, of course, VS Code, which is preferred by many for Python as well as Jupyter.

Now, I've already gone ahead and created a VS Code session. But if I wanted to, obviously, we could create more, and I could have multiple different ideas going on at the same time. My VS Code session is over here, and I want to start off by just showing how we can get access to the VS Code extension. Okay. So from VS Code, I can come over here to the extensions tab, and then if you just search for Databricks in the marketplace, you'll find the very first one is owned by the Databricks organization. You can install it, and then you're good to go. So it's that simple.

Once you have the extension installed, you'll see this icon over here in the left nav bar. And the first time that you open this up, you'll basically be prompted to configure it, which will ask you, you know, which Databricks workspace do you want to sign into? So you can specify that URL for your particular workspace here. Once you choose that, you have the option of signing in via OAuth, or you can use the Databricks CLI. I've already gone ahead and installed the Databricks CLI here. Actually, James has done that on my behalf. And I've configured and I've logged into Databricks through the CLI as well. Point being that it's very, very simple to sign in to Databricks through the UI just by clicking. You don't have to go through and mess with environment variables if you don't want to.

And once you do that, once you set up the connection to the Databricks workspace, you'll see this left page over here basically, or left window basically be populated. So we'll see what workspace I'm connected to, which cluster in the workspace I'm connected to. I can actually browse the various clusters that I may want to connect to and choose which one I want to attach my session to. And this is going to allow us to basically use Databricks Connect and run files holistically on this particular compute resource that lives in Databricks.

The last thing I'm going to cover, I'm not going to do like an exhaustive sort of view of the VS Code extension because of time constraints. But just to call it out, you can also choose a destination in the Databricks workspace where you can sync your code to. So this is very, very helpful if you want to have relative paths work when you run code on Databricks from VS Code. So you basically have a copy of your code in Databricks so that all the relative paths will resolve correctly. But it's also helpful if you want to do some kinds of development work in VS Code, and then you want to maybe flip to Databricks to the notebook to kind of use some of the features that we have there as well.

All right, so I'm going to start off. I've already gone ahead and connected to this particular compute. And I'm going to go through a Jupyter notebook that I have showing how we can use Databricks Connect. All right, so the first thing that we're going to do, just remind everyone about Databricks Connect. This is essentially a library that is installed on this Posit Workbench server where this is running in the cloud. Outside of Databricks, this is a remote connection. And normally we just create a new Spark session. And then any PySpark code that we use, any Spark code that we use in the DataFrame API will be run remotely on Databricks. If we were to run any local or any just pure Python code, that would still be running on the machine where this is installed. So it's important to just keep that in mind that Databricks Connect allows you to use Apache Spark to process data in the cloud. But anything that's not using Apache Spark is going to be executed on the local machine.

Okay, now one of the nice things about the VS Code extension is that it has a particular integration with Databricks Connect. Where if you're using the Jupyter notebook in VS Code, it will actually just create the Spark session for you. So here you'll see I'm going to read from a UFO sightings table that is in my catalog under the default schema and then the UFO sightings table. Okay, so I'm accessing data that's in Unity Catalog here. I'm bringing it to Pandas because I just want it to render a little bit more nicely. And this is a very interesting data set because it's essentially records of UFO sightings in the United States.

You can see which particular country, which state, what time it was. And then the thing that I want to focus on actually is the text. So the text is the actual report itself, like what did someone observe. So I live in New Jersey, so I'm going to filter down to the state of New Jersey and we're just going to take a look at the text of one of these sightings. Okay, so we have an object like a plane without wings and so on and so forth. There's a lot of descriptive text here. It would be a little bit challenging to kind of go through and read all of these reports and draft up an analysis of what are the similarities and what are the discrepancies between the various reports. I think this is a very, very good use case to basically take all this text data, give it to a language model and ask it to summarize it for us.

So I'm going to show how we can do that with Databricks Model Serving. So the Model Serving product on Databricks is remarkable because it allows you to essentially serve via REST API every type of machine learning model that you would want. So everything from classic machine learning, everything that we were so invested in up until the rise of generative AI, you can serve all of those on Databricks. This is going to be done in a secure and scalable way. And this also applies to foundation models. So we do the work to go get the best state of the art open source models so you don't have to go do all that research. And we'll actually have those hosted in your environment for you. And you'll be able to call those very simple, very simply.

And then third is that we can actually provide a secure route or gateway that allows you to manage if you want to use commercial offerings from like open AI or other companies. You want to manage that under the same place that you manage your custom models or foundation open source models. We provide the means for you to do that as well. So we kind of bring all these different kinds of machine learning models into a single serving product.

Now, foundation models in particular, these are the state of the art open source models. They're really good for summarization, for content generation. If you want to build a chatbot using RAG or resource augmented generation. And just FYI, this is per token based pricing. The foundation models API is in preview. If you're interested in this kind of thing, you like what we're showing you today, you should definitely reach out to your Databricks rep.

Now let's actually take a look at it. So what I'm going to do here is I'm going to take just eight observations from the data frame that we created before. And we're just going to take the text for those eight observations. All right. So here is these UFO sightings from New Jersey. We have eight observations here. Now what I want to do is I want to access a Llama270b chat model that we have hosted in Databricks. And just call it from VS Code right here.

So I'm going to install a, I've already gone ahead and installed this. I'm not going to run this right now. But we're going to install this inference SDK. And then let's start this here. We'll import this chat completion. And we'll define the prompt, which is going to be, you're a helpful assistant. You're going to summarize UFO sightings and explain the similarities and differences between them. Summarize them in three bullet points. We want this to be pretty snappy and pretty concise. And then to actually make that call, we just have several lines of code. Very, very simple. Where we pass in the system prompt and we pass in the sightings object that we created before, that blob of text. And then here's our response.

So we have three bullet points. It says all the sightings describe objects or lights in the sky that can't be identified as man-made or natural. Many of them describe as moving quickly or hovering. And some describe as disappearing or changing direction suddenly. And let's take a look at some of the differences. Some of the sightings describe, oops, some of the sightings describe the objects as being very large. While others describe them as being relatively small and so on. So I think this is a really, really nice way to kind of combine Databricks model serving with a challenging data set. And leverage some of the superpowers that LLMs actually have.

So I think this is a really, really nice way to kind of combine Databricks model serving with a challenging data set. And leverage some of the superpowers that LLMs actually have.

One last thing I want to call out here before I move on is notice how I didn't have to specify any credentials or anything like that when making these API calls. So all of the Databricks SDKs in the last year that have come out and they've been designed to reuse credentials that are created when you sign in with OAuth or when you use the Databricks CLI. So this is a much more secure and easy way to use these SDKs.

Now the second sort of example that I want to show you is using the Python SDK. So Databricks recently released a pure Python SDK for interacting with the Databricks REST API. And I want to show you how simple it is to call a language model on Databricks using the Python SDK. So we'll import the workspace client from the Databricks Python SDK. We'll instantiate the workspace client. We're going to say what is the serving endpoint, model serving endpoint that we're going to call. And then we'll say our prompt. In this case, we're not looking at the UFO data anymore. We're going to be calling a model that actually has been built on top of a RAG architecture that indexes all the latest Databricks documentation, stores it in a vector database, and then uses those embeddings to actually return a more accurate response to us when we query.

Querying it is also super simple. So we use the workspace client, the serving endpoint service, and then we pass in the query, which is go hit this model. And then here's our prompt that we want to pass. And for this one, I'm going to use the run Python file functionality here in VS Code. And we'll just wait a few seconds. We should be able to see the results here. Okay, so what does our language model have to say about this? Databricks model serving deploys your MLflow machine learning models and exposes them as REST API endpoints. Serverless compute resources, easy configuration, high availability and scalability, dashboards, monitoring, all sounds pretty great.

Okay, now I'm just going to show quickly and I'm going to hand it back over to James. So this is in Databricks itself. This is the model that we were calling. And you can see some of the dashboards that exist here, right? Beyond the scope of the presentation to go into detail here, but I just wanted to show how this is the model that we were calling there.

Okay, so to recap, we showed how you can launch VS Code in Posit Workbench, how you can access Databricks by installing the VS Code extension. You can use Databricks Connect to have interactive sessions with the data that's being managed in Databricks and Unity Catalog. And then we also showed how you can access Databricks model serving both through the Python SDK and then also through that Foundations Models API. All right, and now I will pass it back over to you, James.

Demo of signing into Posit Workbench via OAuth

Excellent. Thanks, Rafi, for walking us through that demo. I think, again, it's clear that Databricks has a lot to offer in regards to the ways in which we interact with modern tools in the data stack. And it's really great to see some of those become more and more available to users from whatever environment they want to be in, whether it's VS Code or something else. And we're going to continue that theme here as we talk about some new changes that we're introducing into Posit Workbench. So a lot of what we're going to walk through here in a moment is going to be made available in the upcoming December release of Posit Workbench. If you happen to be a Posit Workbench customer. And there's also some open source components here that we'll talk through as well.

The first thing I want to highlight is when I'm in Posit Workbench, if I go to start a new session, I'm presented with this option to choose, like Rafi demonstrated earlier, which editor, which environment I want to run my session in, whether it's Jupyter Notebook or Lab or VS Code, whatever the case might be. But there's this new portion down below the session name where I can choose a Databricks workspace that I'm going to want to authenticate into. These workspaces are configured by an admin. So an admin would go through and define what workspaces users would have access to. And then I can select from whichever workspace I want to select from, whether it's none or a predefined one, and then choose to sign into that workspace. Once I click sign in, this will transfer me over to another browser window where I will sign into Databricks using whatever mechanism is used within my organization. On our side, we use SSO. And so I would go through that whole flow to get signed into Databricks. Once that's happened, Posit Workbench will manage a specific token credential for me that I won't need to worry about. And the nice thing about that and the benefit of this is, as a user, I don't need to worry about downloading or managing a personal access token to gain access to Databricks from within Posit Workbench. It's all managed behind the scenes for me and automatically kept up to date once I've gone through this sign in process here from the new session launch screen.

New Databricks pane in RStudio

Once we've started this session, we'll be dropped into the RStudio IDE. So if you've used the open source version, the version that's available on Posit Workbench feels essentially identical and familiar. I have the ability to browse files and packages and other things that I might find interesting in here. I'm going to go ahead and open a project that contains some content that we're going to walk through as part of this demonstration. And then I want to highlight one of the things that's going to be new in this upcoming release of Posit Workbench. If I look up here in the top right hand corner, I have a number of different panes that I can open up. There's a new pane, however, called Databricks. And if I click on this new pane, this will open up a view into my Databricks compute console. So essentially what I am able to do here is I'm able to see what compute clusters I have access to on Databricks. I can see what status they have, whether they're running clusters, whether they're stopped, whether they're in the process of starting up or shutting down.

And if I want to, I can click on any of these. So let's take a look at this old cluster here at the top of my list. I can click on any of these and find additional details about the cluster, including its ID, what its policy is, information about the resources assigned to that cluster. All of this is made available directly here inside of the RStudio IDE within Posit Workbench. And again, this is specific to Posit Workbench. So to be transparent, this is not something that will come to the open source version of the RStudio IDE, but rather is something that is specific to Posit Workbench.

The advantage of this is that it kind of twofold. One is as a Databricks user, if I want to run some workload on Databricks, I can start the cluster I want to run that workload against directly from within RStudio. I don't need to log into the Databricks UI. I don't need to navigate some other system. I can do everything I need to directly from within the development environment window. And then once I've started a cluster, so let's take this old cluster, for example. Once I've started a cluster, I can easily connect to it by clicking on this connection icon. This will open up a dialog box that pre-populates all the information that's necessary to make the connection to this cluster. And again, one of the advantages of the fact that Posit Workbench is managing my Databricks credentials for me is that I don't need to input any sort of password. I don't need to input any sort of key or token here. All of that is automatically passed through from Workbench to the connection when this connection is made.

You'll see here that this cluster ID was automatically populated in this connection dialog. It tells me, okay, the Databricks runtime for this cluster is 13.3. And then it informs me here that the Python environment necessary for DBR 13.3 isn't available locally. So Databricks Connect is something that Rafi talked about a little bit. And that's the tool that we use to connect remotely into these Databricks sessions or these Databricks compute clusters to orchestrate workloads there. But there are some local dependencies necessary in order to have that functionality. I need to have a version of Databricks Connect locally. And that version of Databricks Connect needs to align with the version of Databricks runtime on the cluster that I'm connecting to. So all of these checks happen automatically.

And so in this case, this is a cluster that I just started this morning for this purpose with an older version of Databricks runtime, so 13.3. And it tells me I don't have the right Python dependencies in place. I'll need to install those. So we'll go through that process in just a moment. It tells me down here, look at your credentials are being managed by Workbench. We've already passed the credential. Everything's good to go for connecting. The only thing missing is the local Python dependencies that I need to make this connection.

OK, so if we run this code here inside of our console in our studio, we'll get a prompting to install the right Python dependencies for this particular connection. So here in this case, it says, look, we don't have the right version of dependencies necessary to support a connection to a cluster running DBR at 13.3. Do you want to install those dependencies? And we can hit yes here. And this will go through the process of creating a virtual environment on my specific to my user, in this case, specific to my user on Posit Workbench that contains all the correct dependencies necessary to create a connection to this specific cluster. Now, if I were to create another cluster or try to connect to another cluster that was also running a version of Databricks runtime, the same version of Databricks runtime. So in this case, 13.3, it would know that this environment already exists and use this preexisting virtual environment. So I don't need to recreate virtual environments every time I connect, but rather each time I connect to a new version of Databricks runtime, I'll be prompted to install an updated version of the Python dependencies. And those will be isolated in their own virtual environment so that I can reuse whichever one I need to for the clusters that I might want to access.

You'll notice up here on the top right hand corner now, once I've made this connection, it automatically opened to this connections pane. This connections pane now shows me a view into Unity catalog. So I can browse through the different catalogs, schemas and tables that I have available to me on the Databricks side and view specific details about those tables. So if I wanted to look at the UFO data that Rafi was talking about, I could come in and expand this out and see, okay, here's this New Fork reports data or this New Fork reports table. It's inside of this demos catalog and this New Fork schema. And then this will open up in just a moment and show me a preview of the different columns that that table has, the different values that that table has.

This is all everything that we've walked through here, with the exception of this Databricks pane over here on the right hand side. Everything that we've walked through here is a function of the sparkly R package, which is an open source package. So being able to connect to Databricks from a local R process using sparkly R is something that's available in that open source package. It's just this Databricks pane that's specific to the Posit Workbench release that's coming later this month.

New simplified ODBC access to Databricks

In addition to connecting through sparkly R, we're also excited to announce that we will begin offering a Databricks odbc connector to our professional customers here in the next short little while. And this Databricks odbc, or yeah, this Databricks odbc driver will be made available alongside the collection of existing odbc drivers that we offer to customers. What that means is if I come in and I say, let's go back to connections, let's create a new connection. Here you can see I have a long list of available connection types. And if I scroll, there we go. Down to the middle of this list, I can see that Databricks is listed in here as one of those drivers that I can connect to.

It's also worth noting that this Databricks connect DBR 13 plus is this is a way to connect directly from the connections pane with the sparkly R package. So if I wanted to connect through sparkly R, I have a number of different ways that I can facilitate that one of which is through the connections pane, which will bring up this dialogue similar to what we saw previously. If I go to the Databricks odbc driver here in this list, this will prompt me for the HTTP path that I need to connect to the cluster that I want to connect to or to the SQL warehouse on Databricks that I want to connect to. And I can retrieve that information from the Databricks comps or the Databricks UI.

Here's an example of how we can connect to that, how we can make an odbc connection into Databricks. And I want to highlight one additional thing here that's that we're excited about. If I if I connect traditionally, so a typical odbc connection from R would rely on the odbc package and then specifically would rely on the odbc function from that package to facilitate a connection. And then that allows me to pass in things like what driver do I want to use and what connection requirements or what connection string details need to be presented or provided to that driver to create the connection to this data source. In the case of Databricks, I actually need to provide a lot of detail. I need to provide the driver name, the host where the host where my Databricks, where Databricks is being run, the port, the HTTP path, the catalog I want to connect to. I need to define whether I'm using SSL. There's this thrift transport argument. I need to define what auth mechanism I'm using and provide a user ID and password, whether that's a username and password combination or a token and the personal access token that I'm supplying. But I have a lot of details that need to be provided in order to create a connection to Databricks.

Well, because we've done a lot of work inside of Workbench to simplify things like credential management and other components necessary to access Databricks, we're introducing a new function to the odbc package and upcoming release called Databricks. And this function exists to help facilitate connections to Databricks through odbc, whether that driver is the odbc driver that you install from us, or it's the odbc driver and connector that you install and download and install from Databricks themselves. With this new odbc Databricks function, we can get rid of essentially everything that we saw in the previous connection string and instead just supply the HTTP path that we need. And that's it.

Well, because we've done a lot of work inside of Workbench to simplify things like credential management and other components necessary to access Databricks, we're introducing a new function to the odbc package and upcoming release called Databricks. With this new odbc Databricks function, we can get rid of essentially everything that we saw in the previous connection string and instead just supply the HTTP path that we need. And that's it.

And so to highlight this, if we run this chunk of code right here, we'll see that we have an odbc connection into Databricks. And this odbc connection will give me a similar view into Unity Catalog as what we saw with Sparklyr. So once again, I can expand out and I can look at, okay, here's this new fork schema, here's this table inside of there, here's the different values that are included in that table.

And if we come back here, we can see that we now have two different connections. We have an odbc connection into Databricks, and then we have a Sparklyr connection into Databricks that are both being supported from this local environment. Once we've created these connections into the Databricks environment, we can create local references to data tables that might exist in the Databricks environment. So for example, if I run this new fork command here, if I run this line of code here that uses the table function or the TBL function to reference the data inside of Databricks, so notice that I'm referencing demos, new fork, and then the new fork reports, this will create a local reference that points to that data inside of Databricks. In fact, we can see this because if we run the head function, which would typically return the first six rows of a given dataset, and then we run show query, this will show me how this ends up being translated on the backend in the Spark SQL that gets executed on Databricks and the results come back into my environment. The key here and the benefit of this is that I can keep all of my data in Databricks as long as I need to and work with it there. And then when necessary, I can move whatever summarized version of the data I need back into my RStudio environment.

So if we wanted to, like, for example, right, we had this UFO sightings data. If we wanted to explore this data and gain some insight from it, we could do something like take the new fork data and filter by date time and then calculate by month how many UFO sightings that were from the dataset, and then we could explore what that might look like. We can keep, again, we can keep all the data inside of Databricks to perform the aggregation of this and then finally collect the resulting aggregation into our R session and then work with other tools within R like ggplot, for example, to create some sort of a visualization of whatever it is that we're investigating. In this case, we're saying how many UFO sightings were there by month, by year in this dataset. And so we can see if we look at this, there's not a lot up until about 2000. Then things started to increase in 2010, maybe 2009. Things really kind of ramped up, but it maybe died down a little bit in the past couple of years.

Publishing data apps to Posit Connect

One is, as a data scientist, this is how I typically spend my time. I work interactively with some sort of a dataset until I find some insight or actionable piece of information from that dataset. At that point, what I want to do is I want to be able to share that insight or that whatever it is that I found with other members of my organization. That can take a number of different forms. I could be an email. I could build a PowerPoint presentation. There's a number of, I can send a Slack message. There's all kinds of different ways that I can distribute information. But what we found at Posit is that the most effective way to distribute information is to provide something that's reproducible and that's lasting, that users can come back to and reference whenever they need to. And it's difficult to do that with something like an email that can get buried in an inbox or a Slack message that can get buried in a Slack thread.

To help support that idea, we have what I've talked briefly about earlier, a product called Posit Connect. And Posit Connect would allow us to do something like, here's an example where I can take a Quarto document. So this is Quarto, which is a kind of successor to R Markdown, a way of interspersing code with plain text and prose. This is actually, I'm not going to dive too deeply into this, but this is a preview of a new feature that's coming in the next release of Quarto that allows you to create dashboards using the Quarto package. So I can create this document here that contains R code interspersed with plain text. If I look at this document locally, so I'm just going to go ahead and render this. This will create a dashboard that allows me to view some key summaries of this, in this case, this UFO data.

And once I've created this dashboard locally, right, once I've got all the code working and I have this rendering and looking the way that I want it to look, I can easily publish and distribute this through something like Posit Connect. Once this is distributed on Posit Connect, let's open this up here. So here's an example of this dashboard. I can see over time percentage of reported sightings by different countries. I can also break this down by day of the week or by hour of the day. Looks like most sightings happen in the evening, which is to be expected. Much harder to see bright UFOs in the middle of the day. I can see a summary of the different types of ways in which these sightings are described, as well as the different ways in which the shape of the UFO has changed over time over here on the right hand side.

So this gives me a really nice way to package up some information that might be highly relevant to different members of my organization and then share and distribute it with them. What that looks like is let me come back in here for just a moment. If I there's a couple of different there's a number of different ways you can deploy and share content on Posit Connect. But if you use the RS Connect package, there's a couple of key advantages, particularly when it comes to working with Databricks supported content like a Quarto dashboard that queries data on the Databricks side. And that is I can define both the Python environment that I want Posit Connect to use. So one of the advantages of Posit Connect is that it will define whatever environment I have locally, including the R packages and their versions and the Python packages and their versions. And it will send a definition of that environment to Posit Connect. When Posit Connect tries to publish whatever it is that I submitted, in this case, that would be this Quarto document. It will first recreate the environment it needs to run that content and it will do this in isolation. So I do not have to worry about managing a complex network of dependencies on Connect. Instead, every piece of content has its own collection of dependencies. There's a shared cache, so I'm not duplicating work. But it means that everything can operate in isolation without interference with other content.

So if this dashboard uses some version of PySpark that another dashboard uses a different version of PySpark, there's not going to be a conflict when it comes to publishing and sharing on Posit Connect. The other advantage of using this RS Connect package is that it allows me to pass local environment variables in a very secure way to Posit Connect. So this is how I can pass information about how I'm making the connection to Databricks so that Posit Connect can make that connection itself.

If I look, so here's an example of this document, this dashboard now published and shared on Posit Connect. And now I can come over here and I can say, let's add Rafi to this dashboard. We could have other members of our organization. We could add groups and users. We could be very specific about who has access to this dashboard. We could schedule this dashboard to refresh every week or every day or however often the underlying data is updating. How often do we want to be informed of the latest UFO trends? We can define all of that here inside of Posit Connect and end users can come visit this whenever they want to. It's a specific URL they can log into and view.

Just to highlight one other thing that we could do in this kind of same idea is we could also, if we wanted to provide something that was a little bit more interactive and less static than the dashboard, we could also offer a shiny application. So we could build an interactive web application in R or Python. We could publish that application to Posit Connect and that application would enable users to come and investigate this data on their own and ask questions on their own. So here's an example of a shiny application using this data that's being hosted on Databricks where we can take a look at, OK, I live in Utah. Here's the collection of UFO sightings that have happened in Utah.

And this type of interactivity allows end users not just to ingest the work that's been done for them, but also ask their own questions and get their own answers from the underlying data. And all of the data here is being stored and managed on Databricks through the data governance capabilities that Databricks has to offer. And queries are being submitted and registered against the Databricks computed file.

Demo recap

OK, so let me let me just take a minute and summarize a little bit about what we've chatted through. Right. What do we have today? Well, Rafi talked to us about VS code and some of the changes that exist, some of the opportunities that exist within VS code and Databricks, the VS code extension for Databricks. We talked about our studio and the Databricks pane and the way in which Posit Workbench can manage authentication and manage credentials for Databricks users. And these are all things that are available either today or in the very near future with the upcoming release of a positive workbench. We also talked a little bit about sparkly are and all the new changes we introduced there. There's been a lot of work, an incredible amount of work that's gone into the sparkly our package over the past few months to support this new Databricks connect functionality inside of Databricks and to allow and enable our users to seamlessly create these remote connections into the Databricks environment. And then finally, we're simplifying the ability to connect to Databricks through odbc by providing our own