Zac Davies - Elevating enterprise data through open source LLMs

video

Oct 31, 2024

18:02

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi guys, my name is Zach, I work at Databricks, and I'm going to talk to you about how you can take your data that you've all hopefully got sitting there as part of your organisation and utilising that with LLMs to make it something more than the sum of its parts.

So data isn't anything new, we all work with data more than likely, and it's important we get it, it's our jobs. Data is the new oil, we heard that for the last 15-20 years, but with ChatGPT for the last 18 months, everything started to change. And we've got this hype cycle, and it's pretty much here to stay, by the way. And at Databricks, we're seeing this story time and time again, which is your competitive advantage as a company is tied to your data maturity.

So if we have this chart, on the x-axis we'll have data maturity, on the y we've got your competitive advantage, we paint this lovely picture where as you get more mature, you're going to be more competitive. And it's really about these two extremes. On the left, you've got this data silo where things are really hard, life is difficult, you've got lots of tooling, lines of business may not be communicating with one another, it's just difficult. And on the right-hand side, we have this polar opposite. Your data forward, you're using AI, you've got metrics, insights, everything you do in the company has got data as part of the decision.

Everything wants to be on the right-hand side, you want to be on the right-hand side. Quick show of hands, who feels like they're in a company at the moment who's trying to be on the right-hand side? Yeah. There are a few of you. Keep your hand raised if that was done predominantly through competitive pressure, like, oh, if we don't do this, we're going to get left behind, we've got to go do this, our competitors are doing it, we've got to be doing it, everyone's doing it.

And the last question here, which is how many of you have been in a meeting where you've been told by someone to work on a project, and you've heard something along the lines of we need to be doing more with AI, we can solve this problem with an LLM, can we do this? And leadership want to see AI features.

So you end up in this situation where you've got all of the LLMs, you've been told you need to do something, and you have your data and the LLM. And the data is your competitive edge, like, hopefully no one else has this. I'm sure your company would be displeased if that data was with everyone else. And you've got to keep everything governed, of course.

What is RAG?

And you kind of end up naturally gravitating towards probably RAG, which is Retrieval Augmented Generation. And if you haven't, it's probably a good starting point as well. And the reason is, and we'll talk about what is RAG by looking at what it isn't to start with.

So in a traditional LLM flow, we might ask a question, I've been hearing a lot about Positron , what on earth is that? I'll ask that to the LLM, it gets fed in, and then we get an answer back. And in this case, this is not what I was expecting at all. I didn't want to know about the physics of an electron or its antimatter, whatever. What I wanted to know is about an IDE.

Now for RAG, it's a little bit different, but it's pretty much the same. My question is still there, what is Positron? But the difference is it has access to the data. You set it up so that it knows how to talk to your data. And this is where things get interesting. And this is where the retrieval part of the name comes from. It's looking for relevant documents. And then the augmentation part comes in, which is you're taking your original question and you're augmenting it with that data. So it might be you're an employee at Posit, and it's got access to all the internal documentation if this was a year ago. Or it might be that it's just got documentation from the website. And then you get back your answer, and that's the generative part. And in this case, that was what I was looking for.

So the benefit here is you get to use the data with something that's already there, right? These models are there, you don't have to go and fine tune them or train them to do RAG, you can just add in your data and off you go. You kind of supercharged everything. And you can change that data in real time. It's flexible, it's dynamic. You don't have to go and train a model necessarily to keep things up to date.

Security and compliance requirements

Maybe I can just shove everything into OpenAI and be on with my day. I know that they can do this. Let's do that. Sounds good. Oh, the legal and the security team have something to say about this. There's a lot of stringent requirements you probably have in your organization, and they want to make sure that you're following the internal guidelines. Things like data residency, industry-specific regulation, do you have to abide by HIPAA compliance or PCI or GDPR? You've got to have audit logs set up. You've got to have the model using the right license. It has to be for commercial use potentially.

And then the security team is saying you've got to have SSO, and you've got to have everything remain within the organization tenancy and access controls. I mean, when I access the application, I should only be able to see what I'm allowed to see, not what necessarily you might be able to see. That's the big one. Monitoring and logging, encryption, and really it boils down to control and trust. You want to be in control, and you want to trust as little as you can.

You want to be in control, and you want to trust as little as you can.

So things don't sound so simple anymore. I'm going to need more than just an API and my data. So what are the components that are required? It sounds like a lot. So the data, we get that. We've talked about it. You're going to need something to take that data and chunk it up into nice, lovely pieces. You might even want a place to train a model if you're so bold. You need a place to take those chunks and put them into what we call an embedding model to make what we call vectors. And you'll need a place to have the large language model, of course. Then you'll need a place to host those models. You need the vector database. You need a place to monitor it. You need a chain to tie it all together. You need an application. You need auditing. And then you'll need a place to host the application.

The tooling: Posit Connect and Databricks

It's still a lot, right? And how do you pull this all together and then keep in check? Well, thankfully, it's not as daunting as it sounds. So the tooling that I'll show you a quick example of today is Posit Connect with a Python Shiny application and Databricks for the rest. And really, that'll be using something like a Lang chain combined with MLflow. I'm not going to show you any code. There's examples you can get that from links at the end. But really, Databricks will be handling all the things like the model serving, the vector database, the data processing, and making that compliant with all the logging and stuff that you need. And then Connect is going to have all the things behind your application secured and authenticated.

Demo

So let's have a look at a little bit of a demo. So here, we're just going to go jump into Posit Connect and click on my application. And I've got a little bit of a chat app here. It's using one of the new components from Python for Shiny. And the first thing is at the bottom, you can see that it's running as me. So I'm going to ask my question, which is how do I create an AIBI genie space? Now this is a preview feature in Databricks. It's not in the public documentation when the model was created. And my documents, I can't see them. So it really has no idea what's going on, or I would expect.

It's giving me an answer. But it's not looking correct, right? There's no specific information that it found, but it's going to go ahead and try anyway, which is not really what I wanted. And if I turn off RAG with a flick of a switch, and I try the question again, it's just going to hallucinate. It's going to go and make up something, because it does what it can. So this is what I see, and this is not ideal, but it works to some extent, right? It's doing what we asked of it.

Now if I flick a switch, and I go to look, next slide. If we go and look at what my colleague sees, so this is from a different point of view. So I have a colleague named Rafi. He works on the team closely with this feature. So hopefully he would have access to the data and know a little bit more about it. So if he goes ahead and he asks the same question, making sure RAG's enabled, it's going to tell him how to do it. It's going to have access to that data. It's going to find it, and it's all going to work correctly in giving the right response. And he can ask a follow-up question, and it's going to go and do the correct answer there, and it's going to give him a whole lot of steps.

So the important thing here is that I've got something that's in Posit Connect with Python Shiny, relatively straightforward, just hooked it up to the LLM, hooked it up to the chain that I had built with my data, and it was already going to have the correct access control that's going to be behaving as me or my colleague, and all the permissions are there as expected.

The other nice bit is that when you're within Posit Connect, you're going to have the, that's the right one, it's set so that everyone within the organization who's got access to Connect can see the app. I'm not trying to hide it to a specific subset of users. I'm relying on the integration with Databricks and Posit Connect and authentication to do that, and all the nice logging that is there within Posit Connect and auditing as well.

Now, the nice thing that we have here within the Databricks side, one thing from the Databricks side I'll show you is what we call an inference table. So every time I'm asking a question to my application, I'm generating data. I've got the question I asked it and the answer, and it turns out when you go and train these models or you fine-tune them, that's the information that you need. So when you have these applications, you're actually creating kind of this cycle where you're generating data that you could later use whenever you're ready to go and improve something or at least monitor the situation. So here you can see an example of what that looks like in a table, and that's automatically done for me. I just have to toggle a box, and it goes and creates this table, and you can see that it's got the request, the response, who did it, and when, and all the other stuff. So an important component for the auditing, but very, very useful for that edge you want to create for yourself as an organization.

Architecture and governance overview

So moving on and kind of looking at things in review from an architecture perspective, what we have is we have the users, they're integrated, and they're working with Posit Connect and the application. They're working with Shiny, and they're using the application I've built, and in this case, it's using a Posit SDK with one of the integrations for Databricks, and it's communicating with Databricks, and it's talking to this chain model that I've built, and it's doing it via an SDK that we have, and that chain goes and talks to all the other components on the Databricks side, which are the embedding model, the vector database, the LLM, and of course, the inference tables I just mentioned, and of course, the data as well. The data is feeding the vector database, and that's updating at will when I decide.

From a governance and security perspective, we've made sure to tick all these boxes, and we've got the application again, and the first point of call is this SSO that we've got set up in Posit Connect. I can only access this if I've got part of the organization, and it's going to automatically have my user credentials embedded. That token is going to be used for all of the processes within the application, and it's going to make sure that when I ask that question again, which is, what is Positron, and it does that check against the data, it's doing it as me, and it's only seeing what I should be able to see. We can regularly update that data from secure sources, and we don't have to worry about things going to people who aren't meant to see it. When we feed that to the model again for our augmentation phase, and we get our answer back, we've audited and logged all the actions that are going on.

Summary

In summary, there's three pillars that we've gone through and things that you should probably think about, which is this data, and we know that it's your competitive advantage. I've mentioned it a bunch of times, and it's really when you combine it with the LLMs, you're starting to get something that's far more useful than both of the parts individually. It's really, really important that you do this potentially sooner rather than later, because you're going to be creating this cycle where you're generating data for yourself, and you can use that whenever you see fit.

From the model perspective, we've got these open models, and you need to have something with a permissive license. You need to be very careful here, because not every model will let you use the outputs to do something like that cycle I just described. You want end-to-end control, or maybe your organization, and you want to be able to customize things as required. You don't want to relinquish control to a third party. You want to make sure that you're getting this data, because again, it is your competitive edge. From a governance and security perspective, we need to make sure that everything is behind authentication, and access controls are key here, and auto-logging.

That's it from my side. Thank you. There's a QR code for some of the slides, and you'll find code for doing this in Databricks, and some of the examples from the Posit Connect side at some of those links. If you have any questions, you can either email me at that address, or come and find me for a chat.

Q&A

Thank you, Zach. Do we have any questions coming from the crowd? Just give it a minute. I don't see anything right now, but I'll just give you all a minute just to ask anything.

I guess one thing while we wait, I showed you how to do it in Databricks. I talked about how to do it in Databricks, but it'll work with a lot of other tooling as well. I guess it's just mainly the thought process about how you go about it.

With this one-stop solution, how can the end user get insights on the specifications of the Databricks black box? Not sure what you mean by the black box, but maybe we can have a chat about it.

I can understand how connecting Databricks allow authentication and specific user access, but how can we get our organization on board with using an open source permissive model? What are some examples of these open source models? Yeah. So one of the best models that hits all the checkboxes in my mind at the moment would be obviously LLAMA 3.1 specifically. They changed the license slightly between 3 and 3.1. It's got kind of the most broad capabilities for you to do something with it as a model, but you can actually use the outputs to train something else yourself. Whereas if you're using like chat GPT and you will read the licensing, there's a gray area where it says like, hey, you can't use something that would compete with us, although you own the outputs and all these other various things. So you really will need to go and read the license. The good thing is, is with RAG, the model doesn't need to be the absolute best. Just needs to be pretty good. LLAMA 3.1 is pretty good.

Can the data set stay in the local file system or must it go to Databricks? You can use whatever you want. You can do this without Databricks entirely. So if you wanted to use your own vector search locally and you had something like InductDB as a vector extension, many other databases have it now. You can use that. There's no requirements for this to be in Databricks at all. It's more about, again, you can do RAG anywhere. You can do it locally. It's just the example I showed was specific to Databricks from an enterprise perspective.

I think we have time for one more question. What's your experience with setting sources with RAG? So you'll need to use a little bit of lang chain magic, I think, to make those come up. I didn't do it for my demo, but we have examples somewhere. So if you have a particular need for that, I can probably dig it up. Well, give it up to Zach, Zach at Databricks.