Resources

Rafi Kurlansik @ Databricks | Data Science Hangout

video
May 14, 2024
1:01:59

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi everybody, welcome back to the Data Science Hangout. I'm Rachel Dempsey and I lead Customer Marketing here at Posit. I've actually learned that some people are hearing about Posit through the Hangouts, so I've started just adding this here in the beginning. If Posit is new to you, we're the open source data science company building tools for the individual, team, and enterprise, and I'm so happy to have you joining us here today.

The Hangout is our open space to hear what's going on in the world of data across different industries, chat about data science leadership, and connect with others who are facing similar things as you. So we get together here every Thursday at the same time, same place, unless it's a holiday or something. But if you're watching the recording and want to join us in the future, if you're watching on YouTube, there's details to add it to your calendar below if you want to join live. Just make sure it adds it for 12 Eastern Time so you can join live.

I'm learning from the Hangout survey, which I'll share with everybody in the chat here, that people really enjoy connecting with other attendees in the chat. So if you are interested in connecting with others and want to share your LinkedIn or whatever, I encourage you to say hello there and maybe briefly introduce yourself, your role, where you're based, something you do for fun, too. We're all dedicated to keeping this a friendly and welcoming space for everyone and love to hear from you no matter your years of experience, titles, industry, or languages that you work in.

It's totally okay if you just want to listen in here, but there's also three ways you can jump in and ask questions today or provide your own perspective. So you can raise your hand on Zoom, and if you need a refresher on how to do that, in the Zoom bar below, there will be a, like, reactions button. If you click that, you can raise your hand. You can put questions in the Zoom chat, and if it's something you want me to read out loud instead, just put a little asterisk or star next to it. Maybe you're in a coffee shop or something. And then we also have a Slido link where you can ask questions anonymously, and I'm sure our co-hosts here will share that in the chat here in just a second.

I did just mention, I'll add this real quickly, I love getting to hear from all of you and learn from all of you and really value your feedback about how we can best support this awesome community. So after the Hangout today, if you would be open to sharing your feedback, I would love to hear from you. I just shared the quick Google form in the chat now, but I'll share it again at the end too.

With that, thank you so much. I'm so excited to be joined by my co-host today, Rafi Kurlancic, Principal Product Specialist at Databricks, where he specializes in data science, machine learning, and the developer experience. And so, Rafi, I'd love to kick it off with having you introduce yourself to and sharing a little bit about your role, but also something you like to do outside of work too.

Rafi's background and career journey

Sure. So, thank you, Rachel, thank you Posit for having me. Thank you everyone for the opportunity to meet you and have a conversation with you all. Looking forward to hearing your questions and having a nice discussion.

So a little bit about myself, I actually did not go to school for computer science or anything like that. I went to school for nursing. I graduated with a BSN in nursing from Adelphi University on Long Island, and then I promptly wound up not going into nursing and trading commodities, oil and gas futures for seven years. So, I would have kept doing that, but what happened was essentially around 2013, algorithmic trading really kind of started taking over, and it was very hard for me to make any money. So after looking into maybe getting a job at a firm or something like that, I discovered that all the job postings were talking about R and Python and random forest and all these different algorithms, and I had no idea what that was.

So I went and I took the Johns Hopkins data science specialization from Coursera. And that was where I learned how to program in R, it's the first language that I learned how to write and encode in, unless you count HTML. And then from there, I realized that data science really is universal, like it applies to every industry, every sector of the economy.

So then I went to, I wound up getting a job in the Philadelphia area at a hospital system doing data analysis, you know, on sort of problems like estimated length of stay for an inpatient, somebody who goes to the hospital, how long should they be staying in the hospital and other problems like, how can we predict who in the outpatient population is going to be most likely to be readmitted to the hospital. So I kind of worked on some data science problems there.

And then I wound up transitioning into field engineering or technical sales, whatever you, you know, there's different ways of saying the same thing, sales engineering. I joined IBM in 2016, and I worked there for two and a half years. I learned all about just the world of technology beyond, you know, limited sort of applications in healthcare and learned a lot about enterprise sales and all that kind of stuff.

And then from there, I went to Databricks in 2019. I went to Databricks because I always really appreciated open source technology. And Databricks is the founders of, you know, several different, very popular open source technologies. And I've been at Databricks for five years. What I do here now is still kind of in field engineering. So I work with salespeople to help customers understand whether Databricks is a good fit for them and help them adopt the technology.

But my particular role now as a product specialist is sort of like a liaison between R&D and field engineering. So I spent some of my time talking with product managers and engineers about the stuff that they're building, understanding it, getting them feedback from customers, and then also taking the things that they're building, sorry, and then sharing that with the field, then also taking all the feedback from the field and bringing it to the product engineering team. So that's my journey, my professional journey.

And as far as things I like to do outside of that, this morning, I planted a whole bunch of vegetables that I started from seed in my backyard and some raised beds that I have. So it was a really, really nice way to start the morning in May.

The Posit and Databricks relationship

Well, thank you so much, Rafi, for the introduction to and sharing a little bit about your journey into the world of data. I was so happy to see on your LinkedIn when you mentioned it's been a highlight of your career to work with Posit and the team at Databricks to bring our companies closer together. And so I was just curious to ask you, what is it about this relationship that's so important to you?

Oh, for sure. I'm so glad you asked that. I tried to answer that last year at Posit Conf, and I think I did an okay job. Maybe I'll do a better job this time.

So going back to that realization that I had when I first started learning R and more about data science, I really do believe in general that science is a very, very solid path for humanity to make progress with the problems that we have, make our lives better, to understand the world more. And open source technology is so great for that because it democratizes it and it makes it more accessible to more people. So the more humans, the more brains that we can have kind of working on these problems and the technologies are more accessible, I think that that's fantastic. So I very much see R and RStudio and Posit as very much in that, like for that.

And then when I joined Databricks, I learned that the founders initially, when they invented Spark, which is the main technology that Databricks kind of got started with, it's kind of a long way from there. But when they invented it, they wanted to give it away. They did make it open source, but they were like going to these different companies and they were trying to just say, hey, use this awesome technology. And the companies kind of pushed them back and were like, hey, well, there's no nice science project you have here, but there's no enterprise support, there's no adoption of this, there's too much risk. So they decided, okay, well, we'll make a company and we'll build a commercial offering on top of it.

But the spirit of that was definitely, we want to democratize the power of big data and these technologies so that people can work on really, really difficult problems and not be limited by the amount of data that they can process. So I think that those kind of, the two companies I think actually work really well together because they both share that same fundamental view of the world, which is that technology can facilitate the cultivation of knowledge and yeah, so I really, really, I think there's a lot there.

So I think that those kind of, the two companies I think actually work really well together because they both share that same fundamental view of the world, which is that technology can facilitate the cultivation of knowledge.

I love that. The shirt that I chose today says the future is open. Let me see, here it is over here. Future is open. And this is from our Data and AI Summit a couple of years ago. And I think that there's a lot of, there's just a lot of good that comes from open source technology.

Prioritizing work as a principal product specialist

So I get to ask you questions while we wait for questions to come in from everybody here. But if you join after I mentioned it, you can ask questions right in the Zoom chat here. You can use the Slido link that we'll share where you can ask anonymously, but also just feel free to raise your hand here on Zoom and we'll be on the lookout to see your hands raised.

But I appreciate you also explaining a little bit about what it means to be a principal product specialist. And it sounds like there's a lot of different things that go into that. You mentioned R&D, field engineering. How do you prioritize what you work on in that role?

That's also a very good question. I think that there's company goals. So Databricks has certain products that we as a company want to make successful. So there's always, there's always, that's always a useful way to prioritize your work. I think what I try to do is to find areas that people are not paying a lot of attention to that do need support and then go work on that.

So for example, Gen AI is amazing, super interesting and magical technology. Everybody's paying attention to that. Like there's plenty of people with very smart people working on all of that. For people who are maybe not spending as much time is something like, how do you use Databricks with an IDE or something like that? There's been a lot more effort put into that over the past couple of years, and I've been working on that for the past couple of years. But just by way of example, things that may get overlooked, I think that those are kind of the areas that I try to focus in on a little bit more.

Data governance vs. data stewardship

So in quite a few Hangouts, we've talked a bit around like data stewardship and data governance and what those words mean. And so we actually have a roundtable conversation with the community next week on data stewardship at the individual level. And I was just wondering to hear from you, like, how do you distinguish between data stewardship and data governance or what, or how would you explain data governance?

I mean, my initial thought is, I think governance is more of a security kind of a thing, ensuring that people have, the right people have access to the right data. And it's for a few different reasons. It could be, it's a risk to the company to let certain data come out. I mean, that's like financial data, it could be like that. But also it could be that there's compliance risks, and that would be things like personally identifiable information. So governance is like ensuring that you don't leak any of that and that you're fully compliant with, and you're avoiding risks, I would say.

Stewardship to me, I actually don't have like a great definition for that, but what it sounds like to me is something more on the individual level. So like if I'm tasked with a project, right, like I have a certain use case that I've been asked to build out. And usually in my experience with the way this starts is like, okay, here's some tables in the data lake or the data warehouse or whatever, this is where you can go find the data and like you get started with that. Inevitably, you're going to wind up creating some sort of derivative datasets from there. And I think being a good, some of the question becomes like, are you going to be a good steward of that data or not?

So what does it mean to be a good steward of that data? So to me, that would mean, are you clearly documenting everything associated with that? Are you documenting the code that you wrote to transform it into the derived datasets? Are you documenting the datasets themselves? Databricks has pretty good support for these kinds of things, but if nothing else, like make a page in Confluence or a Google doc or a Word doc, whatever it is, where you actually describe like, okay, this derivative table came from these other tables. This is what these columns mean. And it's somewhere that people can actually find it and understand it. That to me would be data stewardship.

And obviously being a poor data steward would be like not commenting your code, not commenting any of your tables and having hard to read obscure column names. And then when somebody finds your data, it could be the most valuable data in the world for your organization, but it's just like, who's going to know what that means? So I think that's like the distinction between governance, which is more risk and security oriented and stewardship, which is more maybe closer to like data quality or something like that.

What is Databricks?

I'm seeing a few questions coming in here on Slido, so I'll make sure I'll jump over to one of those. And I probably should have asked something about this in the beginning, but somebody asked, how would I describe what Databricks is to my non-technical manager?

Oh, that is a good question. I would say that Databricks offers software to let you analyze your data at whatever scale you want, data, big or small, and it does it in the cloud, so it makes it easy to use. Or maybe you, maybe you could say it does it through your web browser, makes it easy to use.

Learning open source tools and going to production

Well, just, yeah, thank you for joining, Rafi. I don't want to get too awful technical, but adding another layer to the stack, adding Databricks means, for some of us, we go on a journey of maybe trying to figure out best practices or, you know, knowledge discovery. Could you comment a little bit? You mentioned open source, which to me, in some ways, also means using tools like Arrow. And could you comment on how we learn about using these tools and exploit them in the best possible way for like a production deployment?

Okay, so I think there's two different things there. There's learning about them. So let's say the three main open source technologies from Databricks today are Apache Spark, Delta Lake, and MLflow. Apache Spark is an engine. It's a computing engine. It's a big data computing engine. MLflow is a storage format slash table format. And MLflow is a framework for managing machine learning models throughout their entire lifecycle.

So if you want to learn about these, all three of them have their own open source, like websites for the open source project. And there's tutorials there. There's instructions on how to download it anywhere, install it on whatever computer, your laptop. And you can totally do the hello world examples and maybe even go beyond that and do a few levels beyond that. And of course, now there's like tons of videos on the Internet to explain all these things.

But if you want to go to production, I think that that becomes a different story. Because then when there's production, when you want to have an application or some use of production, you have to start thinking about the risk of it failing. If it's something that's on your laptop that's personal, probably not that much risk of it failing. If it's a business case that's been decided on the business, we're going to invest, we're going to put all of our people on this and we're going to build this out. That's a lot of risk immediately.

So for that, you have to start thinking about more DevOps and like engineering type concepts. How do I ensure that this is actually going to scale and that it's reliable? Whatever system that we're installing all this software onto and we're actually managing it. You have to think of who is going to manage the infrastructure that this is running on. And you also have to think about like how are we, let's say we get this up and running and inevitably we'll have to make some sort of changes to it. How are we going to make sure that when we make changes that we don't break anything? These are all much bigger problems. How do we make it secure?

These are all the things that I've seen over the past, all the years that I've been in field engineering. There's this tension between DIY because you look at open source, you're like, oh, it's cheaper if we just do it ourselves. Versus a managed service where the managed service takes care of a lot of those things for you. The managed service will say, I guarantee that the infrastructure will be reliable. I guarantee that it will be secure. So you don't have to spend any money on that. You just pay me some and I'll make sure that the environment is totally secure and reliable.

So I think that that's kind of the, that's like the tradeoff that you have to think about when you think about I have this interesting technology, this useful technology. How do I get this into production? You have to think about the tradeoff between spending money on people to manage it at your company or paying for a managed service. I think in many cases, depending upon what the managed service is, it could be worth it or it could be worth it to DIY. But that's certainly, I hear a lot of customers like lean, customers that lean towards the managed service approach. I find that they're not always thinking about the total cost of ownership and like what the full-time employees that have to maintain the infrastructure are pretty expensive. So that's kind of my perspective on it.

Resources for using Posit with Databricks in R

I saw Fariza asked in the chat, and there's a star next to it, so I'll read it. But it's, what are good resources for learning to use Posit with Databricks together? I'm finding a lot for Python, but having a hard time with R resources.

Yeah. So I'm sure Isabella just shared some of the documentation and examples and blogs and stuff that we've put out in the past year. Aside from that, I think that there will be a training that Edgar is running this summer at the Posit Conf. I think Edgar's on. Edgar, do you want to tell us a little bit about that workshop too?

Yeah, I'd be happy to. Can you hear me okay? Yes, sounds great. I'm actually really excited about it. This is the first time we're going to be doing this class. And I've done other classes before, especially at Posit Conf. And we're going to be focusing on how to essentially connect and use and interact with Databricks through R, where we'll be covering ODBC connections for database connections to the warehouse, accessing the Unity catalog, as well as Spark, and using it through Spark AR, how to do all that setup.

Also very excited that our plan at this point is to make available to all the students their own workbench. That's like the professional version of RStudio. That has the special Databricks pane that we just added recently. So you'll get to experience that as well. So yeah, very much looking forward to it.

Databricks Data and AI Summit

Now, since we're on the topic of conferences, it might be good to chat a little bit about the Databricks Data and AI Summit, if you want to share anything about that, Rafi.

Oh, yeah. So Databricks has a conference every year called Data and AI Summit. The past, I think, four years has been Data and AI Summit. It's pretty large. It's usually in San Francisco. And it's also available online virtually for free. So you can buy tickets, obviously, like any other conference, and go there. Or you can tune in virtually and catch, I think, most of the talks and certainly the keynotes and all that kind of stuff.

I can tell you, as a Databricks employee, that they always save some pretty exciting announcements for the first day. Usually for the keynotes. So if you like what Databricks is doing, or you're interested, and you kind of want to see what the next big splash that they're going to make is, then that is definitely the best place to check it out, is the keynotes for Data and AI Summit. And this year, it's June 10th to the 13th this year.

Prototyping with Databricks

So, yeah, in terms of where we hear a lot about the advantages on the production level with Databricks and integrations with Posit and R, but I'm wondering for prototyping purposes, so not really like, you know, not thinking about like keeping code like, you know, for production levels or maintaining stuff, but more so like for like sketching or like, you know, really like doing proof of concept building. What advantages would you say Databricks gives developers that are looking to create applications in such as like R Shiny or Python apps?

So I think there's two answers there. So there's, yeah, there's two things that come to mind. The first is that if your company manages their data in Databricks, then that's kind of, you know, your advantage is going to be that that's where you're going to be able to access it. But the second thing is, even if your company doesn't have its data in Databricks, you can use the compute to pull in a lot of different things together. And the infrastructure and the ease of managing infrastructure is probably some of the best. I still think it's definitely the best for a long time. I still think it's some of the best, if not the best.

The reason I say that is, you know, as someone who's on a laptop, you're limited by the RAM and the CPU that you have on your laptop. When you move to the cloud, all of a sudden you have, you know, basically practically infinite choice of like how much, what configuration you want for compute. Databricks makes it very simple to just like click, you know, oh, give me this instance with a little more RAM, like restart it. Or if you want a GPU, you just select the GPU and then restart it. So when you're prototyping, it makes it very simple to get the resources that you need. I think that's the biggest thing.

I love, whenever I have to do something significant, I use RStudio. But if I just need to quickly like explore something or look at something and the data is in Databricks, then I do open up a Databricks notebook and I'm like up and running in a few seconds. So those are kind of my thoughts on that.

Excel users and Databricks accessibility

Yeah, the Databricks stuff is always interesting to me. I've noticed like, you know, Databricks has kind of pitched a pretty big enterprise thing. Have you noticed any interesting collaborations for maybe like Excel users with teams that are a little bit more data science focused? So just wondering, like, how do the Excel users on like analytics and BI teams work with Databricks and collaborate with teams that are more Python R focused?

Yeah, that's fantastic. The population of users that are in spreadsheets is probably way more than the population that's writing code in R or Python or SQL all put together for data science and data analytics. Probably. I think that's probably still true.

So today, if you're an Excel user and you want to use Databricks, it's going to be challenging. There's probably a way to do it. I've seen some ways to do it, but none of them are going to be super intuitive and great. So for now, what I would say is at a minimum to get the most out of Databricks, you should know SQL. The introduction of AI systems, there is an AI system in Databricks now. It definitely lowers the barrier to entry. It makes it more accessible because it can help you write SQL or Python or R and kind of get you started. But it's still going to be hard if you're an Excel user.

Where I think Databricks is going, though, what's very exciting, actually, is trying to use large language models to be able to just use English as your programming language. That's something that the CEO has talked about. Ali Ghodsi has talked about this, that English is the new programming language because if you can load your data into, there's something called a data room in Databricks. Now, you basically choose these are the tables that I want to analyze, and then you can just ask questions. And behind the scenes, it'll call functions and actually run the aggregations and run whatever code for you and just give you the results back.

So I think somewhere in between that and something that actually lives in Excel is where the gap will be bridged. But directionally, Databricks is always trying to make it more and more accessible. And I want to just add that, like, that is totally consistent with the original vision of having Spark and open source, right? It's like, make the technology as accessible as possible. So that includes Excel users, people who don't know how to code, necessarily.

Total cost of ownership with Databricks

Basically, I would like to just ask a question, like, based on your experience, like when you bring, like, Databricks inside, like, to a customer, how that impacts, like, total cost of ownership in terms of bringing people to manage it, like, the kind of cost that can manage the security of the tool or, like, consulting service for Databricks so they can manage these resources to you. Like, how do you see normally, like, these conversations go with customers when you bring, like, Databricks inside of a new customer?

Yeah. It very much depends upon the organization and the skill sets that exist at the organization. So if you have, let's see, have, like, I worked in the startups, like, segment at Databricks for the first three years. So I would work with companies that were very DIY, very, very engineering savvy and they were very hard to sell to, to be honest, because they really, they could do it on their own if they really wanted to and they could stand up everything and they understood the technology extremely well. So selling to them was different and it was really making the case that, like, look, you could do all that or you could use a managed service and offload a lot of this work and then just focus on, like, the business logic and the actual things that are going to generate revenue for you.

And those customers are much more sensitive to the actual cost because they understand the cloud more. So they understand how the cloud works. They understand how to limit costs. And they're much more just vigilant about all of that.

On the other extreme, you have companies that or organizations that know that they need to do something about data and AI and they really, really want to and they look at Databricks as the fastest path to do that, but they don't necessarily have the in-house skills to fully administer it or they don't fully understand how the cloud works. And so those are cases where they'll probably need consulting. They'll probably, if they don't have consulting, they'll probably have their learning, they'll hit the learning curve of working with the cloud, which is that you left something on or you chose some huge cluster or huge resource and then you get a huge bill. I've seen that happen a lot. It's not necessarily a Databricks thing. It's just like you have to learn how to work in the cloud. Every cloud service has that.

So I think that that's kind of the two extremes. And I think the differentiating factor is like how much in-house experience do you have working with the cloud and with these sort of big data and data science technologies.

I actually, I'll share a blog post, but I wrote a blog post like last January 2023, that was all about how to provide a same like cost control mechanism for data science teams on Databricks. There's a feature called cluster policies, which basically lets you constrain the size or the dollar value of any resources that somebody could create. So I think it's a nice balance of giving people flexibility, but not unlimited flexibility such that they could accidentally create something that costs, you know, $10,000 or whatever.

Career reflections and gardening

Let me take a quick break from the technical questions. And I would love to ask you, Rafi, what did you want to be when you were a little kid?

Oh, wow. A little kid. I don't know if I remember when I was a little kid. I can tell you when I was in high school, I decided that I didn't want to work in a cubicle. So, I mean, I work in my office in my house right now. I stare at a computer a lot. So I don't know how successful I am with that. But I do have a dream of starting some sort of agricultural business, small business. That I would try to use some of the things that I've learned in technology, but that would definitely not be the main thing. Like it would not be a startup, you know. It would just be like a small business. That's my way to get outside again. Not be in a cubicle.

No, I think more actually only a little bit. I think more the other way. Like I could see interesting ways to apply technology to agriculture that's not really, I just left agriculture. You're in for a world of hurt.

Yeah, I can see some applications both ways. So to answer your question, an interesting thing about gardening is that when you plant something in the ground, it's very small usually. And it kind of stays small for about two years. And then the third year in the spring, all of a sudden it explodes in growth. So I think there's something, there's a lesson there about what's happening is the plant is putting out roots and putting in structure underneath. And then it finally has the energy to really expand above the ground. So I think that there's a lesson there about patience and about getting foundations in order. I think that applies to all areas of life, but it also applies to working in business with long-term projects and things like that. So you kind of have to wait. You can't be too impatient.

And then the other way, going from tech to agriculture, I'm going to share this with everybody here. So if somebody wants to take this idea and run with it, sure. Just maybe message me and maybe include me with it. But the idea that I had is I don't think that there's a data model for farming that I'm aware of. What I mean by that is there's like standard data models in like I think in finance in certain areas and in healthcare in certain areas, but there's not really one that I'm aware of for agriculture.

So I think it could be interesting to look at what that would be and how you could use sensors to be part of that. So like the type of thing that you're growing, consistent real-time data on the conditions that it's growing in, the geospatial data where it's growing in, and then actually how the plant is changing over time and growing and things like that. I think that there could be – I really want to do that. I really want to create like a data set of that and then make it open source so that people could analyze it. That's my fantasy.

R support in Databricks

So jumping back to some of the anonymous questions that were asked earlier, there was one that was Databricks seems to implement new features first in Python or SQL. Will R continue to be supported or enhanced in notebooks, clusters, or new Unity catalog features?

That is an excellent question. I think R is going to be – is going to continue to be supported. I think the level of support is going to be you'll be able to get data. You'll be able to use open source, any open source packages to be able to train models on Databricks and manage them. I think a lot of that is just going to be with Databricks for the foreseeable future. I don't really see that changing.

Where I think things are a little harder, and it's been this way since I joined Databricks, is things like – things that are a little more cutting edge. That's harder because Databricks is innovating. I'll give specific examples like model serving on Databricks, Delta live tables. There's probably another example I could think of, but those are very innovative spaces. The engineering resources of Databricks are really focusing on making them excellent and making them feature complete. Adding – Python is going to be the number one language for that because it's the biggest population of users. That's the biggest population of Databricks users for sure.

I think that Python is going to be where you'll be able to use all of Databricks no matter what. R will be more – you can use everything that's in R. We'll support R and the R ecosystem, but new stuff in Databricks is probably going to be Python.

How Databricks and Posit work together

The other anonymous question from earlier was, how would you describe how Databricks and Posit work together?

I think they work very well together. First of all, for the developer experience, primarily for that I would say. Databricks has notebooks and the notebooks are not just a single file like notebook. There's actually a workspace file system where you can have arbitrary files alongside your notebook. You can get a lot of IDE type experience where you can have relative paths and things like that. You can build an R package in Databricks if you wanted to or a Python package. There's also a variable explorer and a web terminal. There's a lot that's there, but it's not the same thing as RStudio or VS Code. It's just not the same. It's not meant to be the same.

I think that if you are an IDE user, Posit Workbench is going to be your best bet for that. You can connect to Databricks remotely. That's a lot of the work that was done in the past year by Edgar and Tom and their teams that they're associated with at Posit and by some of the folks at Databricks. For sure, being able to work in an IDE and access data that's in Databricks, compute that's in Databricks, have easy sign-in and not have to worry about your credentials and all that kind of stuff, Posit Workbench is the best way to do that in the market today, for sure.

On the other end, once you build something that you want to share, so I think Databricks has some good stuff for that. If you have a dashboard that you want to build, you can build a dashboard inside of Databricks and you can share that with people. But if you want to build a Shiny app or some other data app, that's best suited for Posit Connect, I think, today.

Where they work really well together in the big picture is you can do your development inside of Posit Workbench using data that's governed and the compute that's in Databricks. Then when you're done and you want to put this app, make it available, you want to share it with other people, then you can use Posit Connect. Then on the other side, this Shiny app on Posit Connect is connected to Databricks and querying data and that's governed and all that kind of stuff. That's the two sides of it. Databricks is kind of in the middle.

Shiny and Quarto in Databricks

I think this question actually jumps in there as well, but it was any potential further integrations of Shiny into the Databricks UI or implementing Quarto?

So Shiny, you can run a Shiny app on Databricks, but I would only do that for lightweight, very lightweight cases. For anything more significant, I would use Posit Connect. There may be some lightweight app, like an evolution of that lightweight app hosting this year. I would stay tuned for any announcements of data and ask them for that. I don't know too much about it, but if there would be an announcement, it would be done there.

Quarto, so I'm going to say that I think that Quarto and R Markdown is like the best reporting technology in data science. I have not seen anything as good. And I really wish we had something like that in Databricks. You can use Quarto and R Markdown in Databricks, but you don't get the same rendering and having the file like that. It's more that a Databricks notebook can be, you can import an R Markdown or Quarto file as a Databricks notebook, but you don't get the knit functionality where you have this nice, beautiful render doc. We need more of that. That's my thoughts on that.

Quarto, so I'm going to say that I think that Quarto and R Markdown is like the best reporting technology in data science. I have not seen anything as good. And I really wish we had something like that in Databricks.

Data validation and the bronze-silver-gold model

Kanupraya, and apologies if I mispronounce your name, I saw you asked a question that touched on something you were asked in an interview. Do you want to jump in?

Yes, absolutely. So this question, this has been asked to me a lot of times about how do I validate my data in a project or how do I constantly do integrity checks or check for integrity issues in my project? So how should I answer that? Or what should be my answer to that?

Yep. Important question. The, so what I think data validation means to me, as I understand it, is you can't just necessarily rely on anything that you get coming in from some source system, right? You have to check and make sure that the data actually makes sense. So for example, let's say that you have a date. This is a classic example. Let's say you have a date for a sale for some product from your company and your company started in 2015. You have a record come in that the date of the product sold is like 1970 or 1990 or something like that, right? So it's impossible. It makes no sense. So validating the data means did you check to make sure that all of this is actually correct and sane?

That's what I think data validation means. It's like checking to see if the data is correct. And the way that you would do this, there's lots of different ways you could do this, lots of different technologies you could do this, but I'm going to just talk from a Databricks point of view, which is that we advise our customers to basically build out like three sort of main layers of tables when you're building any kind of data pipeline or working on any kind of project.

There's this concept of a bronze table. There's basically bronze, silver, and gold. So the idea is that bronze is data as you found it, like as it came to you. So you just capture it as you found it. You put it in the bronze table. It could be flawed, whatever. It doesn't matter. You want to get it exactly as you found it so you have a copy of the raw data as it is. Then you have some sort of code that you write that has logic to run all of the validation and check to make sure that you basically are, you know, filtering out rows that have incorrect dates, stuff like that, right? That's going to, and like dealing with missing values, things like that. Then you're going to have a set of silver tables that are complete observations, they're tidy data, and they've already been transformed. They've already had some sort of logic applied to it.

At the third level, the gold tables are essentially aggregates of the silver tables. So if you have single observations, your gold tables are going to be, if you did a group by and you made some sort of aggregation, that would be your gold level tables. So in this way, you have a pretty clear, it's sort of like as you go further in the pipeline, you know, more and more business logic has been applied to it. You can always go back to the bronze table and see the source of where it came from, even if it was flawed. And then you also can see the logic that transformed it and made it clear. So that's what I think data validation means, a meaning of data validation. And that's how we would take care of it is essentially you have to write the code to clean it up, but you should stage the data in these different tables, these different layers so that it's very clear each step of the way.

And, and if they asked about like consistency in the data, what do they exactly mean by consistent consistency in the data set? I think consistency in this context, again, it's hard for me to answer you like perfectly because I don't know the context in which the person was asking you the question, but consistency to me in this context would mean like, let's go back to that date column. So, you know, if we have like a January 1st, 2015, you can encode that like three or four different ways. So it's all the same information. At the end of the day, it's the same data point. It's an observation of January 1st, 2015, but somebody, you know, you could have literally January 1st, 2015, or you could have 0, 1, 0, 1, 2015. So having it be consistent means it should be in the same, it should be in the same format. It should have the same data. Same structure, same data type, that kind of thing.

Managing RStudio on Databricks clusters

So I use RStudio through a compute cluster in the web terminal a lot for Databricks instead of like connecting it and doing it down locally. And I've noticed that there's not an easy way, it seems, to close the cluster after X minutes of inactivity. Like when you end up trying to launch the cluster, you're not allowed to do that. But if I do a regular R instance and then I use Databricks Notebook, I'm able to do that. I was wondering if there's any avenue forward to make that a feature available, or if there's a pipeline for that, or maybe I switched to Posit Workbench and maybe there's a solution there that I'm not aware of.

So just to make sure I got everything, you're using the hosted RStudio in Databricks? Yeah. Okay. So just for people who may not know on the call, Databricks does offer the ability to install RStudio on the driver node or on an instance that you launch on Databricks. So you can open it up in your web browser and you can start working with the data that's in Databricks.

There's like some requirements though for you to do this, and they make it hard to work with. So one of the biggest ones is that Databricks clusters by, or compute by definition, like the default setting is that it'll turn off if you're not using it, which is great. But for this, if you want to launch RStudio, then we force you to disable the auto termination. So it will not turn off. You have to manually turn it off. So the reason why we do that is IDEs are stateful. You want to preserve the code that you've written and maybe some data that you've saved locally. So we don't want you to walk away from your computer for an hour and then we shut that down and you lost everything. So we make it so that you can't auto terminate it.

So the ways to get around this, there's two ways. The best way is to not use this feature, to be honest with you. The better way to go is to use RStudio desktop or Posit Workbench and then set up like a remote connection or to use Databricks notebooks inside of Databricks if you can't do that.

However, if you want to work around, then there's a package out there called Brickster. I'm so glad I got a chance to bring this up. But you can use the REST API, the Databricks REST API, to turn off and turn on clusters. So it's very easy to use Brickster, write some R code that basically identifies your cluster with RStudio on it and shuts it off. So what I've advised customers in the past is maybe at like 10 p.m. and you don't think people are going to be using it anymore, then you can schedule a Databricks job or you can run it locally on your machine if you want, like a cron job. And then it'll hit the REST API and it'll shut it off. And you come back the next morning, you turn it on yourself, and then you're good to go.

Career advice

But a question that I always ask Rafi, and it's one of my favorite questions, is there a piece of career advice that you have either given to somebody or received along your career journey that you'd like to share with us?

Okay, the one that comes to mind is that I would recommend feeling free to ask questions when you don't understand something because you really have nothing to lose, and I'll explain why. If you don't understand something and you don't ask a question, then you will not understand it because you're not going to get the answer. If you don't understand it and you ask the question and you get the answer, then fantastic. Now you understand that. Now you're that much more knowledgeable and that much more capable. If you ask the question and people deride you or basically give you any other response than answering the question, then you know that maybe you're not in the best situation, and that's also valuable information, and you should maybe look for a place where you are free to ask questions. So I think that that's a really powerful thing.

People are often afraid. So why wouldn't someone ask your question? Because they're afraid to look stupid or they're afraid to look like they don't know everything. The truth is nobody knows everything. One of the best experiences that I had at Databricks was when I first met some of the engineers that I admire so much, and I was asking them about some things, about how Databricks works, and without missing a beat, he said, I don't know. And I was just like, that's amazing that that's the culture, that you are just not afraid to just be like, oh, I don't know. That's not my area. I don't know that. So I think that that's the advice that I would give. I think it's served me very well, and I think it would serve anyone well.

I would recommend feeling free to ask questions when you don't understand something because you really have nothing to lose. The truth is nobody knows everything.

Tips for transitioning into sales engineering

First and foremost, thank you all for hosting, especially you, Rafi. I feel like I can really relate to you on a lot of levels. And the reason being is because, you know, I've been in enterprise sales for a number of years, and I'm currently enrolled in a data science boot camp, and I'm trying to career transition as a sales engineer. My question being is, what are some tips or resources to really focus on in helping me throughout my transition to become a sales engineer, especially since I don't really have any skin in the game?

Yeah. So there's a book here. So there's a book