Rafi Kurlansik @ Databricks | Data Science Hangout

So I think that those kind of, the two companies I think actually work really well together because they both share that same fundamental view of the world, which is that technology can facilitate the cultivation of knowledge.

I love that. The shirt that I chose today says the future is open. Let me see, here it is over here. Future is open. And this is from our Data and AI Summit a couple of years ago. And I think that there's a lot of, there's just a lot of good that comes from open source technology.

Quarto, so I'm going to say that I think that Quarto and R Markdown is like the best reporting technology in data science. I have not seen anything as good. And I really wish we had something like that in Databricks.

Data validation and the bronze-silver-gold model

Kanupraya, and apologies if I mispronounce your name, I saw you asked a question that touched on something you were asked in an interview. Do you want to jump in?

Yes, absolutely. So this question, this has been asked to me a lot of times about how do I validate my data in a project or how do I constantly do integrity checks or check for integrity issues in my project? So how should I answer that? Or what should be my answer to that?

Yep. Important question. The, so what I think data validation means to me, as I understand it, is you can't just necessarily rely on anything that you get coming in from some source system, right? You have to check and make sure that the data actually makes sense. So for example, let's say that you have a date. This is a classic example. Let's say you have a date for a sale for some product from your company and your company started in 2015. You have a record come in that the date of the product sold is like 1970 or 1990 or something like that, right? So it's impossible. It makes no sense. So validating the data means did you check to make sure that all of this is actually correct and sane?

That's what I think data validation means. It's like checking to see if the data is correct. And the way that you would do this, there's lots of different ways you could do this, lots of different technologies you could do this, but I'm going to just talk from a Databricks point of view, which is that we advise our customers to basically build out like three sort of main layers of tables when you're building any kind of data pipeline or working on any kind of project.

There's this concept of a bronze table. There's basically bronze, silver, and gold. So the idea is that bronze is data as you found it, like as it came to you. So you just capture it as you found it. You put it in the bronze table. It could be flawed, whatever. It doesn't matter. You want to get it exactly as you found it so you have a copy of the raw data as it is. Then you have some sort of code that you write that has logic to run all of the validation and check to make sure that you basically are, you know, filtering out rows that have incorrect dates, stuff like that, right? That's going to, and like dealing with missing values, things like that. Then you're going to have a set of silver tables that are complete observations, they're tidy data, and they've already been transformed. They've already had some sort of logic applied to it.

At the third level, the gold tables are essentially aggregates of the silver tables. So if you have single observations, your gold tables are going to be, if you did a group by and you made some sort of aggregation, that would be your gold level tables. So in this way, you have a pretty clear, it's sort of like as you go further in the pipeline, you know, more and more business logic has been applied to it. You can always go back to the bronze table and see the source of where it came from, even if it was flawed. And then you also can see the logic that transformed it and made it clear. So that's what I think data validation means, a meaning of data validation. And that's how we would take care of it is essentially you have to write the code to clean it up, but you should stage the data in these different tables, these different layers so that it's very clear each step of the way.

And, and if they asked about like consistency in the data, what do they exactly mean by consistent consistency in the data set? I think consistency in this context, again, it's hard for me to answer you like perfectly because I don't know the context in which the person was asking you the question, but consistency to me in this context would mean like, let's go back to that date column. So, you know, if we have like a January 1st, 2015, you can encode that like three or four different ways. So it's all the same information. At the end of the day, it's the same data point. It's an observation of January 1st, 2015, but somebody, you know, you could have literally January 1st, 2015, or you could have 0, 1, 0, 1, 2015. So having it be consistent means it should be in the same, it should be in the same format. It should have the same data. Same structure, same data type, that kind of thing.

Managing RStudio on Databricks clusters

So I use RStudio through a compute cluster in the web terminal a lot for Databricks instead of like connecting it and doing it down locally. And I've noticed that there's not an easy way, it seems, to close the cluster after X minutes of inactivity. Like when you end up trying to launch the cluster, you're not allowed to do that. But if I do a regular R instance and then I use Databricks Notebook, I'm able to do that. I was wondering if there's any avenue forward to make that a feature available, or if there's a pipeline for that, or maybe I switched to Posit Workbench and maybe there's a solution there that I'm not aware of.

So just to make sure I got everything, you're using the hosted RStudio in Databricks? Yeah. Okay. So just for people who may not know on the call, Databricks does offer the ability to install RStudio on the driver node or on an instance that you launch on Databricks. So you can open it up in your web browser and you can start working with the data that's in Databricks.

There's like some requirements though for you to do this, and they make it hard to work with. So one of the biggest ones is that Databricks clusters by, or compute by definition, like the default setting is that it'll turn off if you're not using it, which is great. But for this, if you want to launch RStudio, then we force you to disable the auto termination. So it will not turn off. You have to manually turn it off. So the reason why we do that is IDEs are stateful. You want to preserve the code that you've written and maybe some data that you've saved locally. So we don't want you to walk away from your computer for an hour and then we shut that down and you lost everything. So we make it so that you can't auto terminate it.

So the ways to get around this, there's two ways. The best way is to not use this feature, to be honest with you. The better way to go is to use RStudio desktop or Posit Workbench and then set up like a remote connection or to use Databricks notebooks inside of Databricks if you can't do that.

However, if you want to work around, then there's a package out there called Brickster. I'm so glad I got a chance to bring this up. But you can use the REST API, the Databricks REST API, to turn off and turn on clusters. So it's very easy to use Brickster, write some R code that basically identifies your cluster with RStudio on it and shuts it off. So what I've advised customers in the past is maybe at like 10 p.m. and you don't think people are going to be using it anymore, then you can schedule a Databricks job or you can run it locally on your machine if you want, like a cron job. And then it'll hit the REST API and it'll shut it off. And you come back the next morning, you turn it on yourself, and then you're good to go.

Career advice

But a question that I always ask Rafi, and it's one of my favorite questions, is there a piece of career advice that you have either given to somebody or received along your career journey that you'd like to share with us?

Okay, the one that comes to mind is that I would recommend feeling free to ask questions when you don't understand something because you really have nothing to lose, and I'll explain why. If you don't understand something and you don't ask a question, then you will not understand it because you're not going to get the answer. If you don't understand it and you ask the question and you get the answer, then fantastic. Now you understand that. Now you're that much more knowledgeable and that much more capable. If you ask the question and people deride you or basically give you any other response than answering the question, then you know that maybe you're not in the best situation, and that's also valuable information, and you should maybe look for a place where you are free to ask questions. So I think that that's a really powerful thing.

People are often afraid. So why wouldn't someone ask your question? Because they're afraid to look stupid or they're afraid to look like they don't know everything. The truth is nobody knows everything. One of the best experiences that I had at Databricks was when I first met some of the engineers that I admire so much, and I was asking them about some things, about how Databricks works, and without missing a beat, he said, I don't know. And I was just like, that's amazing that that's the culture, that you are just not afraid to just be like, oh, I don't know. That's not my area. I don't know that. So I think that that's the advice that I would give. I think it's served me very well, and I think it would serve anyone well.

I would recommend feeling free to ask questions when you don't understand something because you really have nothing to lose. The truth is nobody knows everything.

Tips for transitioning into sales engineering

First and foremost, thank you all for hosting, especially you, Rafi. I feel like I can really relate to you on a lot of levels. And the reason being is because, you know, I've been in enterprise sales for a number of years, and I'm currently enrolled in a data science boot camp, and I'm trying to career transition as a sales engineer. My question being is, what are some tips or resources to really focus on in helping me throughout my transition to become a sales engineer, especially since I don't really have any skin in the game?

Yeah. So there's a book here. So there's a book

Rafi Kurlansik @ Databricks | Data Science Hangout

Transcript#

Rafi's background and career journey

The Posit and Databricks relationship

Prioritizing work as a principal product specialist

Data governance vs. data stewardship

What is Databricks?

Learning open source tools and going to production

Resources for using Posit with Databricks in R

Databricks Data and AI Summit

Prototyping with Databricks

Excel users and Databricks accessibility

Total cost of ownership with Databricks

Career reflections and gardening

R support in Databricks

How Databricks and Posit work together

Shiny and Quarto in Databricks

Data validation and the bronze-silver-gold model

Managing RStudio on Databricks clusters

Career advice

Tips for transitioning into sales engineering