Nathan Stephens | Scaling Spreadsheets with R | RStudio
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Just to go through a pretty quick agenda here, we'll just go through some introductions and we'll have our awesome presentation by Nathan Stephens and then a lot of time for Q&A and just some open discussion and networking amongst the group too. But I do just want to make sure I note that it is being recorded and the session will be shared up to YouTube for anybody who missed it or if you want to go back and check it out again.
But for anyone who's joining this group for the first time, this is a friendly and open meetup environment for teams to share the work that they're doing within their organizations. Also teams from RStudio to share some of the work that we're doing, teach lessons, learn, network with each other, but really just allow us all to learn from each other. So thank you all for making this a welcoming community. And if you do ever have suggestions or general feedback, I'll share a form at the end as well. But we will have a Slido link for any questions that you have during the talk.
But with that, I would love to introduce our speaker, Nathan Stevens. So Nathan will be presenting on scaling spreadsheets with R. Nathan is an enterprise architect here at RStudio and has a background in analytic solutions and consulting and experience building data science team, architecting analytic infrastructure, and also delivering innovative data products. So with that, I will stop my share, Nathan, and turn it over to you.
Great. Thanks, Rachel. It's a real honor to be doing this. And I appreciate you, Rachel, setting this up. And thanks, Maria, for getting this going and being the inspiration here. Really appreciate the opportunity. Hopefully, I don't... I value everyone's time. I have a lot of people on the call. And so I want to make sure that you walk away with some information, some new ideas, something useful that you can take with you. And if you can't, I'm going to try to entertain you to some extent. So if there's nothing here new, at least maybe there are a few laughs along the way.
So I've done data science for a while, and I'm sure if you combine all the years here, it'd be hundreds or thousands of years probably in doing data analysis, right? And in preparing for this presentation, one of the first things I did was open up Excel and start putting ideas in for my topic about why you should use R beyond just Excel, right? So I'm actually in Excel doing work about things about R. So that clued me in on like, okay, where are we here? We're in this world where Excel is ubiquitous.
So spreadsheets have been here for a long time. They're not going away. They're incredibly useful, like I said. I used them to prepare for this presentation. What I want to do is kind of give an overview of this relationship with programming languages.
Why use R instead of Excel?
Excel is not the problem. PowerPoint is the problem. If you want to see the knives come out, let's talk PowerPoint. Then the knives come out. So this will be a friendly discussion. All right. Let's get into spreadsheets with R. Why do we use... Why are we using Excel? We use Excel for a lot of things, right? But, you know, in R, we have this notion of wrangling, visualizing, communicating insights about data, which is the exact same thing that you're doing with Excel, right?
So this is a very business intelligence-like mindset, right? And a lot of this presentation is about, like, you know, making decisions, influencing, you know, taking actions, seeing results, right, in a business setting, right? You can use Excel to build a wedding list too, and some people at RStudio have used R to build their wedding lists. They will go unnamed, but I know who they are.
However, if you've used Excel to any extent, you know that you end up with some problems. If your Dilbert's boss is saying, did you see any errors on the spreadsheet? And he's like, only three. What are they? Your data, your format, and your formulas, right? So we know that as you use these spreadsheets, you know, you kind of run into some problems, and that's where I want to talk about how R can address some of those problems.
I'm going to actually focus on two problems today. The first one is file size, okay? So Excel handles kilobytes and megabytes, no problem, right? When you start getting into, like, the gigabytes, right, and tens of gigabytes, like, lots of problems, right, it becomes much more difficult. And introducing large amounts of data to Excel is kind of like a fireball for me.
So Excel does have some hard limits. So in a spreadsheet, a spreadsheet is limited by a million rows and 16, 17,000 columns, right? Hard limits, good luck pushing on that wall, right? There's some softer limits, too, around, like, the memory limitations, like I said earlier. Megabytes, sweet spot, right? Tens of megabytes, no problem. Start getting to hundreds of megabytes, a little more difficult, right? And then you talk about, like, tens of gigabytes. That's a no-go. Like, that's no bueno, right?
So the second problem I want to talk about is complexity, right? And complexity, this is a nebulous topic, right? I'm taking a multidimensional space. I'm putting it into one latent factor called complexity, right? And I'm going to define complexity for my own purposes in this presentation. But I think there's some intuitive appeal here, too, because, like, I pulled this right off XKCD, right? The most complicated thing on the planet is a sprawling Excel spreadsheet built up over 20 years by a church group in Nebraska coordinating their scheduling. We've all seen it. We've all seen that spreadsheet, right?
But let's put a little bit more meat on the bones and say, what are we talking about by complexity? Well, I think if you have, I think a simple workbook looks something like this, right? It's like a single data source, some summary statistics. It's no data updates. It's got a few functions that are easy to understand. Maybe you throw in a dropdown list for, you know, fancy, you know, use case show off a little bit. You have a couple pivot tables and just a handful of sheets, maybe just one, right?
A complex workbook, but by contrast, would be many data sources with advanced algorithms built into it somehow, right? And then dynamically updating the data on a regular basis with, you know, formulas to get F2 and, like, the thing drops down, fills up half the screen. All the formulas are nested, like, seven times over. There's visual basic scripts running in the background. You've got several pivot tables or charts, and they're all sharing data. So they're all kind of, like, linked to the same spreadsheet. And you've got, you know, maybe dozens or hundreds of spreadsheets. Like, those things are terrifying, right?
So why scale spreadsheets with R? Well, R can handle far more data than Excel. So in Excel, the sweet spot's really in that megabyte range. In R, the sweet spot's in the gigabyte range, right? So, like, handling a gigabyte or a 10 gigabytes, no problem in R. No problem at all. R can also handle much more complexity than Excel because it's code based, right? So a complex workbook in Excel equals a simple script in R.
So a complex workbook in Excel equals a simple script in R.
So I'm going to come back to this previous slide here for a second. So this complex workbook over here in this other column, that is what we call a simple R script, right? So that's simple for R. In fact, that's why you want to use R because R's really good at these things, right? So simple in R equals complex in Excel.
All right. So if you put these two dimensions on two axes, right, and you say, I've got a simple design, a complex design, and on one, and small size and large size on the other axis, then Excel fits neatly, in my opinion, my estimation, around these small file sizes with simple designs. When things get really complicated or really large, so we're thinking like that previous thing, like multiple levels and data sources and update, all this stuff, right? I would argue R is really good at that, right?
What I really want to talk about is the off-diagonals. I want to talk about where, when you have something that's fairly small file size, but very complicated, like when would you think about using R instead of Excel? And when you have something that's really large file size, but it's a very, very simple data structure, just like one table that's just a little too big, like where do you think about using R in that dimension? So I'm very interested in this boundary, right? And defining this boundary. At least that's what this talk is about. It's like, what does this boundary look like? When does it make sense to start thinking about maybe leaving the Excel world and trying some functions in R?
Demo: handling large file sizes with R
Okay, so the first data we're going to use is popular baby names from the Social Security Administration that are state-specific. This is a zipped file. If I click on this, it'll start downloading. You can see I've downloaded it four times, right? It's 21 megabytes, right? It downloads very quickly. And if you open it up, it looks something like that. It's like this. Names by state. This is the zip here. You can zip it. It's got 50 files in Washington, D.C.
Okay, so here's Arkansas. And I loaded up every single state into Excel. And I did some Googling about, like, how to do that the best. And I came to the conclusion it's probably just easiest if I just do this by hand. It took somewhere between five and ten minutes to load everything in to every state.
But there's a lot of issues here right off the bat, right? So the first issue I already mentioned. I had to, like, load them all up. That was kind of a pain. But one-time cost, right? But now I don't have any headers. So I need to go back and I want to put the headers in. Now I've got to put headers into 50 files. And that's a pain. When I put all the data together, yeah, it was 21 megabytes zipped. But when it was unzipped, it was 128 megabytes. And that's what it is in the Excel file, too. It's around 128 megabytes. And that's large enough to get the spinning wheel on the auto-updates that require you to hit the escape on the cancel.
When I got to California, I was looking for names. And it turns out this drop-down menu has a list, a cap of about 10,000 names. And in California, there's 20,000 names in this thing. So I can just get up to carry. And then I can't get beyond that, right? So that's frustrating. And I'm just starting to hit some rough edges here. But the real hard edge is, like, I don't really want to do this state-based. I want to combine everything and then analyze across states. And this thing is over 6 million records long. It's not going to fit into a worksheet, right?
So I want to show you what this would look like in R. So again, one of the big issues here is I can't load all the data into R. Let me show you what it looks like just to load in one state. I come into import data set. So if you've never seen RStudio, I don't know if anyone is new to RStudio, but this is RStudio, right? You get the R prompt down low, your R script up high. You've got some tools over here on the right.
So I'm going to go ahead and read in the data. I'm going to browse through the data. And notice the headers are off, right? We don't have a header in there. So we're going to change this to character. There you go. Now you're male, female. And you can see it actually wrote the code for me over here, right? It just did it all for me, which is really nice.
So I'm going to use the Vroom package, because the Vroom package makes this really easy. It's a very good package for like loading in lots of data. So if you have like many gigabytes, that's great. If you just have 128, it's going to be lightning fast. I'm going to go ahead and grab all the files in that directory. So these are all the files. And the Vroom command says just take the files and take the column and add a column header, right? So I don't have to like do that 50 times. That's a really simple command.
So it says that we have 6,000 records, right? Five columns. And I can, you know, I can take a look at that if I want to, or I can save this. So I'm actually going to write this back to file. I'm going to write this as a CSV. Great. Piece of cake, right? Super fast, super easy. No problem. I now have one file that's concatenated to all of them. No mistakes in the code, because I wouldn't do any copying and pasting, right?
Now, one thing that Excel has that R doesn't have is this notion of notebooks, right? So notebooks are formats that contain the pros and the code and the results of the code all in the same document. And they're really useful because it lets you keep track of your thought process, right? So if you go back later, you can see why, what you were thinking of doing, which is why it's called a notebook, right?
When you go into this larger file size area where data becomes large-ish, almost universally you're going to default back to something that's like SQL, no matter what tool you're using. It's going to look very SQL-like, right? And that's no different with R. So in R, I'm going to use a tool called dplyr. And dplyr looks like SQL because it uses the notion of verbs. And this is what that looks like, right? So I've got a filter statement instead of a where, right? I have a group by statement, a summarize statement, which is like, you know, the aggregate, you know, variables in the select statement or the window variables. And then I do actually have a select statement here at the end, right? So this is how you would manipulate data. Very SQL-like way of doing the data.
Most popular names from 1996, which is Emily, Jessica, Matthew, and Michael. And then if you look at the most popular names leading up in 2020, last year, the most popular names were Ava, Emma, Liam, Noah, Oliver, and Olivia. Man, Olivia, just going strong.
Okay. So basically you get to this and you're like, okay, this is cool. I might have something to say about this. I want to do something. So I'm going to build an application. This is a Shiny application. This is not a Shiny demo, so I'm just going to leave it at that. This is how you can build applications in R.
Okay. So in the Shiny app, when I've blown it up so much, I can't really see it. So let me open up a window here. There we go. Okay. So this is Jesse, and I can go ahead and add in my Colorado and California, right? And that's the name of Jesse. And you can see that Jesse is mostly a male name, right? And there's some females named Jesse as well.
And the cool thing here is all of the names are in here. I can pick any name I want. I picked Jesse because for some reason I was thinking about Uncle Jesse when I pulled the name. And I was like, okay, Uncle Jesse was popular right about here. So good job, John Stamos. You pretty much killed the name.
Now, if you want to share this application online, you do this with RStudio's product, which is RStudio Connect. And so you can publish this thing with the push of a button, and then you can share it with other people. And that becomes a very powerful tool to use in the enterprise, right? You can also share the notebook, right? So here's the notebook that's also on Connect. And this will allow you to keep track of your content and other people to collaborate to or to see what you're doing.
Demo: handling complexity with R
So that, in a nutshell, is size. I'm going to keep moving here because I'm going to run out of time. The next thing I want to talk about was this quadrant here around where the workbook is too complex. So we're going to toggle over and talk about something, a file that's pretty small, but the workbook ends up being really complicated.
So let me do that here. Come back to Excel. New data set. OK, this is a new data set. This is about customer data. This is called a growth equation that breaks down revenue into constituent parts. And it's got these cool little drop downs, right, where I can choose a segment. I can choose a metric and I can either be, you know, year to date or whatnot. And I worked really hard on this thing. This thing took a long time to build. I mean, we're talking weeks to build this thing. Right. And this is actually the simplified view.
In all of its glory, it was actually a lot more sophisticated than this. But basically, you pull some data in the pre-period. You pull some data in the post-period. You write some lookups here, some offsets to get the data that you need based on the inputs here in the parameters table. And then you pretty it up here and you do that every week and you mail it out. And, you know, people that like it go, oh, cool. And they copy and paste it and they use it too.
Right now, there was a problem with this thing that I ran into. And that was that it was really hard to maintain. Like it was always changing and any change was very difficult to make. And sometimes the data were really hard to get right because the data were coming in in such a way that even the smallest changes were hard to track down. So this thing that was, you know, kind of my pride and joy, it ended up being kind of this taskmaster monster for me.
And unfortunately, the bottom line here is that, you know, this took a lot of time and ultimately yielded, you know, the best result here was I became very good at Excel. So there was that. But like from the business value perspective, I think the business value is a little bit off.
This is another application, a little hard to see here, but this is the application. It's not it's it's really not that bad. It's it's not that bad of a shiny application. Let me go ahead and run it.
So we're here. Here we go. Cool. So I can choose my segment. Right. I can choose. I can do all the same stuff. Right. And I've got my output here, which is really nice. And I can download the data set. Right. And this is really cool, because if I open up this thing, I actually get back to the Excel output. Right. With the chart and the data. And I actually get the data that I used to create those metrics here. So that's really that that's really nice, because then you're just pulling the exact data that you want and the users can do it themselves.
Now, again, this is something that you're probably going to want to publish to connect. And this is what it looks like in connect. This is a tool where everybody can access it. And honestly, you know, at this point, this is really fancy and nice, but I've kind of gone above and beyond. But shiny apps are sexy and interesting because they're interactive. They're reactive. Right. And people like that. But it doesn't have to be that way.
So in that situation, you use a report and the report here looks like this. This is the R Markdown. And this is actually a simpler thing to do, because you put all of your parameters here at the top and the header and then all of your stuff is here. Now, one of the cool things I really like about this script is is stuff like this. This shows me all my formulas. Right. Like anybody can read this. Anyone on this call can read this.
If you want to knit this thing, come up here to knit with parameters. You can just choose, you know, the things here like shiny. This is actually a shiny gadget. Get this thing and it will give you the report that you want. And this is actually a little easier to maintain. And if you put this into if you put the same report into connect, you can do this really cool thing where you can like email yourself.
This code here, I dusted off that thing that took like weeks and weeks to build and I recreated it from the ground up and are in about six hours. From six weeks to six hours, literally. OK. So when we talk about complexity, that's what I'm talking about. Like writing the code here like are that. Things complex workbooks just become simple. And they become more. I would argue to be more accurate to at least in my experience.
This code here, I dusted off that thing that took like weeks and weeks to build and I recreated it from the ground up and are in about six hours. From six weeks to six hours, literally.
Summary and Q&A
So this boundary, hopefully this boundary is a little bit clearer. I think I know I drew it as a hard line. I actually think that's a pretty wide spectrum. Right. I think there's a pretty big difference between what spot you're going to move from Excel to R. But hopefully this presentation has helped clarify why you would use R instead of Excel, even in the event you're not doing like a bunch of random forests. Right. Even in the event where you just need to do some regular old data science exploration.
So the last thing I will say is that in this R space, you know, you're doing data science. You know, you can do data science in Excel. Right. You can do it in R. And Excel is a great tool. It's got it's Turing complete. Right. It's got power query built into it. Right. You can publish things to Power BI. You know, you can. You've got the visual basic in the back end, which is not really my cup of tea, but it is a programming language and it works.
The nice one of the nice things about R is it really clarifies when you're using the code. And I'm going to suggest to you that, you know, in data science, the source for all your results really is your code. It's not it's not the report. It's not the output, but it's the process that you use to get the code. So in Excel, if you think about Excel and you like share some output from Excel or like in part. So you take some pivot tables from Excel and put them in the PowerPoint. So that's nice. That output is really valuable. But the source for that was the process that you went through to get that information. That whole process is where the value is in data science. And when you use things like R scripting or code or scripting language, like you get that part as a natural part of the process.
But, Christina, I see you just asked the question around what if the executive wants the report for every combination of parameters? Do you have strategies to iterate through them and generate a full PDF report? Yeah, so if your executive wants every combination, it sounds like your executive wants to do data science. You can build tools for that executive to do the data science. And that's what you would do with the shiny application. You build a shiny application and you put that out there.
So what we actually do are Tableau dashboards, but then we also have to publish all like we have to do all the filters for like every combination, basically. And then we have this like C++ program that takes web shots of all of them and like puts this report together. And it's a really complicated process. That's what that's what I was getting at with that.
Oh, that's so interesting. I actually I actually took that part of this presentation out, but I gave this so I can send you a link. Another version of this presentation actually does the web shots, puts them into PowerPoint presentations programmatically and would allow you to do something like that. I think that the so Tableau is a great tool. Tableau and RStudio products are basically solving the same problem. It's just that R is doing it with code and Tableau is not. So I'd be surprised if there were things that you're doing in Tableau that you couldn't do above and beyond R. But the thing that you mentioned, we're talking about like programmatically pulling all combinations and doing web shots, putting them in a presentation. Yes, that's all possible. And I've done it.
Leo asked, could you please show a quick step by step to achieve that awesome email with the graph and formatted table? Oh, you want to see how the email is created? Yeah. Yeah, I can do that.
Blastula is the package that you're looking for. And then DT is another one that you're looking for. No, no, GT, GT, not DT, GT, Grammar of Tables and Blastula. But this is where I put in the subject line. Right. And this is where I define the plot. And then I compose the email. This might look like comments, but this is all being recognized by R, right? So this is like actually pulling in the things in R and inserting them into the content of the document. And you can read all about that in the Blastula help site. And then on the R Markdown side, you have to attach the subject line here. So this is the subject line. This is the body. These are images and these are attachments. So the XLS file gets attached here in the R Markdown file. Makes sense.
I just want to make sure that you're aware that when we do like a presentation here on R Markdown, if you come here to presentation on R Markdown, one of the options is PowerPoint. So you can actually generate PowerPoints here. And then in PowerPoint, going back to the web shots, you can put web shots in here to like, you know, pick up the things that you're trying to insert into the PowerPoint.
So we take security very seriously at RStudio. It became very clear to me early on that we don't have a company without secure software. So I'm not saying that we've got everything, you know, figured out because it's an ongoing process. But I will tell you that, you know, we have onboarded hundreds of customers that have had a variety of security needs and we haven't lost any deals specifically because security.
Yeah, the general question is, you know, can I trust, you know, my shiny applications? And keep in mind that normally these shiny applications are used for internal purposes. Like normally you are like you're putting this behind your firewall. So that's usually the standard you're looking for.
I see another anonymous question that was Excel may be considered as interactive app for data analysis. How do you help Excel users successfully transition to a shiny app developer? Well, yeah, you're in luck. I mean, there are a lot of resources around learning to do shiny. So if you go to, what is it, shiny.arcstudio.com, that's a great place to get started. The creator of shiny is Joe Chang. And shiny is 10 years old now. It's been around for a while. But the cool thing about shiny is that it was designed to fail nicely. So Joe talks about the pit of success, right. So you might make a mistake, but shiny is going to be nice to you, you know, when you make that mistake.
So the way if you're transitioning to shiny, be nice to yourself, start simple, and then just follow the patterns and fall into that pit of success. I say that myself because I'm not the world's best shiny app developer. My background is in big data, right, as opposed to front-end application design. But I have learned that just going with the patterns, going with the flow, you're going to have a pretty gentle ride with shiny. And if you do have problems, go to community.arcstudio.com. There's a lot of resources on the community site about doing shiny.
I will point out one thing that's interesting about shiny is that reactive nature is very similar to the Excel paradigm, right? So in Excel, you change one cell and it just trickles through the whole spreadsheet and can just have massive changes to the experience. Shiny is the same way. You'll change one thing and it will just reactively, like, percolate through the entire application.
Another question on Slido from Josh was tools such as GitHub, SQL, Power BI, Tableau, Excel. Curious your thoughts on these tools as they don't seem to be emphasized enough with respect to their use in the workplace.
I'm struggling with that question a little bit because I deal with customers on a daily basis and it seems like everybody's using those tools. And SQL has been such a dominant paradigm for so long. I've worked with no SQL databases for quite a while, right? No structured things like Tableau or Redis. There's always some sort of SQL-ness around it. You're like, hi, it's not even SQL, but they put HQL on top of it. So I think SQL is one of the most enduring paradigms in technology that I'm aware of. And don't bet against it.
I think one of the problems with SQL in general, with the world, is very few people get formalized SQL training. I was lucky enough to get that. So I learned SQL on the job. Nobody in academia talks about SQL, right? I shouldn't say nobody, but nobody in my experience in academia ever mentions anything about SQL. And then I go into the job and that's all anybody was doing. So I learned SQL that way. But then I actually did take a training class on SQL and then talked about the query planners, right, and other clauses that maybe you're not so familiar with. And window functions and different styles of it. And that was very useful.
If I could kind of just clarify, I think you did end up touching on what I meant. With R, I learned a lot about R through school. Barely was exposed to any of the other tools. So it seems now that I'm backtracking to learn the other tools to match my use with R, but I would barely use R in the workplace.
I think R is a niche programming language mostly used by scientists who are seeking the truth. And if you still go to the R project page today, it'll say it's a language for statistical computing and visualization. And how many companies are sitting around thinking, I need some statistical computing visualization software around here. That's not what they're thinking. They've got data. They've got operations. They've got BI. They've got these big systems. So it's not a surprise that, like, in academia, you're using R to do science and learning, and then the industry is using more enterprise tools.
I think one of the reasons why I wanted to join RStudio was to help bridge that gap. But I think, yeah, Josh, you and I should grab a cup of coffee and talk.
I wrote a post many years ago when I first joined RStudio about making R legitimate in your organization, and the tenets of that was to make a decision and allocate resources. So you're looking for a stakeholder, usually someone very senior in your organization, to say, we will consider R to be an analytic standard, and we're going to support it, just like we would SQL or these other tools that we're talking about. And then if you can get resources allocated to them, whether it be, like, you know, like a server, like money resources or software, or you get individuals, like saying, we're going to train these individuals on those things, that's what you're looking for, right? And that's the natural process in most organizations to recognize standards.
There's an assumption that R is – that data science is things that people do on their laptops with free software, right? And I think that greatly undermines the value of data science. I think that – and it's, you know, Josh pointed out, well, what about Tableau, right? And what about Power BI? Power BI is a little different than Microsoft, right? So Microsoft is a different thing. But, you know, all these other tools. Well, if you brought Tableau into your organization, somebody went up in there and said, we're going to make Tableau an analytic standard for this organization, we're going to allocate resources for it. You have to do the same thing with the R programming language.
There's an assumption that R is – that data science is things that people do on their laptops with free software, right? And I think that greatly undermines the value of data science.
I think most organizations aren't very data-driven, right? And there's a spectrum. But they want to be, and they know that it's relevant. I think going through the process of saying, like, you know, R, you take all the emotions out of it and say, like, you have a business to run, right? We're talking just enterprise. We have a business to run. R is a competitive advantage. It brings in great people. It allows you to use technologies. It needs support. You're already using C Sharp, and you're using some Python, and you're using Java. Why not use R? And those are arguments that you can win.
There's usually some sort of fulcrum, and getting that executive sponsor is key. Like, if there's no executive sponsor, no one's willing to actually say, yes, this is what we're going to do, then you really – you're back in that – you're back in the guerrilla warfare, right? You're just back in there, like, doing stuff and, you know, throwing something against the wall and hoping it sticks, right? And I have to say, like, if you're in that organization, there are better organizations than that, right? Like, if you're looking to, like, make an impact, make a difference, and improve your skills, like, why would you want to be in an organization that didn't value data science, right?
And data science is mainstream now. Most of them are not data-driven at all. And for them, it's one extra step that they need to do. And, you know, like, people are lazy. They don't really want to do an extra step if they can, like – if they don't have to. So before they realize all the benefits, they need to do effort.
There's usually some sort of fulcrum, and getting that executive sponsor is key. Like, if there's no executive sponsor, no one's willing to actually say, yes, this is what we're going to do, then you really – you're back in that – you're back in the guerrilla warfare, right?
But one of the questions from earlier, Nathan, was what packages do you recommend for importing Excel files? Wasn't there, like, a readxls and then there's, like, an xlsx package as well? Yeah, you know, that's a miss in my presentation because I was showing how to read in the flat files. But, yeah, just reading those things are pretty straightforward. Everyone in the chat is confirming that it is readxl. So it's – okay, so if I come in here to environment, import dataset from Excel. Yeah, readxl is the one that's used by default.
And then another question from earlier was, can you version control Excel? Yeah, the one I'm familiar with is a paid solution, though. So you actually have to, like, subscribe to it. It's an enterprise solution, which makes sense, right, because Excel is an enterprise tool. Like, you have to pay for it. That's one of the nice things about R is that it is free, right? So if you want to use it, you can start it easily.
But last – a few months ago they released some new function that made it Turing complete, you know. So it's like Excel is there forever. And it's not just Excel either, right? It's also Google Sheets, and it's also OpenOffice, right? So it's like you actually have three, you know, spreadsheet platforms to choose from that are common.
My experience is I've been using R and R Markdown for three years, using it in the enterprise business. I first started people introducing – getting them to know R, but that was a little step too big. But once I introduced them to R Markdown, they could see the link with Excel very easily, very straightforward. And they love R Markdown, and it connects with them, with Excel, because you can work in the R chunks and generate nice reports easily, do it step by step.
And how I use it now in the enterprise, I use it in the oil and gas, I use it in the marketing business, is they have these ERP systems, SAP, Microsoft Dynamics, and when they have to do reporting with those systems, it is very cumbersome. And when they use R Markdown, they extract the data from the backend, put it in a small database, access the data with R Markdown, then it is for them easy peasy. Really, they can generate fantastic reports, and they love it.
Yeah, I appreciate that feedback. I'd be really – so I showed the visual editor today on the demo. It broke at one point, which was unfortunate. It's new. I'd be curious to know your feedback introducing these same people to visual editor. That actually made things even better for them.
They don't like R, but they like R Markdown. They love R Markdown. Well, I mean, R Markdown, you've got R code chunks inside of R Markdown, so you can't really use R Markdown without R, right? But you're saying they don't like the R script. They don't like the R script. They like the R Markdown. It's easier to understand, to connect with Excel, all things work.
I like it because you – I just have too many experiences where I wrote R code and I came back six months later and I couldn't understand what the heck I was thinking. Like I hardly recognized it was me, you know. And R Markdown at least gives you some opportunity to say, this is what I'm thinking. This is why I'm doing this. This is what you should expect here. And, yeah, that's really valuable just to me personally, right, like me as a coder.
And what I also like is about the different output formats that you can easily generate with R Markdown. You can make an HTML, you can make a dashboard, you can make a PDF, you can make whatever you wish.
The one area of Java that I have a hard time avoiding is Spark. Spark depends on Java. And guess what? Spark and Java. Spark is kind of difficult, you know, because of Java.
Can I put one thing? I couldn't get into the presentation because we were talking about Excel, but everything I showed on the presentation today was all file-based, right? And there's another type of data that we didn't get into at all that would be interesting to do, and that's machine data. So things are coming from databases, right? So when I built this presentation, I kept thinking about the data sources. But if your data are coming from Excel, meaning that the original source of your data actually is Excel, so somebody got into Excel and inputted the data, then you're in Excel land. You're not in R land. I mean, you're going to be in Excel for quite a while. But if your data source is a database and you're pulling tables from a database, then I think the jump to R is actually a lot easier. And I personally would probably bypass Excel in most cases, right? I know there's Power Query and all that stuff, right? But if I'm getting machine data, and that's not just databases, but that's like APIs, right, or some sort of streaming feed, or even just curling information from the website, all of those things, I would start thinking R, personally.
A follow-up question there was, can R work directly within a database? Yeah, that's the right guy. So, yeah. So, if you come here to RStudio, and you do Connections and New Connection here. Yeah, these are all the database connections, right? So, you probably, I don't know if people use Snowflake out there, but this is a Snowflake connection. And then it will populate right here into the connection window, and you can browse things.
A follow-up question there, Nathan, was have you ever connected to any bank portals? Bank portals. Man, I don't even know what we're talking about. What's a bank portal? Like, banks keep their data really locked down. We work with some banks, and they usually have, their databases are usually pretty traditional. You're looking like a DB2 database or something like that.
Well, there's lots of ways to interface with APIs through R. So, again, I need a little bit more information. But, yeah, the bank group, the banks are, those environments can be very restrictive, right? So usually my experience with the bank, it's not the tools that are the problem, it's the policies that are the problem, right? So what ends up happening is that users kind of find a way to get around those barriers and those problems. Like, maybe they'll just download the data to their desktop, you know, and just kind of email that around or something like that.
And Nathan, one of the questions on Slido was how can we connect with you to chat about other topics like enterprise solutions or architecting workflows? Yeah, it just brings joy to my life. I'd love to connect with any of you. My email is Nathan at RStudio.com. And I'm on LinkedIn, of course. I used to have Twitter, but I've kind of uncoupled from Twitter and Facebook. So I'd say the email or the LinkedIn is probably fine. I'm also on community.rstudio.com. You can find me there, too.
I see one other question that came in from Rafael. Can we connect R directly for any, and I'm assuming this means database, the DDBB, and visualize real-time data? Yes, if you want to hook up like a Shiny application to a live database, that is very common. We have a lot of customers doing that. And there are some resources online for doing that. You'll have to make sure you have permissions to connect to that database. But what you normally do is you run the Shiny application either as yourself or as a service account that has permissions to access that database. And then the tools are you're just going to use the same R tools to pull that data in depending on the user inputs. Very, very common request. Very common use case for Shiny.
Antonia Lee said, I'm new to R and the dopamine rush is so rewarding. What's the best way to start getting your hands messy? Is there a site with cases to start analyzing data? Oh, man, that's going to make me cry. Okay, so, yes, I hear you. I see what you see. R speaks to me as well, right? There are a lot of people that do that. I just tell people that like I will just pull up R on the weekend and just kick around a few ideas in my head. I'll just do it for fun, you know.
But Tidy Tuesday, I think, is a great project. So, every week, Tuesday, there's a new data set released, so then people kind of visualize that data and then share it out on Twitter for other people to see.
I think necessity is the mother of invention for me. I mean, you can go to Kaggle and do that competition if you want. That could be fun. But, like, if you have data, like, that's where things get really, you know, that's where you really get the motivation to do stuff. So, if you have the opportunity to seek out that data and, like, seek out a problem that matters, that's where I learn the most.
When I was starting R, this isn't going to date me, guys, but I started using R pre-Google, okay? So, like, finding information about R back then was almost impossible. And then when Google did come out, it wasn't smart enough to distinguish between, like, R, the programming language, and the letter R. So, it's like you put R in there and it was a worthless search, you know? You'll still see issues now with R getting conflated with Reddit, right? Like, R something and sometimes you'll get Reddit as a result. So, Google still struggles with that.
And we'll also just put a plug out there for rweekly.org as well. Does a really nice job of pulling in blog posts and different materials every week.
But thank you all so much. Have a great day ahead, depending on where you are. Have a great evening.
