Resources

Daniel Chen - Moving to Quarto from RMarkdown and Python Jupyter Notebooks

video
Aug 4, 2023
20:28

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Next month, he's running 100K through the woods of British Columbia, and he grew up in New York City, so he's hoping that as a city boy he doesn't get mauled by a bear. Please welcome Dan.

All right, so I'm here to talk to you about moving to Quarto from RMarkdown or Python Jupyter Notebooks, so hello everyone.

But before we get started, I would first like to acknowledge that we are currently on the unsucceeded land of the Munsee and the Lenape people.

And I'm Daniel. I'm currently a postdoctoral research and teaching fellow at the University of British Columbia up in Vancouver. I'm from New York City, but I moved up there over a year ago, and I'm currently a data science educator at Posit. I got all of my teaching background from the Carpentries, and I also authored a book about pandas in Python.

What is literate programming?

But this talk is about literate programming, which is the ability for you to write code and write regular prose text and interweave it together in a single document or a report. It also allows you to write slide decks like we've learned from Emil earlier today, and also this talk is also written in Quarto, which is also another form of literate programming that I'm going to be talking to you about today.

So literate programming, who would use it? Why would you want to use it? So as a data scientist, you may or may not have learned or used R Markdown or Jupyter Notebooks, and for people like us, it's really good for doing analysis or reports and documentation. If you're also an academic, you can write your papers in it. I sort of have this project that maybe I'll try to translate my dissertation into Quarto and just see how well the tool works.

But even as a technical writer, you can use it for not just Quarto, but just literate programming in general. You can use it to write blogs or websites or presentations or books and all of that stuff. So that whole realm of content is called literate programming.

R Markdown overview

As R users, we probably have heard or used a tool called R Markdown. It's been around for many, many years, and I'll give you a... If you don't know about R Markdown, then this is sort of the main component of an R Markdown document, where you have three backticks, a set of curly braces, the letter R, and then in here, in your regular document, you get to write regular R code as if it was in your console or in a script or right in RStudio.

And this gives you the ability to take that R code and write regular prose text with it. So I can have a regular Markdown header that says, like, here's a section about loading data, here's some regular text about what the dataset is, and here's the code to load the data. Up at the top, I have some metadata information about what the title would be, what type of output. So right now, I'm picking an HTML output. You can set it to PDFs or other forms of output as well. And then there's also some things that you just have to do or load, but you don't really want to be part of the report. So you have the ability to still run code, but sort of turn off the output so no one really sees it in your final report. So that's the main part of an R Markdown document.

And what we can do is we can take that regular R code or that document that you just saw, and you can run a command right in your command line called rscript, and you can run this function called render from the Markdown library. And what it will do is then generate the format that you asked for. So in the previous example, you saw I asked for an HTML output. Running this code will give me a HTML file of that code.

I talk a lot about project formats and project templates. So what I also like to do is specify like, hey, when you generate this HTML output, please put it in the output folder so I don't end up with a folder with 50 files all in one go. And you end up with a file that looks like this during that output. So here's the longer form of that type of analysis. I'm loading data, filtering data, saving, doing some tidying stuff, plotting, and then fitting a model. And you get like a nice little report in here. And so I can have code, see its output, and write about it all in one go.

Introducing Quarto

But this talk is about Quarto. So you can think of Quarto as like the next iteration of R Markdown. R Markdown's not really going anywhere. But there was a lot of things learned during the development of R Markdown. And so Quarto sort of encompasses all of the things that were learned and sort of tries to make it better. So it's still a plain text source document. So you can open it in any type of program that can read plain text files. It still does literate programming. Quarto natively just has multi-language support. So you can have an R document, a Python document, or an R and Python document. It also out-of-the-box supports Julia and Observable for JavaScript.

And you can take a Quarto document, very similar to R Markdown, and generate multiple outputs. So PDFs, HTML files, anything else that I can't think of because I don't really use. And that's all because it's built on top of Pandoc. So you can write regular Markdown, and then it can generate out into something a little bit prettier. And the nice thing about it, about Quarto, at least from my end and maybe a lot of people here at this conference, is that it's just more familiar because it's coming from the R Markdown world. So hopefully, you don't have to do that much more work to get using Quarto if you are already using R Markdown.

So hopefully, you don't have to do that much more work to get using Quarto if you are already using R Markdown.

So we can actually use Quarto to render the same exact R Markdown file. And we'll get a nice little piece of output that looks like this. So it's really similar. The formatting is a little bit different. You'll see at the bottom, I added the ability to just put in a table of contents. And so you get a nice little table of contents on the side instead of at the top. So as I scroll, it's kind of always there with us. So that's one of the nicer things that comes out of the box compared to something like R Markdown.

But there's a couple of differences if we actually want to create a Quarto document, like a .qmd instead of .rmd. In the YAML header, really the only thing we need to change is instead of output HTML document, it's now format HTML. And a lot of the other bits and pieces, there are some overlaps or some new things. Some things get taken out, et cetera, et cetera.

The other thing about Quarto and R Markdown documents, I forgot the exact version of R Markdown. But you can now take all of those command line chunk options and actually put them in line using this hash pipe. And then you can now have all of those options as a comment in your code. And that's really nice because if you start doing things like doing figure captions or alt text, the problem in the previous way was everything actually had to be in that first line. And things just ran off the screen, and it gets really hard, especially if you're using some version control system. So this sort of allows us to separate things out and make that part just a little bit nicer to work with.

That's generally the only sets of changes that we need to make. And then from there, instead of telling Quarto render to render the RMD document, we can just say, hey, I just created this QMD document. Can you go and output that as well?

And the only sort of tricky thing about if you're just working with QMD files and you want to sort of have the output go into a different directory, you sort of have to do this thing where you turn the whole thing into a Quarto project, and then the output directory and the output file formats you can sort of use. But if it's just a regular Quarto document and you don't care about exactly what the file name and the file output is, you can just use plain old Quarto render or sort of do this. I do this as this little hack where you just turn the whole project into a Quarto project. So that's sort of like the little weird thing with using Quarto documents. Only if you're trying to render individual Quarto documents, it doesn't apply for websites or slides and stuff like that that use the Quarto YAML file. And it just needs to exist for this whole process to work. So you don't have to put anything in that YAML file. It just sees the file and then is like, oh, it's a project. I know what to do now.

But if you want to read more about this sort of issue or any developments, here's two links about the discussion. And then there's this other way where you can have a pre-render script or a post-render script that if you really want to move stuff around, you can sort of put that in your workflow as well.

And that's all around this concept around project templates that I talk a lot about. So in DCR and NYR, I've given two talks about project templates. I've taught a course with Tiffany Timbers at UBC. It's DSci310. And it's around reproducible workflows for data science. And project templates are also another thing about it. So I really care about project templates. I'm the person, if I'm in your group project, I just deal with the file structure of it. And then I just make that. That's one of my things that I get really annoyed if it's not set up properly. So I usually end up handling that.

Jupyter Notebooks: pros and cons

So Jupyter. If you've used Jupyter Notebooks, there's these two sort of talks. One that says, I don't like notebooks. And that's sort of stirred the boat a little bit. And then there was a follow-up discussion called, I'd like notebooks. And you'll have access to the slides, and you can actually watch those talks about the pros and cons of working with actual Jupyter Notebook files. But I think both of the speakers there, they do have a point. There are pros and cons with any tool that you use.

Here's my list of what I agree with and what I don't agree with, or just my workings with Jupyter Notebooks, is that it's really good for technical writing. It does the whole literate program thing almost too much, because there's no easy way to hide certain bits of the output. You just see everything all the time, which is, for better or for worse, if you really want transparency, you have that option. But on the downside, at the end of the day, if you try to double-click the file, if you can even double-click the file, it's really just this giant JSON file. And that just makes version control kind of messy.

But on the plus side, if you upload a Jupyter Notebook to GitHub, it automatically renders. So that's a really nice thing as well. Quarto documents and R Markdown documents, if you upload them to GitHub, it just gets rendered as just a plain text file. So Jupyter Notebooks, it's really nice. After a workshop, and I'm teaching at a Jupyter, it's just like, here's all my stuff, and I don't really have to think about making things prettier. This is exactly what happened in the workshop.

For data science, the thing with Jupyter Notebooks is I sort of treat it more like an output format than a source document, or at least that's how most people use it, because all of the output and figures are all in the same document. So again, if you start using version control, you sort of end up in this scenario where you're now version controlling output of code and not just the raw source part of the code. So again, it's really nice for a workshop. You sort of run into issues or humps, speed bumps, as you're working in version control.

But at UBC, we use this a lot, or Jupyter Notebooks a lot, in teaching. There's this tool called MBGrader, and it helps automate a lot of our student assignments and stuff like that. So there is a lot of tools, and there's an ecosystem around Jupyter Notebooks, so that's always really great, especially if you're a teacher.

But again, I've talked about Jupyter Notebooks being JSON. Here's just the first two or three bits of the actual JSON. So you see some of the code bits, and here's the markdown of the thing, and then I tell it to give me the first five lines of the data set. So here's the HTML view, here's the actual plain text view, and then this is the actual code. So you can see it's really easy to accidentally delete a comma, especially if you're trying to deal with merge conflicts of some sort. And then the entire notebook doesn't even load. So there's a really high chance of just not looking at it in a dedicated notebook application, you're going to screw something up.

So there are tools, VSCode on the left and JupyterLab on the right. There are now tools that, at least with VSCode, I can at least double-click the thing and just view it. With JupyterLab, you always have to make sure that Python's installed and running, all the tools are there, and then you have to be able to load it up, and then hopefully everything loads. But at least with VSCode, you have a chance to just look at the output without having to worry about if you have everything installed properly. But that's sort of the downside, because it's not really a plain text format. I still really need a dedicated tool to sort of just browse and look at a notebook, compared to a QMD file that's a markdown file, really, with some special bits of syntax in it. But this is sort of just what evolved in the Python ecosystem, and that's why it's super popular.

And Jupyter does do R. You need the IR kernel that you can install, and then all of a sudden you get this nice little R notebook. So for those of you who are in a Python shop, but you know R, you can actually get this stuff installed and sort of integrate yourself into a Python team if you want, and still have a thing that looks like a Python notebook until they open it, and it's like, what is going on here? But it is kind of nice, and we do this at UBC as well when we teach our intro data science course. We teach it in R, but using Jupyter notebooks.

Using Quarto with Jupyter notebooks

And so if you were to programmatically in the command line want to render a Python notebook, you can use Jupyter convert, and you can convert whether or not it's a Python notebook or an R notebook, and it will give you the HTML format, because that's what I asked for. And like I said, one of the things is I personally try to keep the Jupyter notebook as a source document, so one of the things I like to do is when I'm in my makefile or right before I'm going to check my code in, there's this option now to pass in clear output, and so you don't have to have the notebook open to sort of like kernel reset, kernel delete output. You can do this in the command line, which sort of gives you a lot more flexibility how you call this. And this way, at least if you're in a notebook flow, the inversion control, it's still just the source code.

So if you're really like, I want to use notebooks all the time, this is sort of how you can reduce the amount of friction, especially if you're working with version control. And then if you really want the rendered output, you can turn the output to an actual notebook file, so you actually do have source and rendered version as part of your analysis or your project.

But Quarto, again, can also work with notebooks as well. So your barrier of entry to using Quarto is really as simple as like calling Quarto render and then just passing in the Jupyter notebook file. It doesn't, by default, execute the code from top to bottom. It just says like, whatever is rendered in here, I'm going to dump it out. So if you really wanted to run the code from top to bottom, like kernel run all, you pass in this dash dash execute, and it will actually go and run the code out for you.

And you end up with the Quarto version of a Jupyter notebook, looks something like this. And then the regular version of the Jupyter notebook looks like this. So really, it's the same amount of things, and you can see like here's actual Python code has like the Quarto template around it as well.

And so this is just the code bits for like all of the extra stuff that I usually put in. So convert to an HTML, execute the code from top to bottom, add in a table of contents, put it in a special location, really like some type of output folder, and then I have it as a name. This is the actual name of the HTML file. That's what you ended up seeing in that particular example.

So the other thing, let's say, for example, you already have your notebook, and everything's already rendered, and you don't really want to change it because everything's already built in that pipeline. But you want to start like a new report. And that report needs bits and pieces from that Jupyter notebook. So what you can actually do is actually reference a Jupyter notebook, add in some metadata comments, and then it can find those bits in the Jupyter notebook and put it into your Quarto document. So you don't really have to change any of the code bits if you already have a Jupyter notebook pipeline. But if you're ready for like, oh, I need to write like another report, and I'm going to try this new tool called Quarto, that's another way you can sort of dabble with using Quarto.

So I have all of these example files. If you look at the repository at the bottom, all the example files are there. There's a make file, so you can actually see all the commands that get run as well. But so first, let's render a notebook just so that all of the output is somewhere. And then all you need to do is in the cell of a Jupyter notebook, use regular Quarto comments, and you can add bits and pieces of Quarto-like metadata stuff into a Jupyter notebook cell. And then in your Quarto document, you use a short code. You say embed, and then the actual name of the Jupyter notebook, and then literally like whatever the label that you have in that cell. So right here, label fig.hist. That's all you need to put into your Quarto document.

And then you render the Quarto document just like normal, and you end up with a document like this. So this is a Quarto document, and all I ended up doing was putting in that short code. And this is pulling actually that image straight from the Jupyter notebook and just inserting it into my Quarto document. And at the bottom, you'll see that there's even a reference to the Jupyter notebook. So it does cross-reference back to this is where the original thing came from, and if you want, you can download that notebook as well. So it's a really nice way of like, hey, I just want to start a report. I already have something working, and I don't really want to change that pipeline too much. This is a nice little way to like, oh, use Quarto, use it to build your websites or your slide decks or something, and just use bits and pieces from your notebook world.

So it's a really nice way of like, hey, I just want to start a report. I already have something working, and I don't really want to change that pipeline too much. This is a nice little way to like, oh, use Quarto, use it to build your websites or your slide decks or something, and just use bits and pieces from your notebook world.

Converting and publishing

Last thing I want to talk about is converting. So there's another tool called JupyText, and you can use a tool. You can use this particular tool to rewrite your RMD document to a Quarto document or a Jupyter notebook file to a Quarto document. You can use this tool called JupyText to handle that for you. So you just end up with your regular QMD file that's been translated over. And then QuartoConvert is also another tool that you can use where you can say like, hey, here's my Jupyter notebook file. Turn this notebook file into a QMD file, and it will do its best to translate everything over. It does a pretty good job, but you might still have to go do some minor tweaks of things, but that's about it. But it's a really good way to get started. If you already have a notebook and you just want to, I just want to try Quarto, let me just convert everything and start from there, this is a really good way of dealing with that.

Last is publication. So you can actually publish your files using Quarto Publish. So for example, I can, this talk is actually published on QuartoPub, so if I were you, you can go to QuartoPub.com, make an account, just claim your username right now, and then you can sort of publish your stuff onto QuartoPub, right? So thank you. And then here's all the resources for everything that you need. Thank you.