Reproducible Publications with Julia and Quarto | J.J. Allaire | JuliaCon 2022

video

Jul 29, 2022

24:32

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hey everyone, my name is J.J. Allaire. I work at RStudio and I'm here today to talk about a new scientific and technical publishing system called Quarto and then specifically how Quarto can be used with Julia. So a little bit of a road map. I'll cover first what is Quarto, where does it fit among other projects you might use or be aware of. Talk a little bit about the concept of scientific markdown, which is an important thing that underpins Quarto and its design philosophy. Talk a little bit about the different types of output you can create with Quarto that includes different types of documents, but also websites, books, presentations, blogs. So we'll kind of cover that and provide some examples of using Julia with all of those formats. And then take a little more of a deep dive into Quarto and how specifically it works with Julia and how some of the ins and outs of that, how some of that can be made better, etc.

What is Quarto?

So I'd start with the very basics, which is what specifically is Quarto. You might actually be familiar with other systems that are similar to Quarto. I would call it a literate programming system in the tradition of Org Mode or Sweave or in Julia you may have seen Weave.jl, R Markdown, JupyterBook. Basically the fundamental idea is that we want to create a scientific and technical publishing system that uses markdown and enhances it with features that are required for scientific communication. So first and foremost of those is the incorporation of computations into documents and publications. Again at the core we're going to use markdown. Traditionally in scientific communication there's a substantial tradition of using LaTeX and as you'll see the flavor of markdown that is used within Quarto and within Pandoc borrows a lot from LaTeX and in fact is able to incorporate LaTeX directly. And then we'll talk about the again as I said the different types of output that can be created.

So where did this project come from? It's an open source project. It's sponsored primarily by RStudio and it actually comes out of about 10 years of experience with another system, a similar system called R Markdown. That as I said that was actually announced 10 years ago almost to the day here. And we worked on R Markdown for about 10 years and I developed a lot of good sound core ideas and folks in the R community got a huge amount of value out of the system. But as we're all aware the number of languages and runtimes and environments used for scientific discourse is very broad. There's R, there's Julia, there's Python, there's Scala, and there will be many more. And what we really wanted to do was take all of the experience you had with R Markdown, build a new system kind of reimagined with the benefit of hindsight and make that system fundamentally multi-language and multi-engine. So not tied to R in any way and not tied to Python in any way or Julia in any way. It sort of is able to work with lots of different languages and engines that are available now and in the future.

Goals of the project

And so I also before I get into more of the mechanics and the how and the what I wanted to talk a little bit about the goals of the project. One is this idea of computational documents and that is documents that incorporate the source code required for their production directly inside the document. And this can take the form of notebooks which we're all familiar with but also there are plain text flavors of computational documents. The goal here is principally reproducibility. The ability to replicate the studies and documents and analyses that we create. And also automation. Automation leads to reproducibility but it's also a significant practical benefit to working this way. So we want to provide a system that makes it easier than not to create computational documents.

We have this also this idea of scientific markdown and if I'm sure many or most of you are familiar with preparing scientific manuscripts and you can see here if you try to use a tool like Microsoft Word it might be relatively smooth going at the beginning but the requirements of technical documents quickly escalate and make a tool like Word quite unwieldy. You may have used LaTeX which starts out you can see the red line here at a higher bar of difficulty of use but it's a relatively flat curve once you once you climb up to that bar. So LaTeX has been a great system for creating technical documents but it is not the most accessible. Markdown is in some ways has some similarities to LaTeX in that it's a plain text format that's sort of compiled or rendered to a final document and it's not nearly as capable as LaTeX. It's easier to start with kind of approaching as easy as using something like Word but then it it doesn't quite have the features and functionality of LaTeX. So our goal here really is to take Markdown, hopefully take that line all the way down to where Word starts, make it extremely easy to work with Markdown but then have a system whose complexity scales a lot more like LaTeX. Once you learn the basics of it then it's very straightforward to do sophisticated things.

Once you learn the basics of it then it's very straightforward to do sophisticated things.

And another goal is is this idea of single-source publishing. The content that we create oftentimes needs to go to multiple variations of HTML, they may need to go to print, it may need to go to become a Word document or presentation, EPUB books, there's lots of different ultimate locations that we need to deploy our content to and we really like to write it one time. And that's what a system like Markdown or an approach to publishing that is afforded by Markdown leads to.

A concrete Quarto example

Okay so let's take a simple concrete example of Quarto. This is a Markdown document, you can see on the top there's some metadata which gives basic information, title, author, also points to a specific Jupyter kernel to use, in this case Julia 1.7. And then you see some Markdown, you see a code block there, an executable code block, it's got a little bit of Julia code in it, it's got some other metadata in it, some options specifying a label and a caption. And this is a complete Quarto document and what you can see on the bottom is we can render that document to a wide variety of different formats. So let me show you a little bit of what that rendered content looks like. You can see HTML is a web page, PDF you can kind of tell from the typography that this is something that was produced by LaTeX, but that's a PDF variation of the same document produced from the same code. You can see a Word version of the same document and then a PowerPoint slide produced from the same code. So one set of code producing multiple formats.

I want to dig a little bit more into this code cell construct because if you're familiar with Markdown you've seen just a normal tick-tick-tick Julia would be like I would like to include some Julia code for the reader. Those braces around the Julia indicate this is an executable code block which means it's going to be run when rendered. The code is executed and its output is included in the document. You can see on the top there's some special comments that provide options, in this case echo false means don't show the source code. And there's a whole bunch of different options I won't enumerate them or recount them all here but as you can see from the slide lots of different ways to control how output is handled from code cells.

Talking a little in a little bit more depth about kind of how the rendering pipeline works, there I was showing you a QMD file which is a Markdown text file and it's essentially picked apart into a Julia code chunks and Markdown. The Julia code chunks are executed using Jupyter and specifically iJulia. I'll talk a little bit more about that soon. Turned into Markdown, sent to Pandoc and then that is used then to render into final output formats. That's for the the scenario of a QMD file. You can also just take an IPy and B file directly. So an IPy and B file, a Jupyter notebook with Julia of the Julia kernel and you can render that directly to Markdown and onto Pandoc and then into all the various target output format.

Scientific markdown features

So let's go through some examples of what this sort of scientific Markdown looks like in Quarto. You can see this is some Markdown syntax that's citing things inside a work. Bibliographies can be specified in lots of different formats including BibTeX and CSL. And you can see by when I use that Markdown then the citations are resolved in the document. And when bibliographies are rendered there's as you know many disciplines have many different variations and standards for how citations and bibliographies are formatted. And there's a thing called the citation style language that allows you to once you've got that bibliography output citations in actually over 10,000 different styles. So very robust support for citations.

Also support for cross-references. This is I want to reference a figure, I want to reference a subfigure, I want to reference a table, an equation and I want to have those references automatically numbered and resolved in my document rather than having to track them manually. So this is an example of you know referencing a figure and a subfigure. Again we can do this for tables, equations, theorems, sections. I can show you a little bit of what the syntax for that looks like in Markdown. Here I'm defining a couple of figures with an HTML looking HTML ID and then I reference it here and you can see then those references are automatically resolved. The same thing works for computations. Here's an example of a Jupyter. I won't show that now but basically if you have a code cell that produces a couple of plots you can say I'd like to be able to cross-reference those plots and the numbering system will also work with computational outputs.

Callouts are another thing that are quite useful. Used very often in books that allow you to highlight specific pieces of content different ways. And then if you're familiar with LaTeX, the LaTeX grid system, there's quite a bit that can be done to do sophisticated layout of pages including the use of one or both margins, kind of putting notes in the margin, putting content in the margins, having figures or code span to use the full page while still maintaining an optimal reading width. We have a lot of tools in Quarto for advanced page layout. This is an example of some of the different kind of columns including full bleed, full width kind of treatments of things. We also have lots of ways to use the margin so you can use the mark you can put content in the margin but also something that's very popular here is putting an equation or side note or even foot notes in the margin. So lots of tools for advanced page layout. And then finally we have integrated support for embedding diagrams, either mermaid diagrams or graph viz diagrams. So this is an example of a mermaid diagram, very helpful for a lot of technical publications.

Output formats

Let's talk a little bit about output formats. Lots of different document formats. I've highlighted HTML, PDF, and Word, but JATS, Context, RTF, ASCII doc, lots and lots. There's probably over 30 different formats supported. Presentations, lots of different formats available for creating presentations. Reveal.js does HTML presentations. We can create PowerPoint presentations, Beamer presentations as well. Lots of advanced features in Reveal.js. Speaker notes, printing to PDF, animations. I'll give an example here of a Julia presentation and they're taking advantage of Julia to do some fancy diagrams. Just flick through here quickly. But this is a an HTML presentation created to talk about a Julia package.

Websites are another very useful type of content you can create. Here's an example of a Julia workshop for data science. You can see a website with the kind of navigation you'd expect. You can see you'd expect callouts, code blocks. So this is a website and you know mostly all this navigation that is sort of provided automatically by the website framework. Search is provided automatically. So very convenient way to publish collections of documents. And then the Quarto website itself actually uses Quarto. As you can see here we've got hundreds of documents that are easily navigable and also support search.

Books are sort of in some ways a variation of website. They inherit all the features of Quarto websites. So navigation, search, etc. But they also support cross-references across chapters. So when you reference a figure from chapter 2 and you're in chapter 5, it'll say figure 2.3. And it also supports print format. So books support PDFs, Word documents, EPUBs, as well as essentially a website for your book. And here is a book example created by Doug Bates about mixed-effects models with Julia. And you can see here's examples of hiding and showing code. There are some cross-references in here. You can see some citations, tables. And I don't see a download for it, but it would be again possible from the same source to create a PDF of this book or an EPUB version of this book.

And then blogs are another sort of another variation of websites that again have just any old pages you want and can have arbitrary navigation and search. But also blogs are collections of posts. And so we can automatically generate a listing, automatically generate an RSS feed. Here's an example of a blog created for Julia. You can see here's a list of posts. Okay, that you can see it's got its own kind of theme. I won't cover it in this talk, but Quarto has lots and lots of different themes. And you as the publisher can create your own themes or adapt themes that are created by others. So it's an example of a blog created with Quarto and Julia.

Using Quarto with Julia

Okay, so let's talk more specifically about how Quarto and Julia can be used together, how we actually execute Julia computations. What are some of the drawbacks and opportunities for improvement? So if you want to use Quarto with Julia, first thing you need is you do need Jupyter because we're going to use Jupyter to kind of manage execution and collect output from execution. So you install Jupyter and then you install iJulia, the iJulia kernel. That's all that's required. It's recommended, and I'll explain why in a minute, that you also add the revise package. And so the basic workflow, and this really applies to Julia or Python or R or any engine that you would use with Quarto, is to say Quarto render to render a document. I showed that earlier. You can also render a notebook. So if you prefer to work in a notebook versus plain text, you can just render the notebook directly. And then preview, I'll get into kind of Quarto tooling in a few minutes, but preview lets you preview documents, and as you save them and work with them, the preview is automatically updated. So that's a nice iterative workflow that you can use before doing final rendering.

So this is an example of that preview. So this is an example of, I've got a Julia notebook in JupyterLab, and as I work with the notebook and save it, then the preview on the right is automatically updated.

All right, so let's talk a little bit about iJulia. If Quarto sees a Julia code cell inside your document, it's automatically going to assume that it wants to use the iJulia Jupyter kernel. It'll find the most recent version of the Julia kernel on your system, but you can also pick a specific version by specifying in this example, for example, Julia 1.7 directly. iJulia executes Julia code and then transforms it to either plain text, graphics, markdown, HTML. It knows how to take Julia output and render it into something that can then be published. One piece that I'll get into in a little more depth in a minute is that for interactive sessions, we don't want to absorb the startup time of the kernel for every document render, so Quarto will keep the Jupyter kernel resident to mitigate this, and then revise is used to make sure that if there are changes to dependent files or packages that occur while that long-running session exists, that they're updated and refreshed.

Managing performance

I want to talk a little bit about how we manage two types of performance. So one is startup performance, which is how long does it take to load the interpreter and packages and how often I need to do that, and then rendering performance, how expensive are computations and how frequently do we need to run them. Startup performance, I talked about a little bit of this idea of keeping a kernel daemon around to mitigate startup costs. That HelloQMD example from earlier takes about 30 seconds on first run on my machine, but it takes less than a half a second on subsequent runs, so keeping that kernel around is really valuable, but then that creates the problem of stale code, and that, again, the solution of that is revise, which I'm sure most of you are familiar with for just long-running REPL sessions, and you can add revise so that it always runs inside iJulia by adding this code. This is all documented on the revise website, adding this code to the iJulia startup. So revise, people have found to be a great addition to iJulia when working with Quarto.

Thinking about rendering performance, you know, and this has nothing to do with startup performance, this is just how long do my computations take, and if I'm working iteratively on a document and focusing on content, I don't always want to rerun on my computations, so there's a few approaches to that. One is just authoring inside a notebook allows you to control exactly when code execution occurs and actually cache the results in the notebook, so that's one approach. Jupyter Cache, which is a package you can install separately, will actually do caching of all your cell outputs. It's all or nothing, so if any of your code changes it's got to re-execute the whole document, but again, if you've done your computations and now you're writing and doing analysis, this allows you to re-render with zero computations. And then I won't get into all the mechanics of it, but Quarto has a freeze feature that allows you to also, separate from cache, it's a little more explicit and durable, you can permanently save and reuse computational output. So for example, a blog post that you wrote three years ago, you don't want to have to keep re-rendering that, and so that can be saved. Or if you're deploying code to a server that may or may not have the permissions and software required to render, you may want to render everything locally, have it freeze, and then deploy the kind of frozen execution results onto the server.

We picked iJulia mostly because we had made a bunch of investments to make the working of Quarto and Jupyter to work really well. One was this kernel demonization and caching work. iJulia also had implemented a lot of primitives for supporting MIME outputs from Julia results, and that's including the ability to output raw LaTeX, which ends up being important for some more sophisticated tables, and the fact that as a format, IPyMB was supported in a bunch of popular notebook front ends, JupyterLab, VS Code, etc. So I think it aligned well with a lot of infrastructure that both we had and that users were taking advantage of. But it's definitely conceivable that the other notebook or literate programming systems like Pluto or Neptune, they could be integrated as an alternative to iJulia. So that's something we'd certainly be interested in talking about. The execution engine system is intended for future extensions, and it's pluggable. So it's something that we could definitely do.

Tooling and editors

A little bit about tools. We have a VS Code extension for Quarto, which I won't go into exhaustive detail about all the different features, but it's a pretty deep feature-wise. Specifically as regards to Julia, it integrates with the Julia VS Code extension. So here I've got a Julia QMD document, and I'm doing side-by-side render and preview. Here I'm actually able to run. You can see run cell. I can run individual cells in the QMD, and their results are put in the interactive Julia session that's running in the terminal, as well as plots displayed. And then we also integrate with the Julia VS Code extension for code completion. So lots of good productivity tools available if you're a VS Code user.

As I showed before, we also integrate pretty well with JupyterLab. You can run Quarto preview on a notebook, launch it in JupyterLab, and as you're working with it and you save it, there's a browser preview of it that refreshes automatically. You can also use Quarto preview specifically with any text editor. Just type Quarto preview Julia.QMD or Julia.ipynb from any terminal, and you'll get live reloading from whatever editor you're in, and there's actually some Quarto extensions available for popular editors, some of which we maintain, some of which others maintain. So EmacsVim, NeoVim. I'll make the link to these slides available at the end of the talk, so you can find those links if you need to.

And that is, in fact, the end of the talk, and there is the link to the slides. We've got the Quarto website, specifically the Quarto and Julia article you can see linked to there. Repeats a lot of the Quarto and Julia section, so if that went a little bit fast, that breaks it down in much more detail and has more step-by-step instructions. The Getting Started gives you a tutorial that walks you through the basics of how to use the system, and we'd love to hear from folks in our discussions or GitHub issues, so I'd also love to hear any questions that people have now. Thank you.

Featured software#