Christoph Scheuch - Empowering Reproducible Finance through Tidy Finance with R and Python

video

Oct 31, 2024

21:02

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi everybody, I'm Christoph and I'm really excited that I get to tell you the story behind Tidy Finance and why we believe creating educational content that supports both R and Python yields tremendous benefits for researchers, teachers and students.

The reproducibility crisis in finance

Our story actually begins in the year of 2015. Back then, Professor Kemp Harvey, who is a well-known figure in the field of finance, published a quite provocative paper. In the abstract of his paper, he stated that most claimed research findings in financial economics are likely false. As you can imagine, this triggered quite a bit of a debate in finance and even the Financial Times reported about it back then.

In the abstract of his paper, he stated that most claimed research findings in financial economics are likely false.

But one of the most important and long-lasting effects this paper had is that top journals introduced code and data sharing policies. So now, as a researcher, when you want to publish a paper, you have to provide your code and at least some mock data and ideally the actual data that is used to create your results. The Journal of Finance was the first journal to implement this because Kemp Harvey was the editor at the time, so this helped a lot. But the other journals followed as well in the coming years.

Origins of Tidy Finance

One important consequence of these code and data sharing policies was that as young researchers, we kind of faced a hard time. So you see here a picture of my brilliant colleague and co-author and friend Stefan Voigt and me at the time in 2015 when we started our PhD.

There was hardly any public code or data available. So this really sucked because we wanted to do empirical work. So you have to use proprietary data and come up with the code yourself. It was really hard to reproduce papers. I mean, obviously, people are not sharing all the artifacts that you need to replicate the results. It's very hard to do so.

And since we loved working with data, we found ourselves in this well-known position of data scientists that we spend a lot of time preparing data that actually people have already worked with thousands of times. We felt like we are reinventing the wheel every time. So that was annoying.

And Stefan and I, we had different coping mechanisms with this situation. I decided to leave academia, actually. Before I did that, I started writing blog posts. I figured that's a great opportunity to signal to potential future employers that I know different things. I can do asset pricing stuff, for instance.

So what I did is I picked up a textbook that is used for teaching all over the world. And I love the tidyverse . I love R. So I combined those two and created a series of blog posts where I just tried to reproduce what's happening in the textbook. So these are my tidy asset pricing blog posts series that I started in 2020.

And what happened is that over time, people reached out to me thanking me for putting this stuff online. They're like, oh, my God, you make my life so much easier. Thank you for doing that. And I mean, they were also asking me to do more stuff, obviously. But I was really happy to receive that kind of feedback. It's like, OK, I mean, people find my stuff, they use it. This is great feedback.

Stefan took another route. He stayed in academia. He's a brilliant researcher. He's a brilliant professor. He's at the University of Copenhagen at the moment. And he created a course. The course is called Advanced Empirical Finance Topics and Data Science. So in this course, he immediately started working with data. And students, even if they've never programmed before, they write from the beginning, worked with R, analyzed financial data. And it was a success. Students loved it.

And he created some lecture notes, used some of my stuff. And at some point, we kind of jokingly said, well, you know, we have so much stuff, we could create a textbook ourselves. This is what we did, actually.

So a couple of years ago, we started creating a website to compile all the things that we've worked on. And at some point, CRC Press reached out to us. They discovered our content, said, don't you want to publish with us? And we were thrilled. It sounds really exciting to have a book out there that kind of lives in libraries and bookshelves now.

But it was very important for us to keep the stuff open source. I mean, this is why we started with all of this. And CRC Press was very supportive. They said, of course, no problem. You can just keep your website as it is. Just make sure that people cannot download the PDF. People can still download HTML files and print to PDF. But everything's open source online. And we are really happy that this worked out this way.

Benefits of open source educational content

And this is the first point that I want to make in my presentation. There are tremendous benefits of this open source educational content. First, fairly obvious, everybody can access it. Wherever you are, regardless of your financial constraints, you can just visit the website, use the stuff. That's brilliant.

As an author, it's also great that you can just keep it up to date. I mean, you will immediately, once you send stuff off to the printing press, you will find a mistake like every time. But you can just correct it. You can keep a changelog. When a new package comes around, you can just update all the relevant chapters. You can add new stuff as you go and collect feedback. That's amazing.

And I want to stress this feedback collection very much because readers can actually contribute. So we have our book on GitHub so people can create issues. And it has happened frequently that people raised issues around, oh, are you sure that this is the way to do it? Yes, we were most of the time. But sometimes they also ask clarifying questions because the text is not clear enough or the code is not clear enough. Or they come up with new edge cases that we have not thought about yet. And that's amazing for me as an author to have this kind of feedback.

Transparent code: narrative, code, and results

Now, as a next step, I want to show you a typical page in our Tidy Finance book. Don't focus so much on the details here. It's just it's about the outline. So on the top of this example, you have Narrative. So we economists, we love regression models. So this is an example where there isn't a regression model. So we explain the equation, what's happening, what do we want to estimate? And this is usually where typical textbooks in finance stop.

What we do is we talk about the package, why we're using it, what potential alternatives might be. So for each code chunk, we try to provide much more context than just describing what's going to happen next from an empirical perspective.

So second piece, the code chunk. We try to keep code chunks as small as possible to solve a very specific problem. Not too many things at the same time in a code chunk, which is sometimes challenging. And we try to keep the object naming and the syntax consistent across all the chapters. So people should feel familiar with the syntax when they browse through the book. And here we're just implementing exactly the regression model that we described above and we're using the function that we described.

And as a third part, we have the results. And we, as much as I love GT and GT Extra to create amazing tables, in our book, we wanted to print the results as they are for people who would just copy the code into their console. So we didn't want to put many bells and whistles around it because we want to be easily reproducible. People can just go to a website, copy the stuff that they find interesting and ideally get exactly the same results.

And we call this combination of narrative code and results transparent code. And I think there are also a lot of tremendous benefits. I've just talked about the reproducibility. People can just copy the stuff. They can understand what exactly is happening. We can bridge the gap between theory and applications very easily. We also have a chapter on machine learning where we do a brief introduction and then immediately jump into applying machine learning models. And the content is extendable. So when people understand what's happening in a chapter in specific code chunks, they can just copy it, modify it for their specific application.

Adding Python support

Some things happened in 2023. We were super happy with the content out there. We were motivated to create additional stuff. And some professors already picked up our stuff. But then the University of Cambridge, Mark Zellman from the university approached us and said, guys, we love your content. It's great. But I'm not allowed to teach R. I can put you in the recommended literature, but my students have to use Python. We said, yeah, what are you going to do?

Then all of a sudden we receive an email. Somebody on GitHub said, I translate your book to Python. It's like, oh, wow, resist. So he started translating the first few chapters. He did not exactly reproduce it. It was very hard to follow. But it's nice if people do that. It's a community. I get a community feeling.

And then ChatGPT came along. I can write Python code for you. So we thought, it's easy. Let's do that. We can do it with the help of ChatGPT. And it wasn't obviously as easy as we thought. So we spent the last year writing the second book. And again, fortunately, CLC Press was again very supportive. So we pitched them the idea we can we try to translate the stuff to Python. We think there is a big demand for this. And they said, great, send us the completed manuscript in a month. And we thought this is not going to fly.

So we spent a year on that. But we managed to publish this. And we got another co-author on the book who is a Python expert. Because Stefan and I, we were very much coming from R and loving the tidyverse. So we needed somebody to discipline us in the sense that we should do stuff that is very common in Python. That was an interesting experience.

Now, our website supports both languages. So you can go to tidyfinance.org and choose the R or Python version.

Trade-offs of multi-language support

At this stage, I don't want to just talk about benefits. I feel like there's a stronger trade-off here that you have to face when you're creating this type of content. And one big pro to me is enhanced reproducibility. I mean, we had our stuff in R and we just reproduced the stuff in Python. That's great. We found some mistakes, obviously, in the R version. But we now have two consistent versions of the same applications.

It increases the accessibility. And now the University of Cambridge, they can use our stuff. There's also learning flexibility. So for instance, Stefan extended his course by telling his students, I'm just teaching these topics. You might pick your own programming language. So in the same course, you have some people working with R, some with Python. And they are able to kind of communicate with each other. And they are super happy that they can pick the tool that they find most valuable.

And I want to mention, Quarto makes it really easy. We were super grateful that the Quarto team worked on features that we needed to publish the book just as we wrote it. So we actually sent, we used pre-release versions to get the final version of the book to the printing press. But it's a great support.

But the cons are huge, because this type of consistency is quite challenging. You have a numerical problem, for instance, an optimization problem. And then you implement it in a different programming language. And all of a sudden, you get different results. So is it the optimizer? Is it the language? Should I make a mistake with the data? You have to really tinker things until everything works out. And that's the most painful part.

You have now double the code maintenance. So if somebody discovers a potential mistake in your code and you have to correct it, you have to do it twice. Or another example, I updated the data that we use in the book for another year. And now instead of looking at 12 chapters, I had to check 24 chapters. So that was kind of a painful experience. But overall, it's totally worth it for us. So we're really happy that now we have both R and Python version on the website. And we are really looking forward to taking this to the next level by creating new content.

Conclusion

So with this, I want to conclude my talk. I hope that our story inspires some of you to make the world a more reproducible place. And I think there are three very simple ingredients. First, share your code publicly. We've heard many stories about blog posts. Or you don't necessarily have to write a book. So you can collect feedback.

Use Quarto because combining narrative code and results like in the same document is just a much better experience than sharing code as a script with your students. And if you have the capacity to support multiple programming languages, it's also an amazing experience. Thank you everybody for your attention. I'm happy to connect via LinkedIn or chat more here at the conference about reproducibility in finance or maybe in your field.

Q&A

But first one, was it hard to format the book for both HTML and for the version sent to the publisher? Did you have to change much?

Yeah. So in LaTeX, there are these style files, CLS, and they are a real pain because there's thousands of lines of code that you somehow have to integrate. And I felt like on some occasions we reached the limits of the Quarto documentation, but eventually it worked out. So we just can compile the same Quarto files and to render it as an HTML and rendered the PDF version of the book. So that was fine.

Although you had to create two versions of the same book. Do you think it would have been easier to teach people to think like a programmer while using R rather than creating an R in Python books? That way it forces them to experiment?

Well, it sort of happened. I mean, we started out with the R version and figured this is it. This is what we'll be doing for the rest of our lives. But yeah, life changes in unexpected ways. So I'm curious to see where this goes. We had a request to create a Julia translation. I have some experience with Julia, but I don't think I have the capacity at the moment. Well, thankfully it's open source, right? So if anybody else wants one... Exactly. Yeah, if somebody wants to create a Julia translation, I'm happy to support.

Can you speak to your experience trying to have ChatGPT translate for you?

I only have 20 seconds left. So I use ChatGPT in my everyday life now as a coder for also stuff that are too purely in R. Usually when we translate something, we just ask ChatGPT to translate it. So we tried to implement and we tested it and it didn't work out. So we debug it. And fortunately we had four co-authors. So basically everybody did the same and we got different responses and somehow we merged everything together. So there were basically no cases where you would just ask ChatGPT to translate it, get the code back and use it. So we always had to clean it up, make it more concise, make it work, something like that.

So alluding to Emily's point of using method chaining, this is something that ChatGPT does not use at all. And we love it very much because it makes the code more readable.