Resources

Johnny Breen | Model Migration: from Excel to R | Posit (2022)

video
Oct 24, 2022
16:46

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So I'm here to talk about model migration from Excel to R, but I'd like to actually begin this presentation with a story.

So this is a story about a model migration exercise that I did about five or six years ago, and it was from Excel to R, subject of this talk, and the ultimate outcome of this story was that I really improved my programming skills in the language R. So, you know, this is a positive story, right?

But I don't think it would be interesting to tell a story about just good things happening to me. I think the best stories have conflict, and in the best stories, things go wrong. And this is a story, I'm pleased to say, where the price of my success is that I ended up crashing the server.

And it's doubly tragic, because not only did this happen to me, I was also having to use Excel day in, day out, and I know people in the audience, if you've had that experience before, that can be a bit of a struggle.

The job

So let me take you back five or six years ago. Where did this all start? So this started with my boss at the time.

And my boss comes over to me, and he says, Johnny, I've got a job for you. And the job is this. So we have a model that we've configured in Microsoft Excel. It's slightly complicated. It takes multiple people multiple days to actually execute the model. And it's highly error-prone process. So at the end of it, we often have to check if things have gone wrong. There's a lot of validation involved in this process. And personally, I think it would be better suited to a programming language like R. So would you be interested in doing this?

So my instinctive reaction is, yeah, that sounds great. What's the model? So my boss goes on to explain, it's a statistical model, okay? And it's also a simulation model. So at the time, I was working in financial services. And this was for a banking client. And what they were interested in was multiple simulations of yearly loss scenarios. And after those simulations had been executed, what they wanted to see was, well, if you plotted them on a distribution, what would the 99.9th percentile look like?

And if you think about that, for a yearly loss distribution, that's an event that would only happen once in 1,000 years. So it's pretty rare. And the point about this is, this was not only a complicated model, it was something that was quite intensive. It needed to be run multiple times. And I'm talking 10 to 15 million iterations. So I was aware of the challenge that lay before me.

Diving into the Excel model

And so I dived into the Excel model. And it's kind of hard to explain to people who haven't worked in financial service or industries like this before, where you come across these Excel models, and they're monolithic. They're trying to do everything. And I think this picture kind of sums up the feeling of walking into one of these things. It's a bit like booking an Airbnb, and you turn up in a demolition house.

But I was resilient. I was a resilient individual. And I understood the task before me. And I wanted to carry this out and prove that I could do this. So I opened up a script, a single R script. And I proceeded to copy the contents of the Excel model into this script. And I basically did this verbatim, because quite honestly, I didn't fully understand what was going on in the Excel model. And I'd also written on my CV that I was an advanced user of R, which wasn't quite true.

And I proceeded to write the whole thing up. This was painful, but I got through it. I was really proud of myself. And I kind of thought that, like, by the end of this, like, my boss is really going to be impressed with this. It's going to really change the business. So I go away to lunch. I click run. And I'm like, great. Happy days. I'm on cloud nine.

Crashing the server

Then I come back from lunch, and there's a message in my inbox. And somebody has sent an email, and it's not addressed to me. It's addressed to the team. That's the first kind of sign that you've done something really wrong. And someone says, what's the effect of I can't log on to the server?

Now I'm a junior analyst at the time, so I don't really understand the consequences that I can create for other people. So my reaction to this is, well, that's a bit of a shame for you, but I hope you solve the problem soon.

And then inevitably you start getting other messages. So someone else says, yeah, I can't log on. Someone else inevitably needs this server for critical work, and you're blocking it. And then eventually someone comes to you, and they're like, do you know what's going on, Johnny?

My initial reaction is no, but then it quickly dawns on me that I kind of do have some idea of what's going wrong. That very last shred of denial is quickly eroding, and I finally realize that something I've done has caused this mess. And I realize that I crashed the server.

Now I put a celebratory emoji there, because I think this is a bit of a badge of honor. If you're ever involved in this, treat it as a badge of honor. Don't treat it as a total failure. You will come through better in the end.

Now I put a celebratory emoji there, because I think this is a bit of a badge of honor. If you're ever involved in this, treat it as a badge of honor. Don't treat it as a total failure. You will come through better in the end.

So this was — I jest about this, and it's kind of a funny experience in hindsight, but at the time this was kind of a serious issue. So I was working in an industry that is dominated by Microsoft Excel. In fact, Excel is like a fortress. Most models are configured in Excel, and Excel is the most trustworthy piece of software for many senior executives in these scenarios.

There's a quote from The Wire. I don't know if anybody has seen The Wire. The character Omar says, if you're going to come at the king, you best not miss. And I'd just come at the king, and I'd missed pretty badly. I was lucky enough to have a boss at the time who was sympathetic enough to, you know, persevere with this project beyond my drastic failure.

Tip 1: Don't copy or replicate Excel

And so what's the moral of this story? Why am I here? What kind of advice can I give you? Let's kind of analyze what went wrong here. So the first big tip I have, and it's really the biggest fundamental problem that I encountered here is, you don't really want to copy or replicate Excel. Okay?

Now, like, this is an obvious point to make to people if you're a seasoned programmer, but let's take a step back and put yourselves in the shoes of somebody who hasn't really written much code in their life. So somebody who's only ever used Excel before, for somebody like that, everything looks a bit like this. And that's a spreadsheet. But if you think about it visually, it's a bit like a data frame in R. And that's not a mistake. But Excel models a lot of things using data frames.

And whilst this isn't a problem in itself, you end up with a lot of Excel models which have a few questionable design choices. So the first issue you encounter typically in the wild is Excel models which start out with, you know, a few number of tabs of slowly increasing levels of information, and you think we're okay, we're stable, and then by the end of it, you've got hundreds of tabs. And if you're the creator of this model, you typically end up having a nervous sweat any time anybody else needs to use it. It's a very precarious situation to be in.

And all of this comes about because the most natural data structure to use when you're using Excel is a data frame object. So this is one issue that you encounter sometimes. The second is a lot of these models, you can store data just in a spreadsheet or in a data frame. So you don't really need a database, right? Well, not quite. I mean, a lot of these models probably should be supported by a database, but they're not. And that's nobody's fault, right? Like a lot of analysts who have built these models don't necessarily have the knowledge or the skills to even think about databases, let alone implement a database.

Now why am I mentioning all this? These are issues with Excel models. But the point is, if you're migrating an existing model into an R environment, you definitely don't want to be copying over these design choices. If you do this, then you're only going to reintroduce the problems in a different context.

Tip 2: Structure your model

So what's my next tip off the back of this? So what you want to do, instead of kind of copying what's going on in the existing platform, which is Excel in this case, you want to think about the structure of this model. So what you want to do is abstract away the implementation that you see in Excel and think about, well, what is it actually trying to do, right? What are the functional requirements? What is this model trying to achieve?

And the first fundamental step of this process is where does my model begin and where does it end? And that can be pretty tricky with some of these Excel models. They are so monolithic that it's actually very hard sometimes to know where the calculations begin and where they end. But this is a process that you have to go through in order to start unpicking this object and migrating it across.

So this is why I have this A to B presented here. So you want to kind of know, how do I get from point A to point B? That's the first ingredient. The second thing is, well, along that route, what are the operations that I'm applying to the data at point A to get to point B? And if you're doing a migration, especially in R or any language, really, these are really important things to identify, because you'll be able to encapsulate these into functions and objects that you can use in your programming environment. So this is a good kind of planning scheme to go through before you actually do anything.

And the second thing kind of related to this is you want to think about, what are the data structures that I need to transport the data along this route? Okay. Just because the Excel model that you're looking at is doing everything in Excel and you've got multiple tabs and multiple bits of information hanging in memory, it doesn't mean that you need to do the exact same thing when you migrate this into R. In fact, you could probably save a lot of the hassle, a lot of the overuse of this memory and all the other problems that come with it by just thinking about, what do I need to retain as I go from A to B?

So let me kind of put this into context with an example. So I mentioned earlier that I've been working on a simulation model at this time. Now, not to go into too much detail, but the simulation model in each outcome, it would generate one simulation which had a set of 30 numeric outputs across 30 different risk classes.

And one stage of this model is involve the task of trying to aggregate these simulations together. So the author of the Excel model decided that, well, the best idea that we can use in this context is we'll take the current simulation that you can see on the screen here, and then we'll take the previous simulations, OK, the previous set of simulations, we'll add that on to the existing set, and then we'll get another set of simulations that is one row bigger than the previous. And that would go on recursively until you got to, in this case, the 15 millionth iteration.

So yeah, this was something you needed to execute across multiple spreadsheets, and it was a slow process. But here's the problem, right? If you kind of replicate this approach and you just stick it in an R script, you're going to run into trouble really quickly. And it'll look something a bit like this. And this is like a classic problem that you'll encounter in the R programming world, which is copying objects in loops unnecessarily.

And unfortunately, like, this is one of the key design flaws that I had first implemented in my script, which meant that this thing would just eat up your memory and it would keep on going for ages. You probably wouldn't get to the end of this on 15 million iterations, at least within a few days. So this was a really, like, key design flaw in the way I had structured this.

So how did I fix this in the end? So I thought about it in a more abstract way, and I abstracted away from whatever the model was doing initially, and I thought, well, what I actually need to generate here is a list of simulation results, and that will be equal to the length of the number of simulations I want to run. And for each of these entries in this list, I just need a function that will generate my simulation results across these 30 risk classes, and then I just need to run this again and again until I get to the very end. And then we just need to bind these together.

So this looks simple and, like, it was kind of a subtly different way of designing this, but this very different way of doing this was actually, like, a massive improvement on the existing script that I had written. And it's small tweaks like that where you think about the requirements, what you're trying to do, which have quite a significant effect on the performance.

Tip 3: Keep things simple

So my key tip here, as I've said already, is to structure your model. And the final tip I have here is try to keep things simple. So one of the problems is, and I don't know what possessed me to do this at the time, but as a budding analyst, I thought it might be a good idea to, you know, try and do a bit of parallel processing in this script. And I think this was the straw that broke the camel's back in my case.

Coupled with some of the other design flaws that I'd implemented, I was trying to do multi-threaded processing and it all went a bit wrong. I didn't really read the documentation properly. And I kind of was overcomplicating things, right? Was this really a concern that I needed to address? The answer is no. What I really needed to do with this process was start simple and start with a sample of iterations that I'm trying to run. And then I want to build this up gradually, profile what I'm doing, and work from there. So doing the basics right in these cases can be really important.

So with that, I want to thank you for listening, and I want to just leave you with this quote from Bob Ross, which I really like, is that, you know, you make mistakes, and we all make mistakes in life, and it's a cliché, but I like the addition that mistakes are just happy accidents, right? Like now I can jest about this and I can be happy about my accident, because I actually learned a lot in the long run.

you make mistakes, and we all make mistakes in life, and it's a cliché, but I like the addition that mistakes are just happy accidents, right? Like now I can jest about this and I can be happy about my accident, because I actually learned a lot in the long run.

So thank you all for listening. That's the end. Thank you.