Resources

Data Science Hangout | JD Long, RenaissanceRe | Empathy When Integrating with Other Tools

video
Sep 15, 2022
1:10:37

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everybody. Welcome to the Data Science Hangout. If you're joining for the first time today, it's nice to meet you. I'm Rachel. I think we maybe have some first timers from JD's Twitter post earlier. So the Data Science Hangout is an open space for the whole data science community to connect and chat about data science leadership, questions you're facing, and what's going on in the world of data science. So these sessions are recorded and shared to YouTube, as well as the RStudio Data Science Hangout site. So you can always go back and rewatch or find helpful resources.

Also have a LinkedIn group for the Hangout too. So if you ever want to continue a discussion or if you just want to meet somebody and talk in there, other than me being the one talking in there, feel free to use that. Tyler or Hannah will share it in the chat as well.

Together, we're all dedicated to creating a welcoming environment for everyone. So we love when everybody can participate in these, and we can hear from everyone. So there's three ways you can ask JD questions today. You can jump in by raising your hand on Zoom, and I can call on you. You can put questions in the Zoom chat, and just put a little star next to your question if you want me to read it out loud instead. Maybe your dog's barking or you're in a coffee shop or something. And then lastly, we also have a Slido link where you can ask questions anonymously too.

And I see Hannah just shared that in the chat. Just like to reiterate, we love to hear from everybody, no matter your level of experience or area of work as well. So with that, I am so excited to be joined by my co-host for today, which there was a lot of excitement on Twitter about, JD Long, VP Risk Management at Renaissance Re.

JD Long's background and role

Well, hey y'all. JD, I'd love to have you maybe start by introducing yourself and telling us a little about your role, company, maybe also something you like to do in your free time.

Absolutely. So I'm JD Long, and I've been in and around the R community in particular for a number of years. I love telling the story of the first time that I met JJ was he came to Chicago and they had this company. I can't remember if they were like stealth or if they were open and they were going to make an RIDE and somehow or another make a business out of that. And it sounded kind of ridiculous, but I love the idea that it was hosted on a server. And a bunch of us went out to dinner, maybe a bunch, maybe it was like six, seven of us. And JJ and the folks involved shared this idea of the RStudio editor. And I was like, well, I got no idea how they ever make a business out of that. How do you compete with Emacs?

So this is 13 years ago and I was living in Chicago, but I was like, hey, I really like this idea. I immediately took RStudio and stood it up on an AWS EC2 machine and was running RStudio in the web and I could connect to it from my browser. And I was like, whoa, it's like I can get a really big machine to do my stats on, you know, from AWS and run RStudio on it. It feels like a native desktop application, I thought is the most amazing thing I'd ever seen. They figured out a business model, it appears as far as I can tell. And so, you know, here we are.

Now what I do for a living is I do financial risk modeling for a reinsurance company. Most folks have never heard of reinsurance because we're not a consumer-facing product, right? When you buy your homeowner's insurance or if you're Travis and you're buying insurance for his RV, you know, when all of those RVs converge in Florida and then there is a hurricane, the RV riders may not be able to cover that loss. And so what reinsurance is, is we spread the risk of insurance companies around the globe by taking bits of risk from many insurance companies in many different regions all over the world and spread it around. You know, our only product is capital, is being able to spread that risk and pay claims to rebalance risk globally. So we're part of the global risk spreading mechanism that allows global finance to work, especially in the insurance industry.

As a result, it's all about stochastic probabilistic modeling. So I spend my world and a lot of time discussing things like, you know, how do we calculate the return probabilities and the tails of events? So, you know, the one in a hundred, one in a thousand, I was literally just teaching a class of how to do this with SQL using one of our internal systems. I do a lot of that. I teach a lot. I manage a small team. The team I manage does a whole lot of taking Excel that someone in the business has created. And we're part of the business. I'm not part of IT. But my team has a little more software experience. So they help port that Excel into typically Python, sometimes R, and run it using Airflow and have jobs automatically created so that the business can stop drowning in Excel hell, right? Not that Excel isn't a great tool. It is. But sometimes we want things that are more automated that take no human touch to run.

So I manage a team that does that. And I do a handful of, I'm in a team called Risk Solutions. And we joke that our team exists to answer the questions for which there is no easy button, right? We have internal applications. A whole lot of stuff can be done automatically. You hit the calculate button and it calculates something. Then somebody wants to ask a question that that button wasn't designed for. And somebody's got to reach into the data, pull it out, understand it, and do a calculation. And that's the kind of stuff the risk solution teams does. We also do a lot of building prototypes of risk analysis. We also do a thing we call building productotypes. Everybody knows what that is, right? That's a prototype that you're actually using to show the board of directors. That's a productotype, right? So you've got a prototype, but you're using it in production. And then when we pass those off to IT, we call them no longer productotypes. That's a production system.

So involved in that value chain, I live right now in Richmond, Virginia. I'm based out of Raleigh, North Carolina office. That's about three hours away. When I go to the office in Raleigh, I stay for a few days. Like Travis, I got an RV. So I stay at a state park in Raleigh and hang out and go into the office, and then invite folks over to the campsite to shitpost in real life. My interest, I like building things. I've got an old Jeep in the garage right below me that's become a really good Jeep because I took every single mechanical piece that had Jeep stamped on it and replaced it with something else. It made it an incredibly better vehicle. So I enjoy fabrication. I do metal and wood fabrication both. And I got a dog named Sparky who's over here asleep. And if he goes apeshit in a minute, y'all get to meet him.

Modeling heavy-tail distributions

Ricardo, do you want to jump in?

Sure, J.D. Can you hear me?

Yep. I can hear you great. Go for it.

Okay, thank you. So if you're modeling risk, I guess you are using heavy tail distributions. So do you have some one that are your favorites? What kind of distributions do you use to model those unlikely events?

So for distributions for heavy events, pardon me, now I got to pick the pieces up I dropped. The way we think about modeling is we model individual risks and then we aggregate those up. So as opposed to like, we don't fit one distribution to represent the whole company. We have individual risks that we aggregate up. The property losses, which are a major component, right? So we think about hurricane, severe convective storm, flooding. We actually don't fit distribution shapes to this. We use an internal model that's an event-driven model, not a probabilistic distribution-based model. So we aren't fitting a distribution. We're actually simulating storms, simulating actual events, hitting areas we're interested in.

So our North American model will simulate a number of hurricanes, right? And we'll do some what we would call deterministic modeling. So that's not a probabilistic model. A deterministic model would ask the question of if you may know the big hurricane in 2005?

Yes, I was there.

Okay, so hurricane Katrina. But what's interesting about 2005 is it wasn't just Katrina. It was KRW, Katrina, Rita, Wilma. And there was actually a fourth named storm that went into Florida, but those are the big three. So we will run a deterministic model against our portfolio that is a simulated footprint of those three events. And we will say model our current portfolio losses under the 2005 loss year. And we'd call that 2005 deterministic. We may have a different deterministic that is just Katrina to try to understand if those events were to play out again today, what would be the financial impact, right? Because we obviously property values are tremendously higher in 2022 than they were in 2005.

Different areas of Florida would have grown faster or become valuable or had more development. So you can't just like take the number from 2005 and gross it up 20%, right? We actually want, and our footprint has changed. We have a different re-insurer footprint than we had then, which properties we re-insure. So we would run it against the current portfolio. So that's a deterministic. Our stochastic catalog would potentially include some deterministic events, but also have events we've never seen before. So hurricane Katrina actually nailing Miami, right? That's a big tail event because that's worse than Katrina actually was. Or hurricane Andrew, which came near hitting Miami. So our stochastic catalog is going to be synthetic events, hurricanes that didn't happen, but could have happened. And there are going to be events in there that have very low probability that are really bad and a whole bunch of average kind of years, right? A whole distribution outcomes.

So when we do an event-based model, there's no distribution fitting per se, there's event modeling. But when we come around and say modeling a casualty portfolio or an individual casualty deal, it's not uncommon there to use more of a, what I would call a curve fitting exercise. We just have some loss experience and we try to say, okay, what do we think that shape might look like? So we're literally fitting a distribution and that's some black art, right? And the reason it is, is because we have so little data and we're in the world of long tail distributions. And so we're only getting a few observations and by definition, almost always those ain't in the tail.

And so we have to say, okay, we create some a priority assumptions about what we think that may look like. We may use something that feels a little Bayesian. And we try to model things that way. And we'll use a host of different distributions to get those, right? So there's a number of things you might do to get long tail. Sometimes I'll run it through literally like a best fit and say, try to give me, run 50 different distributions through there. And I'll kind of look at them. And we also have some a priority beliefs about what the assumed underlying pricing might be. So we'll bring those into the modeling. So that's in the casualty lines is much more likely to look curve fitting while over in property, it's more likely to be a VIT based simulation instead of curve fitting. Now, I didn't give you the names of a list of distributions I like, because it varies wildly, but did that help a little bit?

Helping the business escape Excel hell

JD, I know in the beginning, you mentioned that you help the business from drowning in Excel. And I'm curious, what is one of your favorite like examples of helping the business was something that was just maybe taking people just way too long or mess up?

Yeah, let me give you a design rather than one specific one, because it's like, you know, if I tell you, oh, the, you know, some legal entities risk reporting board of directors book, right? That's not nobody knows what that looks like. Let me give you that big picture. So every organization I've ever been in, and, you know, 25 years or more of work has always had this, these spreadsheets where it's like, there's a tab, it's got a query in it. Somebody takes that query, paste it into an ID, maybe changes a date or something, gets a result, takes the results, paste them back into Excel, build a pivot. And then they also build a bunch of other stuff right off of that, maybe they do, maybe they have three queries they run, right? And these get built, because people have hammers, and they're gonna find nails, right? They're gonna solve problems, people are crafty as hell.

And you often get folks in the business who have cobbled together enough SQL, or they got someone over an IT to help them write a lot of SQL. Excel is the only tool they really know, but they can copy and paste a query and maybe change, you know, a value in it. And so they run this thing. And it is potentially error prone, it's often slow, they maybe only update it quarterly, because it takes them a few hours to get all the fiddly pieces in place.

And sometimes it's a good prototype. But it's probably not what they should be running like every day, right? If somebody ran this every day, and it takes them an hour to update, they're not going to do that. So what we have gotten great mileage of is analysts coming in, and I joked when I pitched this on Twitter, that the first thing we help people do is like fix their shitty Excel. And I'm not kidding. Like the first thing we do is maybe like, hey, let's make this Excel structured more logically. Because we'll do things like, all right, like, let's make it flow from tabs left to right. Call me crazy, it's a little bit easier to read, right? I want the very first page to have any inputs. So like if you have to have a date as an input or a key from a system or something, we're going to have a little block on the first page that's all of our inputs, instead of having them spread across five or six different tabs buried, cell D76 has a magic number in it.

We're gonna get all that over on this inputs piece. And when we change those inputs, we may do something like, just do simple string replacement. So that gets put in the query, right? Really basic stuff and make sure we understand how this thing works. And then we're going to test it, make sure it still works. And then we say, cool, well, how about if instead of, you know, copying and pasting this Excel, we were able to just, I'm sorry, this SQL, we will just run this SQL every night. And then we'll just create a database connection between Excel and the table where we put the results of your SQL and not break the rest of it, right? So step one is, this is that like minimum viable product thing, Sparky says, hey, minimum viable product, where it's like you start with a skateboard, and then you build a kickboard, and then you build a bicycle, and then you build a motorcycle, right? So we're helping folks do that. We're not gonna come in here and be like, that's ridiculous. Let's throw it away and build you an application. It's like, let's just take a piece of this out and automate it, see if we can make your life better. And we'll iterate on that.

And if it's like, so I have two analysts that work for me who do this for supporting one of our teams internally. Their goal is every one of those that they have, they want to turn it into a fully automated process. So they will take every one of these all the way through to a system that typically, if it has any Excel in it at all, it's only the last step drops results in Excel, no calculations in Excel. It's like Excel is a reporting format, because some folks really like to see that. And we have one partner, an external partner, that needs it in Excel, because they use it as part of a process after that. Excel is how they want to pick up our results and put it into their system.

So the process is make little changes, see if we can take some pain points out, keep iterating, and we'll keep Excel in there for a while. And sometimes what we find is if we just make it less painful, they can use it more and we'll stop. Sometimes we find we want to turn it all the way into a fully automated process that runs every day in Airflow, updating a database, and then we have a reporting tool on top of the results. I am a big fan and very strong believer in do your data transformations before your reporting tool, as opposed to doing a bunch of data transformations in the reporting tool, because reporting tools are where business logic goes to die, right, or goes to get calcified. The other problem is we end up reproducing business logic in a bunch of different reports, often in inconsistent ways. So if we have a calculation, I'd really like to make that done in the central place using code that's in Git, not buried in Power BI code, which is hard to put in Git, or buried in Tableau code, which has the same problems, right.

reporting tools are where business logic goes to die, right, or goes to get calcified.

So I try to have those reporting tools mostly being like pull from the calculated values, and maybe you have to do some division, right, because you can't pull, you know, if you want to scale ratios, you need to actually pull the numerator, pull the denominator, and then do the math in the reporting tool. But the analysts try to move things in that direction is the big principle, and the other big principle is in small steps, moving towards the platonic ideal of a fully automated process that runs without human interaction and has business logic calculations that are completely decoupled from the reporting. Like that's the platonic ideal. We don't always take everything to that platonic ideal.

Low-code and no-code tools

Do you ever recommend replacing Excel workflows with dedicated no-code, low-code, like visual data prep tools instead of using R or Python? If so, how do you think about this choice?

All right, so I like the low-code, no-code on the reporting side with the annoyance that it's hard to get it in version control, and that bugs me, right? But I like, you know, Power BI is reasonable for making a reporter dashboard. Tableau is reasonable for making reports in dashboards. Anyway, some of that drives me crazy, but those are great, but the challenge I have with low-code, no-code is, as a general principle, is they basically mean you get 75 or 80 percent of a workflow that's visual, and then you got this cell, and you shove all the code in that cell, because you can't quite do everything in the tool, and most of the tools have the ability to let you write some piece of code, you know, R, Python, SQL, something, and so we have basically now what we've done is really hidden our code in these little magic cells inside a low or no-code that doesn't fit well in version control, and it feels a lot of times like the worst of both worlds.

I've now taken the bits that could be in code, and I've hidden them, tucked them away, it feels to me an awful lot like putting code inside a cell in Excel. It's like kind of frustrating. The way I wish these tools worked, and I've been, I use AWS Glue some, and Glue kind of works this way. You use the low-code no-code tool, and it generates code, and then you have code, and you can do whatever you want to with it. I like that model a lot, so in the R community, we have Esquisse for making, and forgive me for if any francophones here will know I'm pronouncing that wrong, but the Esquisse tool allows one to use kind of a drag and drop and an interactive UI to make ggplot code, and then when you're done, it gives you the ggplot code, and then you go put that in your script, and you can tweak it a little bit by hand, or you can do other stuff with it, but it's like, it's not a, it's like a GUI for writing code as opposed to a no-code solution. I like that a lot better because I end up with code.

Now, I think some kind of magic great world would be if I had low-code no-code tools that generated code, and I could like have the GUI on one side, the code on the other side, and I could edit the code, and the GUI would change, or I could edit the GUI, and the code would change. Like, that's really hard, right? That's a pain in the ass to implement, so nobody does that, but that would be, that would be a platonic ideal for how I wish these pieces of tools would work. So, to answer the question about do I use some of those tools, I like them. I worry about getting code locked inside of them that's hard to see in GitHub. I want all business logic to live machine-readable in GitHub, so we can do stuff like, hey, if we change this table name, how many queries is this going to impact, right? Just search GitHub. We can see them. I think having a tooling interface, sorry, a playing code interface with our tools and, you know, version control and tracking and all that is so powerful. I hate to give that up just so I can get lots of people writing mediocre versions, so I'm a little cynical.

Sparse data and stakeholder hypotheses

I've had, I had a project a while back where we had kind of a small data set with very expensive data, and there wasn't a lot of correlation between the variables, but we had a, you know, a fairly strong hypothesis about what was happening, but it was difficult to validate that with the data, so I was wondering, like, what you do in situations where you want, like that, where stakeholders have a pretty good idea of what's going on. They want to prove it with the data, but the data is sparse and lacks information.

Yeah, that's a good question. I can think of a few situations in my life where I've been there, but the first thing I want to do in the purpose of, towards the purpose of intellectual honesty, is I like to be real transparent with folks I'm working with what we're doing, right? So we're kind of doing a validation exercise, which is a little different than, like, a discovery, and actually, we may not even be doing hypothesis testing, because folks often don't really want to know if we can prove or disprove that they're right. They want to, they really want to ask the question, and I apologize, my statistic language, I should be able to say this in terms of inference and what it means, and I'm hesitant to use the terms of art for fear I may misuse them, but in principle, what we're doing is we're saying, is there evidence that supports this thing we want to believe? We're not really looking to disprove it, or we may be, but usually we're just looking to say, this is our intuition, so it's almost like this is our Bayesian prior, is the data inconsistent with our prior?

We're not really asking to be, to prove it, and so sometimes when we, when I get a situation like that, we go looking for evidence that supports this conclusion we're starting with, and often the best we can say is, we can't really find a lot that supports or disproves it. That does not mean in any way, shape, or form it's an incorrect hypothesis, it's just hard to see evidence of that in the data, but we have an a-priority belief that we can't disprove. Now, often, I'll go ahead and peek and see if we have anything in the data that would be a strong indicator that that's incorrect. Now, that may be a career-limiting move if you're at an organization that doesn't have a healthy relationship with the truth.

I have worked in some of those organizations. The organization I am at currently, I've been with for, you know, 13, 14 years, we have a really healthy relationship with the truth, and lots of people that are very comfortable saying, I think this is what the answer is, and I can say, I see no evidence in the data, but I find strong evidence to the contrary, and their response is not, get out, but is, oh, really, right? And so, it depends on your organization. If your organization is a get-out organization, that isn't going to work as well as if your organization is an, oh, really, organization. So, if you can point out the, oh, really, and maybe, you know, the thing I always try to do is be high-empathy, right? So, it's not like, you're stupid, the answer is this, which I have seen done. That's low-empathy response. It's, I can't disprove it. I see some evidence to the contrary. Maybe we should look into this, that, or the other.

The other thing I often try to look for is, now, that's very different than this, is say, okay, when I'm digging around in this data, might this be a situation where rare events are causing disproportionate effect? That's another way of saying non-linearity. So, there's two things that make modeling any system really hard, non-linearity and feedback effects, and those, the presence of either of those or both of them, God forbid, right? So, let's pause right here and think, what types of systems have non-linearity and feedback effects? I'm an economist by training, and this is why macroeconomics always feels like a black art and not like a real set of analysis, is because there's non-linearity and feedback effects in the economy. It's tremendously hard to model. It's tremendously hard to calibrate parameters because of all the non-linearity feedback effects, lag effects, all that sort of thing. Similarly, over the last three years, two years, we have all gotten a tremendous education in how hard epidemiology is, right? Those of us who couldn't even pronounce epidemiology four years ago have learned that epidemiology is full of non-linearity and feedback effects.

That makes it really hard. And so, a lot of times when I've come to a data set and I'm seeing no effects that I would theorize are there, I try to think about and drill into the data and see if there's possibly some non-linearity or possibly some feedback effects because both of those will mess up correlations, right? So, you may show a data that has relatively low correlation because correlation is an average across a range. It's boiling that down, the average of the relationship across a range. If that value of one variable is zero most of the time, and occasionally it's one, and the impact over here is continuous and is noise, but when this one goes to one, this one doubles, that correlation is going to look not very significant, but it may be a real meaningful effect, right? So, that's kind of a, it's not a, this variable isn't linear, right? It's usually zero, occasionally it's one. This one's noisy, but when this one goes to one, it doubles its range or something, right? So, that's an example of a non-linear relationship that's really hard to get linear correlations out of that tip you off to what's going on. So, sometimes I go digging in looking for those or if there's some kind of feedback effect of some combination of two variables interacting with each other is having effect you want. So, I guess that's the big picture.

Best practices and the ham story

So, I know when I reached out to you first about the Hangout, I had just watched your conference talk from a few years ago on empathy and action and building communities of practice, and I see that somebody just asked, Brian just asked a question about that as well. Brian, I'm curious, are you starting to build a community as well?

Yeah, we've had a community of practice for a few years, but it's, you know, it's always a work in progress, and it's difficult. We have a federated model of how we do data science, and not by design, but by, you know, it's just organically sprung up that way. I work for Delta Airlines. I work in a group that was part of Northwest Airlines before the merger, and so we had a centralized operations research group, which has become, you know, the data science group, but, you know, other groups have sprung up within operations, within maintenance, within marketing, and so we started a mailing list a few years ago. We just send stuff out to kind of get people interested in data science. We probably have 500 people on our mailing list. Now we have a monthly meetup where we typically have 70 to 80 people who will tune in for a deep dive on a project.

That's great. But, you know, I noticed the last few days there were some disparaging comments about the phrase best practices, just as we're about to have a panel session on best practices. But, so anyway, I just wanted to... Do you want feedback on best practices, or should I stay away from that?

No, no, no. Dive into the controversy, JD. That's what we're here for. So here's my thoughts on best practices. By the way, it sounds like a thriving community, right? I'm in an organization that only has 600 people in the whole organization, right? So a community of analytical thought people where you can get together and have 80 people seems just tremendous to me, right? All right, let's discuss best practices. There was kind of a thread discussion on Twitter about this. One of the things I learned from that discussion is someone opined saying in medicine, best practice has a very different meaning than how I see it used in business, right? In medicine, it's like we have practices, and this is the best kind of known practice for something, right? It has a very specific meaning. That's almost never how it's used in business. What I often hear is more like the comment someone else made, and I think I had amplified this, of it usually means I have organizational status over you. I want you to shut up is what best practice often means. I'm going to end this conversation by saying this is best practice, right?

it usually means I have organizational status over you. I want you to shut up is what best practice often means.

And, you know, I kind of joke internally, I don't have people in my organization that operate like that. We're low on assholes, long on intellectual curiosity. So if somebody says something or another is best practice, I was like, okay, well, cool. Well, we want to be better than that. So let's talk about how to do it well. Like that's my joke is best practice means average, and we don't want to be average. And I'll give you an example of this very specific that we just went through in our organization that, in my mind, shows a lot of good thought. So we implemented a best practice for system passwords. So we have accounts, system accounts, and we use for our, you know, these automated processes that I talked about, we use system accounts for database access or whatever. And as a best practice, our ops team had a very large character set that included some particularly pernicious special characters, like ampersands, semicolons, and backslashes, I think, all produce real challenges on different systems.

But the bottom line is, what you want in a password is lots of entropy. The character set with a password this big and a big character set gives you a certain amount of entropy. You can lower the character set and lengthen the password and get the same or more entropy. What you care about is entropy. Nobody gives a crap that there's special characters in there if you know what you're after is entropy. But best practice is use big character set. So a bunch of us were having systems that were barfing when the passwords would get rotated, and we would end up with like an ampersand in the password, and certain code systems were falling over on the password because it was causing problems. And we went to our security team and we said, we know what, what's that?

I said, quick question in between, what is entropy? So entropy is the amount of randomness, right? What we really care about is how hard is it to guess. And we'll use entropy as a proxy for how hard would it be for a search loop that was going through every character and making the password. How long would it take that to guess that password? That's what we really care about. And, but yet, you know, our best practice is, you know, eight characters and a big character space. So each one of these values can be 50 different things because we include special characters. You know, we went to them and said, let's take special characters out because they're causing us some pain, and let's make it really long because we don't care about length. Length is easy. It's an automated system, right? But these special characters are giving us a problem. And, and our team, our ops team said, oh, that makes sense. Cool. Are we not doing best practice? No. Are we accomplishing our goal? Hell yeah, we are. And we've gotten more entropy than we had when we were using the character set with the special characters because we made them a lot longer. We just use a smaller character set. So if someone's trying to brute force it, actually, it's harder to brute force our passwords now with no special characters than it would have been to brute force them with the special characters in there because we made them really long.

And that's an example of if we're using our head about what is the problem we're actually trying to solve and not hiding behind, this is best practice, we get to a better outcome. And it's because we had, you know, I hate to sound like a Stephen Covey book or something, but like we started with the end in mind. What's the thing we're optimizing? We want passwords that are hard to guess. All right. How do we accomplish that? Subject to the constraint of special characters are giving our systems problems. That's an example of, I think, bucking a best practice, but doing it thoughtfully because you understand the principle.

And I think we need to have openness in our organizations to do that. I think a lot of best practices, you know, I watch folks do stuff that they inherited from, you know, Google does it this way or Facebook does it this way. And that should be a best practice. And I'm always reminded of a very particular anecdote. So I'm going to tell you all an anecdote. There was a woman whose mother taught her, and I should retell the story with his son, right? This feels kind of sexist, but I'm sorry, I'm already in now. I'll not change. The woman had learned to cook a ham at Thanksgiving from her mom. And later she's married. She has her own kid and they're cooking a ham at Thanksgiving. And the little kid asked the mother, why do you cut the hock, the end, the bony bit off the ham before you cook it? And she's like, I honestly don't know. That's the way your grandmother always did it. So I do it the same way. Let's call grandma and ask her. So they call grandma and they say, grandma, why do you always cut the hock off the ham before you cook it for Thanksgiving? And grandma said, well, ham wouldn't fit in my pan. And the principle being understand why we're doing these things, right? If you don't have a pan, that's too small. There's no advantage to cutting the hock off. And I feel like a lot of times people replicate what they see, you know, Amazon, Facebook, Google doing. They don't understand what problem they were solving, but yet they say we should do that. We should cut the hock off our ham because Facebook's cutting the hock off their ham. I'm not sure we got the same problem Facebook has.

Building communities of practice

Before leaving the communities of practice discussion though, I am curious, what is your team do to get people together?

All right. So we have a couple of informal communities, right? So one kind of formal community is we are, we have a team called risk solutions, right? And I'm a VP on the risk solutions team. We get that whole team together periodically. We get them together on a virtual call once a week. We actually, they have a meeting that the analysts lead and we usually only have one or two of the managers actually go to that because the purpose is it's to be led by the analysts, the doers, and the people who are managers don't lead it, don't interject themselves unless they're asked a question of, or at the very end, there's a time for Q and A. So we try to get, that way we're trying to develop a community of peer support in our analysts, not driven by managers, right? That's driven peer to peer. So they lead the meeting. They have a spreadsheet where they track who's leading it each week. It rotates. They bring in speakers from inside the organization or outside the organization to come in and speak to that group. And then they do Q and A at the end and love that, right? Because it's pushing down ownership of development and team and all this stuff down to the analyst level as opposed to being something dictated by people like me. I love that for a way to build community. And we have told them, we want you all to do this, but what you do during that time and how you do it is up to you.

The other example of communities are things like we use Dreameo internally as our data lake and really like Dreameo. So we have a Dreameo community internally. It has its own, we use Microsoft Teams. So it has its own Teams channel where people can ask Dreameo questions. We periodically do a Dreameo meetup where all the people in the organization, finance, risk, actuarial, whoever, who are using Dreameo get together. So there'll be a couple of presentations. So that one cuts across different business units. That's a great little community. I like that a lot.

I have led and we've kind of let it fizzle the last year or two. I had led a more of a data science community and we would do topics like, how do you make a good Jupyter notebook? Like, what's the characteristics of a good Jupyter notebook? And that was driven by me going in Git and seeing like Jupyter notebooks with 500 lines of code in three cells and two comments and no rich text. It's like, why are we even in a notebook, right? Like this isn't the spirit that Knuth had in mind with literate programming. This is just a Python script or some kind of script dumped into a notebook instead of dumped into a text file. It's like, I would even hope if this wasn't a notebook, it would have more comments than this. And so I was like, okay, let's talk about what a notebook could look like or should look like. And so we do a presentation on it, things like that.

People sometimes ask me about starting communities because I started the Chicago R user group, you know, 13, 14 years ago. And the design pattern for that group was, I think, a really good one. We had about four or five people who were going to get together, drink beer and talk about R anyway. And what we effectively did was, let's get in the types of presentations, the four or five of us would like to hear, and then invite other people. And if they don't show up, we don't really care because we're going to drink beer and listen to this anyway, because we're interested in this. And that community grew wildly with me running it like as a benevolent dictator, just getting the presentations I wanted to hear. And then one of the things I observed is we were getting lots of new users of R. So I decided once a month, we would have a beginners meetup. Sorry, not once a month, once a quarter or something. But we would periodically have a beginners night. And at the beginners night, we would make sure we had two things. One is topics appropriate for beginners, right? So no sophisticated stochastic gradient descent garbage. It's like grouping and summing and getting environments set up or updating packages, all the normal stuff people have friction with. Focus on that. And then we would make sure we had like, I forget we called it, I would call it now like a learner salon or something. We had a handful of people who were more experienced who said, I will sit at a table and anybody that brings their computer with questions, I will answer. And we would have, you know, really senior people with lots of experience there doing Q&A and people could bring their own problems, literally pizza, drink beer and look at code. And those beginner nights were super helpful.

And so anyway, when people ask about building communities, I recommend those two things. Get yourself a core group of small number of people, get the presentations you would want. If the beginner concept is germane to what you're doing, you have beginners around who may be intimidated, do explicit beginners nights and a big fan of having the extended Q&A or tutoring time. Those are all were hugely helpful.

That's great. Thank you. I can remember JD at the San Diego RStudio conference. I was talking to you out by the pool and I was telling you, I was going to start maybe doing the Boston use our group and I wasn't sure like how to get it started. And I was thinking, should I make this form and have people submit and then we could vote on talks and all that. And JD had shared with me, like when you, when someone wants to give a talk, just let them, like if people are interested in coming to this community and sharing different ideas and have topic ideas, don't try to like overly do it by having a form and having people vote. Like, yes, it's great to be able to source lots of ideas, but you don't have to make it so formal.

I did very little voting. I did the benevolent dictator approach because, you know, I don't know. I feel like I wanted, I did some voting on things a few times, but it was voting for input for me, not deterministic voting. They weren't giving the outcome. They were voting to signal to me what's important to them. And I integrated that in my own objective function and then made a bunch of decisions. So it was more representative, not I'm going to run this group and let them vote on everything. People lack creativity. And so I found that by being a little bit creative and taking that as input. So that's like the first time we did the beginner's night, nobody voted on that. Somebody recommended it. And I was like, Oh, that sounds great. Let's do it. We tried it. And then I did ask afterwards, like we did some voting about what frequency to do the beginner's night. And I was surprised how frequent people wanted it. That was the surprise for me. So we did it more frequently than I would have otherwise.

Separation of concerns in R Markdown and reporting

Yeah, I just have a sort of general brainstorming topic. It's not really necessarily attached to your personal experience, but I kind of feel in our, I do clinical research and our pipeline is, you know, data of some sort to R and then to R Markdown or Shiny or, you know, different reporting endpoints. And particularly with R Markdown, we use a lot of templated LaTeX PDF output, you know, for reasons, you know, mainly because we can't convince people to use HTML or whatever. But I feel like one of the problems we run into is that the separation of concerns between the computation or calculation and the formatting of the output is kind of broken like in R Markdown, especially with LaTeX. And, you know, I've been looking at Quarto too, and I don't necessarily see that that fixes the problem. And I was just curious if you had any ideas about, do we need a whole new framework as to completely separate out the way something looks from the way that, you know, the pieces in it, the plots and the, you know, tables were created. And you see a lot of packages rise up to address this challenge by, you know, saying I can format this line and I can bold this and all that, but it still doesn't change the fact of like, you need to have this LaTeX package installed, but not this one.

Let me give you a couple of thoughts on the separation of concerns. There's kind of two parts here, and I'm going to talk about the separation of concern of analysis versus reporting. And then the other thing you brought up sounds like package management on the output side, especially at the LaTeX level. I got weaker thoughts there. I don't do that as much. The first part, the design pattern that I have had good success with, and I think is important, is to, this is a special case of the, it's easy to co-mingle what I would call business logic. It's not business logic for you. It's analysis maybe, but you can co-mingle that analysis with your reporting framework. And a lot of times we do that, especially when we first start, because it's like real convenient. Like I'm in this doc, it's got R, I'm going to do all my data manipulation and then output a result. Then I do all my manipulation and then I output a result. Next thing you know, you've got 1,500 line R markdown document. And somewhere buried in the middle of that is like reform narrative explaining what's going on. And then there's hundreds of line R code. That's tough to maintain in my experience.

And what I have moved towards is having usually an R script, not even an R markdown, an R script or a Python script that does all the data manipulation and calculates all the things. Now, this is what's different than a clinical world. In my world, almost anything I do, I'm going to want to do over and over. So I take the results, I tag them with like a date, time, whatever configuration tags I need, and I write them to a database. And that just runs on some sort of regular interval and populates that database, right? No reporting at all. It's just the manipulation, the calculation, whatever. And then when I come over to do the reporting, what the reporting does is it reads those effectively cached values and grabs what it needs for itself. And you got narrative. But your R blocks are all about formatting the output. They're not doing any math particularly. Or if they are, it's very limited. Mostly they're make the exhibit, the setup for the ggplot to do the graph, the table formatting to format the table. And so you end up with that second document being like everything in there is about presentation. And the calculation has a separate set of concerns that's over here in this other script.

Now that R script that does the manipulation need not be run nightly or whatever, right? It could be run on an ad hoc basis if you're only going to run this, you know, once a year or something. But that's the separation of concerns that I like. And I tend to do that when that R Markdown documents get out of control, or if I can see that it's going to get out of control. Now, in terms of reporting environments, and what packages you need installed in what order, especially when it comes to LaTeX, that just got Docker written all over it, I think. Because if you got to manage like six or seven of these, for each report has really bespoke, and it can't coexist with the other ones, you could probably do that with virtual environments. But my virtual environment foo is not that good. And so what I would tend to do is probably overkill what I would consider having, you know, Docker instances that have all the packages installed in the right order in the right place and just have a script that stands up the Docker machine, builds the R Markdown and the Docker machine tears it all down. It's a little bit kind of feels like killing a gnat with a sledgehammer. But it could be a way to solve that problem.

Integrating multiple tools and frameworks

So Eric asks, do you have advice on how you navigate through complications when integrating multiple tools and frameworks together in a project? Eric said, I have major pains linking R to AWS as I speak.

Yeah, depends on the nature of the friction. I like clean interface points. An example of a clean interface point is if the Python process writes out Parquet files and the R process reads that Parquet file, super clean interface, right? As opposed to having R running inside of Python or having Python running inside of R using Reticulate. Those can be real useful, but I really like that clean separation. And for me, the new interface format is Parquet, right? It used to be CSV. And now I'd much rather use Parquet as the interface between all my steps.