Data Science Hangout | JD Long, RenaissanceRe | Empathy When Integrating with Other Tools

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everybody. Welcome to the Data Science Hangout. If you're joining for the first time today, it's nice to meet you. I'm Rachel. I think we maybe have some first timers from JD's Twitter post earlier. So the Data Science Hangout is an open space for the whole data science community to connect and chat about data science leadership, questions you're facing, and what's going on in the world of data science. So these sessions are recorded and shared to YouTube, as well as the RStudio Data Science Hangout site. So you can always go back and rewatch or find helpful resources.

Also have a LinkedIn group for the Hangout too. So if you ever want to continue a discussion or if you just want to meet somebody and talk in there, other than me being the one talking in there, feel free to use that. Tyler or Hannah will share it in the chat as well.

Together, we're all dedicated to creating a welcoming environment for everyone. So we love when everybody can participate in these, and we can hear from everyone. So there's three ways you can ask JD questions today. You can jump in by raising your hand on Zoom, and I can call on you. You can put questions in the Zoom chat, and just put a little star next to your question if you want me to read it out loud instead. Maybe your dog's barking or you're in a coffee shop or something. And then lastly, we also have a Slido link where you can ask questions anonymously too.

And I see Hannah just shared that in the chat. Just like to reiterate, we love to hear from everybody, no matter your level of experience or area of work as well. So with that, I am so excited to be joined by my co-host for today, which there was a lot of excitement on Twitter about, JD Long, VP Risk Management at Renaissance Re.

reporting tools are where business logic goes to die, right, or goes to get calcified.

So I try to have those reporting tools mostly being like pull from the calculated values, and maybe you have to do some division, right, because you can't pull, you know, if you want to scale ratios, you need to actually pull the numerator, pull the denominator, and then do the math in the reporting tool. But the analysts try to move things in that direction is the big principle, and the other big principle is in small steps, moving towards the platonic ideal of a fully automated process that runs without human interaction and has business logic calculations that are completely decoupled from the reporting. Like that's the platonic ideal. We don't always take everything to that platonic ideal.

Low-code and no-code tools

Do you ever recommend replacing Excel workflows with dedicated no-code, low-code, like visual data prep tools instead of using R or Python? If so, how do you think about this choice?

All right, so I like the low-code, no-code on the reporting side with the annoyance that it's hard to get it in version control, and that bugs me, right? But I like, you know, Power BI is reasonable for making a reporter dashboard. Tableau is reasonable for making reports in dashboards. Anyway, some of that drives me crazy, but those are great, but the challenge I have with low-code, no-code is, as a general principle, is they basically mean you get 75 or 80 percent of a workflow that's visual, and then you got this cell, and you shove all the code in that cell, because you can't quite do everything in the tool, and most of the tools have the ability to let you write some piece of code, you know, R, Python, SQL, something, and so we have basically now what we've done is really hidden our code in these little magic cells inside a low or no-code that doesn't fit well in version control, and it feels a lot of times like the worst of both worlds.

I've now taken the bits that could be in code, and I've hidden them, tucked them away, it feels to me an awful lot like putting code inside a cell in Excel. It's like kind of frustrating. The way I wish these tools worked, and I've been, I use AWS Glue some, and Glue kind of works this way. You use the low-code no-code tool, and it generates code, and then you have code, and you can do whatever you want to with it. I like that model a lot, so in the R community, we have Esquisse for making, and forgive me for if any francophones here will know I'm pronouncing that wrong, but the Esquisse tool allows one to use kind of a drag and drop and an interactive UI to make ggplot code, and then when you're done, it gives you the ggplot code, and then you go put that in your script, and you can tweak it a little bit by hand, or you can do other stuff with it, but it's like, it's not a, it's like a GUI for writing code as opposed to a no-code solution. I like that a lot better because I end up with code.

Now, I think some kind of magic great world would be if I had low-code no-code tools that generated code, and I could like have the GUI on one side, the code on the other side, and I could edit the code, and the GUI would change, or I could edit the GUI, and the code would change. Like, that's really hard, right? That's a pain in the ass to implement, so nobody does that, but that would be, that would be a platonic ideal for how I wish these pieces of tools would work. So, to answer the question about do I use some of those tools, I like them. I worry about getting code locked inside of them that's hard to see in GitHub. I want all business logic to live machine-readable in GitHub, so we can do stuff like, hey, if we change this table name, how many queries is this going to impact, right? Just search GitHub. We can see them. I think having a tooling interface, sorry, a playing code interface with our tools and, you know, version control and tracking and all that is so powerful. I hate to give that up just so I can get lots of people writing mediocre versions, so I'm a little cynical.

Sparse data and stakeholder hypotheses

I've had, I had a project a while back where we had kind of a small data set with very expensive data, and there wasn't a lot of correlation between the variables, but we had a, you know, a fairly strong hypothesis about what was happening, but it was difficult to validate that with the data, so I was wondering, like, what you do in situations where you want, like that, where stakeholders have a pretty good idea of what's going on. They want to prove it with the data, but the data is sparse and lacks information.

Yeah, that's a good question. I can think of a few situations in my life where I've been there, but the first thing I want to do in the purpose of, towards the purpose of intellectual honesty, is I like to be real transparent with folks I'm working with what we're doing, right? So we're kind of doing a validation exercise, which is a little different than, like, a discovery, and actually, we may not even be doing hypothesis testing, because folks often don't really want to know if we can prove or disprove that they're right. They want to, they really want to ask the question, and I apologize, my statistic language, I should be able to say this in terms of inference and what it means, and I'm hesitant to use the terms of art for fear I may misuse them, but in principle, what we're doing is we're saying, is there evidence that supports this thing we want to believe? We're not really looking to disprove it, or we may be, but usually we're just looking to say, this is our intuition, so it's almost like this is our Bayesian prior, is the data inconsistent with our prior?

We're not really asking to be, to prove it, and so sometimes when we, when I get a situation like that, we go looking for evidence that supports this conclusion we're starting with, and often the best we can say is, we can't really find a lot that supports or disproves it. That does not mean in any way, shape, or form it's an incorrect hypothesis, it's just hard to see evidence of that in the data, but we have an a-priority belief that we can't disprove. Now, often, I'll go ahead and peek and see if we have anything in the data that would be a strong indicator that that's incorrect. Now, that may be a career-limiting move if you're at an organization that doesn't have a healthy relationship with the truth.

I have worked in some of those organizations. The organization I am at currently, I've been with for, you know, 13, 14 years, we have a really healthy relationship with the truth, and lots of people that are very comfortable saying, I think this is what the answer is, and I can say, I see no evidence in the data, but I find strong evidence to the contrary, and their response is not, get out, but is, oh, really, right? And so, it depends on your organization. If your organization is a get-out organization, that isn't going to work as well as if your organization is an, oh, really, organization. So, if you can point out the, oh, really, and maybe, you know, the thing I always try to do is be high-empathy, right? So, it's not like, you're stupid, the answer is this, which I have seen done. That's low-empathy response. It's, I can't disprove it. I see some evidence to the contrary. Maybe we should look into this, that, or the other.

The other thing I often try to look for is, now, that's very different than this, is say, okay, when I'm digging around in this data, might this be a situation where rare events are causing disproportionate effect? That's another way of saying non-linearity. So, there's two things that make modeling any system really hard, non-linearity and feedback effects, and those, the presence of either of those or both of them, God forbid, right? So, let's pause right here and think, what types of systems have non-linearity and feedback effects? I'm an economist by training, and this is why macroeconomics always feels like a black art and not like a real set of analysis, is because there's non-linearity and feedback effects in the economy. It's tremendously hard to model. It's tremendously hard to calibrate parameters because of all the non-linearity feedback effects, lag effects, all that sort of thing. Similarly, over the last three years, two years, we have all gotten a tremendous education in how hard epidemiology is, right? Those of us who couldn't even pronounce epidemiology four years ago have learned that epidemiology is full of non-linearity and feedback effects.

That makes it really hard. And so, a lot of times when I've come to a data set and I'm seeing no effects that I would theorize are there, I try to think about and drill into the data and see if there's possibly some non-linearity or possibly some feedback effects because both of those will mess up correlations, right? So, you may show a data that has relatively low correlation because correlation is an average across a range. It's boiling that down, the average of the relationship across a range. If that value of one variable is zero most of the time, and occasionally it's one, and the impact over here is continuous and is noise, but when this one goes to one, this one doubles, that correlation is going to look not very significant, but it may be a real meaningful effect, right? So, that's kind of a, it's not a, this variable isn't linear, right? It's usually zero, occasionally it's one. This one's noisy, but when this one goes to one, it doubles its range or something, right? So, that's an example of a non-linear relationship that's really hard to get linear correlations out of that tip you off to what's going on. So, sometimes I go digging in looking for those or if there's some kind of feedback effect of some combination of two variables interacting with each other is having effect you want. So, I guess that's the big picture.

Best practices and the ham story

So, I know when I reached out to you first about the Hangout, I had just watched your conference talk from a few years ago on empathy and action and building communities of practice, and I see that somebody just asked, Brian just asked a question about that as well. Brian, I'm curious, are you starting to build a community as well?

Yeah, we've had a community of practice for a few years, but it's, you know, it's always a work in progress, and it's difficult. We have a federated model of how we do data science, and not by design, but by, you know, it's just organically sprung up that way. I work for Delta Airlines. I work in a group that was part of Northwest Airlines before the merger, and so we had a centralized operations research group, which has become, you know, the data science group, but, you know, other groups have sprung up within operations, within maintenance, within marketing, and so we started a mailing list a few years ago. We just send stuff out to kind of get people interested in data science. We probably have 500 people on our mailing list. Now we have a monthly meetup where we typically have 70 to 80 people who will tune in for a deep dive on a project.

That's great. But, you know, I noticed the last few days there were some disparaging comments about the phrase best practices, just as we're about to have a panel session on best practices. But, so anyway, I just wanted to... Do you want feedback on best practices, or should I stay away from that?

No, no, no. Dive into the controversy, JD. That's what we're here for. So here's my thoughts on best practices. By the way, it sounds like a thriving community, right? I'm in an organization that only has 600 people in the whole organization, right? So a community of analytical thought people where you can get together and have 80 people seems just tremendous to me, right? All right, let's discuss best practices. There was kind of a thread discussion on Twitter about this. One of the things I learned from that discussion is someone opined saying in medicine, best practice has a very different meaning than how I see it used in business, right? In medicine, it's like we have practices, and this is the best kind of known practice for something, right? It has a very specific meaning. That's almost never how it's used in business. What I often hear is more like the comment someone else made, and I think I had amplified this, of it usually means I have organizational status over you. I want you to shut up is what best practice often means. I'm going to end this conversation by saying this is best practice, right?

it usually means I have organizational status over you. I want you to shut up is what best practice often means.

And, you know, I kind of joke internally, I don't have people in my organization that operate like that. We're low on assholes, long on intellectual curiosity. So if somebody says something or another is best practice, I was like, okay, well, cool. Well, we want to be better than that. So let's talk about how to do it well. Like that's my joke is best practice means average, and we don't want to be average. And I'll give you an example of this very specific that we just went through in our organization that, in my mind, shows a lot of good thought. So we implemented a best practice for system passwords. So we have accounts, system accounts, and we use for our, you know, these automated processes that I talked about, we use system accounts for database access or whatever. And as a best practice, our ops team had a very large character set that included some particularly pernicious special characters, like ampersands, semicolons, and backslashes, I think, all produce real challenges on different systems.

But the bottom line is, what you want in a password is lots of entropy. The character set with a password this big and a big character set gives you a certain amount of entropy. You can lower the character set and lengthen the password and get the same or more entropy. What you care about is entropy. Nobody gives a crap that there's special characters in there if you know what you're after is entropy. But best practice is use big character set. So a bunch of us were having systems that were barfing when the passwords would get rotated, and we would end up with like an ampersand in the password, and certain code systems were falling over on the password because it was causing problems. And we went to our security team and we said, we know what, what's that?

I said, quick question in between, what is entropy? So entropy is the amount of randomness, right? What we really care about is how hard is it to guess. And we'll use entropy as a proxy for how hard would it be for a search loop that was going through every character and making the password. How long would it take that to guess that password? That's what we really care about. And, but yet, you know, our best practice is, you know, eight characters and a big character space. So each one of these values can be 50 different things because we include special characters. You know, we went to them and said, let's take special characters out because they're causing us some pain, and let's make it really long because we don't care about length. Length is easy. It's an automated system, right? But these special characters are giving us a problem. And, and our team, our ops team said, oh, that makes sense. Cool. Are we not doing best practice? No. Are we accomplishing our goal? Hell yeah, we are. And we've gotten more entropy than we had when we were using the character set with the special characters because we made them a lot longer. We just use a smaller character set. So if someone's trying to brute force it, actually, it's harder to brute force our passwords now with no special characters than it would have been to brute force them with the special characters in there because we made them really long.

And that's an example of if we're using our head about what is the problem we're actually trying to solve and not hiding behind, this is best practice, we get to a better outcome. And it's because we had, you know, I hate to sound like a Stephen Covey book or something, but like we started with the end in mind. What's the thing we're optimizing? We want passwords that are hard to guess. All right. How do we accomplish that? Subject to the constraint of special characters are giving our systems problems. That's an example of, I think, bucking a best practice, but doing it thoughtfully because you understand the principle.

And I think we need to have openness in our organizations to do that. I think a lot of best practices, you know, I watch folks do stuff that they inherited from, you know, Google does it this way or Facebook does it this way. And that should be a best practice. And I'm always reminded of a very particular anecdote. So I'm going to tell you all an anecdote. There was a woman whose mother taught her, and I should retell the story with his son, right? This feels kind of sexist, but I'm sorry, I'm already in now. I'll not change. The woman had learned to cook a ham at Thanksgiving from her mom. And later she's married. She has her own kid and they're cooking a ham at Thanksgiving. And the little kid asked the mother, why do you cut the hock, the end, the bony bit off the ham before you cook it? And she's like, I honestly don't know. That's the way your grandmother always did it. So I do it the same way. Let's call grandma and ask her. So they call grandma and they say, grandma, why do you always cut the hock off the ham before you cook it for Thanksgiving? And grandma said, well, ham wouldn't fit in my pan. And the principle being understand why we're doing these things, right? If you don't have a pan, that's too small. There's no advantage to cutting the hock off. And I feel like a lot of times people replicate what they see, you know, Amazon, Facebook, Google doing. They don't understand what problem they were solving, but yet they say we should do that. We should cut the hock off our ham because Facebook's cutting the hock off their ham. I'm not sure we got the same problem Facebook has.

Building communities of practice

Before leaving the communities of practice discussion though, I am curious, what is your team do to get people together?

All right. So we have a couple of informal communities, right? So one kind of formal community is we are, we have a team called risk solutions, right? And I'm a VP on the risk solutions team. We get that whole team together periodically. We get them together on a virtual call once a week. We actually, they have a meeting that the analysts lead and we usually only have one or two of the managers actually go to that because the purpose is it's to be led by the analysts, the doers, and the people who are managers don't lead it, don't interject themselves unless they're asked a question of, or at the very end, there's a time for Q and A. So we try to get, that way we're trying to develop a community of peer support in our analysts, not driven by managers, right? That's driven peer to peer. So they lead the meeting. They have a spreadsheet where they track who's leading it each week. It rotates. They bring in speakers from inside the organization or outside the organization to come in and speak to that group. And then they do Q and A at the end and love that, right? Because it's pushing down ownership of development and team and all this stuff down to the analyst level as opposed to being something dictated by people like me. I love that for a way to build community. And we have told them, we want you all to do this, but what you do during that time and how you do it is up to you.

The other example of communities are things like we use Dreameo internally as our data lake and really like Dreameo. So we have a Dreameo community internally. It has its own, we use Microsoft Teams. So it has its own Teams channel where people can ask Dreameo questions. We periodically do a Dreameo meetup where all the people in the organization, finance, risk, actuarial, whoever, who are using Dreameo get together. So there'll be a couple of presentations. So that one cuts across different business units. That's a great little community. I like that a lot.

I have led and we've kind of let it fizzle the last year or two. I had led a more of a data science community and we would do topics like, how do you make a good Jupyter notebook? Like, what's the characteristics of a good Jupyter notebook? And that was driven by me going in Git and seeing like Jupyter notebooks with 500 lines of code in three cells and two comments and no rich text. It's like, why are we even in a notebook, right? Like this isn't the spirit that Knuth had in mind with literate programming. This is just a Python script or some kind of script dumped into a notebook instead of dumped into a text file. It's like, I would even hope if this wasn't a notebook, it would have more comments than this. And so I was like, okay, let's talk about what a notebook could look like or should look like. And so we do a presentation on it, things like that.

People sometimes ask me about starting communities because I started the Chicago R user group, you know, 13, 14 years ago. And the design pattern for that group was, I think, a really good one. We had about four or five people who were going to get together, drink beer and talk about R anyway. And what we effectively did was, let's get in the types of presentations, the four or five of us would like to hear, and then invite other people. And if they don't show up, we don't really care because we're going to drink beer and listen to this anyway, because we're interested in this. And that community grew wildly with me running it like as a benevolent dictator, just getting the presentations I wanted to hear. And then one of the things I observed is we were getting lots of new users of R. So I decided once a month, we would have a beginners meetup. Sorry, not once a month, once a quarter or something. But we would periodically have a beginners night. And at the beginners night, we would make sure we had two things. One is topics appropriate for beginners, right? So no sophisticated stochastic gradient descent garbage. It's like grouping and summing and getting environments set up or updating packages, all the normal stuff people have friction with. Focus on that. And then we would make sure we had like, I forget we called it, I would call it now like a learner salon or something. We had a handful of people who were more experienced who said, I will sit at a table and anybody that brings their computer with questions, I will answer. And we would have, you know, really senior people with lots of experience there doing Q&A and people could bring their own problems, literally pizza, drink beer and look at code. And those beginner nights were super helpful.

And so anyway, when people ask about building communities, I recommend those two things. Get yourself a core group of small number of people, get the presentations you would want. If the beginner concept is germane to what you're doing, you have beginners around who may be intimidated, do explicit beginners nights and a big fan of having the extended Q&A or tutoring time. Those are all were hugely helpful.

That's great. Thank you. I can remember JD at the San Diego RStudio conference. I was talking to you out by the pool and I was telling you, I was going to start maybe doing the Boston use our group and I wasn't sure like how to get it started. And I was thinking, should I make this form and have people submit and then we could vote on talks and all that. And JD had shared with me, like when you, when someone wants to give a talk, just let them, like if people are interested in coming to this community and sharing different ideas and have topic ideas, don't try to like overly do it by having a form and having people vote. Like, yes, it's great to be able to source lots of ideas, but you don't have to make it so formal.

I did very little voting. I did the benevolent dictator approach because, you know, I don't know. I feel like I wanted, I did some voting on things a few times, but it was voting for input for me, not deterministic voting. They weren't giving the outcome. They were voting to signal to me what's important to them. And I integrated that in my own objective function and then made a bunch of decisions. So it was more representative, not I'm going to run this group and let them vote on everything. People lack creativity. And so I found that by being a little bit creative and taking that as input. So that's like the first time we did the beginner's night, nobody voted on that. Somebody recommended it. And I was like, Oh, that sounds great. Let's do it. We tried it. And then I did ask afterwards, like we did some voting about what frequency to do the beginner's night. And I was surprised how frequent people wanted it. That was the surprise for me. So we did it more frequently than I would have otherwise.

Separation of concerns in R Markdown and reporting

Yeah, I just have a sort of general brainstorming topic. It's not really necessarily attached to your personal experience, but I kind of feel in our, I do clinical research and our pipeline is, you know, data of some sort to R and then to R Markdown or Shiny or, you know, different reporting endpoints. And particularly with R Markdown, we use a lot of templated LaTeX PDF output, you know, for reasons, you know, mainly because we can't convince people to use HTML or whatever. But I feel like one of the problems we run into is that the separation of concerns between the computation or calculation and the formatting of the output is kind of broken like in R Markdown, especially with LaTeX. And, you know, I've been looking at Quarto too, and I don't necessarily see that that fixes the problem. And I was just curious if you had any ideas about, do we need a whole new framework as to completely separate out the way something looks from the way that, you know, the pieces in it, the plots and the, you know, tables were created. And you see a lot of packages rise up to address this challenge by, you know, saying I can format this line and I can bold this and all that, but it still doesn't change the fact of like, you need to have this LaTeX package installed, but not this one.

Let me give you a couple of thoughts on the separation of concerns. There's kind of two parts here, and I'm going to talk about the separation of concern of analysis versus reporting. And then the other thing you brought up sounds like package management on the output side, especially at the LaTeX level. I got weaker thoughts there. I don't do that as much. The first part, the design pattern that I have had good success with, and I think is important, is to, this is a special case of the, it's easy to co-mingle what I would call business logic. It's not business logic for you. It's analysis maybe, but you can co-mingle that analysis with your reporting framework. And a lot of times we do that, especially when we first start, because it's like real convenient. Like I'm in this doc, it's got R, I'm going to do all my data manipulation and then output a result. Then I do all my manipulation and then I output a result. Next thing you know, you've got 1,500 line R markdown document. And somewhere buried in the middle of that is like reform narrative explaining what's going on. And then there's hundreds of line R code. That's tough to maintain in my experience.

And what I have moved towards is having usually an R script, not even an R markdown, an R script or a Python script that does all the data manipulation and calculates all the things. Now, this is what's different than a clinical world. In my world, almost anything I do, I'm going to want to do over and over. So I take the results, I tag them with like a date, time, whatever configuration tags I need, and I write them to a database. And that just runs on some sort of regular interval and populates that database, right? No reporting at all. It's just the manipulation, the calculation, whatever. And then when I come over to do the reporting, what the reporting does is it reads those effectively cached values and grabs what it needs for itself. And you got narrative. But your R blocks are all about formatting the output. They're not doing any math particularly. Or if they are, it's very limited. Mostly they're make the exhibit, the setup for the ggplot to do the graph, the table formatting to format the table. And so you end up with that second document being like everything in there is about presentation. And the calculation has a separate set of concerns that's over here in this other script.

Now that R script that does the manipulation need not be run nightly or whatever, right? It could be run on an ad hoc basis if you're only going to run this, you know, once a year or something. But that's the separation of concerns that I like. And I tend to do that when that R Markdown documents get out of control, or if I can see that it's going to get out of control. Now, in terms of reporting environments, and what packages you need installed in what order, especially when it comes to LaTeX, that just got Docker written all over it, I think. Because if you got to manage like six or seven of these, for each report has really bespoke, and it can't coexist with the other ones, you could probably do that with virtual environments. But my virtual environment foo is not that good. And so what I would tend to do is probably overkill what I would consider having, you know, Docker instances that have all the packages installed in the right order in the right place and just have a script that stands up the Docker machine, builds the R Markdown and the Docker machine tears it all down. It's a little bit kind of feels like killing a gnat with a sledgehammer. But it could be a way to solve that problem.

Integrating multiple tools and frameworks

So Eric asks, do you have advice on how you navigate through complications when integrating multiple tools and frameworks together in a project? Eric said, I have major pains linking R to AWS as I speak.

Yeah, depends on the nature of the friction. I like clean interface points. An example of a clean interface point is if the Python process writes out Parquet files and the R process reads that Parquet file, super clean interface, right? As opposed to having R running inside of Python or having Python running inside of R using Reticulate . Those can be real useful, but I really like that clean separation. And for me, the new interface format is Parquet, right? It used to be CSV. And now I'd much rather use Parquet as the interface between all my steps.

Data Science Hangout | JD Long, RenaissanceRe | Empathy When Integrating with Other Tools

Transcript#

JD Long's background and role

Modeling heavy-tail distributions

Helping the business escape Excel hell

Low-code and no-code tools

Sparse data and stakeholder hypotheses

Best practices and the ham story

Building communities of practice

Separation of concerns in R Markdown and reporting

Integrating multiple tools and frameworks

Featured software#

rstudio

Shiny

shinyuieditor