Kirill Müller | dm: Analyze, build and deploy relational data models | RStudio (2022)
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, I'm Kirill and I want to tell a few stories about the dm package and show recent developments.
In a nutshell, dm helps when you have multiple tables in R. Some call this a relational data model. So we had computers for sale in the 50s, we started to think about databases in the 60s, 1970, Edgar Kot published a relational model and this is way before I was born and still in use today. An example, the points at the right are energy consumers in Switzerland stored in rows in the table at the left and there is a lot of detail in all those related tables and this looks scary.
With our client, we wanted to build a Shiny app that allows filtering across all those related tables. This shows the Shiny app in action. All we had back then was a tool to visualize the relationships. We did not have tools that made the task less scary, so we went ahead, we implemented a routine that did all those joins and filters in like 800 lines of code. It worked, but it was tedious, it was difficult to maintain, it was error-prone and we thought we can do better than that and that's how dm was born.
With the dm package, the filtering logic was reduced to a variant of this. So from 800 down to 6 and the best thing is if the data model changes, the code stays the same. I call it a success. Others call it a success too. Amazing. It's a treasure, it's a lifesaver. Thanks, Gary. Thanks, Hadley. Thanks, Niels. Thanks to everybody who filed issues who contributed code to the dm package. You are awesome.
So from 800 down to 6 and the best thing is if the data model changes, the code stays the same.
Best thing is we are on CRAN in version 1. Our universe in GitHub as well. So if you want the latest and greatest, you may want to go to our universe. If you don't want to mess up your local installation, click this pause it cloud button. This gives you, well, should I say RStudio cloud? IDE in your browser with everything installed so you can just test. So there is no excuse to not try today.
Some people are wary about dependencies. We are not among those people. So in fact, dm wouldn't have been possible without all those great packages we depend upon, without iGraph, without everything. Thanks. This is us.
Why multiple tables need dm
So we're talking about multiple tables. So which of these tools has a concept of an active dataset, right? You start a program, you open a file and this is your dataset and you work with it. One car, one dataset. Many cars, many datasets. In R we don't have the active dataset. That's a good thing.
Took me a while to realize there's a difference between data for analysis and data cleaning, data management. So for analysis, all we really want is tidy data. Rows, columns, cells with atomic values, everything lined up cleanly, ready to put into our model, except there is duplication in those highlighted parts at left. And we can't have that for data management, we can't have that for data cleaning. This leads to problems further down the road. So if we try a different approach, wide form, this is better. There is no duplication, except that those beautiful column names make me want to turn this into long form the first time I see it.
So we want multiple tables, we need often multiple tables, and we want DM, because that's what DM is doing. If you have more than one table, DM makes your life easier. We are in a database session, so many of you will know about primary keys and foreign keys. This is just to make sure we're all on the same page, very briefly. The dataset we saw in the previous slides could be decomposed into two tables, measurements at the left, entities at the right. So multiple measurements correspond to a single entity. The entities have a primary key, the ID column. The measurements have a foreign key, also in the ID column, linked to the entities, and in fact this is what a DM object contains.
Tables, this could be local data frames, or lazy tables on the database with a DB plier package. Maybe arrow also works, but I'm not sure we need it. More on that on the last slide. We have primary keys, ID, and we have foreign keys that link tables together. That's a DM object, but do we really need a new package? For two tables? For five tables? Even for ten tables? Will it actually make your life easier? Will it integrate into your existing workflows? Will it help you in your work? And I believe the answer is yes, yes, yes, and I'll show you how.
New Shiny app for building data models
So this is a very new thing, a shiny app we have built, and the team at Sunkra helped a lot with that, so thanks here for all the support I got when preparing this. This is the AdventureWorks dataset, if you're familiar with that, and that's the shiny action. So if we start from the database, we can, with pointing and clicking, select the tables we care about, we can create code that we can run immediately, we can store it in an R script, we can put this into a function, so David here will talk more about how to turn this into a real R package, but for us, this kind of works. So you run this shiny app once, this is pretty much like the RStudio IDE import assistant. It will give you code that you can reuse.
This is just a summary. So selecting tables gives you code which you can copy to your console at once. For the data frames use case, when you start with data frames, data frames don't know about primary or foreign keys, they have just rows and columns. So we have to teach DM what those keys are, and we can do this as well with the shiny app. So we're selecting the tables, we're selecting the columns that constitute the primary keys and the foreign keys, and we get an instant preview in the diagram.
And here we get the code, plus we can also experiment with colors. I find this very helpful in guiding the eye in a more complex data model. There's an undo function if we mess up. So if you're satisfied, the copy to console button will give you the code for this data model from your data frames. So that's new, that's in DM version 1.0.1. And yeah, to sum up, works for databases, for tables on the database, works for data frames. Try it out.
Using dm in consulting work
In our consulting work, we are using DM, well, you saw the expression, whenever we have more than one table, the DM is just the basic thing that we do. We used it to help with a shiny app that handles contracts. So this is a situation where we have for a contract partner in the first stage, contracts at the level of individual countries, and in the second stage, we would pool contracts together to get an overwatching query for this customer.
And this shows the database UI that we have contributed. We have projects, family, quotes, and this maps to this data model that we have designed with the DM package and applied to the database as soon as we're ready, and we could also continuously update it because the data model is specified in code. So the first stage is everything that's orangish at the bottom, and what we can do with DM very easily for a specific quote, which we identify with the primary key, the quote ID, we can get all the related data in one swoop with the DM package, load it locally, update what we need based on the input from the shiny app, put it back to the database, so this is create, read, update, delete for you in shiny made easy with the DM package on a complex data model.
Deconstruct and reconstruct for data analysis
For data analysis, so this is the NWC flight data. We use it throughout the documentation of the DM package. If we aggregate a flight table, let's say we group by origin, summarize, compute number of flights per origin airport, or let's say mean arrival delay, what happens is the result is a table that is conceptually linked with the airport's table. So this group by summarize creates an implicit foreign key into the airport's table because the data in the origin column maps to what's been in the airport, and this link is useful if we want to later bring in details from airports or from other related tables, maybe not shown here.
How does the DM package make it easy to keep track of your keys? We have a new feature, we call it deconstruct, reconstruct. You can go from a DM object from a data model to individual tables that still know about their primary and foreign keys. I'm not going to tell you how, but we can talk about it later. We can apply data transformations on those separate tables with dplyr or whatever works for you, and we can put this back into a relational data model as a DM object.
A bit of code. So the new DM deconstruct function. There's nothing except printing code. So from a DM object that has tables, primary keys and foreign keys, it will give you code that gets you those separate tables, which you then can compute on. So key equals true is kind of the key here. The DM constructor on those objects will result in an object that still has all those primary and foreign keys in place. And this also works when we have analyzed the data. So here the new table will contribute one primary key, one foreign key.
I'm very excited about this feature. We had a similar concept called zooming before, but this, of course, doesn't integrate as well with existing workflows. So I think this is a step towards helping adapt DM to existing codes so that you could benefit from the DM even if you're not starting on the green field.
ETL across multiple data models
Why one data model? Why just one data model? We can do more. So this is an example from another project with another customer. The nodes, the boxes in this diagram are whole data models. The arrows represent data transformations. This is an ETL process, transform load, targeting whole data model, data model objects, whole data set going from data set to data set, giving better guarantees, moving to the right in terms of data quality, data integrity, and so forth. We don't have a specific framework to do this. This is right now all custom. This could be a blog post.
But we have a pretty good idea how to do this because there is an inherent chicken and egg problem here. When populating a database, we first need to create a structure. The tables with the columns and the relationships in an empty form before we can fill them. And how do you know the structure? So the MP type function helps us, which gives us the prototype of a DM object, meaning tables and relationships minus the data, which we then can write to the database and populate.
Bridging normalized and nested data
I want to spend the last few minutes on a new idea that didn't make it into DM1, but that we are also excited about. Normalized data has been around for ages and will stay around. At the other end of the spectrum, we have deeply nested data. When you query a web API, typically what you get is JSON, deeply nested. And somewhere in the middle, we have nested tables. With the package, think of nest, unnest, pack, unpack. Arrow has great support for this in their data format. And what we can do in DM today is get from normalized data from a DM object to nested tables to a single nested table and back. More or less losslessly, more or less in all cases, we're working on that.
But this idea looks very appalling in particular with the very nice package, which I think is really underrated. We can go from deeply nested to nested tables in a controlled way. And the way back to JSON is understood. So, with all those bits and pieces in place, I hope very soon to be able to present a way to seamlessly move between all those formats, so all of this becomes one. And perhaps nested tables are a good middle ground, because this would allow us to go back to one single table. That's just easier than all those boxes and arrows.
So, with all those bits and pieces in place, I hope very soon to be able to present a way to seamlessly move between all those formats, so all of this becomes one.
We are Synchro. The team, as I said, helped a lot preparing the Shiny app, making sure this all works. Thanks here. We're very excited to be a sponsor here. If you have detailed questions or would like to chat or would like to have one of those cheat sheets here, come by. Happy to talk or happy to take questions here. Thank you.
