Resources

Bryan Shalloway | From summarizing projects to setting tags, uses of parsing R files | RStudio

video
Oct 24, 2022
12:22

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Today, I'm going to be talking about some of the ways that parsing out the functions and packages in your files can be helpful for organizing and viewing projects. And I'll also be introducing a new package I wrote called FunSpotter that provides some helpers for parsing, analyzing, and organizing functions in different project structures.

Okay, so I hope I'm not the only one here who maybe follows some of their favorite R developers a little too closely. So for example, before going to sleep, rather than scrolling through Instagram, I'll just scroll through the recent GitHub activity of my favorite R developers. And then when a new blog post or new book comes out that I'm super excited about, it's like a major event for me, and I just want to jump straight in. And I would love to say that I'm just this sponge of information, and immediately I'm just, you know, know exactly what I'm doing, but usually it ends up kind of looking a little bit more like this.

And you know, there's all those things coming at you, and you're trying to figure out, you know, how these things relate to each other. And I've come up with this kind of heuristic where I'll just start kind of like writing down the functions that I see as I'm going through the new book or blog or whatever it is. And I'm happy that I've seen other people in the R community that kind of share this same pattern as I do. Here's a post or a tweet from Alex Cookson, where he says, any other R stats people find D-Rob's Tidy Tuesday screencasts useful? I made a spreadsheet with timestamps for hundreds of specific tasks he does. And then another example from Jeff Rothschild, who made something similar for different developers in the Tidy model space, particularly Julie Silgi and other authors there.

And I think that creating a reference table like this for yourself can be really helpful in terms of kind of building a, you know, mental model for what's going on and printing these different functions that you can go back to and check on and give yourself a little bit of context of the materials that you're working through. One downside with doing this is it just takes a long time. And you also may miss some specific examples or packages as you're going through and creating these by hand. And I think that this is where something like Funspotter can be useful, either in supplementing these materials or in creating an initial reference table for yourself.

So I'm going to start by just showing what a typical Funspotter workflow looks like and what the output looks like. I'm going to give you a few seconds to kind of soak this in. And this is also another opportunity if you find you can't read the code to scoot closer up. There's plenty of seats in the front. So I'll just give you a few seconds to see if you can soak this in.

Okay, so what I'm doing here is I'm specifying the GitHub repo where Julia Silke's blog lives. And then I'm just pulling out all of the individual functions and packages that are used there. And then we have links to the relative paths within the repo as well as the URLs where these files actually exist. And then I think passing something like this into a nice little HTML lookup table can give you a nice little reference table for reviewing whatever material you're in the process of going through. So here I'm just looking up where are the cases where she's used written about random forest in the past. And this gives this nice little way of looking this up.

Another place where this is useful is just in those cases where you kind of want to sift through some set of information. You may not even remember the function or package that you saw her write about. You just remember, okay, there was something that was useful that I think she did. And you just kind of want to look through a reference sheet. And I think that's also where these types of tables can be really helpful and provide nice little supplements to whatever other method you use for searching information or helping you kind of in your learning process.

FunSpotter functions overview

So next I want to talk a little bit about each of the different kind of groups of functions that exist within Funspotter. So one set are these list files functions that essentially take in some location and return all the different files in that location. Then there's a set of functions that spot things. So in this case we're talking about spotting functions. And then I have a variety of helpers that either manipulate the output of these objects or are helpful for plotting. And then there's also a bunch of unexplored functions that I'm not really going to get into today but can help with certain situations you may run into if you have like different dependency structures than just kind of like the standard setup.

Okay. So now I'm going to walk through this step by step just so you can see the details of how this works. So after running list files GitHub repo what is returned are the file paths for each of your R and R Markdown files within that repo and then also the URLs to those. Next we pipe this into spot funds files which is going to for each of those individual file paths parse out the actual functions and packages within them. So if we just hone in on an individual element within that spotted column you'll see this is where our actual function packages that are used within that file exist. And then that next step the unnest result is just going to take each of these individual data frames containing the function packages and put them into a more simple data frame structure that you can use as a reference table for yourself.

And where most of the work that's going on here is really in this this like spotting step. And one thing about R that adds a little bit of complexity here is that when you do a library call you're attaching all the functions within that package. So what fun spotter does is essentially try and recreate what it thinks that search space is going to look like. It takes a variety of steps to do this so that when it identifies a function it can say I think this is the package this function is actually coming from. I'll note too that this takes a few seconds to do for each file. So while you can do it for a few for hundreds of files if you try and do it for millions this isn't going to work it's going to take too long. If you have any ideas and suggestions for how to do this faster definitely feel free to open a GitHub issue for me if you happen to be an expert in this space.

And I'll also note that if the if you don't have the package installed locally fun spotter is not going to know which package that particular function came from is just going to say unknown. So you if you and if you don't want to actually install all of those packages that you may have for some you know set of files that you're running fun spotter through if you don't want to install those across all your projects you may want to start off by using something like our end so that you can keep you know the package you're installing for this specific to that project. And again this process of recreating the search space is currently set up to be used on like self-contained files and by that I just mean picture like an R file and R markdown file that has a bunch of library calls at the top. So it's going to work on things like repos for blogs or books and things that more along those lines it's not going to work so well on things like targets workflows or you know shiny apps or other types of projects that have more sophisticated dependency structures potentially.

Creating reference tables for your own code

But outside of just creating reference tables for you know blogs or books or screencasts I think another set of examples are just your own code. I know for me my own code snippets can get really disorganized and tough to keep track of. Here is a tweet from Chelsea where she says I have so many snippets of our stats code and untitled files that have useful bits of code that I want to catalog somewhere but I don't know where would be best. And I definitely have this experience particularly with my GitHub gists. If you don't know what GitHub gists are they're just like a place online where you can save little code snippets. And for me I know like make some little example of something and post it on GitHub gists and then kind of forget about it. And then sometime down the road I'll think OK I know I did something along this space and I can't quite remember what that was. And this is where I think creating a little reference table for yourself so you can go back through and glance through those or search through you know what functions and packages you were using in your prior examples can be really useful.

And this is I'm using this example just for your own code but you can picture also in an organizational context it may be useful to you know get a sense do this across a wide range of different folder structures so you can see OK what functions and packages are different people using. And just one little caveat about these examples they're a little bit manicured sometimes depending on what the dependency structure looks like or what the files that you're actually saving like there may be a little bit of cleanup that I'm kind of just not exposing you to here because I don't want to scare you away too much.

Using FunSpotter to organize blogdown posts

OK so up to this point I've gone through a few examples where we are just creating you know reference tables for yourself but this can also be used for actually directly organizing your files. And I'll walk through an example using blogdown. So I started blogging in 2020 after the RStudio conference in San Francisco Rebecca Barter had a really great talk that inspired me to start blogging more regularly. And one issue I ran into when creating my blogs initially was my tags really didn't work coherent at all. I'd have you know M.L. machine learning I'd have data science spelled three different ways. I have a bunch of synonyms for each other for my tags to the point that the tags were really not any useful way for actually organizing my posts.

And I thought a better way of organizing my posts would be to just directly see which packages are being used in each post and use those as the way to actually organize those tags. So that's where this function spot tags can be really helpful where what this does is when it actually renders the file it first goes through checks which packets are actually being used and then puts those into a format that it can be used for the little tags at the top of the header for your post. So now rather than actually having to put any thought into what my tags to be for my blog I just had the spot tags it does all that for me I don't need to type anything out and it can just kind of like it's one less thing to worry about when you know blogging can already be a somewhat intimidating thing to do in the first place. And then if you actually go and look at like what the tag section of my website looks like all my posts are organized based off of whatever packages are used within them which I think is this nice clean coherent way of actually you know setting things up.

And then if you actually go and look at like what the tag section of my website looks like all my posts are organized based off of whatever packages are used within them which I think is this nice clean coherent way of actually you know setting things up.

Analyzing package relationships with network plots

Okay so now I've talked through some examples of creating reference tables. I just went through an example of how you can use this to organize your files directly. And the third area that I just want to touch on really briefly is I think they can also be helpful to actually analyze this information. So there's a lot of potential data here. This is again Julia Silke's blog. Maybe we want to see how do these different posts and packages relate to one another. So here's another example where I'm passing these into this network plot. In this case I'm actually just parsing out the packages not the functions. So in the network plot we have each circle here represents an individual package. Each square represents an individual file. The connections represent those cases where a packet is loaded within that particular script.

And you may notice that there's you know some clustering of your groups here and we see okay there's these two big clusters. One of these are packages associated primarily with the core tidyverse and then another set of packages that are associated primarily with tidy models. So you know just glancing this you get a sense that Julia obviously writes a lot about tidyverse and tidy about tidy models and does a lot of tidyverse code in terms of doing her cleanup. You can imagine another author who writes more about spatial analytics may have some different set of packages central to what their posts are largely about.

And this is just kind of one example. I think there's a lot of different ways you could take this. You could imagine trying to explore the relatedness between different projects or between what different authors are writing about. Or you could also use this to help organize those files directly based off of like what types of things actually exist within them. So this is just kind of like a little taster example but I think there's a lot of interesting things you could do here. And I just want to reference a few other interesting projects. This is not within the R community directly but if you're just interested in kind of like visualizing code space or using visualizing code repos or using the information that's within the files to help organize those files. These are just a couple other projects I think are worth looking into if you're interested.

Okay. So I hope that you will take the jump in and try out the package. I'm planning on submitting it to Kranith in the next few weeks. So if you try it, I'd love to get feedback from you. And I also have on my readme just like a few examples of reference tables created using Funspotter. If you use this and create some reference table, definitely open up an issue or a pull request and I'd be happy to add that there just to get some other examples. So thank you.