Resources

How to mitigate package security risks with Posit Package Manager

video
Sep 27, 2023
31:34

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everybody. I'm Rachel and I lead our customer marketing here at Posit. It's so nice to see so many of you last week at the conference and thank you to everybody who joined us both in person and virtually. A lot of our time was focused on last week's conference, as you could imagine. So rather than skipping the session this month, we're just going to do things a little bit different. At last week's conference, Joe Roberts gave a talk on package security that I thought would be helpful as part of our workflow demos here. I recorded this part because I'm actually on vacation this week, but Ryan and Joe will be hanging out with you all in the chat right here. So if you have any questions today, or maybe follow up questions from last week, you can ask them in the YouTube chat. And we also have a Slido link where you can ask them anonymously too. And I've shown that Slido link here on the screen. But thank you again for joining us and I'll turn it over to Joe.

Hi, I'm Joe Roberts and I'm a product manager here at Posit. And one of the most common questions I get asked, especially by IT administrators and security teams, is how to make sure the public packages that their data scientists want to use are safe. So today we're going to explore that topic. And we're going to start with an overview of the main public package repositories used by data scientists working in R and Python to understand how they work, and more importantly, how they differ in their approaches to package publishing. We'll also explore some of the more common types of security risks in using public packages. And we'll look at strategies to mitigate those risks using tools like Posit Package Manager.

Overview of public package repositories

So let's get started. And let's take a look at the main public package repositories for data science.

CRAN is the primary repository for R packages. Bioconductor is also widely used for R packages, primarily in data science and life sciences. And on the Python side, the Python Package Index, or PyPI. CRAN and Bioconductor take a similar approach to packages. And if you've ever submitted a package to be published on CRAN, you know exactly how stringent those requirements are. Each package has to pass a comprehensive set of checks to ensure the package is properly formed and documented, builds properly, doesn't cause any other packages to fail. Most importantly, from a security standpoint, these packages must be submitted entirely as source code so there's full transparency of what is contained within the package, and it will go through an actual human verification before being published. It's like the exclusive club with the bouncer out front, where they may not make sure you can dance, but they definitely are going to make sure you're well-dressed before letting you in.

In contrast, PyPI takes the opposite approach. PyPI is much more like Wikipedia, a completely open repository where anyone can create an account and publish packages. Both in source or binary packages, which then become instantly available to everyone around the world. This isn't necessarily a bad thing, as it does make a far wider array of packages available, though it does pose some additional security risks that we'll have to take a closer look at.

Package quality risks

So let's start with the security risks now, and the first security risk is around the more fundamental question of package quality, or does this package accurately do what it claims to do? And that's not always an easy question to answer, especially if you're not a domain expert, but there are some other more objective factors you can look at to help you assess quality. Things like the number of package downloads, or how many other people are using this package, the author's reputation, is this package author well known in the community, do they have a lot of other well-used packages that they've published, is this package well documented, are there the functions documented, are there instructional guides, vignettes, or other tutorials available, and how frequently is the package updated, when was the last update, is it actively maintained and frequently updated with fixes, and finally, I mean, does the package include a comprehensive set of tests to verify that the functionality is working correctly?

So how do we mitigate package quality concerns? Well, the obvious option is for someone to carefully scrutinize the package and make a determination, but as we pointed out earlier, that's not always easy. Fortunately, the community has developed some tools to help with those assessments. For example, our friends over at the R Validation Hub have produced an R package called Risk Metric that takes these that I just talked about and many more quality metrics into account to help you get an objective assessment of package quality. And once you've done your reviews and done your assessments and decided which packages you want to allow your data scientists to use, you can leverage a feature in Posit Package Manager called curated repositories, allowing you to create a separate repository just containing the CRAN or PyPI packages that you've approved for use in your team or organization.

Vulnerabilities in packages

So next let's talk about what you typically think of when you say security, and that's vulnerabilities. And these can occur at any point in the package, from the source code introduced in compiled code or vulnerabilities that exist in external libraries that are used by the package. And it's important to note that while these are sometimes malicious, most often these are just due to mistakes in poorly written code. But they're still just as important to assess and resolve.

So mitigating vulnerabilities is actually a little bit easier for public packages, and that's because the packages are known published quantities. You don't generally need to worry about scanning them yourself because numerous others are scanning these already published packages and reporting any vulnerabilities on them. So the best resource for mitigating those are through databases of known vulnerabilities. In CRAN's case, there are relatively few vulnerabilities reported, partially due to that strong curation of packages due to the bouncer of the door. But Posit has recently started working with the R Consortium on building out an advisory database to track known vulnerabilities in a standardized format that was first adopted by the widely used Python Packaging Advisory Database that's maintained by the Python Packaging Authority. Both of these now feed into the OSV open source vulnerability database that was created by Google and is the best source for looking up known vulnerabilities in both R and Python packages at this point.

Once you have that, you can leverage a feature in Package Manager called Package Blocking, where instead of taking a bottom-up curated approach to approving every package you want to make available, you can instead go top-down and allow the majority of packages available in the public repositories, giving your users the most compatibility with everything that's out there, but still blocking the ones that have known vulnerabilities and even for blocking packages that may have open source licenses that you don't want to make available inside your organization.

Package confusion: typo squatting

Now, separate from the security risks in the packages themselves, there are a couple more malicious risks that prey on weaknesses in the infrastructure that is used to distribute these packages. We group these into a category called Package Confusion, which in simplest terms is all about deceiving the user into installing the wrong package. So instead of getting the package they want to use, they are tricked into installing a malicious package that they don't. The two risks we'll talk about today are typo squatting and dependency confusion, which are some of the more common ones and most high-profile ones that we've seen in the last year.

So I'm going to start and give some examples in terms of Python packages, as due to that more permissive nature of PyPI, these have shown up in high-profile real-world exploits in the past year. But the same cases can apply to our package repositories as well.

So first, let's learn about typo squatting. And let's start with you, the data scientist or developer. And you, let's say we want to start a machine learning project. And so you want to grab your TensorFlow package from PyPI. You use the standard Python installation utility pip and say pip install TensorFlow, which is the command to install TensorFlow, equivalent to doing install.packages and R to install a package from CRAN. And so pip goes out to PyPI, looks for a package called TensorFlow, downloads it, installs it into your environment. Perfect. Everything's fine.

But now let's start again, back in our clean environment. We've still got PyPI there, but this time when we try to install TensorFlow, we accidentally make a typo and had two Ns in the name of the package. So we have pip install TensorFlow with two Ns. Pip goes out and looks for it on PyPI and can't find a package by that name and returns an error. You notice you mistyped the name, fix it and try again. All is well. That's what you expect to happen in those situations and nothing inherently wrong with that.

But here's where things get scary. So let's go back to our clean environment here. And let's say that some evil person, try up to no good, looks at the most popular public packages on PyPI. Like TensorFlow, which, for example, gets downloaded from PyPI on average about 600,000 times a day globally. And so they take that and they write their own malicious package and upload it to PyPI with a bunch of different similar names that are common things that you might make type by accident if you were trying to install the actual TensorFlow package.

So now we let's take our same scenario. We try to install TensorFlow with two Ns in it and make the same typo as before, but instead of getting an error this time, pip goes out to PyPI and finds this malicious version that someone put there and with the same name downloads it and we've been exploited now because we've gotten the wrong packaging here.

So these are tough to mitigate against, but there are a few things that we can do. And I should point out again that these are higher risk with PyPI than on CRAN due to the more permissive publishing nature we talked about. But one thing working to our advantages in that case is that there are many organizations and companies in the Python community watching for these malicious packages now. So they're identified relatively quickly once they are published, usually within a few days or less. So the simplest mitigation we can do is just avoid the latest packages and updates from public repositories and give them time to be discovered, reported, and blocked or removed. And so latest isn't always the safest there, and sometimes it's worth it to sacrifice a bit of cutting edge and compatibility in the name of security.

And so latest isn't always the safest there, and sometimes it's worth it to sacrifice a bit of cutting edge and compatibility in the name of security.

So in Package Manager, we can leverage what we call repository snapshots to easily pin our installation source to a slightly older versions of the repositories. We also get the added benefit of reproducibility of our projects by not always getting updates to the packages that might break our existing code. And it's not a perfect solution, but it's one tool that can be used together with these other strategies we're talking about to reduce our risks.

Package confusion: dependency confusion

Finally, I want to talk about a different variant that preys particularly on larger teams, especially those who develop and share their own internal packages to supplement their work that we call dependency confusion.

So let's go back into our clean environment here, where we're now part of a larger company that has its own internal packages. Let's say we're at Posit, and we have our internal Posit tools package that we use, along with some public packages from PyPI. We may even be so sophisticated as to have our own internal packages in a PyPI-like repository so that Pip can install from there.

So we want to install Posit tools. We add this extra index URL flag to our internal server so that it can find where our additional packages are. In reality, this extra index URL setting is probably set as part of our system configuration, so we don't have to remember to type it every time we install something, but I include this here because this is exactly what would actually happen in reality. So let's say Pip goes out, looks for a Posit tools package on PyPI, doesn't find it, realizes it's not there, so it's probably an internal package, and then searches the extra index URL, finds it on our internal repository, and installs it. I also need pandas, so I do the same command, but in this time, it does find it on PyPI and installs it just as expected. So everything's perfect until our malicious actor steps in.

So an evildoer is targeting Posit employees, for example, and maybe guesses that we probably have an internal package named Posit tools that's not available on PyPI. A lot of companies have internal packages like this for connecting to internal resources, databases, or other internal sources, and so this malicious actor creates their own malicious package and gives it a name they think that our company might use internally, like Posit tools, and they publish it on PyPI, probably with a large version number, in this case like 9.0, so it looks newer than anything else.

And now we go back to our user who's trying to, say, upgrade Posit tools they have installed, and so they, same as before, have their extra index URL, and PIP goes out, and it sees this newer version of Posit tools, not from our internal repository, but available now on PyPI, and it doesn't know any better, assumes that's what you want, and installs it, and you've been exploited.

So, you know, there's lots of variants of this, not just directly installing packages, but these packages being dependencies of other packages, and all of this can happen in a similar way here. So this one's actually great because Posit package manager can completely insulate you from this case using a unified local and public repository. So in that case, we put all of our packages, access all of our packages through package manager, we put our internal packages in front of the public package repository, and present that to the user as a single repository, taking that decision of which package to install completely out of PIP's hands.

taking that decision of which package to install completely out of PIP's hands.

So now when we ask PIP to install Posit tools from our repository here, again, it's probably been pre-configured, so we don't have to add the package manager address directly, but there's only one that PIP knows to pull things from there. And so PIP goes out, asks package manager now, I need the Posit tools package, and package manager knows, hey, I'm always going to give you the internal one, and because I know that the internal package supersedes the public one, and I will never ever serve you the public package, and our case is solved. Similarly, using the same server, I asked for pandas, pandas is not in our internal package source, so package manager says, okay, here you go, here's the public one from PyPI, and still taking advantage of all of the other security measures that we can also use, the curated package blocking or even repository snapshots can all be used in conjunction with these local and public repository can unify. Everyone's happy, and we've solved at least one of many security risks that we have to worry about.

Summary and Posit Public Package Manager

But in summary, the reality is we can't deny that public packages do present risks, but really understanding those risks gives you the knowledge to manage them. Today, we've talked about some of those strategies, and how tools like Posit Package Manager can help you reduce some of those risks.

And for those of you interested in learning more about Posit Package Manager, I want to make you aware of our free hosted service, Posit Public Package Manager, or P3M. We provide a full mirrors of CRAN, Bioconductor, and PyPI that are free to use, including historic snapshots that you can take advantage of, as well as the added benefit of our Posit-built binary CRAN packages to make things easier and faster to install in your R environment. You can find it out and explore it today at p3m.dev, and definitely reach out to Posit if you're interested in learning more about the advanced risk management features we talked about today, and how you can bring the power of Posit Package Manager into your own organization. Thank you so much.

Thanks so much, Joe. It's great to be able to dive deeper into Package Manager today. And while we're not jumping over to a live Q&A session here, both Joe and Ryan Johnson are here hanging out in the chat to answer any questions that you have. So we'll leave this chat open for the next 15 minutes or so. You can type your questions into YouTube or use the Slido link for anonymous questions. The short link for that, for Slido, is pos.it slash demo dash questions, which you can see on the screen.

We host these monthly end-to-end workflow demos on the last Wednesday of every month. We'd love to have you join us again. The last five months, Ryan Johnson has walked us through workflows for deploying visualizations to stakeholders with Dash and Shiny, shared how to create scheduled and company-branded Quarto docs for redundant reports, and also two ways to ensure consistent and up-to-date data in your work with APIs and also with PINs. The link to those previous workflow demos are also in the YouTube description below. But thank you again for joining us today. Ryan and Joe will be here in the chat for questions, but have a great rest of the day, everybody.