Resources

Luis Lopez - earthaccess: Accelerating NASA Earthdata sci through open, collaborative development

video
Oct 31, 2024
17:49

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is Luis Lopez and I'm a software engineer at one of the NASA Earth Science Data Centers. I hope that I don't mention a lot of acronyms because that's, you know, a disease that NASA has and government agencies, but I'll try to explain all the acronyms and hopefully this is not a boring talk for you. So I'm going to present a Python package that has improved the experience of scientists trying to get data from NASA Earth Sciences.

This is basically the gist of it. If you want to take home the message is that a couple years ago we started developing this package that facilitates how scientists access satellite data, basically, and reduces the time to science. This is a must, I think, for the times that we're living in and it became now kind of a small community-led project that has helped a lot of different research groups to accelerate their science. And also the other message that you have to take home is that we have the coolest logo. And I don't know if, well, you probably are an R user and you notice something particular about our logo and that is that it fits with our ecosystem. And the reason for that is that it was designed by Allison Hertz, the designer for a lot of our shiny packages and different things in the art world. So we're really happy about that.

Why this matters: the state of our planet

Now, I'm going to tell you the story of why, right? The main reason why is, as you all know probably, how many of you have experienced what is happening in our planet lately. Like every day you read in the news or you experience yourself that there is something off with the planet, either in the way of a natural catastrophe or in the way of, you know, disruptions to what used to be normal. And now we're living in a world that, you know, things like this happen more often. And I'm not going to dive too deep into the science of why this is caused by humans because that's, I mean, a given. But one important thing to take into account is that we're altering the ecosystem and the natural systems of the planet.

So a way of looking at this for science, it's called the planetary boundaries. So the planet has its own way of, you know, balancing different systems. Like this is a way of looking at the whole thing, right? Like we have like ocean acidification. We have like pollution in the atmosphere. We have things like, well, sea level rise. And all of those things have, they belong to a system that interact with each other. So they're not isolated. And to study the whole thing, you need to basically pull information from different sources.

And if you're a research scientist, most likely you're going to deal with different data inputs from different data centers, from different research groups to try to understand like how we're altering the systems and how can we do to like do climate adaptation, for example. So, and this is very pressing because you can see the progression from one year to the last year.

NASA's Earth data infrastructure

But the good thing is that not, you know, in the last 50 years, NASA and other government agencies have been monitoring the planet and gathering a lot of information about it. So we have what NASA calls an earth fleet. All of the satellites that are not looking upwards to the planets and stuff, they are in a science division that NASA has that it's called the NASA Earth Sciences Mission Directorate. And all of that information, starting from the 80s, contains invaluable information about our planet. Like each of those satellites has very complex sensors and very accurate measurements are made on a daily basis. That information is sent back to us and it's being distributed to one of these data archive centers. Historically, there have been 12 and they are divided by kind of like a science domain expertise. And another thing that it's important to note from that graphic is that the information is booming because each mission contains probably more complicated sensor or more sensors and or more coverage. So it's an exponential growth in information that scientists have to like now have to deal with.

The cool thing about having all this information is that now scientists can actually answer research questions at a planetary scale. So before having satellites and remote sensing, we had surveys, we had like, you know, these big research campaigns that took a lot of effort. But with a satellite, you basically or scientists have basically the information at their fingertips or should have the information at their fingertips to say what is, for example, the biomass of the Braille forest or what is the extent of the sea ice. And the answer to those questions has big implications for planet and the understanding of the systems that I was explaining.

Now, to get to these nice graphics and data processing for answering those questions, they have to deal with a lot of the data that's coming from the satellites, of course. But once that they have that information, they can they can definitely answer those questions. And one of the cool things that they can do, for example, with that data is to create machine learning models, for example. This is a very recent publication that from an article at The New York Times that talk about the latest Google DeepMind model called GraphCast that uses a lot of these observations and machine learning to predict the trajectory of a hurricane, for example. And as you can see, the model did better than the actual numerical models that are in blue at the bottom.

It takes a lot of data and computation to come up with this stuff. But the good thing about it, the promising thing about it is that it can help us, you know, adapt to what's coming or what we're living in our day.

The problem: complexity blocking scientists

Now, so we have data, we have AI, we have scientists. Now let's get to work, right? Like we're going to solve this thing and if we put all together to work and we'll have smart policies, then we shouldn't worry about climate change. But first we need to start by finding the data and working with the data. And I really like this tweet from a NASA scientist that I put in all the presentations that I give. And it's basically a pain that if you're a research scientist that works with some of this, you know, scientific data or any data, you probably have come across like the same issue. You're trying to find something useful for your research and you Google it up and then you end up in a government page that has some links and, you know, that link takes you to another link and it takes you to another link until you finally find the information that you're looking for.

But that gets complicated. So, of course, NASA being NASA is like, well, there must be an easier way, right? Or a more efficient way of distributing this information. Of course, let's create an API. So now you have, you know, say that you're looking for some data about the ocean or some data about aerosol in the atmosphere. NASA developed APIs to get to like that information. But because historically this data has been distributed in these different data centers that are domain-specific, well, something happened with the evolution of these APIs. And that is that every data center in NASA has their own set of APIs and their own set of, you know, ways of getting to the data.

Now, this is changing because NASA is migrating all this information to the cloud. All this information is like halfway into AWS. But the APIs are kind of like an artifact of the history of how this data was distributed. Now, once you take those APIs and put them in the cloud, that's when the real complexities start to emerge for a research scientist. So you are a biologist, but guess what? Now you have to learn AWS and take the course of cloud architecture. Because as you can see, like this diagram illustrates a little bit of like the journey of like, oh, so NASA has the data in the cloud. So I know that you can access it in an S3 bucket. S3 buckets are the, you know, the Amazon way of storing objects in the cloud. So you try to access it and then you find all of those obstacles that are related to how the APIs work together. And if you're a software engineer, well, you kind of get that if you're not a software engineer, like most of the research scientists, then you have a bad day.

And then it leads to this. It leads like, you start like writing code just to deal with AWS, just to deal with how to get the data. And by the end of the day, you are frustrated because it's like, show me the data. I just want the data. I don't want to learn, you know, software engineering or AWS. And we can all agree that it shouldn't be that way. Science, it's, you know, a collaborative effort. And my vision and the people that I work with has the same vision of it's not, it shouldn't be exclusive, right? Like it should be inclusive. And everybody that can participate in science should participate in science. And software, it shouldn't be a limitation. So no one should be left behind. Scientists, more importantly.

And we can all agree that it shouldn't be that way. Science, it's, you know, a collaborative effort. And my vision and the people that I work with has the same vision of it's not, it shouldn't be exclusive, right? And everybody that can participate in science should participate in science. And software, it shouldn't be a limitation. So no one should be left behind.

The earthaccess solution

So in the year 2000, late 2021, NASA partnered with OpenEscapes. And if you were here last year, you know what OpenEscapes is. But if not, it's a really cool project about like teaching open science practices to different cohorts of scientists across, you know, different organizations. So NASA partnered with OpenEscapes to kind of learn together, okay, we know that this is painful. We know that data in the cloud is going to be problematic for the scientists. So how can we like all try to solve this problem together?

And well, then we went back to the what is problematic for the scientists. And we kind of divided the problem into like, you know, the operations that you have to do in order to get to the data. Regardless of where the data was, you needed some authentication. You need to search for the data. And once you find the data, you need to access it. So once you divide that, two years ago, it occurred to me that, oh, this should be in a package. Because like those operations are very concrete. And that's how the package that I'm talking about came to be.

So earthaccess is really simple. It doesn't have a lot of, there is a lot of things probably going on under the hood. But for the scientists, it's basically three lines of code. So they have to have a credential with NASA. They have to search for something. And you can use the DOI for the data set, if it's one available, your bounding box, your temporal domain. And you can use more of those things. But as you can see, even if you understand Python, you see that it's very concrete and very simple.

Now, this is a real science example for earthaccess. And this is like searching for some data, getting some, what in the NASA lingo is called granules or files from the satellite, and plotting that data using X-ray, the library that we heard in a previous talk. So the whole code for doing this, that's plotting sea level rise, the line on the right is from the satellite, and the other is from a different data set that was in voice in the ocean. Well, you know, you're looking at it, and it's not as long as it used to be because now earthaccess allows scientists to focus on their science and not on the other problems of accessing the data. So you can basically come up with this infographic by yourself that NASA has in their sea level rise group webpage with very few lines of code.

And now the next thing that we're trying to do with earthaccess is like, well, we know that this is a toy example, kind of. Scientists, what they really want to answer those planetary questions that I was. So they don't deal with one file or two files. They deal with archives that contain a thousand or a million files, and you need to scale this thing. So once you put that same philosophy on doing that, the code doesn't change, but the semantics doesn't change. What the science wants is that answer. And the next thing for earthaccess is the library is how to deal with distributed authentication, with optimized reads to the files and IO and caching and all of the cool stuff, but with keeping the simplicity of the API.

A bridge between ecosystems and languages

So in a way, I think the project has become this technical and social bridge between the APIs that NASA produced over the years it is producing and the libraries that scientists are using for getting to that data. So you see, you can use earthaccess to get some data and put it in Pandas or open it in X-ray. And we have this interoperability because, you know, we have this connection of ecosystems. And fortunately, that has resonated with the scientists because it's like, wow, this is like not as complicated as it used to be. And from only me and a few others, the library grew up to like, now a lot of people are using it because it makes sense. It's simpler.

And the other side of this is that I thought at the beginning that this shouldn't be language specific. So Carl Bodiger, a professor at the University of Berkeley, had like, oh, we should take some part of earthaccess and do like the login mechanism in R if you are like an R scientist. And it started this project called RdataLogin, which has some of the basic code that you need to know to do RdataLogin, which has some of the basic core things that earthaccess does. And the idea that I have is that it becomes an SDK because that's what at the end scientists will need. Like, they are the same operations and it shouldn't be limited to one language. Whatever language a scientist uses, that's where earthaccess should be living. And we have some talks with people that work with Julia. So that will be the next one.

And I think this is the message, right? Like, if we reduce accidental complexity and make things simpler for the scientists, help them actually solve the problems, because we're not solving the problems. That is the complicated part. I think the technical side on the left on this graph is the simple part. The really complicated thing is like, once you have that, you know, the biggest study about what's happening in the nervous system, like what's happening to the Amazon, what's happening in the rivers, then how to transform that into policy. That is the complicated stuff. So the technical side, I invite you all to help your local scientists in solving that part.

If we reduce accidental complexity and make things simpler for the scientists, help them actually solve the problems, because we're not solving the problems. That is the complicated part. So the technical side, I invite you all to help your local scientists in solving that part.

And this is some of the people that have helped the library over these last two years. And you might have seen some familiar faces. But with that, I thank you and I accept your question.