Resources

Saumiitha Leelakrishnan - Partnering with Posit for progress on Environmental Stewardship

video
Oct 31, 2024
20:09

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Good afternoon, everyone. My name is Saumitha Leelakrishnan. I'm a technical specialist, Global Emissions Center of Excellence team at Cummins.

I'm here to share with you a story, which is a journey that is involving a very interesting example of wider adoption of R for business applications. And this story, and surprisingly, starts with a deep breath.

If all of you could take a deep breath with me, if that's OK?

Every day, we are breathing, and we need fresh and clean air. We at Cummins recognize it. Cummins leads at Inverse in innovative data-driven solutions to enhance environmental stewardship through emissions analysis and sustainable practices.

Cummins, a Fortune 120 company, a 100-year-old history company, historically a diesel engine manufacturer, embraces stricter emission standards over the years and has now broadened its portfolio to include diverse energy sources. This company's investment in clean diesel and natural gas and currently into electrified products and hydrogen and fuel cell technology is commendable.

So in this talk, I'll be sharing how partnering with R has helped the global product compliance team at Cummins help deliver solutions for our customers that are safe and lead to a cleaner environment. Even if you have a different data set, some of the lessons that I've learned in emissions analysis and environmental stewardship will definitely improve your workflow. So let's dive in.

Step one: grasp your needs

So step one, grasp your needs. So for all of you in this room to understand the needs of this specific project, I think I would have to give a picture of what our team does and what is the data. So we are a functional excellence team who are responsible for creating tools and processes for onboard diagnostics and emissions compliance data.

So what is onboard diagnostics? They are systems and algorithms to detect failures that adversely affect engine emissions. Out of it, this data is very prominent, the in-use monitoring performance ratio that allows us to verify if the diagnostics are actively making decisions.

So this onboard diagnostic data, along with a zillion of engineering data that gets collected from the engine units out there in the real world, through telematics and data logger devices, gets uploaded to the cloud.

So I utilized extract transform load to begin with. So for some of you here who are wondering about the CTL term, it is the process of combining data from multiple sources and moving into a large centralized repository, often called a data warehouse. ETL uses a set of logical rules to clean and organize the data and prepare it for storage, data handling, data analytics, and mission learning.

So why did we use ETL? High volume data sets necessitate ETL for the telematics and the data logger data that gets collected from the various engine units which goes to the cloud. The onboard diagnostics data has multiple disparate sources, data lake, Power BI models, relational data dictionary, Excel, and many more. Data cleaning and enrichment using ETL in this example is to pick the latest image data for each engine unit and associate it with its equipment manufacturer's details.

So now that I've identified and utilized ETL, the next step will be to move this data to a centralized repository. So to achieve this step, I worked with the analytics engineering team to read the data. So I worked with them to identify and set up Databricks jobs to automate the execution of data processing.

So the next step will be to read this data from the delta lakes, which are nothing but an open source storage layer. So for that, I utilized PySpark, which is actually nothing but an API for Apache Spark, to read the data from the delta lake and move it to a centralized repository, which in this case was a SQL database.

So because it is a relational data. So this relational data is organized into three different or multiple different engine platforms, like light duty, medium duty, and heavy duty. And then as you see, it's further segmented into model year. So different model years have its own engine families, which are grouped based on fuel type, transmission, et cetera. And each engine family can be applicable to various applications, be it a truck, shuttle bus, motor coach, RV, street sweeper, ambulance, like n number of applications. And then each application has numerous engine serial numbers.

Step two: use your needs as a compass

So what were the primary needs? So we have to move from merely just describing the data to making inferences and drawing conclusions. Secondly, embracing machine learning algorithms to extract meaningful insights. So with these primary needs in mind, the idea was to build a full-blown R Shiny web application and host it in Posit Connect app so that the 600-plus engineers who have to look into this data globally will be able to utilize it.

So the web application was built and which was first able to highlight the applications that required attention, then population distribution of the engine serial numbers based of your application type, duty cycle, equipment manufacture details, and so on and so forth, and multiple interactive visualizations, and then an offline workable export as well.

Advancing with machine learning

So now we have a web application. Engineers can use it. They can derive insights from it. They have action items to work on. So what is the next step? Is that all there is? No, we would have to advance and adapt.

So in this example, I would like to say that the engine serial numbers that have been identified to have presumably not accumulated enough of that IOMPA ratio I mentioned earlier, you need some key attributes like duty cycle and application so that the engineering teams can go and analyze the data. So how do I find the duty cycle? So before then, what's a duty cycle? So duty cycle is a definition of history of speed and load conditions over which an engine is operating over a period of time for any selected application.

So I wanted to find the duty cycle with the engineering data that was available in hand with me. In that case, I jumped into using the simplest and the popular unsupervised machine learning model, which is k-means clustering. I utilized this Python library for it. So I worked with subject matter experts to identify five engine features based on the characteristics to determine the duty cycle. So it was idle percent, torque curve percent, motoring percent, key cycles per hour, and drive cycle average percent.

So once I identified these engine features, how I used it in the model is that how this k-means clustering model works is that it first randomly assigns a data point to one of your clusters, and then you compute the centroid of each cluster. Then there is reassignment of your centroid of each data point to the nearest centroid of your cluster. And this continues going until there is no changing of your clusters happening. Thus, finally, you're able to identify your duty cycle of each of the engine serial numbers out there.

So that will give an insight on what conditions do you think those engines were not able to accumulate enough of the IOMPR ratio.

Generating and analyzing reports with Quarto

So the predictive modeling is complete. So they have some actionable insights. So is that all there is, or is our compass taking us further? Looks like it is. Because the next step is generate and analyze reports. Proactive reporting anticipates issues and opportunities in a timely manner so that you can make informed decisions.

So in this case, I utilized Quarto to generate monthly and quarterly reports for each engine family, which will highlight the applications and the duty cycle for which the engine units were not able to accumulate enough of the IOMPR ratio that we were talking about earlier. So Quarto came in very handy. And these are some of the highlights packages that I was able to utilize in Quarto. So NITAR, so that you can integrate your code, text, and the graphs. Because the report had a lot of not just text. It had tables, graphs, interactive visualization. So CableXTRA was used to bring in pretty tables. And then WebBot for interactive HTML widgets.

And so that these reports were not static, you could go back to the Posit Connect app that we had published earlier, which is going to have many details, wherein this report is something high level, right? So the engineers and the product compliance managers can have a way to interact directly with your app and through your report as well.

So why did we use Quarto? Quarto seamlessly integrates code, text, and graphs. Quarto's batch processing automates generation of multiple reports seamlessly. Because at a single point of time, you have 60, minimum 60 different engine families that you would have to generate the reports for. And it smoothly integrates with Git for version control. And then it enables automated report generation.

Quarto's batch processing automates generation of multiple reports seamlessly. Because at a single point of time, you have 60, minimum 60 different engine families that you would have to generate the reports for.

Monitor and maintain

So now it takes us back to the last, but not the least. This is the last step as we wrap up this whole journey and pipeline, which is monitor and maintain. As long as we are able to set up this process, go on seamlessly and automatically, that's when we will be able to reap the best of the benefits.

So a monthly and quarterly data refresh was set up using ETL jobs and scripts. So that the reports get generated monthly and quarterly. The reports included average IUMPR computation ratios for each engine family based off your unicycle application, equipment manufacture details, so on and so forth. And this report included clear actions directed to the engineering teams who are working on fine tuning the system performance of the engines. So that you can deduce emissions. And these were generated as PDFs and available in the RStudio Connect apps and also in a secure SharePoint location.

Data reflections and closing thoughts

And this is the glimpse of the whole journey. So embracing the right tools and technologies, you will be able to build sustainable solutions.

So I also wanted to share my data reflections on this journey. Take the plunge. Stepping out of the ordinary and taking bold initiatives drives impactful results and growth. Bring the productive synergy. Integrating the best of tools and methodologies, be it R, Python, Databricks, SQL, Quarto, and many more, creates comprehensive solutions. Proactive automation is paramount. As demonstrated by this Quarto report example, this exemplifies and seamlessly integrates efficiency into your process.

This talk, I hope, will benefit the data science community into understanding a real world example of how to harness the power of both R and Python when you are faced with solving a challenging problem and drive innovation within your organization. Even if you are a data scientist, data engineer, developer, researcher, some of the lessons that I've learned and shared with you will definitely help your workflow.

And I would like to thank the partnership with Posit, because partnering with Posit has helped the global product compliance team at Cummins deliver solutions for our customers that are safe and lead to a cleaner environment. A special thanks to our Global Emissions Center of Excellence team, Product Compliance and Regulatory Affairs at Cummins, and a big shout out to the brilliant brains at Posit who have been continuously supporting us. And last but not the least, thank you, everyone, for your participation and attention today. Thank you.

Q&A

Thank you, Sumita. We do have a couple of questions. By the way, this was outstanding.

So what tools do you use to move between R and Python? For example, Reticulate or something else?

No, so when I think going forward, I might be. So I was kind of convinced in this con with some of the tools and packages and even thinking about Posit, like bringing an integration to both the worlds, right? But in this case, I had to clearly use Visual Studio Code to use anything on Python. And I directly use the Posit Workbench for anything on R. But going forward, and for Quarto, I was using RStudio. So I was completely convinced with Posit yesterday because, as you see, how complicated, so many things you would have to do just to achieve some results. But then the results are sustainable, right? So I went into three different environments, basically, for this project. But going forward, I think I'm going to embrace a few things, for example, like Posit and others.

Did the results from the k-means analysis that you did, did that end up in Connect at one point?

Yes, yes. The k-means data, which is for each engine serial number, we were able to identify UDCYCLE. So that data, I was taking back to the SQL itself and relationally connecting with all of the engine serial numbers so that, in real time, the RStudio Connect app can read that data back. And this is real time. So the data refresh is kind of, though it is sent monthly, for the amount of data that we had to get from the cloud. But then anything that is up to date in the back end is live in the RStudio Connect app.

There's another question here. How does the Posit ecosystem help with data transparency?

Data transparency. I'm also having some struggling with what that would mean. So Posit ecosystem, in fact, the data transparency.

So Posit ecosystem was able to be very, very compatible with bringing cluster clear idea to the respective people in terms of the engineering insights, or even the regulatory insights, or even to the decision makers on what this data speaks about. So I did not at least experience any gaps there. So Posit was able to deliver what I envisioned, and even what the regulators and the product compliance team envisioned very transparently to the whole engineering community.

Well, thank you so much. And that's all the time we have for questions. I appreciate it. Yeah, thank you, everyone.