Intro to Functional Data Analysis - Part 2 | Matthew Malloure, Dow Chemical
videoimage: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Thank you so much for joining us today. Welcome to the RStudio Enterprise Community Meetup. I'm Rachel. I'm actually calling in from San Diego today, so more of a morning meetup for me. We're streaming out to LinkedIn and YouTube right now, so I'd love to have you all introduce yourselves and where you're calling in from in the chat. I'm excited to have Matt Malloure here with us today for Part 2 of our Functional Data Analysis Meetup series. A big thank you to Santiago who kicked off the intro to FDA meetup initially as well.
If this is your first time joining one of these sessions, welcome. We're so glad to have you here. This is a friendly meetup environment for teams to share use cases, teach lessons learned, and just meet each other and ask questions. I just wanted to say this as well. While it's called the Enterprise Community Meetup, this is open to everyone regardless of where you work and the tools that you use. I created this group as the Enterprise Meetup so that it would be okay for us to occasionally talk about the RStudio professional products like RStudio Workbench and Connect, but everybody is welcome. Together we are all dedicated to making this an inclusive and open environment for everyone, no matter your experience, industry, or background.
Also during the event you are able to ask questions. You can ask them through the chat. If you're on LinkedIn and YouTube, it will gather them for me. You can also ask questions anonymously through Slido. You can just use the link that I'll show on the screen here in just a second to be able to ask questions anonymously as well.
With that said, I would love to introduce our speaker for today. I'm so happy to be joined by Matt here. Matt is an Associate Research Data Scientist supporting new product development within the packaging and specialty plastics business at the Dow Chemical Company. Matt and I actually met through Santiago's FDA Meetup, which was your first meetup, I think. I wanted to share that too just to say if there's ever a topic or use case that you all would like to highlight from your team, we'd love to hear from you too. If you ever even want to just float an idea by me, please feel free to connect on LinkedIn or email me too. It's just Rachel at RStudio.com.
Awesome. Thanks, Rachel, for the introduction. Thank you all for joining. I'm excited for this opportunity to just continue the discussion on functional data. So during, I think it was in March with Santiago's Meetup, the Q&A session, two topics came up, functional PCA, principal components analysis, and functional regression, both of which that I've applied in various projects at Dow. And so that's why connecting with Rachel, we said, all right, let's go for part two of the series. And hence we're here today.
So I've broken this talk in general into, I guess, four major sections, sort of an introduction, both of myself and just kind of a quick refresher on functional data, functional data analysis, and kind of introduced our simulated case study that we'll walk through. And the questions we want to solve using functional data methods. Then second, again, it'll kind of be a refresher with some data pre-processing steps for FDA. So particularly smoothing with penalized splines to create the true functional observations that we want to analyze.
And then again, the two topics du jour, functional PCA, which will help us answer the first question about clustering different additives or experimental materials based on decay curves that we see for measuring a material characteristic. And then functional regression. So then taking those characteristic profiles to predict an application performance. And I think it'll be important in that section as well to kind of compare to traditional or other methods that we might use to analyze such data given various limitations we'll talk about, as well as just the tall versus a wide data matrix. And last, of course, just some summary and takeaways.
About the speaker
So quickly about myself. So I'm native to Michigan, which global headquarters for Dow is in Midland, Michigan. But I actually work in our Texas location. So metro Detroit is where I was born and raised. I didn't travel very far to do my first two degrees. So Grand Valley State University is located in the western side of the lower peninsula, just outside of Grand Rapids, Michigan. So I did my bachelor's in stat and master's in biostat. And even though the master's program is technically a terminal degree for most, I still wanted to learn more. I also realized that I wasn't a fan of cold and winter. So I came south where I've still been. So I went to Texas A&M University for my PhD, which is also a connection with Santiago as he was also an Aggie.
My PhD research was kind of a grab bag of a bunch of topics. So Bayesian nonparametric goodness of fit testing using cross-validation base factors. It's essentially the use of kernel density estimators for an alternative to, for our alternative hypothesis in multivariate goodness of fit testing, like testing multivariate normality. And it required some data splitting and divide and conquer type techniques for cross-validation. So for those interested, so I'll throw in a shameless plug. One chapter of that work is published in International Statistical Review for those interested in checking it out.
So after graduation, I came from, went from College Station to Lake Jackson, Texas, just south of Houston, where most, okay, so I'm coming up on my five-year work-a-versary, but all but like two months of time, I was a statistician within Core R&D. So supporting global R&D at Dow in any and all statistical needs, primarily in experimental design. And just recently, I took the role that I'm in as a data scientist in packaging and specialty plastics. And that was a natural transition because I was, my primary focal point was the plastics business. So it was just taking a step to do more projects within the business. My general specialty areas, especially at Dow, so hence today, functional data analysis is probably my main area at Dow, in addition to non-parametrics and statistical computing or simulation.
Recap: what is functional data analysis?
So let's quickly recap or review functional data and functional data analysis. So we're all familiar with the two major types of variables, continuous, categorical, or our usual scalar approaches. Functional variables are where we can't use a single value or we don't have a single value to measure the characteristic of interest, that variable for an experimental unit. So we actually have an entire process, a distribution, or a curve. And so we want to be able to have a class of methods that allow us to analyze functional variables. And here's some examples. These are all available in the FDA package, which all stem from Ramsey and Silverman, who are sort of the original bible, I guess, of functional data analysis is the book of the same title, published in the, I guess, first time in the mid-1990s.
So these examples, you can find this data within that package in R. And common examples, so top left, growth curves. So those of you who have children or remember going to the pediatrician as a child, right, there's unequally spaced milestone visits, but we want to track how individuals grow from birth all the way till 18 years. So for each person, they have an entire growth curve. The Canadian weather data, which if you recall, Santiago talked about quite a bit. So for, I think it's 35 weather stations in Canada, we're looking at the temperature profile across an entire year for each of those cities and try to analyze how do the profiles differ, say, based on region in Canada. Functional data can be two-dimensional or multi-dimensional. So an example is writing on an electronic pad the letters FDA in cursive multiple times. So you have this also spatial orientation. So it's not required, it's rescaled to time, but, you know, it's also, you're not required to have time, say, as your only dimension or your domain for your functional observation.
And lastly, something that looks like distributional data, which you can see maybe some challenges in analyzing this as it has to do with pinching your thumb with your forefinger and the force exerted. And so you can see that there's different time points at which the force starts and the peak force occurs at different different time values in the interval from zero to 0.3 seconds. Plus it has a different height as well. So it's shifted as well as scaled. So we won't talk about it today, but registration and alignment as another pre-processing is very important for data like this. Say maybe peak aligning is something you might look at.
Now for analyzing functional data, there are many of the methods that are just analogs to traditional scalar approaches, two which are sort of more unique to functional. Well, curve estimation or smoothing. Certainly you need more than one point to smooth a curve. So that's a pre-processing step that we'll look at. I think one interesting one is the analysis of derivatives. So for growth curves, you could look at acceleration functions because you've now got a smooth and continuous variable observation that you can take derivatives for. So you get functions, derivative functions, as opposed to just rise over run calculations.
For the functional analogs, just so within exploratory data analysis, means, variances, correlations, you can compute all of those. You can do principal components analysis, linear or generalized linear models, and you can even do functional experimental design. And many of these are related, but I think we'll see just natural extensions of methods that we know and love to apply them to functional observations.
The case study: simulated additive screening data
So to sort of motivate, you know, use cases for FDA. So this is our case study data that we'll talk about. Now it is simulated, and in the link that Rachel, I think, sent out or will send out, I've made the slides as well as my code available. So every analysis, every picture that you see, including all the seeds, is available to explore and you'll be able to reproduce everything. But for this example, we have 41 different additives, different experimental materials, one of which is a standard. You can think of standard as a, it could be the gold standard on the market, it could be a control, it could be just a theoretical target that you want to hit. So that's the solid black. But the further study on 41 different additives could be resource intensive. So the idea, one of the questions we'll look at is, is how can we quickly select profiles or new materials for further study that are most similar to this standard?
Now, it, this material characteristic, it's measured over 24 hours, but it resembles exponential decay. And so as we might often see, especially with like decay functions, that this, the time points at which we measure are unequally spaced. So it starts out condensed at time zero, which they all started 100. So time one, two, then it goes every two hours till eight, then every four hours till 24. So they're unequally spaced time points. And also, in the example that we had, based on the data processing to actually create these profiles, we actually had a lot of missing data. So not all, actually rarely did we have all 10 points. So we had somewhere between five and eight points for each curve. So each observation, each material had unequally spaced points and a differing number of points.
So the first question that we'll ask and try to answer, and this will be used with the functional PCA is, can we pick three to five experimental materials that are top candidates for further study? And second, this material characteristic, this profile is strongly related to application performance. So as we try to create and design new materials, we'd say, okay, from this profile, we'd also want to know which ones to look at based on predicting performance in the final application. And that's where we'll look at functional regression.
Now, you might think there's, we've seen pictures like this, you know, quite often for some traditional analyses. So some things that we might use, maybe to not answer these exact questions, but to analyze this data is you could use time series, repeated measures, longitudinal data analysis, all sort of similar flavor. They answer different questions, of course, but we could do moment-based regression. So try to summarize these profiles into some set of scalar moments, and then just apply like typical linear models, or we could do a multivariate approach. So PCA or PLS, say, if we had multiple correlated responses, and so we'd actually analyze the data matrix formed here. But each of these have some limitations. So particularly a time series repeated measures, typically you require equally spaced time points and the same number of points for each curve. So a consistent domain for the discrete measurements. For moments, we know in dimension reduction, typically we lose information. So that's always kind of a pro and con. It's easier to analyze, but we do suffer information loss. And methods like multivariates of PCA PLS, well, that looks at data matrix, which kind of ignores the functional relationship between the successive columns. And in fact, with PLS, it's known, right, that if you permute the columns of your data matrix, you're going to get the same model in the end. So that functional relationship is essentially destroyed and are not captured in these multivariate approaches.
And in fact, with PLS, it's known, right, that if you permute the columns of your data matrix, you're going to get the same model in the end. So that functional relationship is essentially destroyed and are not captured in these multivariate approaches.
But FDA methods allow us to alleviate or basically remove many of these limitations.
Initial step in FDA: smoothing
So once you have measured your functional observations, I think what you've noticed is that it's not truly a smooth, continuous function that we're analyzing because we never actually observe this true process because we measure it at set time points or points in the domain. It's always a discretized function. So that's why smoothing is often the first step of preprocessing. And in many cases, it can be a requirement for like PCA or for a functional PCA.
So why do we have to do this? Well, again, with a discrete curve. So X is actually our true process that we want to explore, but we actually observe Y because when we go to measure the points, we try to measure at different TJs, the value of this process. There's error likely involved, measurement error, process error. And so we don't observe the true process. We observe something with error that we need to convert to our functional observation to analyze. And so the common approaches are penalized splines in some fashion, Fourier series for periodic data, cubic B-splines or some other regression spline for non-periodic data. We play with the bias-variance tradeoff with a smoothing parameter lambda. And an important note is that you could do this all at once. So you could take all your observations, smooth them all together, or you can actually treat them and do them individually, which we'll see we have to do in our case.
And just to give an example of this penalization and smoothing here. So I've generated from the sine function some points. So the gray circles, I think it's 100 or maybe 200 points within the domain minus three to three, equally spaced. So you see that there is a process underlying this. There is a functional relationship across these points, but we don't have a great, we wouldn't want to connect these points. So how do we actually estimate the true function, the black curve? So if we pick a lambda that's too small, we undersmooth. And as it approaches zero, you basically interpolate the points. So you have high variance, low bias. So we probably wouldn't want to use that red curve. Green, if we pick a lambda too big, we oversmooth. So we actually typically will miss valleys and peaks, which you observe here. So here we have very low variance, but high bias. So the best is, you know, we can pick this through trial and error or some cross-validation technique. But when you have an appropriately smoothed function with Fourier series, you see it's actually hard to discern the blue dashed line compared to black, which tells us that, see, we were able to use splines to capture our true, the underlying process here.
Now, how does this apply to our case study data? So remember, we do have the unequally spaced and differing number of points. So we'll do it individually one at a time. Remember, we only have five to eight measurements, but because of using the penalized splines, we can have great flexibility with 20 cubic B-splines. So we actually have, you know, like four times as many splines as we do points, but we can still estimate the function. We do it one at a time and then reevaluate to basically form a consistent data matrix. So we fit the smooth, evaluate it at every hour time point, stack them all together. You get a 41 by 25 matrix where row per row, each row is a experimental material. And then eventually we'll kind of re-smooth or reapply the same smoothing to create a functional data object with all 41 curves. So it's kind of like two different sets of smoothing. And if you check out the code, you'll kind of see how this works, but we need that functional data object for later use.
And how well does this work? So here's the ample for the standard. So the discrete curve is actually the dashed red. Applying these splines and reevaluating and plotting, you can see that even with only, I forget how many we have on the standard, five to eight points, the smooth can actually estimate that true or our underlying process very well. And we apply the same approach to all the curves. And I think you can see that we do reasonably well at capturing just the overall shape of each of those individual profiles.
Q&A: getting started with FDA
One question that was on Slido was, what made you first interested in FDA? Ah, great question. I didn't cover it in the bio. So before one of the elective courses at A&M was, I forget the number, it's not important, nonparametric curve estimation. And because of where my research was heading, it was an elective that my advisor said I should take. And so in that class, the professor taught us FDA. So it just was one of those kind of like, you know, happy accidents that my advisor had the project and said, take this class. And we did FDA. So actually, and part of that class was a project that you had to implement all of the methods. And so I did a study of statehood obesity curves. So actually growth curves to look at what are the important factors that would relate to like statehood obesity and test outs hypotheses about, you know, socioeconomic status or other information from demographics.
A question I had actually is, what's a good way, if like we're just learning about this right now, what's a good way for us to know a project could use FDA analysis? Yeah, so I think it's, there are some more like natural scenarios where it's just like, when you have one experimental unit, one thing in your study, that's got a measurement that's like, I have an entire distribution, I have like, where you know, it's not a single value. So there might be an underlying, like, you know, that so like a molecular weight distribution, we know that there's not just one polymer chain of constant length. So you have to actually look at like chromatographic data in analytical science or different characterization groups. Or if you're looking at growth or decay functions, like you just, it kind of just naturally, you ask the question, can I represent this as one number? Or do I need all of these values? And, of course, you know, like two dimensional or more, they're, they may be more difficult. But, you know, it's usually like, you have something measured over time and space, and it's hard to summarize it as one single value.
And it was actually Santiago said, Hi, Matt, Santiago, great job so far, by the way, what is your decision framework for which basis function cubic? Yeah, so that is a good question. I think, you know, the first, okay, so it's like a tree diagram, the first one you look at is, okay, do I have periodic or non periodic data periodic, meaning that my so if I looked at the function, the far left point in the far right point should be the same. So like the temperature profile is an example where December 31, and January 1 are far more related, even though they're at the opposite ends of the domain. So those those temperatures should probably be close to each other. And so then then you use Fourier series, cubic B-splines, I think that's just a good default. And there's some theory out there. Like, it comes from DeBoer that says cubic B-splines with a knot at each data point is the best you can select. I think it's more so like you could use wavelets as an example for when you have really peaked functions. And so yeah, like, there's a response that I saw, it really depends on what you're, what you're trying, depends on what you the data you have and what you're trying to do. Because if you think like, take a some normal distribution, which kind of looks like some of the basis functions in the interior, and you try and have a really peaked function. So like, like GC mass spec data, which is very peaked, well, you're going to always underestimate that because the normal just can't get all the way up to a like an absolute value looking function, but wavelets have that ability. So it really based on the context and what you observe with the data. But a foreshadowing answer is that sometimes you can actually use your data to define basis functions for later use. And that will see an FPCA here.
Screening additives with functional PCA
So the first question, remember, we want to identify three to five values or experimental materials most similar to the control. So to do that, we're going to use functional PCA. Now I mentioned that their functional methods are often analogs of traditional. So a quick three bullet point review here on traditional PCA. So we have a data matrix, many columns, likely correlated. So PCA first is a data transformation. It's a rotation to convert those correlated columns into uncorrelated linear combinations of those original vectors. And we can then do dimension reduction because the rotation is such that the first principal component explains the most variation, followed by the next most important uncorrelated to the first and onward. So oftentimes we can take many dimensions and reduce them down into one, two, or three that explain 90 or 95% of the total variability in the system. Mathematically to do this, it's an eigenanalysis. You compute the eigenvalues to get your PC scores and the eigenvectors define our linear combinations. And we do this on either the covariance or correlation matrix.
So what is functional PCA? Well, mathematically, I guess I'm going out of order here. We also still perform an eigenanalysis, but this time on the functions. So slightly more complex, but the same general approach. We still capture the primary modes of variation in successive order, first, second, third. But now the functional PCs are orthonormal functions, so similar properties to what we get in the linear combinations in traditional PCA. Now I think this is the, I think one of the interesting additional pieces that we get from functional PCA is this idea of empirical basis functions. So Carhoon and Loewe have this expansion. And basically what it says is that any functional observation in our set, our sample, can be written as the mean function of the sample plus a weighted sum of the functional PCs, our new building blocks.
So let's actually apply this to our case study data. So, okay, this Carhoon and Loewe expansion, this actually gives us our empirical basis functions. So we can use our sample data to make new building blocks, which are these empirical basis functions. So what it says is that any sample in our, any functional observation in our sample can be written as the overall mean function, Y bar of T, plus a weighted sum of the functional PCs. So we have the PC functions, the Xs times coefficients, which start originally as your PC scores for the actual observations we observe. So what does this look like for our case study data? We have the overall mean function, which resembles our typical exponential decay for all of our samples. And we get two PC functions. First one explains 96 and a half percent, the second 3%. So in two PCs, we basically capture all of the variability, 99.5%. They're not the easiest to interpret, but we'll see in the next slide that we can get a picture that makes them easier to interpret. Another bonus over traditional PCA, often those linear combinations are difficult to interpret in context, but we can actually get an idea of what these PC functions look like and how they contribute to our, the different functions we observe in our sample. And we can still get a score plot. We have two dimensions. So we actually see that they're not, we don't see just a random scatter here. We see a kind of a parabolic shape, but they are uncorrelated with each other. And so now just thinking about how we might solve the question with typical PCA, I think we can already see that we have PC scores in two dimensions. Maybe we just perform simple clustering to solve our question.
Now, before we get there, so this interpretation idea, so this, if you create a functional object in R that's a result of functional PCA, and you just plot it, you just use base plot function, you'll get this, this picture by default. And it really helped with interpretation because those PC functions are essentially perturbations of the mean. So what does that mean? Take PC1 here on the left panel. The solid line in each picture is the overall mean curve. Then it takes, so the line defined by pluses is a positive coefficient multiplied by PC1 added to the mean function. So it says for a positive PC1 score, you would expect the shape to change in that way. So it becomes sort of like an increase with a slower decay. A negative value of the same magnitude gives you the minus line. So it basically tells us that, okay, the difference in these three curves really is in the first half of time, the relative steepness of the initial 10 to 12 hours, because essentially after that point, the lines are roughly parallel. PC2, remember, is uncorrelated to PC1. So it would make sense that we basically have, regardless of that magnitude, there's no change or minimal change in the first half of time. So PC2 actually explains the relative decay, the steepness, the slope of that decay in the second half of time. So this is actually really sort of powerful for understanding just to get an idea of how those PC functions relate to your overall data as a function of that like expansion. It's pretty interesting and it can really help to explain the scenario, I think.
As dimension reduction, so let's also look at these building blocks to show how you could actually... So, okay, once you do FPCA and you define those PC functions, they're fixed. They're known. That's why you can use them as basis functions. The only thing that you... And the mean function is also known, right? So the only thing in that Carhoon and Loewe expansion that changes now are scalar coefficients. So that can be pretty powerful for just dimension reduction because curves are more complex, more degrees of freedom to change those shapes compared to just, say, two coefficients here. So how this works, we can basically generate new scores to look at the intuition of, okay, if I change the coefficient, how does the resulting function shape change? So I've just plotted again the two PC functions. They look pretty flat because the scale has changed. And I'll just generate new pairs of scores. So this one's about 50 and minus 25 for X and Y. So 50 times blue gives me my dashed blue. Minus 25 times red gives me my dashed red. So that's the contribution now, the weighted contribution of PC1, PC2. Add that back to the black mean curve, and you get this maroon function. So it looks like, right, so that for positive PC1, negative PC2, it has a higher, or it's pulled away, it's above the mean curve. It's sort of, the decay is slower compared to the mean.
Do this again. All the past results are going to be grayed out, but now see our points over here at a sort of negative PC1 and roughly zero PC2. So now we see that that shape, which we'd expect, right, that as PC1 got negative, the decay at the beginning was greater. It was quicker, steeper. So matches what we expect. And we can keep doing this again and say a hundred times. And what we see is basically now those scores, when we constrain them just at basic box constraints, minimum and maximum, from our sample, we can actually get an idea of what would all possible functional observations for maybe new materials that we might want to explore. So we can get an idea of what are the all possible shapes that we might want to explore.
Now if we have, so think of functional experimental design, right, you would say, if my factor is functional observation, it's really hard to define what is my, you know, my low value for my factor. So my minimum, my maximum, maybe like midpoints in traditional experimental design. But now I can convert that with FPCA to scores to get my two scores, right, two scalar values that I can define highs and lows. I mean, we've already sort of explored that. So actually, if you try to do a functional experimental design, Jump has a platform that does this and it will actually, this is what's happening in the background, is it's using a functional PCA to redefine scores to alter your shapes for defining functional factors in DOE.
All right, back to the analysis part. So how do we answer our question, right? Remember, we had the PC scores in two dimensions. So again, the typical approach, we'd just say, okay, K means clustering. I picked four. First component is 96.5% of our total variation. So it's no surprise that basically it's just four or three vertical lines to separate the four clusters. Our standard falls within green, and we kind of see the colored overlays here. And so to find the three closest, just let me think, I just did the Euclidean distance between all the points and the standard and picked the top three. They're all in the green cluster. And I think it's pretty cool picture here to show the raw data. So back to the discrete samples that were measured at the very first picture, that black and the three green are extremely similar.
And so, it might seem like I'd encourage anyone interested to look at the code because basically once you have the functional object, it's only, so smoothing plus PCA plus clustering, it's only a few lines of well-defined code. It's a pretty easy workflow to reach this result. So I actually think that this is probably quicker than if I tried to do some of the other methods and trying to find different moments that would capture this shape appropriately.
The last point on this is, so for the actual application that I applied this at Dow, we similarly selected three similar materials based on this approach. And I think it was also a nice result that two out of those three were experimentally validated when they did more downstream work. So it was also a very successful project and really the first time that it was applied within Dow at the time.
Q&A: resources and applications
So I'll definitely pause there for questions about FPCA. But one question that did come up a little bit earlier was, specific FDA problems can pop up in my work and I'm not always sure which analysis is right. Is there a recommended resource for selecting approaches? Yes. So there is a reference slide. So if you go, I'll probably pull it up, but may not talk about it when we get to it. But there's a series of three texts all by slightly different authors, but Ramsey and Silverman are involved. And they're basically three different levels. There's one, it's a really small Springer book about just applications. So all those examples that I showed in the very beginning, it'll say, here's a type of problem or a type of data, and here's how you would solve it with a functional analysis method. So, and there, some of them are pretty creative too. And so I think you can, I would start there because that gives you, you don't have to know anything about the code. That's the second one in the series. It's how do you do this in R or MATLAB? And then the third one is the full book, which has all the math and the derivations and theory. So it's a very easy read. That's all basically like, here's an application and here's how we might try and tackle it.
Could FDA be used to describe sales of a certain product family? I would say possibly, I don't know if we have enough information. So if product family is defined as, so something like what I'm showing of material characteristics where you have a functional observation, certainly you could have a product family that's defined as cost as an output. I mean, that's something that's commonly used for evaluating, like when you do an optimization of multiple properties, you could throw cost in and say, okay, how is, you know, I want to optimize these application properties while minimizing cost, right? That's a typical approach. But I think it's hard to answer for sure without having, okay, where's the functional observation or where's the functional data in that question?
Can you define functional as it is used in the FDA context? Yes, simply that the underlying variable that you're trying to analyze is a smooth continuous function process curve distribution. So it's just, functional just refers to the type of data that we're analyzing. It's the equivalent of saying multivariate data analysis where multivariate implies that you have more than one variable being analyzed at a time. Functional data analysis is you have functional data and the methods, it's like the class of methods to analyze functional data curves, distributions, processes, or the usual, just like a, or profile.
Awesome. Thank you. I just thought of a question here as well. So for people who are doing functional data analysis or connecting with each other, are there like typical resources that you use? Like, are you on a Slack group somewhere or is there some community group somewhere? No, at least not that I know of. Most of, like I say, I had the one class and then the list or the essential library of textbooks out there is still pretty small. Like the list that I share of maybe seven is like maybe 75% of the ones that I know about. So it's mostly self-learning. I do know, like some universities have functional data analysis groups, but I don't actually follow. The authors of the text built the FDA website that is now defunct. So they tried to build a community, but I don't think it, it's not being maintained anymore. So unfortunately, yeah, it's, you know, kind of self-learning exploring. And then as, as new people, as I meet new people who maybe do it at Dow, then it's, we kind of make a community there, but nothing external for me.
I see, well, I see a couple pop up here. One about FPCA for two variables or maybe two dimensions. I'm sure I haven't explored those problems myself. But so within the refund package, I'd say that's newer and might tackle more. I mean, refund stands for, well, regression of functional data. But there, it also includes like PC, functional PC regression is another approach that, okay, this is fun, that they might actually tackle it there too. So again, I recommend, because I haven't done it myself, that some of the examples in those three books that I talked about might provide you some background on how you might handle two-dimensional or multi-dimensional functional data type problems.
Which industries or sectors widely use FDA? That's a good question. So I don't remember off the top of my head, but there is a report, maybe Rachel, I'll go, I'll find it and maybe we can share it somehow with everyone. Because it actually does, it's like a systematic review of functional data across all industries. So they basically, a few years ago, went back, looked at all the literature where functional data was represented or referred to in keywords and summarized it by method, industry, type of problem. And it was at least the only exhaustive list that I had found up to a few years ago, like I say.
Functional regression
So before, just like we did with functional PCA, so what are different types of functional regression? It's not as straightforward as just like linear regression in terms of you can have functions on either side or both of your equation. So the first scalar on function, you have a functional response and scalar variable, a single scalar in this case as your input. I think it's easiest to look at in terms of functional ANOVA plus it shows another analog to, sorry, another analog to a traditional method of just ANOVA. So the equation looks very similar, but you basically end up with an overall mean function, mean for your, across all of your groups or your levels. And then you'd have a group specific contribution that's also a function. So just to give you an example with that Canadian weather data, again, you could look at the temperature or precipitation profile as a function of climate zone, which you would expect that, right, the contribution. So compared to the mean profile, if you're in the Arctic region, it's probably your contribution might be some negative function to drop your profile to be colder. For precipitation, if you're in the Pacific, it's probably wetter, for instance.
Next is function on scalar. So just flip them. So now you have a functional input and a scalar response. And this is what we're going to apply in our situation here. And we'll see what the response looks like in a second. But I just always thought this equation was interesting because it's like the question first is how do you relate a scalar and a function. And so you actually estimate a slope function, which we'll see. And this slope function is represented as a basis expansion. So you actually use basis functions or basis expansion all over again when you actually fit this. So what do we do is we actually look at those coefficients because the basis functions are known. So the integral builds that relationship to convert the profile to scalar. So in the example, again, it'd be like predicting total precipitation based on temperature profiles.
More generally, you could have function on function. So functions on both sides. Now you'd actually have a slope surface and the domains can be different. They don't necessarily have to be exactly the same. Of course, it gets more complex when they're different. And in the same example, it'd be now predict the entire temperature prediction profile from temperature profile. And last is principal components regression, but functional version. So we'll actually see how we can use the results from our first exploration to predict our property. But now what's interesting and quite useful is right now it just becomes a essentially multiple linear regression for just a scalar response because we have uncorrelated scores that capture our variation. So it really does resemble what you do in regular principal components regression.
To set the stage, so the true slope function that I have here is just like proportional to square root of time. And the reason being is because in general, this property in the application that we're looking to predict, the further the curve is bounded away from zero, the slower the decay, the higher this property value. So since they all start at the same point, relatively small to no contribution at time zero, and you get a stronger, a larger contribution as this is further away from zero and in longer time. So the responses are simulated again, all the codes available. And I just use the true exponential decay function, so no error, plus the true slope function, no intercept term. And so this is how we get the response.
So both of these perform similarly well. So we kind of expect that too, and maybe we'd hope so, because we capture nearly all the variability in the system with the functional PCA. So let's compare these two quickly. The functional linear model, I think one of the big benefits is you do get an estimate of that slope function. You don't do any dimension reduction, so you use the entire functional observation. And pre-smoothing is not necessarily always required, depending on your inputs within the PFR function. So that's how you'd fit this within refund. Now there are some cons. Degrees of freedom, because the slope function is a basis expansion, that you do have to worry about. If I use 10 basis functions, then that's 10 scalars that I'm estimating 10 degrees of freedom. So it can get, you can in some cases require a lot of data if you need a very extensive basis expansion for that slope function. Sometimes the slope function may not resemble the truth. I'll kind of show an example of that. And so it may not be guaranteed that you have a unique solution. So it's one that while you can get the estimate, it's, you know, maybe use it with a grain of salt sometimes. Because you can also, I've seen cases of just having large standard errors of my intercept, which you'd get good predictive performance, but I think it tends to being over fit in that case. So it just, I always recommend with the functional linear model, very careful exploration into, you know, model stability and, you know, standard errors and typical things we might look at in like a regression setting.
And to show the example, so I actually do 10 different seeds, regenerate the response. And you can see the true function slope function is red. The solid black is from the original model fit. And we get a bunch of different lines with different slopes with the gray line. So you see some are steeper, some are flatter, but you also get some that are more quadratic. So they all have similar, maybe not exactly the same prediction performance, but it is possible that, you know, you could get a shape that's reasonable, but slightly different from the truth.
In functional PCR, I think it's easier just from a fitting standpoint and inversion, because we go back to just scalar scores. So which scores achieve a given property, and then we just apply that expansion to get the prediction, to get the function that is going to give us that predicted response. And again, this is all just tied to the amount of variation explained. So in this case, if we just looked at FPC1 only, we do lose some prediction performance, but we're only losing three and a half percent. If PC1 was 50%, PC2 was 30%, PC3 was 10, you would expect very different performance if you just used PC1 only compared to all three. And one thing to keep in mind practically is if you do this approach, you basically need, I would recommend all the PCs for a given functional observation or none, because if you just applied like model selection methods, what you'll end up seeing is that it might just say, oh, you only need FPC2. Well, that's only three and a half out of our 100%. So you really need all of the more important PCs as well. So if you pick three, use all three.
All right. So just to compare to, or just kind of show how I think functional regression, it can be so powerful. Scenarios where you have a very, you have more of a wide matrix, so far more columns than rows. So in this case, I only picked, I simulated 10 materials, but now everything's measured at every hour. So it's a 10 by 25 matrix. We can apply various methods. So I can compute moments like numerical derivatives and just apply regression. So just rise over run either the entire domain or subsections, bins. I can look at area under the curve, right? The larger that area, the further from zero, it's a natural moment to pick. I can do just regular PC regression on the data matrix. I can estimate, say, oh, I think exponential decay is appropriate here. So let's estimate that parameter. Then we know everything and just use that as our moment. And what you see is compared to the functional methods, moment-based approaches, we just lose too much information typically. And even if multivariate approaches perform similarly, we still lose that functional relationship. So I think this is a powerful example to show that in very data-limited scenarios, we actually can apply and use the entire function in regression or PCA settings.
So I think this is a powerful example to show that in very data-limited scenarios, we actually can apply and use the entire function in regression or PCA settings.
Closing Q&A
So I see we're right at time. I will just leave the takeaways and further thoughts up here and happy to answer questions. Rachel, if you are allowing this to extend, I'm happy to stay and chat. If it needs to end, I'll let you make that point. Yeah, if you're able to stay and chat with us, that's awesome.
Between FPCA and K-means on the PCs versus functional clustering techniques such as FunFem, which, Santiago, that sounds fun. Any thoughts around how to choose between the two approaches? Yeah, I have the same question. I've not seen FunFem before. I guess that one I'd have to look up. Or if you could send in the chat even what the FEM there stands for.
Another question was, is there some sort of an overlay between functional data analysis and time series? Yeah, and so actually that gets to a potential takeaway here that I didn't actually talk about when I was covering smoothing. But so the data can be very related, right? You look at, we have measured points over time in our case study. It looks like you could do something with time series. However, that key limitation, time series is very rigid about needing equally spaced points. And if you have multiple different series that they have the same number of points, what you could do is actually use smoothing as like an imputation approach or a way to convert data like our case study, which I did in an earlier step, right? I smoothed all of them, then I reevaluated at every hour from zero to 24. So what did I essentially converted them all to equally spaced and same number of time points. So that's actually one approach that I think can be used if you provide it, of course, that the smoothing is able to capture the, it's not such a complex time series that just a smoother can't give you reasonable estimates for where you're trying to interpolate. But as long as it's reasonable, I think you could use smoothing as a precursor to time series because remember time series and FDA might actually solve different problems too. So, you know, if you're looking at time series to analyze, you know, stationarity or seasonality, that's a different type of question being answered than like the clustering or prediction in functional data. I'd use time series for forecasting, for instance, there are approaches you could do in FDA with forecasting, but I think if I could use time series, I would.
Another question was, have you tried your example with traditional PCA or PLS and can you just for everybody here, just explain those acronyms as well. Any comparison on how FDA performs better or worse? Yeah, so I think the big difference, so PCA is principal components analysis and PLS is partial least squares. The difference between those two is PLS, you actually have a response or set of responses. So like PCA is dimension reduction on a data matrix. That's typically then like you might not have a response. PLS says I want to use that. I want to reduce dimension on the input space, reduce dimension on the output space and relate those two together. Um, the big limitation of PLS and it's actually in a paper that compares these two. It does that example of functional observations say that, you know, column one is related to column two to column three based on that function, right?
