# Advances in stochastic and functional modeling of high dimension data

- Christian José Acal González

- Ana María Aguilera del Pino Director
- Juan Eloy Ruiz Castro Director

Defence university: Universidad de Granada

Fecha de defensa: 02 July 2021

- María del Mar Rueda García Chair
- Francisco de Asís Torres Ruiz Secretary
- Cristian Preda Committee member
- Rosa Elvira Lillo Rodríguez Committee member
- Juan Carlos Ruiz Molina Committee member

Type: Thesis

## Abstract

In many scienti c elds, it is usual to nd magnitudes characterized by the evolution of a random variable over some continuum (stochastic process). Despite the experimental data measured on these variables are functions (curves, surfaces or images), historically their treatment has been through multivariate or time-series analysis, losing key information. Luckily, the great advances experimented by the technology sector in last years, have made easier the monitoring and reconstruction of the functions quickly and e ortless, being possible to work with the complete functions. In this scenario, there is a high probability of having high dimensional data, in which the number of variables is greater than the number of sampling individuals. This fact makes that traditional statistical methods could not be appropriate. Depending on the nal purpose, in this thesis these data are tackled from two di erent and complementary statistical perspectives: Functional Data Analysis (FDA) or Reliability Analysis (RA) based on Phase-type (PH) probability distributions. FDA arose facing the need of building robust tools to model and predict functional data, whose observations are normally curves depending on time or any other continuous argument. In the last two decades, FDA has been subject of intensive research in which most multivariate techniques have been generalized, specially dimension reduction, regression and classi cation methods. Functional Principal Component Analysis (FPCA) stands out because reduces the dimension and explains the variability structure in terms of a small number of uncorrelated variables. In the reliability eld, one of the main objectives is to study the behaviour of complex systems, whose operation is conditioned by several uncontrollable variables. In this sense, RA attempts to identify the probability distribution of the data to shed light about the variability behind the systems operation. A suitable solution is to contemplate the Markovian processes and the PH distributions. This class is known to be able to approximate any non-negative distribution as much as desired thanks to its versatility and to model complex problems with well-structured results. The methodological contributions of this thesis are elaborated in based to datadriven problems of great interest related to Resistive Random Access Memories (RRAMs) and COVID-19 pandemic. RRAMs awaken much expectation because are one important source of incomes in the industry, whereas for mitigating the spread of the virus, it is crucial developing suitable models to make correct decisions A new statistical approach based on PH distributions is developed to analyze the RRAM variability, which is one of the key issues to solve. A wide comparison with experimental data shows that the tted PH distributions works better than the classic probability distributions and helps to know the RRAM internal performance. A new stochastic process is built by considering the internal performance of macro-states in which the sojourn time is PH distributed. It is showed as the internal behaviour of the process is Markovian but both the homogeneity and Markovianity is lost for the new macro-state model. Other associated measures are also obtained. The new methodology allows the modeling of complex systems in an algorithmic way, in particular, the noise produced inside the RRAMs. FPCA based on Karhunen-Lo eve expansion enables to characterize the stochastic evolution of RRAMs. Nevertheless, it is essential to identify the distribution of the principal components (pc's) to describe the entire process. In this sense, a new class of distributions, Linear PH (LPH) distributions, are introduced. Speci cally, it was proved that if the principal components are LPH distributed then the process follows a LPH distribution at each point. In relation to pc's, sometimes their interpretation is not immediate and a rotation is needed to facilitate it. We develop two new functional Varimax rotation approaches based on the equivalence between FPCA and PCA. One method consists of rotating the eigenvectors, and the other one, rotates the loadings of the standardized pc's scores. They are applied to interpret the variability of the positive cases curves of COVID-19 in the Spanish autonomous communities. Additionally, two di erent parametric and non-parametric functional homogeneity testing approaches are proposed by assuming a basis expansion of sample curves. They consists of testing multivariate homogeneity on a vector of basis coe cients and on a vector of pc's scores, respectively. This fact will be useful to check the in uence of the material and thickness in the RRAM behaviour. For the case of more than one functional response variable, the previous methodology for testing homogeneity based on multivariate FPCA is extended. It is used to test if there are di erences between the levels of several pollutants in terms of the location of measuring stations in the Region Abruzzo, Italy. Also, an approach for repeated measures is considered to study if the level of each pollutant decreased during the lockdown established by the Italian Government for COVID-19 pandemic. Finally, a multiple function-on-function regression model in terms of pc's is proposed for the imputation of missing data for the functional response, by assuming that the multiple functional predictors are completely observed. This approach will enable to impute missing data related to COVID-19. The content of this thesis are presented as a compendium of seven publications. The full version of the papers is included in the Appendices.