A robust factor analysis model based on the unrestricted contaminated skew-normal with missing information

Document Type : Research Paper

Author

Department of Statistics, University of Kashan, Kashan, Iran

Abstract

The rise of advanced data collection technologies has led to a surge in the use of factor analysis (FA) for dimensionality reduction. With more measurements comes a greater risk of outliers, potentially resulting in biased estimates, decreased stability and robustness, and ultimately inaccurate conclusions. To accommodate asymmetric high-dimension data and outliers, we developed a FA model using the unrestricted contaminated skew normal distributions as the latent factors that automatically detects outliers. Maximum likelihood parameter estimates are computed using EM-type algorithms that we developed. We use an information-based method to calculate the asymptotic standard errors of the parameters. The asymptotic properties of the maximum likelihood estimators and the model's outlier robustness were examined via numerous simulation studies. Pima data serve as case studies to further showcase the model's practical use in social data analysis.

Keywords

Main Subjects


[1] Anderson, TW. (2003). An introduction to multivariate statistical analysis. Wiley Series in Probability and Statistics.
[2] Arellano-Valle, RB., & Genton, MG. (2005). On fundamental skew distributions. Journal of Multivariate Analysis, 96 (1), 93-116. https://doi.org/10.1016/j.jmva.2004.10.002.
[3] Arellano-Valle, RB., Ferreira, CS. & Genton, MG. (2018). Scale and shape mixtures of multivariate skew-normal distributions. Journal of Multivariate Analysis, 166, 98-110. https://doi.org/10.1016/j.jmva.2018.02.007
[4] Bekker, A., Hashemi, F., & Arashi, M. (2023). Flexible Factor Model for Handling Missing Data in Supervised Learning. Communications in Mathematics and Statistics, 11(2), 477-501. https://doi.org/10.1007/s40304-021-00260-9
[5] Dempster, AP., Laird, NM., & Rubin, DB. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Method-ological), 39 (1), 1-22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
[6] Gupta, A. K., Gonzalez-Faras, G., & Domnguez-Molina, J. A. (2004). A multivariate skew normal distribution. Journal of multivariate analysis, 89(1), 181-190. https://doi.org/10.1016/S0047-259X(03)00131-3
[7] Hashemi, F., Askari, J., & Darijani, S. (2024). Flexible parsimonious mixture of skew factor analysis based on normal mean-variance Birnbaum-Sanders. Mathematics Interdisciplinary Research, 9(4), 385-411. https://doi.org/385-411.10.22052/mir.2024.254416.1459 
[8] Jarque, C. M., & Bera, A. K. (1980). Ecient tests for normality, homoscedasticity and serial independence of regression residuals. Economics letters, 6(3), 255-259. https://doi.org/10.1016/0165-1765(80)90024-5
[9] Lawley, DN., & Maxwell, AE. (1971). Factor analysis as a statistical method. 2nd edn. Butterworth, London. https://doi.org/10.2307/2986915
[10] Lin, TI., Chen, IA., & Wang, W.L. (2023). A robust factor analysis model based on the canonical fundamental skew-t distribution. Statistical Papers, 64(2), 367-393. https://doi.org/10.1007/s00362-022-01318-8
[11] Lin, T. I. (2009). Maximum likelihood estimation for multivariate skew normal mixture models. Journal of Multivariate Analysis, 100(2), 257-265. https://doi.org/10.1016/j.jmva.2008.04.010
[12] Lin, T. I., Wu, P. H., MaLachlan, G. J., & Lee, S. X. (2015). A robust factor analysis model using the restricted skew-t distribution. TEST, 24, 510-531. https://doi.org/10.1007/s11749-014-0422-2
[13] Louis TA. (1982). Finding the observed information matrix when using the em algorithm. Journal of the Royal Statistical Society Series B: Statistical Methodology, 44(2), 226-233. https://doi.org/10.1111/j.2517-6161.1982.tb01203.x
[14] Lopes, HF., & West, M. (2004). Bayesian model assessment in factor analysis. Statistica Sinica, 4, 41-67. https://www.jstor.org/stable/24307179
[15] McLachlan, G., & Peel, D. (2000). Finite Mixture Models. Wiley. https://doi.org/10.1146/annurev-statistics-031017-100325
[16] Meilijson, I. (1989). A fast improvement to the EM algorithm on its own terms. Journal of the Royal Statistical Society: Series B (Methodological), 51(1), 127-138. https://doi.org/10.1111/j.2517-6161.1989.tb01754.x
[17] Meng, X-L., & Rubin, D.B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80(2), 267-278. https://doi.org/10.1093/biomet/80.2.267
[18] Morris, K., Punzo, A., McNicholas, PD., & Browne, RP. (2019). Asymmetric clusters and outliers: Mixtures of multivariate contaminated shifted asymmetric Laplace distributions. Computational Statistics & Data Analysis, 132, 145-166. https://doi.org/10.1016/j.csda.2018.12.001
[19] Pourmousa, R., Jamalizadeh A., & Rezapour, M. (2015). Multivariate normal mean-variance mixture distribution based on Birnbaum-Saunders distribution. Journal of Statistical Computation and Simulation, 85, 2736-2749. https://doi.org/10.1080/00949655.2014.937435
[20] Punzo, A., & McNicholas, P.D. (2016). Parsimonious mixtures of multivariate contaminated normal distributions. Biometrical Journal, 58(6), 1506-1537. https://doi.org/10.1002/bimj.201500144
[21] Punzo, A., Blostein, M., & McNicholas, PD. (2020). High-dimensional unsupervised classi cation via parsimonious contaminated mixtures. Pattern Recognition, 98, 107031. https://doi.org/10.1016/j.patcog.2019.107031
[22] Schwarz, G., et al. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461-464. https://www.jstor.org/stable/2958889
[23] Spearman, C. (1904). General intelligence, objectively determined and measured. The American Journal of Psychology, 15, 201-293. https://doi.org/10.1037/11491-006
[24] Wang, WL., & Lin, TI. (2013). An ecient ECM algorithm for maximum likelihood estimation in mixtures of t-factor analyzers. Computational Statistics, 28, 751-769. https://doi.org/10.1007/s00180-012-0327-z
[25] Wang, WL., & Lin, TI. (2023). Model-based clustering via mixtures of unrestricted skew normal factor analyzers with complete and incomplete data. Statistical Methods & Applications, 32(3), 787-817. https://doi.org/10.1007/s10260-022-00674-x
[26] Zhang, J., Li. J., & Liu, C. (2014). Robust factor analysis using the multivariate t-distribution. Statistica Sinica, 24, 291-312. https://www.jstor.org/stable/26432544