Penalized logistic regression models for phenotype prediction based on Single Nucleotide Polymorphisms
Most of the studies on phenotype differences, including some diseases, are based on studying the areas of the genome called Single Nucleotide Polymorphism (SNP). Some SNPs on their own and some by interacting with other SNPs play an important role in any phenotype or specific disease. Various models, including regression models, are designed and implemented for prediction of these diseases. As the phenotypes are both quantitative and binary, linear regression is used for models predicting quantitative ones, which is only based on the number of minor alleles per SNP, and logistic regression is used for binary ones like complex diseases. Since complex diseases are not caused only by independent SNPs, but by the interaction of a large number of SNPs, which mostly exceeds the number of samples, penalized logistic regressions are counted to be a better choice. These models, therefore, can overcome the limitation of ordinary logistic regression on high-dimensional SNP datasets. In this paper, three regression models, including Ridge, Lasso and Elastic Net (EN), were implemented on 10000 samples of the SNP datasets of OWKIN-Inserm Institute to predict the risk of a specific disease (undisclosed for confidentiality reasons). Among these three, the Lasso model with minimizer lambda indicated higher accuracy (73.73%) and AUC (83.54%). The model is also less complex since it eliminates less related features as much as possible and keeps only the most informative ones. Besides, getting better results with Lasso indicates that multicollinearity is either not existed between variables or is low that can be neglected.
- حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران میشود.
- پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانههای چاپی و دیجیتال را به کاربر نمیدهد.