Stochastic LASSO for extremely high-dimensional genomic data
- Authors
- Baek, Beomsu; Jo, Jongkwon; Kang, Mingon; Kim, Youngsoon
- Issue Date
- Jan-2026
- Publisher
- Nature Publishing Group
- Keywords
- Stochastic LASSO; LASSO; High-dimensional data; Variable selection
- Citation
- Scientific Reports, v.16, no.1
- Indexed
- SCIE
SCOPUS
- Journal Title
- Scientific Reports
- Volume
- 16
- Number
- 1
- URI
- https://scholarworks.gnu.ac.kr/handle/sw.gnu/82450
- DOI
- 10.1038/s41598-026-35273-3
- ISSN
- 2045-2322
- Abstract
- Accurate identification of significant features in high-dimensional data is indispensable in high-throughput genomic analysis and association studies. Least Absolute Shrinkage and Selection Operator (LASSO) and its derivatives have been widely adapted to discover potential biomarkers as a feature selection scheme in various biological systems. Recently, bootstrap-based LASSO models, such as Random LASSO and Hi-LASSO, have been effective solutions for extremely high-dimensional but low sample size (EHDLSS) genomic data. However, the bootstrap-based LASSO models still have several drawbacks, such as multicollinearity within bootstrap samples, missing predictors in draw, and randomness in predictor sampling. To tackle the limitations, we propose a new bootstrap-based LASSO, named Stochastic LASSO, that effectively reduces multicollinearity in bootstrap samples and mitigates randomness in predictor sampling, resulting in remarkably outperforming benchmarks in feature selection and coefficient estimation. Furthermore, Stochastic LASSO provides a two-stage t-test strategy for selecting statistically significant features. The performance of Stochastic LASSO was assessed by comparing the existing benchmark models in extensive simulation experiments. In the simulation experiments, Stochastic LASSO consistently showed significant improvements in performance compared to the state-of-the-art LASSO models for feature selection, coefficient estimation, and robustness. We also applied Stochastic LASSO for the gene expression data of publicly available TCGA cancer datasets and identified statistically significant genes associated with survival month prediction. The source code is publicly available at: https://github.com/datax-lab/StochasticLASSO.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - 자연과학대학 > Dept. of Information and Statistics > Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.