범주형 변수의 연속형 환산 방법론 비교: 국민건강보험공단 표본 코호트의 흡연 문진의 시계열 단절 문제 해결Comparison of Methods for Converting Categorical Variables to Continuous Measures: Resolving Temporal Discontinuity in the Smoking Questionnaires in NHIS-National Sample Cohort
- Other Titles
- Comparison of Methods for Converting Categorical Variables to Continuous Measures: Resolving Temporal Discontinuity in the Smoking Questionnaires in NHIS-National Sample Cohort
- Authors
- 하강희; 양혜지; 박민주; 이시경; 김수환
- Issue Date
- Nov-2025
- Publisher
- 한국보건정보통계학회
- Keywords
- Smoking questionnaire; Discontinuation; Machine learning; .
- Citation
- 보건정보통계학회지, v.50, no.4, pp 441 - 452
- Pages
- 12
- Indexed
- KCI
- Journal Title
- 보건정보통계학회지
- Volume
- 50
- Number
- 4
- Start Page
- 441
- End Page
- 452
- URI
- https://scholarworks.gnu.ac.kr/handle/sw.gnu/81463
- DOI
- 10.21032/jhis.2025.50.4.441
- ISSN
- 2465-8014
2465-8022
- Abstract
- Objectives: We developed and evaluated statistical and machine-learning approaches to convert categorical smoking variables into continuous values, addressing temporal discontinuity caused by questionnaire format changes in the National Health Insurance Service–National Sample Cohort (NHIS- NSC). Using repeated measurements from the same individuals, we compared strategies for transforming objective multiple-choice responses into sub- jective numerical values. Methods: We analyzed 44,755 smokers who completed health examinations during 2007-2010 and answered both objective (2007-2008) and subjective (2009-2010) smoking questionnaires. After temporally correcting smoking-duration variables, we compared simple substitu- tion rules (median, mean, mode, midpoint), regression models, and machine-learning algorithms (random forest, gradient boosting, XGBoost, K-Nearest neighbors, support vector regression). Performance was assessed using mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and the coefficient of determination (R²). Results: For smoking duration, random forest performed best (MSE=35.18, R²=0.70), followed by XGBoost (MSE=35.30, R²=0.70) and gradient boosting (MSE=36.63, R²=0.68). For daily cigarette consumption, random forest (MSE=32.02, R²=0.38) and XGBoost (MSE=32.07, R²=0.38) outperformed alternatives. Machine-learning models consistently exceeded simple substitution methods; notably, the midpoint approach performed poorly for daily consumption with negative explained variance (R²=-0.10). Predicted values generally respected category boundaries, with minor discrepancies in extreme categories. Conclusions: Machine-learning approaches—particularly random forest and XGBoost— substantially outperformed traditional statistical conversions when mapping categorical smoking variables to continuous values. The proposed frame- work preserves temporal continuity in longitudinal health surveys affected by questionnaire changes and is portable to other public-health databases undergoing similar methodological transitions.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - 자연과학대학 > Dept. of Information and Statistics > Journal Articles

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.