Detailed Information

Cited 0 time in webofscience Cited 0 time in scopus
Metadata Downloads

범주형 변수의 연속형 환산 방법론 비교: 국민건강보험공단 표본 코호트의 흡연 문진의 시계열 단절 문제 해결Comparison of Methods for Converting Categorical Variables to Continuous Measures: Resolving Temporal Discontinuity in the Smoking Questionnaires in NHIS-National Sample Cohort

Other Titles
Comparison of Methods for Converting Categorical Variables to Continuous Measures: Resolving Temporal Discontinuity in the Smoking Questionnaires in NHIS-National Sample Cohort
Authors
하강희양혜지박민주이시경김수환
Issue Date
Nov-2025
Publisher
한국보건정보통계학회
Keywords
Smoking questionnaire; Discontinuation; Machine learning; .
Citation
보건정보통계학회지, v.50, no.4, pp 441 - 452
Pages
12
Indexed
KCI
Journal Title
보건정보통계학회지
Volume
50
Number
4
Start Page
441
End Page
452
URI
https://scholarworks.gnu.ac.kr/handle/sw.gnu/81463
DOI
10.21032/jhis.2025.50.4.441
ISSN
2465-8014
2465-8022
Abstract
Objectives: We developed and evaluated statistical and machine-learning approaches to convert categorical smoking variables into continuous values, addressing temporal discontinuity caused by questionnaire format changes in the National Health Insurance Service–National Sample Cohort (NHIS- NSC). Using repeated measurements from the same individuals, we compared strategies for transforming objective multiple-choice responses into sub- jective numerical values. Methods: We analyzed 44,755 smokers who completed health examinations during 2007-2010 and answered both objective (2007-2008) and subjective (2009-2010) smoking questionnaires. After temporally correcting smoking-duration variables, we compared simple substitu- tion rules (median, mean, mode, midpoint), regression models, and machine-learning algorithms (random forest, gradient boosting, XGBoost, K-Nearest neighbors, support vector regression). Performance was assessed using mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and the coefficient of determination (R²). Results: For smoking duration, random forest performed best (MSE=35.18, R²=0.70), followed by XGBoost (MSE=35.30, R²=0.70) and gradient boosting (MSE=36.63, R²=0.68). For daily cigarette consumption, random forest (MSE=32.02, R²=0.38) and XGBoost (MSE=32.07, R²=0.38) outperformed alternatives. Machine-learning models consistently exceeded simple substitution methods; notably, the midpoint approach performed poorly for daily consumption with negative explained variance (R²=-0.10). Predicted values generally respected category boundaries, with minor discrepancies in extreme categories. Conclusions: Machine-learning approaches—particularly random forest and XGBoost— substantially outperformed traditional statistical conversions when mapping categorical smoking variables to continuous values. The proposed frame- work preserves temporal continuity in longitudinal health surveys affected by questionnaire changes and is portable to other public-health databases undergoing similar methodological transitions.
Files in This Item
There are no files associated with this item.
Appears in
Collections
자연과학대학 > Dept. of Information and Statistics > Journal Articles

qrcode

Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.

Related Researcher

Researcher Kim, Su Hwan photo

Kim, Su Hwan
자연과학대학 (정보통계학과)
Read more

Altmetrics

Total Views & Downloads

BROWSE