Machine-Learning-Based Gender Distribution Prediction from Anonymous News Comments: The Case of Korean News Portalopen access
- Authors
- Suh, Jong Hwan
- Issue Date
- Aug-2022
- Publisher
- MDPI
- Keywords
- anonymity; social media; big data; news comments; gender prediction; word embedding; machine learning
- Citation
- SUSTAINABILITY, v.14, no.16
- Indexed
- SCIE
SSCI
SCOPUS
- Journal Title
- SUSTAINABILITY
- Volume
- 14
- Number
- 16
- URI
- https://scholarworks.gnu.ac.kr/handle/sw.gnu/1023
- DOI
- 10.3390/su14169939
- ISSN
- 2071-1050
2071-1050
- Abstract
- Anonymous news comment data from a news portal in South Korea, naver.com, can help conduct gender research and resolve related issues for sustainable societies. Nevertheless, only a small portion of gender information (i.e., gender distribution) is open to the public, and therefore, it has rarely been considered for gender research. Hence, this paper aims to resolve the matter of incomplete gender information and make the anonymous news comment data usable for gender research as new social media big data. This paper proposes a machine-learning-based approach for predicting the gender distribution (i.e., male and female rates) of anonymous news commenters for a news article. Initially, the big data of news articles and their anonymous news comments were collected and divided into labeled and unlabeled datasets (i.e., with and without gender information). The word2vec approach was employed to represent a news article by the characteristics of the news comments. Then, using the labeled dataset, various prediction techniques were evaluated for predicting the gender distribution of anonymous news commenters for a labeled news article. As a result, the neural network was selected as the best prediction technique, and it could accurately predict the gender distribution of anonymous news commenters of the labeled news article. Thus, this study showed that a machine-learning-based approach can overcome the incomplete gender information problem of anonymous social media users. Moreover, when the gender distributions of the unlabeled news articles were predicted using the best neural network model, trained with the labeled dataset, their distribution turned out different from the labeled news articles. The result indicates that using only the labeled dataset for gender research can result in misleading findings and distorted conclusions. The predicted gender distributions for the unlabeled news articles can help to better understand anonymous news commenters as humans for sustainable societies. Eventually, this study provides a new way for data-driven computational social science with incomplete and anonymous social media big data.
- Files in This Item
- There are no files associated with this item.
- Appears in
Collections - College of Business Administration > Department of Management Information Systems > Journal Articles
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.