Search Word: *:*, Search Result: 1
1 The Influence of NPMI and TF-IDF-Based Automatic Stopword Generation on Semantic Consistency
Hye-soo Cho(Department of Sports Science, Hanyang University ERICA) ; Eun-Hyung Cho(Korea Institute of Sports Science) ; Hong-suk Kim(Department of Sports Science, Hanyang University) ; Soo-Kyung Cho(Department of Sports Science, Hanyang University) ; Ji-Yong Park(Department of Sports Science, Hanyang University) Vol.36, No.4, pp.557-567 https://doi.org/10.24985/kjss.2025.36.4.557
초록보기
Abstract

PURPOSE This study optimized stopword removal to enhance topic modeling performance. We propose an objective method combining normalized pointwise mutual information (NPMI) with median-based term frequency–inverse document frequency (TF–IDF) to automatically generate stopwords. METHODS Using text data from 443 research papers on “Taekwondo sparring,” we selected stopword candidates based on NPMI and identified 30 words with the lowest TF–IDF scores. We examined the impact of removing 1–30 stopwords on u_mass coherence scores. RESULTS The NPMI–TF–IDF method significantly improved coherence (R² = .456; p < .001). However, excessive removal led to diminishing returns, with the optimal coherence score (−11.442) achieved at 200 stopwords. In contrast, manually selected stopwords yielded a lower coherence score (−16.001). The findings indicate that integrating TF–IDF with NPMI effectively preserves meaningful words and outperforms PMI2 and PMI3 approaches. CONCLUSIONS Manual stopword selection can reduce reproducibility. Optimizing stopword removal based on domain-specific characteristics is essential. Future research should validate this method across diverse fields to establish a more generalizable standard.


logo