Token-Based Classification and Dataset Construction for Detecting Modified Profanity 


Vol. 13,  No. 4, pp. 181-188, Apr.  2024
https://doi.org/10.3745/TKIPS.2024.13.4.181


PDF
  Abstract

Traditional profanity detection methods have limitations in identifying intentionally altered profanities. This paper introduces a new method based on Named Entity Recognition, a subfield of Natural Language Processing. We developed a profanity detection technique using sequence labeling, for which we constructed a dataset by labeling some profanities in Korean malicious comments and conducted experiments. Additionally, to enhance the model's performance, we augmented the dataset by labeling parts of a Korean hate speech dataset using one of the large language models, ChatGPT, and conducted training. During this process, we confirmed that filtering the dataset created by the large language model by humans alone could improve performance. This suggests that human oversight is still necessary in the dataset augmentation process.

  Statistics


  Cite this article

[IEEE Style]

S. Ko and Y. Shin, "Token-Based Classification and Dataset Construction for Detecting Modified Profanity," The Transactions of the Korea Information Processing Society, vol. 13, no. 4, pp. 181-188, 2024. DOI: https://doi.org/10.3745/TKIPS.2024.13.4.181.

[ACM Style]

Sungmin Ko and Youhyun Shin. 2024. Token-Based Classification and Dataset Construction for Detecting Modified Profanity. The Transactions of the Korea Information Processing Society, 13, 4, (2024), 181-188. DOI: https://doi.org/10.3745/TKIPS.2024.13.4.181.