Improving Text Categorization with High Quality Bigrams 


Vol. 9,  No. 4, pp. 415-420, Aug.  2002
10.3745/KIPSTB.2002.9.4.415


PDF
  Abstract

This paper presents an efficient text categorization algorithm that generates high quality bigrams by using the information gain metric, combined with various frequency thresholds. The bigrams, along with unigrams, are then given as features to a Na ve Bayes classifier. The experimental results suggest that the bigrams, while small in number, can substantially contribute to improving text categorization. Upon close examination of the results, we conclude that the algorithm is most successful in correctly classifying more positive documents, but may cause more negative documents to be classified incorrectly.

  Statistics


  Cite this article

[IEEE Style]

C. D. Lee, C. M. Tan, Y. F. Wang, "Improving Text Categorization with High Quality Bigrams," The KIPS Transactions:PartB , vol. 9, no. 4, pp. 415-420, 2002. DOI: 10.3745/KIPSTB.2002.9.4.415.

[ACM Style]

Chan Do Lee, Chade Meng Tan, and Yuan Fang Wang. 2002. Improving Text Categorization with High Quality Bigrams. The KIPS Transactions:PartB , 9, 4, (2002), 415-420. DOI: 10.3745/KIPSTB.2002.9.4.415.