Uso de la detección de bigramas para categorización de texto en un dominio científico

  1. Montejo Ráez, Arturo
  2. Martín Valdivia, María Teresa
  3. Perea Ortega, José Manuel
  4. Ureña López, Luis Alfonso
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2010

Issue: 44

Pages: 91-98

Type: Article

More publications in: Procesamiento del lenguaje natural


This paper presents some experiments using the technique of multi-words detection for text categorization in scientific domain. We have used part of the collection of scientific papers of High Energy Physics (HEP) provided by the European Laboratory of Particle Physics (CERN). The supervised machine learning algorithms employed have been Rocchio and PLAUM. The technique of multi-words detection used has been limited to fixed sequences of maximum two terms, known as bigrams. The aim of this study is to determine whether the use of frequent bigrams as unique features may be an improvement for text categorization task in this specific domain. Our conclusion is that multi-words detection should not be used for this task in the HEP domain.

Bibliographic References

  • Buenaga, M., J.M. Gómez, y B. Díaz. 1997. Using wordnet to complement training information in text categorization. En Proeedings of Se ond International Conference on Reent Advanes in Natural Language Proessing (RANLP).
  • Cavnar, W.B. y J.M. Trenkle. 1994. N-gram- based text categorization. En Symposium On Doument Analysis and Information Retrieval, páginas 161-175, Las Vegas.
  • Churh, K. W. y P. Hanks. 1990. Word as-so iation norms, mutual information and lexiography. Computational Linguistis, 16(1):22-29.
  • Kilgarriff, A. y D. Tugwell. 2001. WORD SKETCH: Extration and display of significant collocations for lexiography. En Proc. Collocations Workshop, ACL 2001, páginas 32-38.
  • Lewis, D. D. 1992. Feature Seletion and Feature Extration for Text Categorization. En Proeedings of Speeh and Nat- ural Language Workshop, páginas 212-217, San Mateo, California. Morgan Kauf-mann.
  • Li, Y., H. Zaragoza, R. Herbri h, J. Shawe-Taylor, y J. Kandola. 2002. The per ep-ron algorithm with uneven margins. En Proeedings of the International Confer- ence of Machine Learning (ICML'2002).
  • MacKay, David J. C. 2003. Information theory, inference, and learning algorithms? Cambridge.
  • Peng, F. y D. Schuurmans. 2003. Combining naive bayes and n-gram language models for text classification. En Fabrizio Sebas- tiani, editor, ECIR, volumen 2633 de Lec-ture Notes in Computer Sience, páginas 335-350. Springer.
  • Sebastiani, F. 2002. Machine learning in au-tomated textcategorization. ACM Com- put. Surv., 34(1):1-47.
  • Vassilevskaya, Lyubov A. 2002. An ap- proach to automatic indexing of sien- tific publications in high energy physis for database spires-hep. Master's thesis, Fachhochsule Potsdam, Institut für Infor- mation und Dokumentation, September.