Integrating Lexical Databases in Text Categorization
Integración de Bases de Datos Léxicas en la Categorización de Texto

An introduction

In the last decade we have witnessed an impressive growth of the online information, mostly represented by the Internet and the Digital Libraries. Most of the information in these environments exists in the form of text. For instance, some reports estimate that more than the 90\% of the information that a corporation produces and manages is text [Oracle97]. Text classification tasks like text retrieval and categorization aim at providing organization and access to this huge amount of information.

Text categorization -- the assignement of documents to predefined categories -- is one of the most prominent information organization and access tasks. Text categorization (TC) has been applied to a wide variety of problems, ranging from automatic cataloging of web resources to spam messages filtering [Lewis92, Sebastiani99, Gomez00]. Great attention has been paid to the automatic induction of TC systems, specially related to the resurgence of empirical methods in Natural Language Processing, and to the increasing availability of electronic text resources.

The induction of a TC system is a machine learning problem. Given a set of documents (manually) classified into a set of predefined categories, the problem is learning or training a classifier, that is, a program that automatically assign new documents to the categories. The main resource involved in learning a classifier is a prelabelled text collection, that is, a \emph{training collection}. Such text collections are on hand both in research and real environments. Two popular examples of research collections are Reuters and Ohsumed . Entire web directories like Yahoo! and online medical information sources like Medline are examples of huge, real text collections available for training and using TC systems.

Other resources, amenable of being used in Natural Language Processing tasks, have also been built in the latest years. Lexical databases like WordNet , EDR [Yokoi95], and EuroWordNet , are specially relevant in text classification. Lexical databases are repositories that accumulate information about the lexical items of one or several languages. For instance, WordNet 1.6 includes information about more than 125,000 words and nearly 100,000 concepts of the English language, organized according to semantic relations. This great amount of data has been successfully employed for improving text classification tasks like text retrieval [Gonzalo98] or word sense disambiguation [Agirre96]. The utilization of lexical databases for TC is the focus of our work.

Lexical databases can be used to improve TC. We have developed a general model for including lexical information to complement training information in TC. This model is focused on linear classifiers [Lewis96], an important subclass of classifiers that can be learnt by machine learning algorithms. Linear classifiers show an interesting set of properties that make them specially useful for industrial applications, namely simplicity, efficiency and a surprising degree of effectivity [Lewis98]. Most linear classifiers let an easy and seamless integration of external information sources in the system as prior knowledge.

Our model is based on integrating lexical information about categories in the form of prior knowledge for training algorithms [Mitchell97]. The prior knowledge is represented as an initial distribution that is later refined by learning. The utilization of the lexical knowledge leads to a improvement of perfomance on all the categories. Moreover, the model specially improves perfomance on less populated categories, that is, on categories for which less training data is available. The lexical knowledge about categories is extracted from the lexical database WordNet. The utilization of this information requires a process of automatic word sense disambiguation (WSD) [Urena98], because the information about categories is potentially ambiguous.

We have evaluated our work on the standard TC test collection Reuters. The results of our experiments show that the model we have designed is effective. We have also applied the model in the prototypes of two industrial projects, namely Mercurio and EIII. The prototypes, built for a popular Spanish newspaper and an important news and photograph Spanish agency, implement a set of personalized search and routing functionalities, in which a TC module plays a key role.


[Agirre96] E. Agirre and G. Rigau, Word sense disambiguation using conceptual distance , Proceedings of COLING'96, 1996.

[Gomez00] J.M. Gómez and M. Maña and E. Puertas Sanz, Combining Text and Heuristics for Cost-Sensitive Spam Filtering , Computational Natural Language Learning Workshop, CoNLL-2000, 2000.

[Gonzalo98] J. Gonzalo and F. Verdejo and I. Chugur and J. Cigarrán, Indexing with WordNet synsets can improve text retrieval , Proceedings of the ACL/COLING Workshop on Usage of WordNet for Natural Language Processing, 1998.

[Lewis92] David D. Lewis, An evaluation of phrasal and clustered representations on a text categorization task , Proceedings of SIGIR-92, 15th ACM International Conference on Research and Development in Information Retrieval, 1992.

[Lewis96] David D. Lewis and Robert E. Schapire and James P. Callan and Ron Papka, Training algorithms for linear text classifiers , Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval, 1996.

[Lewis98] David D. Lewis, Naive Bayes at forty: The independence assumption in information retrieval , Proceedings of ECML-98, 10th European Conference on Machine Learning, 1998.

[Mitchell97] T. Mitchell, Machine Learning , McGraw Hill, 1997.

[Oracle97] Oracle, Managing Text with Oracle8(TM) ConText Cartridge , An Oracle Technical White Paper, 1997.

[Urena98] L.A. Ureña and M. de Buenaga and M. García and J.M. Gómez, Integrating and evaluating WSD in the adaptation of a lexical database in text categorization task , Proceedings of the First Workshop on Text, Speech, Dialogue --TSD'98--, 1998.

[Sebastiani99] Fabrizio Sebastiani, A Tutorial on Automated Text Categorisation , Proceedings of the First Argentinian Symposium on Artificial Intelligence (ASAI-99), 1999.

[Yokoi95] T. Yokoi, The EDR electronic dictionary , Communications of the ACM, 1995.

NOTE: Some of these papers are available at the Bibliography on Automated Text Categorization by F. Sebastiani.

Some other relevant papers

de Buenaga Rodríguez, M., Gómez Hidalgo, J.M., Díaz Agudo, B. Using WordNet to Complement Training Information in Text Categorization . 2nd International Conference on Recent Advances in Natural Language Processing (RANLP), Tzigov Chark (Bulgaria), 1997.

Gómez Hidalgo, J.M., de Buenaga Rodríguez, M. Integrating a Lexical Database and a Training Collection for Text Categorization . ACL/EACL Workshop on Automatic Information Extraction and Building of Lexical Semantic Resources for NLP, Madrid (Spain), 1997.

M. Junker and A Abecker. Exploiting Thesaurus Knowledge in Rule Induction for Text Classification . Proceedings of RANLP-97, 2nd International Conference on Recent Advances in Natural Language Processing , pp. 202-207, 1997.

Sam Scott. Feature Engineering for a Symbolic Approach to Text Classification . Technical Report, Computer Science Department, University of Ottawa, 1998.

Sam Scott and Stan Matwin. Feature engineering for text classification .  Proceedings of ICML-99, 16th International Conference on Machine Learning, pp. 379-388, Morgan Kaufmann Publishers, San Francisco, US, 1999.

Ureña López, L.A., de Buenaga Rodríguez, M., García Vega, M., Gómez Hidalgo, J.M. Integrating and evaluating WSD in the adaptation of a lexical database in text categorization task . First Workshop on Text, Speech, Dialogue --TSD'98-- , 1998.

Ureña López, L.A.,  Gómez Hidalgo, J.M., de Buenaga Rodríguez, M. Information Retrieval by Means of Word Sense Disambiguation . TSD 2000 - Third International Workshop on TEXT, SPEECH and DIALOGUE Brno, Czech Republic, September 13-16, 2000.

José María Gómez Hidalgo - 22  Noviembre/November 2000