C/ Tajo, s/n, 28670 Villaviciosa de Odón
While information overload is a common element to all of society, perhaps one of the areas where its impact is greater the of biomedicine, where researchers and professionals in general, require an increasingly crucial tools that facilitate access to adequate information to their needs. This domain also provides the opportunity to investigate new and better techniques for analyzing the content of texts, capable of solving specific problems of new application environments.
In recent years, much of the group’s research work has focused on textual content analysis, and especially in automatic text categorization and summarization, and its application in different environments access to information.
The efficacy and interest in applying techniques of text analysis tasks access to information is supported by a multitude of jobs in the area and international initiatives as important as the number of TREC and DUC conferences organized by NIST. Regarding our own experience over the last years we have successfully completed work that shows the interest and feasibility of the proposal presented here. Thus, for example, the work [Knack, 1998, 1999 and 2000] tested the effectiveness of monodocumento summaries tailored to the user in ad hoc retrieval tasks and relevance feedback. In [Knack, 2004], a user experiment that shows improvements in effectiveness in interactive retrieval task, when used in conjunction clustering techniques and multidocument summarization. Finally, in [Buenaga, 2000] and [Gómez, 2003] shows the integration of personalization techniques categorization topical editorial information.
The main objective of this project is to develop new mechanisms for access to information through the application of analysis techniques of human language. The analysis techniques are focusing on automatic categorization of texts and automatic summarization.
We propose the introduction of original and relevant improvements in these techniques and algorithms, and the performance of the specializations and adaptations required by the specific application environment and information processing bilingual (English and Spanish).It will develop an implementation and experimentation environment of suitable dimensions on biomedical domain documents: Medline, MedlinePlus / HealthDay (English / Spanish) and TREC / GenomicsTrack. This environment will integrate text analysis techniques listed on search engines to facilitate access to the information required by the user. An assessment of the application environment and each of the various elements integrated under general and specific standards of TREC information retrieval, and specific operations of categorization and summarization.
The project proposes the design and integration of summarization techniques and automatic text categorization to access bilingual information in the biomedical field.
The main objectives of the project are:
Development of advanced techniques monodocumento generation and multidocument summaries adapted to the biomedical domain:
- Studying what is the real contribution of the thematic structure in abstracts monodocumento
- Exploiting the formal and thematic structure of documents in multidocument summaries
- Improve summaries of differences, based on effective techniques to find relevant information and original
- Adapting techniques to the special features of biomedical texts. Integration of lexical-semantic resources as UMLS (Unified Medical Language System)
Improving text categorization techniques and biomedical domain adaptation:
- Adapt and apply techniques from automatic text categorization biomedical domain, increasing the effectiveness of the task in this domain.
- Study resource utilization multilingual lexical-semantic domain, especially UMLS, in order to improve the current techniques based on machine learning
Development of technologies bilingual Spanish / English integrated:
- Development of software components for processing texts in Spanish and English focused on the implementation of algorithms for categorization and summarization, promoting the use and integration with existing resources as Freeling and Gate (language analysis), Weka (learning automatic), and ontologies and lexical resources (UMLS, MeSH, WordNet and EuroWordNet).
Development of a search and information access, integrating interaction methods based on summary and categories:
- Develop a search system that provides the user easier access to information reduces information overload by summaries and categorization, and improving the organization of the response, showing groups of related documents and contextualized by categories.
- Develop fully operational interfaces for real users and offering interaction through devices to suit your needs: standard and mobile.
Evaluation of usability and effectiveness: assessment processes are conducted with end-user groups of suitable dimensions, in two types of environments:
- Open environment: these experiments are designed to measure the interface usability and user satisfaction in end tasks access to information on Medline and HealthDay. Usability parameters will be evaluated as average time consultations of interest, and degree of user satisfaction.
- Controlled environment: these experiments are designed to measure the improvement achieved in the effectiveness in trouble specially designed to evaluate the best achieved in access to information, and conduct on document collections, results and relevance judgments, TREC benchmark experimental -Genomics Track
Journal articles and book chapters
Maña, M.J., M. de Buenaga y J.M. Gómez. 2004. Multidocument summarization: An added value to clustering in interactive retrieval. ACM Transactions on Information Systems, vol. 22, núm. 2, 215-241
J.M. Gómez, I. Giráldez, M. de Buenaga, 2004. Text Categorization for Internet Content Filtering (Categorización para los filtros de contenido en Internet) : Inteligencia Artificial Vol III/2004, núm 22, 147-160
J.M. Gómez, E. Puertas, F. Carrero, M. de Buenaga, 2003, Categorización de texto sensible al coste para el filtrado de contenidos inapropiados en Internet, Procesamiento del Lenguaje Natural vol 31, 13-20
Ureña, L.A., de Buenaga, M., Gómez, J.M. “Integrating linguistic resources in Text Categorization through Word Sense Disambiguation”, en Computers and the Humanities, Kluwer Academic Press, vol. 35, núm. 2, 215-230
Maña, M.J., L.A. Ureña y M. de Buenaga. 2000. Tareas de análisis del contenido textual para la recuperación de información con realimentación. En Procesamiento del Lenguaje Natural , nº 26, septiembre 2000, 215-222.
Buenaga, M., Gómez, J.M., Díaz, B. , 2000. Using Wordnet to Complement Training Information in Text Categorization, cap. en “Recent Advances in Natural Language Processing II“ John Benjamins, 353-364
Maña, M.J., M. de Buenaga y J.M. Gómez. 1999. Using and Evaluating User Directed Summaries to Improve Information Access. En S. Abiteboul y A.M. Vercoustre (eds.), Research and Advanced Technology for Digital Libraries, LNCS, Vol. 1696, 198-214, Springer-Verlag. París (Francia).
Maña, M.J., M. de Buenaga y J.M. Gómez. 1998. Diseño y evaluación de un generador de resúmenes de texto con modelado de usuario en un entorno de recuperación de información. En Procesamiento del Lenguaje Natural, nº 23, septiembre 1998, 32-39.
Buenaga, M., Fernández-Manjón, B., Fernández-Valmayor, A., 1995. “Information Overload at the Information Age”, cap. en “Innovating Adult Learning with Innovative Technologies”, Elsevier, 17-30
Communications in conferences
José María Gómez Hidalgo, José Carlos Cortizo Pérez, Enrique Puertas Sanz, Miguel Ruíz Leyva Concept Indexing for Automated Text Categorization, 9th International Conference on Applications of Natural Language to Information Systems, NLDB 2004, Salford, UK.
Manuel de Buenaga Rodríguez, José María Gómez Hidalgo, Enrique Puertas Sanz, 2004. Text Filtering for Spanish, Workshop on present and future of open-source content-based Web Filtering, Pisa, Italia
Mark Hepple, Neil Ireson, Paolo Allegrini, Simone Marchi, José María Gómez Hidalgo, NLP-enhanced Content Filtering within the POESIA Project, Fourth International conference on Language Resources and Evaluation (LREC 2004), Lisboa, Portugal.
Gómez Hidalgo, J.M., 2003. Evaluating Cost-Sensitive Unsolicited Bulk Email Categorization, ACM Symposium on Applied Computing, Madrid.
Gómez Hidalgo, J.M., de Buenaga Rodríguez, M., Ureña López, L.A., Martín Valdivia, M.T., García Vega, M., 2002. Integrating Lexical Knowledge in Learning-Based Text Categorization, 6th International Conference on the Statistical Analysis of Textual Data, St. Malo, Francia.
Ignacio Giráldez, Enrique Puertas, José María Gómez, Raúl Murciano, Inmaculada Chacón, 2002. HERMES: Intelligent Multilingual News Filtering Based on Language Engineering for Advanced User Profiling, Multilingual Information Access and Natural Language Processing Workshop, IBERAMIA, Sevilla