POESIA

logo_escnr
0crest
profile-photo-wmarte-96x96logo_fundacio-catalana-recerca_280media,973,en
poesia
Principal Investigator:

José María Gómez Hidalgo


Contact:
buenaga<at>uem.es

Address:
C/ Tajo, s/n, 28670 Villaviciosa de Odón

Duration:
2002-2004

Project Page:

Introduction

Internet provides children access to pornography and other unsuitable content much more expeditious ways than other media. In order to improve the effectiveness of existing filters, introduced POESIA project that aims to develop and evaluate an open source tool for filtering material accessible via the Internet in educational settings.

Operation

POESIA (Public Open-source Environment for a Safer Internet Access) is a tool for filtering inappropriate content on the Internet, in educational settings. The software contains different filters that include pornography, internet channels and works in several languages ​​(English, Italian, Spanish and French.

This uses text analysis techniques, image processing, code analysis, etc..

The software is designed to be implemented in educational settings such as libraries, classes, etc..

Entities involved in the development

Istituto di Linguistica Computazionale (Italy), Commissariat à l’Energie Atomique (France), Ecole Nouvelle d’Ingénieurs en Communication (France), M.E.T.A. S.r.l. (Italy), Universidad Europea UEM (Spain), University of Shffield (United Kingdom), Fundació Catalana per a la Recerca (Spain), PIXEL Associazione (Italy), Liverpool Hope University College (United Kingdom) and Telefónica.

System and architecture

POESIA used two different strategies. The first is the integration with various technologies, it is possible to POESIA is open-source.

The second is to nourish estregia external collaborations to achieve more effective filters and in several languages. This is achieved using additional technologies like SQUED, JigSaw, WEKA, etc..

For light filters are used VSM (Vector Space Model). For hard filters using a representation based on natural language using terminology, phrases, words and other specific content related to pornography. Just as semantic analysis techniques

Results

There is a prototype lightweight filter works in Spanish. The prototype has been designed based on in WEKA1 environment, the system proxy Bookseller Muffin and HTMLParser3.

Includes 26 coded in Java language classes and 2700 lines of code. Using the vector-binary model text rendering using a deny list.

In a test with 55 sites (27 were pornographic sites) gave a result of 0.941 to 0.806 pornographic and non-pornographic content. The results are promising.

Bibliography

For the light filter, a Vector Space Model (Salton, 1989) text representation has been selected, with stemming and stoplist filtering. A cost sensitive learn Gómez, J.M., M. Mana, y E. Puertas. 2000. Combining text and heuristics for cost-sensitive spam filtering. En Proceedings of the Fourth Computational Natural Language Learning Workshop, oNLL-2000, Lisbon, Portugal. Association for Computational Linguistics.

Gómez, J.M. 2002. Evaluating cost-sensitive unsolicited bulk email categorization. En Proceedings of the the ACM Symposium on Applied Computing.

Salton, G. 1989. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley.

Sebastiani, Fabrizio. 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47.