C/ Tajo, s/n, 28670 Villaviciosa de Odón
The project’s objective is tefillah design, development, evaluation and promotion of techniques for the development of advanced tools, flexible, configurable, and more effective than current filtering of information on the WWW, aimed at Internet service providers. Develop techniques allows Internet service providers to companies offering new value-added service. This service provides greater assurance client companies a productive and profitable use of the Internet from the workplace.
Current tools for filtering Internet content have very limited effectiveness due to the use of overly simplistic techniques. Furthermore, these tools are rarely designed to be operated by Internet service providers, interested in providing value added services to their companies customers, such as secure Internet access. The tools rarely cover various content domains (eg, violent content, racist, Internet gaming, etc..), usually limited to pornographic content. Moreover, current tools are filtering content in one language, usually the English. That is, the tools lack the flexibility and configurability needed to be adapted to other content domains and languages.
The project aims Tefillah develop a set of innovative techniques aimed at producing more effective filtering systems, flexible and configurable than today. The main scientific contributions Tefillah can be framed within the following research areas:
- Natural Language Engineering. The content filtering task is specifically a document categorization task [Sebastiani, 2002]. On one hand, current categorization techniques are often based on very superficial representations of documents, and information should be used syntactic and semantic nature to increase the effectiveness of categorization. Moreover, there is hardly any multilingual categorization systems. Oriented techniques must be developed to achieve this functionality, given the nature of Internet multilingualism.
- Machine Learning. The use of learning techniques significantly reduces the development effort text classification systems. It is necessary to investigate the use of learning methods based on costs as MetaCost [Domingos, 1999], since users typically considered more harmful not to block harmful content otherwise. It is also necessary to develop assessment techniques based learning systems costs if these costs are unknown, imprecise, or variables, like the method Rocch [Provost and Fawcett, 2001].
- Agents Technology. The use of agent technologies to develop more robust and flexible systems. We intend to investigate opportunities to incorporate these technologies into content filtering systems on the Internet.
Among the features planned for tefillah, we highlight those that according to the IEEE (ISO 9126) give the software quality.
- One of the basic objectives of tefillah will provide both installers and users of an easily installable and configurable. At this point, our intention is to create a simple installation system that does not require too much time both when installing the application as when adding new users. In addition, provide dynamic HTML pages allow users easy configuration of filter elements, and even the ability to enable / disable the filter.
- Given that users already have waiting times large enough when accessing different services on the Internet, try not to add Tefillah longer essential. Therefore, much care for efficiency, making all the filtering process in a time of the order of a fraction of a second.
- Inherent tefillah is its GPL. This will give you some possibilities almost limitless expansion and improvement, but also require that the system is easy to maintain and have a high capacity for change.
- One final aspect to consider is the platform on which you can run our tool. Given the circumstances, and as we rely entirely on Java technology, this will not pose any limitation, since it will be a cross-platform.
[Domingos, 1999] P. Domingos. Metacost: A general method for making classifiers cost-sensitive. In Proc. of KDD, 1999.
[Gómez y otros, 2000] Gómez Hidalgo, J.M., Maña López, M., Puertas Sanz, E. Combining Text and Heuristics for Cost-Sensitive Spam Filtering. Fourth Computational Natural Language Learning Workshop , CoNLL-2000, Lisbon, September 14, 2000.
[Gómez y otros, 2002] Gómez Hidalgo, J.M, Puertas Sanz, E., Maña López, M. Evaluating Cost-Sensitive Unsolicited Bulk Email Categorization. Accepted for 6th International Conference on the Statistical Analysis of Textual Data, Palais du Grand Large, St-Malo / France, March 13-15, 2002.
[Jacobson, 1999] Jacobson, I., Booch, G., y Rumbaugh J. 1999. “Unified Software Development Process”, Addison Wesley.
[Provost y Fawcett, 2001] F. Provost and T. Fawcett. Robust classification for imprecise environments. Mach. Learn. J., 42(3), 2001.
[Sebastiani, 2002] F. Sebastiani. Machine learning in automated text categorization. Computing Surveys, 2002.
[Simmers, 2002] Simmers, C. Aligning Internet usage with business priorities. Comm. Of the ACM, January, 2002.
[Urbaczewski y Jessup, 2002] Urbaczewski, A. and Jessup, L. Does electronic monitoring of employee Internet usage work? Comm. Of the ACM, January, 2002.
[Yasin, 1999] Yasin, R. Web slackers put on notice. Internet Week, October, 1999.