SMS Spam Corpus v.0.1

The SMS Spam Corpus v.0.1 is a set of SMS tagged messages that have been collected for SMS Spam research. It contains two collections of SMS messages in English of 1084 and 1319 messages, tagged acording being legitimate (ham) or spam.

Note: This collection is provided for experiments compatibility with the papers below. A new collection that extends this one has been prepared and it is available at: The SMS Spam Collection v.1 web site.

Corpus collection

This corpus has been collected from free or free for research sources at the Web:

Corpus details and download

There are two collections:

As reported at [3] in the about section, the big corpus average number of words per message is 15.72, and the average length of a word is 4.44 characters long.

Thee files contain one message per line in raw text. Each line is finished with a coma-separated tag, which can be "ham" or "spam". Here are some examples:

Urgent! call 09061749602 from Landline. Your complimentary 4* Tenerife Holiday or £10,000 cash await collection SAE T&Cs BOX 528 HP20 1YF 150ppm 18+,spam
Ok then i come n pick u at engin?,ham
Anything lor... U decide...,ham

The messages are not chronologically sorted.

Here you can download the SMS Spam Corpus v.0.1. Please read the included readme file. We would appreciate if:

Previous usage

This corpus has been used in the following research. The SMS Spam Corpus v.0.1 Small:

[1] Gómez Hidalgo, J.M., Cajigas Bringas, G., Puertas Sanz, E., Carrero García, F. Content Based SMS Spam Filtering. Dick Bulterman, David F. Brailsford (Eds.), Proceedings of the 2006 ACM Symposium on Document Engineering, Amsterdam, The Netherlands, ACM Press. Ámsterdam, The Netherlands, October 10-13, 2006. (preprint)

[2] Cormack, G. V., Gómez Hidalgo, J. M., and Puertas Sánz, E. 2007. Feature engineering for mobile (SMS) spam filtering. In Proceedings of the 30th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Amsterdam, The Netherlands, July 23 - 27, 2007). SIGIR '07. ACM, New York, NY, 871-872. DOI= http://doi.acm.org/10.1145/1277741.1277951. (preprint)

The SMS Spam Corpus v.0.1 Big:

[3] Cormack, G. V., Gómez Hidalgo, J. M., and Puertas Sánz, E. 2007. Spam filtering for short messages. In Proceedings of the Sixteenth ACM Conference on Conference on information and Knowledge Management (Lisbon, Portugal, November 06 - 10, 2007). CIKM '07. ACM, New York, NY, 313-320. DOI= http://doi.acm.org/10.1145/1321440.1321486. (preprint)

The NUS SMS Corpus has been used in the following research:

Yijue How and Min-Yen Kan (2005). Optimizing predictive text entry for short message service on mobile phones. In M. J. Smith & G. Salvendy (Eds.) Proc. of Human Computer Interfaces International (HCII 05). Lawrence Erlbaum Associates. Las Vegas, July 2005. ISBN 0805858075.

Yijue How (2004). Analysis of SMS Efficiency. Undergraduate Thesis. National University of Singapore.

Ming Fung Lee (2005). SMS Short Form Identification and Codec. Undergraduate Thesis. National University of Singapore.

About

The corpus has been collected by José María Gómez Hidalgo, and Enrique Puertas Sánz.

We would like to thank:

Tao Chen and Min-Yen Kan are currently collecting a larger public domain SMS corpus.

Other collections mentioned in the previous papers are:


(c) José María Gómez Hidalgo, 2011