UP | HOME

Anti-spam measure: Against bots harvesting email addresses

Web robots, simply called bots, are programs running automated tasks over the World Wide Web. Harvesting bots try to obtain email addresses from web pages, in order to broadcast spams.

What is spam? See its etymology summed up here and its economy. In short, spam is harmful.

In the post, we tackle the following problem: how to provide on a website an email address following the standard practice, that is via an hyperlink opening a mailer, while preventing bots from harvesting this address. The solution should be easy to use and "harvest-proof", efficient against harvesting bots. There exist different solutions, described and analyzed in the first sections of the post. Finally I describe the solution that I have adopted.

Table of Contents

Solutions for address obfuscation

The solutions can be analyzed following two criteria, usability (convenience for use) and efficiency (prevention from harvest), when compared with the standard practice. They all resort to a form of obfuscation1: the address is covered for bots but can be easily discovered by a human. Thus such solutions belong to a specific class of tests, called reverse Turing tests, human interactive proofs (HIP), or completely automated public Turing tests to tell computer and human apart (CAPTCHA).

Standard practice: mailto hyperlink

<a href="mailto:user@domain.tld">Email: user@domain.tld</a>

Perfect usability: just click to open the mailer with the address pre-filled. But the efficiency against harvesting bots is null.

No mailto hyperlink

An hyperlink with a mailto attribute is easily found in a web page. A first solution is to remove the hyperlink while preserving the plain-text address. Usability largely decreases, since a copy-paste operation is required to get the address and the mailer needs to be manually opened; efficiency increases, but marginally, since bots analyze not only mailto hyperlinks but actually the whole body of the web page.

It turns out that the unique solution is to give the email address in an obfuscated form. Precisely, before being sent to the requesting client, each web page must be checked and possibly transformed by replacing each occurrence of an email address with an obfuscated version. Of course, this transformation can be automatized on the server. Thus email addresses occur in the source code only in an obfuscated form. But when the browser interprets the source code, it shows a web page where the email addresses appear in a human-readable form.

Here are the more common methods for address obfuscation.

String substitution

It is possible to use specific encodings for characters used in an email address. Percent-encoding is a method for encoding characters in email addresses and uniform resource identifiers (URI). It can be applied not only to reserved characters, like @, but also to unreserved characters, like the letters of our alphabet. For instance, @ can be replaced with %40 and u with %75: a character is replaced with a percent followed by its ASCII hexadecimal code. Note that the substitution only occurs in an email address, which is a limitation: thus, in the textual content of the following anchor, no substitution occurs for %40,

<a href="mailto:%75%73%65%72%40%64%6f%6d%61%69%6e%2e%74%6c%64">Email: user%40domain.tld</a>

Numeric character reference is a construct used in markup languages like HTML allowing characters to be replaced with a short sequence of characters in any context, not only in email addresses. For instance, @ can be replaced with &#64 and u with &#117: a character is replaced with an ampersand and a sharp (&#) followed by its Unicode decimal code (or hexadecimal code preceded by x).

<a href="&#109&#97&#105&#108&#116&#111&#58&#117&#115&#101&#114&#64&#100&#111&#109&#97&#105&#110&#46&#116&#108&#100">Email: 
   &#117&#115&#101&#114&#64&#100&#111&#109&#97&#105&#110&#46&#116&#108&#100</a>

With these methods, usability is quite preserved since the rendering engines of web browsers directly decode the characters. Efficiency is limited: bots just need to recognize the sequences that encode characters.

To avoid this recognition, a possible solution is to adopt an original encoding. Of course, usability decreases, since the user must recognize the sequences used to encode characters, without the help of the browser. Efficiency can be good, depending on the degree of originality of the encoding method. Here are common examples.

user [at] domain [dot] tld
user_nospam@domain.tld

Gradually bots recognize these sequences. But the solution space has no limit, as shown by the following example, where a complex periphrase describes the address.

known as user inside domain in top-level domain tld

Html and css obfuscation

An alternative to string substitutions is to add HTML code inside the address, while warning the rendering engine that these additions must be forgotten. Two methods are common, HTML comments and display:none spans.

user<!-- pre @ -->@<!--- post @ -->do<!-- post first syllable -->main<!--- pre dot -->.tld

<style type="text/css">
  p span.no_display { display:none; }
</style>
user@<span class="no_display">post @</span>domain.tld

Another method rewrites the whole address into its mirror.

<style type="text/css">
  span.mirror { unicode-bidi:bidi-override; direction: rtl; }
</style>
<span class="mirror">dlt.niamod@resu</span></p>

These methods cannot be applied with a mailto hyperlink, which decreases usability. Initially, these methods were not targeted by bots, but gradually they become.

Javascript obfuscation

Another method, more sophisticated, is to replace the email address with a computation. The computation is described by a small program embedded within the web page and written in a language that browsers can interpret, typically Javascript. Two kinds of computations are common, simple concatenations or simple encryptions, like Caesar ciphers.

Concatenation:

  <script language="JavaScript" type="text/javascript">
  var u = "user";
  var arr = "@";
  var d = "domain";
  var dot = ".";
  var t = "tld";
  document.write("<a href=" + "mail" + "to:" + u + arr + d + dot + t
        + ">" + "Email (concatenation)" + "</a>" + "<br>");
</script>

Encryption with the ROT13 function:

  <script language="JavaScript" type="text/javascript">
  function decode(a) {
    // ROT13 : a Caesar cipher 
    // letter -> letter' such that code(letter') = (code(letter) + 13) modulo 26
    return a.replace(/[a-zA-Z]/g, function(c){
      return String.fromCharCode((c <= "Z" ? 90 : 122) >= (c = c.charCodeAt(0) + 13) ? c : c - 26);
    })
  }; 
  document.write("<a href=" + decode("znvygb:hfre@qbznva.gyq") + ">" + "Email (ROT13)" + "</a>" + "<br>");
</script> 

Both methods preserve usability. Initially their efficiency were total: no bots analyze Javascript code. But times have changed: some bots correctly generate addresses computed by Javascript concatenations. However, the encryption method using ROT13 seems to be still safe: this suggests that bots do not fully interpret Javascript. If this should happen to be the case, a possible counter-measure would be to trigger the computation only after some user event happens, for instance a click over the mailto hyperlink.

<script language="JavaScript" type="text/javascript">
  function decode(a) {
    ... 
  };
  function openMailer(element) {
    var y = decode("znvygb:hfre@qbznva.gyq");
    element.setAttribute("href", y);
    element.setAttribute("onclick", "");
    element.firstChild.nodeValue = "Email (ROT13)";
  };
</script>
<a id="email" 
   href="click:the.address.will.be.decrypted.by.javascript"
   onclick='openMailer(this);'>Email: Please click.</a>

With this method, bots have to not only interpret Javascript but also simulate user events.

Addresses as images

A last method is to provide the email address as an image. Usability severely decreases since copy-paste operations and mailto hyperlinks are no more available. But efficiency is top: currently no harvesting bots analyze images. Of course, in another context, there exist bots, mainly dedicated to popular web sites, that analyze images to recognize words. The solution is to use text-based captchas, where the characters appear distorted in the image, as in the following example.

./../../images/captchaMail.png

A text-based captcha

Actually, a race is run between bot designers and captcha designers: more recognition against more distortion. Whereas captcha recognition is intended to be easy for humans and difficult for machines, a recent study shows that the situation deteriorates: it also becomes difficult for humans.

Comparison and statistics

The following table sums up the properties of the different methods previously described, with respect to the standard practice, a mailto hyperlink with a plain-text address.

Usability and efficiency (present and trend) for the different obfuscation methods
MethodUsabilityEfficiencyTrend
mailto\(+\)\(0\)\(\longrightarrow\)
percent-encoding\(+\)\(60\)\(\searrow\)
numeric character reference\(+\)\(70\)\(\searrow\)
[at] [dot]\(0\)\(70\)\(\searrow\)
_nospam@\(0\)\(80\)\(\searrow\)
periphrase\(0\)\(100\)\(\longrightarrow\)
HTML comments\(0\)\(65\)\(\searrow\)
display:none\(0\)\(90\)\(\searrow\)
mirror\(0\)\(100\)\(\searrow\)
Javascript concatenation\(+\)\(85\)\(\searrow\)
Javascript encryption\(+\)\(100\)\(\longrightarrow\)
image\(-\)\(100\)\(\longrightarrow\)
Legend
Usability\(+\)as usable as the standard practice
\(0\)one manipulation (copy-paste operation or mailer opening) required
\(-\)both manipulations required
Efficiency\(100 \%\)no spam
\(X \%\)\(X\) percent of spams avoided
\(0 \%\)as inefficient as the standard practice
Trend\(\longrightarrow\)Predictable evolution of the efficiency percentage
\(\searrow\)

The percentages and the trends have been very approximately estimated from three sources:

  • two experiments led by software developers around 2008 and 2012 respectively,
  • one experiment for a research paper about harvesting spam bots, led around 2010 and published in 2012.

All these experiments are based on the same method: publish a web page containing email addresses, less or more obfuscated, the honeypot, wait and measure the quantity of spams received on each address. Here are the results for the two first experiments.

http://techblog.tilllate.com/wp-content/uploads/2008/07/obfuscation_methods.png

Results of the experiment led by Mühlemann around 2008 (over a 1.5 year period)

https://davidlyness.com/blog/wp-content/uploads/2012/04/obfuscation-results.png

Results of the experiment led by Lyness around 2012 (over a 2 year period)

As for the research paper, entitled Longtime Behavior of Harvesting Spam Bots, and presented by Hohlfeld, Graf and Ciucu at the ACM Internet Measurement Conference in 2012, it describes a large-scale experiment that is based on a method that "enables the mapping of crawling activities to the actual spamming process". Indeed, fresh email addresses have been generated for each request. It leads to the following results for obfuscation methods.

./../../images/obfuscationMethods_2010_Hohlfeld.png

Results of the experiment led by Hohlfeld and collaborators around 2010 (over a 3 year period)

AbbreviationMethod
CMTHTML comments
FRMincluded in a form field
JSJavascript obfuscation
OBF[at] [dot]
TXTplain-text
MTOmailto

Another interesting observation concerns the usage of the addresses that have been harvested. First only 0.5% of the addresses lead to spam. For approximately 75% of the addresses leading to spam,

  • the time between the harvest and the first use is less than 10 days,
  • the period between the first and last uses is less than 7 days.

As a result, since the corruption of an address is essentially temporary, the webmaster can react after an address has received some spam, by increasing the protection against harvesting bots.

The preceding experiments allow the efficiency of bots to be studied, given some obfuscation techniques. It would be also interesting to determine if these methods faithfully represent the current practice. Here are the results of a rapid analysis of some institutional websites in my professional environment.

  • Mines de Nantes: no obfuscation, mailto hyperlinks with plain-text addresses
  • Lina: numeric character reference
  • Université de Nantes: Javascript computation (function concatenating the address)
  • Inria: Javascript encryption with an original and simple cipher

For their professional websites, colleagues resort to string substitutions ([at][dot]) and Javascript obfuscation by encryption, with the ROT13 function.

A new solution for Javascript obfuscation

There is a unique method that is optimal with respect to usability and efficiency: Javascript encryption. It requires that the browser enables the language Javascript, which is presently the standard rule. For the rare browsers without Javascript, it is possible to resort to a safe alternative, which is not very usable: email addresses described by complex periphrases or captchas, as shown before, and enclosed in a noscript tag.

<noscript>
  Email (JavaScript disabled): known as user inside domain in
    top-level domain tld

  Email (JavaScript disabled): 
    <img src="images/emailCaptcha.jpg" alt="email" width="25%"
         height="25%" />
</noscript>

The new solution is based on Javascript encryption. Compared with traditional solutions, it anticipates the interpretation of Javascript by bots. It is based on two phases:

  • an initialization phase involving a complex user event,
  • a steady state, where the email addresses appear through computed mailto hyperlinks.

As seen before, a first response to bots interpreting Javascript is to trigger the computations generating email addresses in mailto hyperlinks only after a user event has occurred. The new solution goes one step further. First, the user must type a password suggested with a text or an image. Second, the user must click over a button to generate the mailto hyperlinks with decrypted addresses. If the decryption process depends on the password, then the bots have to not only interpret Javascript and simulate user events but also guess and transmits the password to the decryption function: current bots are far away from these capabilities. However, with a solution requiring the user to type a password to get mailto hyperlinks, usability would decrease. That is the reason why the password is required only in an initialization phase, at the first visit. Indeed, when guessed by the user, the password is locally stored, by using the new API for persistent data storage of key-value pair data in Web clients2. After the first generation, at each subsequent visit, the password is read in the store to automatically generate mailto hyperlinks. Thus, during the steady state, mailto hyperlinks are directly available: usability is preserved.

The new solution is implemented on a webpage, whose code is commented below.

Initial phase: a form asks the user for a password, suggested as a text above the field to be filled.

<form id="emailPass">
  Anti-spam protection: the email address is obfuscated. 
  <br> 
  To decrypt the address and generate a
  permanent mailto hyperlink, please 
  <ul>
    <li> <b>type</b> the following password: <b>vigenere</b>,
    <li> click <b>Generate</b>.
  </ul>
  <input type="text" id="password" placeholder="Type here the password." value=""> <br>
  <input class="button" type="button" id="transmit" value="Generate"
         onClick="generateMailtoLink(this.form.password.value)">
</form>

When the user clicks over the button Generate, the function generateMailtoLink(password) decrypts the email address by using the password as a key, stores the password and calls the function transformFormIntoMailtoLink(href) that replaces the form with a mailto hyperlink using the decrypted address (passed as argument in href).

/*
 * "mailto:user@domain.tld" in a crypted form.
 */
function cryptedHref(){
  return "c\^W\~WKe\\\\vOXPR\_\&dSVKbqaGe\^MY\{LeSXS";
}
/*
 * Creates and stores the pair (demoKey, password).
 */
function createKey(password) {
  localStorage.setItem("demoKey", password); // stores (demoKey, password) 
};
/*
 * Replaces the form with a mailto hyperlink.
 */  
function transformFormIntoMailtoLink(href){
  ...
};
/*
 * Decrypts the content of the attribute 'href' ('mailto:' + address).
 * Stores the password.
 * Calls the function replacing the form with a mailto hyperlink.
 */
function generateMailtoLink(password) {  
    var href = decrypt(33, 94, password, cryptedHref());
    if(isValid(href)){
      createKey(password); // stores the password
      transformFormIntoMailtoLink(href);
    }else{
      alert("ERROR. The password is not valid: try again.")
    }
};

Steady state: after the generation of the mailto hyperlink, each time the web page is loaded or shown, whenever it is, during the same page session or not, wherever it is, in the same tab or window or not, the mailto hyperlink appears. The function selectFormOrEmailLink reads the key in the local storage. If it is initialized, it decrypts the email address by using the password as a key and calls the function transformFormIntoMailtoLink(href) that replaces if needed the form with a mailto hyperlink using the decrypted address (passed as argument in href). It is called at each occurrence of event pageshow or visibilitychange.

/*
 * Reads the value associated to demoKey in the local store.
 */
function readKey() {
  return localStorage.getItem("demoKey");
};
/*
 * Replaces the form with a mailto hyperlink if needed. 
 */
function selectFormOrEmailLink(){
  var password = readKey()
  if(password !== null){
    var href = decrypt(33, 94, password, cryptedHref());
    if(isValid(href)){
      transformFormIntoMailtoLink(href);
    }
  }
};
/*
 * 
 */ 
window.addEventListener("pageshow", selectFormOrEmailLink);    
document.addEventListener("visibilitychange", function() {
    if(document.visibilityState === 'visible'){
      selectFormOrEmailLink();
    }
});

It is relatively easy to make the decryption function depend on the password, for instance by using the Vigenère cipher. Each character is replaced by another one after a variable shift in the given alphabet. The variations of the shift are given by a key, here the password. Attacks are difficult: actually, the cipher is optimal if the key is randomized and with a length equal to the size of the message to be encrypted (becoming a one-time pad).

Online tools

Some JavaScript functions used to obfuscate, encrypt and decrypt addresses are available online, via HTML forms:

  • conversions from characters to numbers,
  • mirror function,
  • Caesar cipher,
  • Vigenère cipher.

Footnotes:

1 Obfuscation is also a cryptographic technique allowing programs to preserve a secret. Given a program that performs a computation by using some secret data, it is possible to transform this program in an obfuscated form, while preserving the result of the computation and hiding the secret data embedded in the program. This is an active field of research.

2 Some browsers may not implement the standard for Web storage, which is recent. Workarounds exist: see two implementations using cookies in the DOM Storage guide.

Last Updated 2015-06-05T07:45+0200. Comments or questions: Send a mail.