Journey is the Reward: Can you top this regex?

Wednesday, March 12, 2008

Can you top this regex?

Regular expressions (regex) is an indispensable tool for string processing. Languages like Perl are specially optimized for their use (as a side note, Perl was also the first popular high level language to have them built-in). Recently I was looking for a regex to validate an email address. How hard can it be? After all an email address is just something of the form 'something@something.something'. As it happens there is an RFC (RFC 822 to be precise) that describes the form of a valid email address. So you need to check the email address against that RFC. Now the power of 'the net' is precisely that somebody might have already solved your problem. And here is the deal, there exists a regular expression to validate an email address. Now you will ask, what is so special? You got stuck, googled it and found a solution, big deal huh? So let me tell you, the point is not that I found a solution, but what I found. Paul Warren wrote this particular regular expression as part of the perl module Mail::RFC822::Address and it is 6343 character long. Yes, you read it right, Six thousand three hundred and forty three characters. Take a look, here.

And why is it so complex? Because as per the standard '!@' is a valid email address :-). I am sure there are many more such cases. To quote the author, "The grammar described in RFC 822 is surprisingly complex. Implementing validation with regular expressions somewhat pushes the limits of what it is sensible to do with regular expressions, although Perl copes well."

I for my life never imagined there could be a regex this long (and useful too). So what is the lesson here? Can we do with simpler standards (which have a better chance of being implemented correctly (and efficiently))? I think so.

2 comments:

Ashwin said...: hahaha...when i saw that script on my Linux box ...i thought theres some erroneous display due to an invalid font...(just like what happens when u try to see Marathi/Hindi text on Orkut on a fresh Linux m/c)
:-))
Huh! mann ....i ve never written a script even 10% long as what he has! :-P
kewl stuff!; April 1, 2008 at 2:44 PM
Mohsin said...: aur sala ham apne aap ko programmer kahte hai :-D; April 1, 2008 at 3:29 PM