Email Address – wranglings

In a previous life, the need arose to validate email addresses, in a dB which had historically had very little email address formatting/checking controls in place, as such the email address data contained an awful lot of rubbish data, invalid emails mixed in with good email addresses.

The project goal was to allow those clients with a valid email address to use this email address for logon password reset; unfortunately there was no budget to introduce an email verification step, which would have been perfect.

The dB contained some 3M+ rows of client data, so once again I reverted to my SQL toolbox looking for options as well as a quick google search looking for a decent regex for email addresses (which proved harder than I initially thought it would do).

Email Address Parts

Email address rules are quite strange, the email address is split 4 parts (most people only count 3, but there are actually 4), as follows: –

constituent parts of the email address
  • Recipient Name – the recipient name represents an email mailbox normally belongs to:
    • A specific person
    • A mailing list
    • A department
    • A role within a company (such as sales or customer service)
  • @ Sign – required to separate the email recipient from the domain details.
  • Domain Name – the domain name is normally a specific mailbox provider or organisation
  • Top-Level Domain – top-level domains are the highest level of the domain name system and is placed after the domain name in an email address.
Email Address Rules

Email address rule are quite complex, I’ve done my best to explain them as easy as I can below…

Recipient Name – is a maximum of 64 characters long and consist of:

  • Uppercase and lowercase letters in English (A to Z, a to z)
  • Digits (0 to 9)
  • Special characters such as ! # $ % & ‘ * . + – / = ? ^ _ ` { } | (no spaces)

Although these special characters cannot appear as the first or last character of a recipient name or appear consecutively. The most commonly used special characters are the dot (.), underscore(_), hyphen (-) and plus sign (+).

  • Alternative special characters such as  ” ( ) , : ; < > @ [ \ ] 

These alternative special characters may be used but are generally avoided since they may be prohibited by some sending or receiving email servers.

Although recipient names are technically case sensitive, most organisations and mailbox providers tend to accept upper and lower case letters to indicate the same user (for example, PAUL.MILLER@domain.com is the same as paul.miller@domain.com).

Domain Name – is a maximum of 255 characters long (including Top-Level Domain) and consist of:

  • Uppercase and lowercase letters in English (A to Z, a to z)
  • Digits (0 to 9) 
  • Special characters such as – (hyphen) and . (dot)
  • A dot (.)  (used to identify a sub-domain; for example,  department.domain)

Top-Level domain – has the same rules as the domain name, some common examples include:

  • .com
  • .co.uk
  • .net
  • .org

there are many TLD’s, a somewhat comprehensive list can be found here.

Email Address Validation

So the easiest way to validate the email address, is to cut it into its constituent parts, using the SQL command split_part; this command does exactly what is says on the tin, it splits a source string into parts.

SPLIT_PART() function
split_part(<string>,<delimiter>, <part_number>)
The split_part function is used to split a source string based on a delimiter and pick out the desired part number from the string, starting from the left of the string.
e.g. SELECT split_part(‘paul.miller@domain.co.uk’,’@’,1); output ‘paul.miller’ (part 1)
e.g. SELECT split_part(‘paul.miller@domain.co.uk’,’@’,2); output ‘domain.co.uk’ (part 2)

After splitting into parts, then it’s a case of validating each part, I prefer to use a combination of SQL commands rather than rely on just a very complex regular expression.

Recipient Name, a simple regex with all the valid local characters; this regex handles the regular character set allowed in the local part. I’ve excluded the alternative characters as not all email software can handle all these characters (you can expand to include all, just be wary).

^[A-Z0-9_!#$%&'*+/=?`{|}~^.-]+

No leading, trailing, or consecutive dots, both the recipient and the top-level domain can contain one or more dots, but not consecutive dots; furthermore, the neither can start or end with a dot. The following regex is an enhancement of the above to restrict this.

^[A-Z0-9_!#$%&'*+/=?`{|}~^-]+(?:\.[A-Z0-9_!#$%&'*+/=?`{|}~^-]+)*

Domain Name, a simple regex to ensure the domain name has two to six letters, must include at least one dot, and the domain name after the last dot can only consist of letters. That is, the domain name must contain at least two levels, such as abc.com or abc.def.com.

The top-level domain, “.com” in the above examples, must consist of two to six letters. All country-code TLD (.us.uk, etc.) have two letters. The generic top-level domains have between three (.com) and six letters (.museum).

(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$

Putting it all together,

^[A-Z0-9_!#$%&'*+/=?`{|}~^-]+(?:\.[A-Z0-9_!#$%&'*+/=?`{|}~^-]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$

I’d appreciate your feedback, as this is one of first blog posts;
how does it read? is it pitched right? or too technical?
have you tried the code? does it work for you? should I include any other SQL formats?


** The SQL provided is from SNOWFLAKE, but the inbuilt SQL functions are available in other forms of SQL (PostgreSQL includes SPLIT_PART).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started
%d bloggers like this: