Phone Number – wranglings

On previous project a few years ago, I had the unenvious job of validating mobile numbers, not just UK mobile numbers, but worldwide; and in a legacy banking database with some data from the late 1990’s.

The end goal was to allow those clients with a valid mobile to use this mobile number for MFA (multi-factor authentication) via a SMS/MSG step-up verification. The dB contained some 800K+ rows of client data, so I reverted to my SQL toolbox looking for patterns.

Rather than dealing with all mobiles individually, which would have been impossible, I needed a reliable way to separate valid numbers, valid numbers disguised as invalid and the downright invalid numbers – along came patterns – by translating characters into ‘a’ (for alpha) and numbers into ‘n’ (for numeric) – I converted some 800K+ mobile phone numbers into a list of 560 different patterns, of which114 were phone based (using a regex – see below, the others we’re completely invalid, including “DO NOT CALL”, “NO MOBILE”, “N”, “N/A”, etc.

-- Mobile REGEX --
--    07nnnnnnnnn, 07nnn nnn nnn, 07nnn-nnnnnn
--    +447nnnnnnnnn, +447nnn nnn nnn, +447nnn-nnnnnn
--    (0)7nnnnnnnnn, (0)7nnn nnn nnn, (0)7nnn-nnnnnn
--    +44(0)7nnnnnnnnn, +44(0)7nnn nnn nnn, +44(0)7nnn-nnnnnn
--    00447nnnnnnnnn, 00447nnn nnn nnn, 00447nnn-nnnnnn
--    0044(0)7nnnnnnnnn, 0044(0)7nnn nnn nnn, 0044(0)7nnn-nnnnnn
SELECT
CASE WHEN Mobile_Phone REGEXP ('^(07\\d{3}|\\+447\\d{3}|\\(07\\d{3}\\)|\\(0\\)7\\d{3}|\\+44\\(0\\)7\\d{3}|00447\\d{3}|0044\\(0\\)7\\d{3}|\\(07\\)\\d{3}\\s)(\\d{6}|\\s\\d{3}\\s\\d{3}|\\-\\d{6})$') 
          THEN 'UK Mobile'
     WHEN TRIM(TRANSLATE(Mobile_Phone,' .-()','')) REGEXP ('^07\\d{9}$') 
          THEN 'UK Mobile'
     ELSE 'False' END AS valid_mobile
FROM client

07789123456 becomes nnnnnnnnnnn (length 11)
447789123456 becomes nnnnnnnnnnnn (length 12)
+447789123456 becomes +nnnnnnnnnnnn (length 13 starting with +)
00447789123456 becomes nnnnnnnnnnnnnn (length 14 starting 00)

This may seem pretty useless, as all the above are valid mobile numbers, but people enter mobiles in the strangest of ways, none of which are really valid, especially for an autodialler…
(0)7789 123 456 or +44 (0) 7789 123456 or 07789-123-456 or 07789.123.456 and some even add text 07789123456 (MOBILE) or 07789123456(BROTHER).

Patterns will highlight these anomalies, as well as other gems in the mobiles.

Mobile Phone patterns reduced 800K+ mobiles into just 114 patterns

SQL has some really useful tools built in, the tool used for the translation of these numbers into their underlying patterns, is TRANSLATE.

TRANSLATE() function
translate(<source_string>,<matching_string>, <replace_string>)
The translate() function is used to translate character in the source_string, matching the matching_string with a characters in replace_string.
e.g. SELECT translate(‘translate’, ‘rnlt’, ‘123’); output ‘1a2s3ae’

tfound ‘t‘ in the matching_string, but nothing in the replace_string
r1found ‘r‘ in the matching_string, and corresponding ‘1‘ in the replace_string
aano ‘a‘ found in the matching_string
n2found ‘n‘ in the matching_string, and corresponding ‘2‘ in the replace_string
ssno ‘s‘ found in the matching_string
l3found ‘l‘ in the matching_string, and corresponding ‘3‘ in the replace_string
aano ‘a‘ found in the matching_string
tfound ‘t‘ in the matching_string, but nothing in the replace_string
eeno ‘e‘ found in the matching_string
Explanation of the example TRANSLATE function

Creating the Phone Number Pattern

RAW DATA: Sample Database with 30 example clients, and a varying set of Mobile Phone Numbers

The following code block converts all numeric data within the mobile_phone field into ‘n’, then counts each pattern and sorts by the highest occurrence at the top.

-- Mobile Patterns --
SELECT TRANSLATE(mobile_phone,'1234567890','nnnnnnnnnn') as Mobile_Pattern, COUNT(*)
FROM client
GROUP BY Mobile_Pattern
ORDER BY count(*) desc
PATTERN DATA: Sample Database with 30 example clients, converted into 19 patterns.

Taking it one step further, you can also use this to tidy up phone data, removing brackets, dashes, dots and spaces which confuse the numbers, results in 19 patterns becoming 12.

-- Mobile Patterns with Special Characters --
SELECT TRANSLATE(mobile_phone,'1234567890() -.','nnnnnnnnnn') as Mobile_Pattern, COUNT(*)
FROM client
GROUP BY Mobile_Pattern
ORDER BY count(*) desc
By removing the Brackets, Spaces, Dashes and Dots, we’re down to 12 patterns.

Although the above does not remove the need to validate the data, it’s a good start and therefore makes the size of the challenge more manageable; and also means that recognising valid numbers is a lot easier.

There maybe other methods that people use to accomplish a similar effect, I just needed to cut down the data into manageable chunks. After the above was completed, I can easily see invalid mobiles, you can also replace all the numeric data with ‘n’, and the alpha data and special characters with ”; therefore leaving nothing but n’s…which easily shows those phone numbers which are just plain invalid.

I’d appreciate your feedback, as this is my first blog post;
how does it read? is it pitched right? or too technical?
have you tried the code? does it work for you? should I include any other SQL formats?


** The SQL provided is from SNOWFLAKE, but the inbuilt SQL functions are available in other forms of SQL.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started
%d bloggers like this: