So we all know the great benefits of Excel, it’s easy to use, easy to get up to speed, very powerful, quite easy to control, widely used and therefore widely available, but there are also pitfalls and these pitfalls are easy to fall into without realising.
Recently, while doing some address clean-ups, I came across an interesting little piece of data stored in address line 1, which contained ’02-Jan’, and upon further investigation this particular anomaly and similar ones, turned out to be quite prevalent within the address data I was cleaning.
This and the others, turned out to be Scottish addresses (see the example below), whereby the ‘/’ is used to designate a floor/flat, so 2/1 would be 2nd Floor, 1st Flat (or Flat 1).
So where did ’02-Jan’ come from, simply put the incoming data file (probably a .csv file) was at some point after extraction was opened and subsequently saved in Excel, which in true excel style, was ‘trying to be helpful’ decided to convert the 2/1 to a date, which was ’02-Jan’.
Excel is awesome, but be careful
The easiest way to sort this is to build the extracted data with every string enclosed in quotes (this will also ensure commas embedded in data, like addresses will be treated correctly), but this isn’t always feasible with some data extracts, as such instead of loading the data into excel, you should import the file; which allows you to import the data as text; the following shows the process using Excel for Mac
Import the file (either using Data > Get External Data > From Text, or File > Import > CSV/TXT)
Select the file to be imported
Select either Delimited or Fixed Width (depending on your file type), followed by delimiter (if required), then select ALL columns (its safer to select ALL) and change the type to TEXT
As you can see the data loaded now into column A stays as it should be, rather than Excel trying to be clever and changing it…in the case below it would be 0/1, 02-Jan, 01-Jan and 0/1.
The same technique above will also correct some other Excel quirks, including:
- Preserve Leading Zero’s (which are dropped by Excel when loading normally)
- Formatting of Decimals (again dropping zeros, but this time trailing zeros)
- Large Numbers are converted to Scientific Notation
I’d appreciate your feedback, as this is one of first blog posts;
how does it read? is it pitched right? or too technical?
have you tried the code? does it work for you? should I include any other SQL formats?
* Excel used is Excel for Mac
** Any SQL provided is from SNOWFLAKE, but the inbuilt SQL functions are available in other forms of SQL.