Wednesday 18 May 2011

State of the Union

80%+ of a dataminers time is normally spent manipulating the data so they can get to a point where they can start extracting insights. This data preparation is actually the most important part of the process where you should be most diligent.

Here is another example (to add to the one in the previous post) that demonstrates you don't always get what you think you should get.

We have been supplied with two data sets of names and want to combine them into a single table. The UNION command will oblige us, but we have got to be very careful how we pose the question...

(This is MS SQLServer)





The result of this is not what we might think. Union seems to ignore the actual name of the field, so our firstnames and surnames are now all muddled up because they were in different column positions in each source table.

So to get what we really wanted, we can get the columns aligned first...



Don't assume the source data is consistent and the fields are in the same order in the database. Fortunately if you have lots of fields there is a way to check this...


If the field orders are not consistent, then it can be a pain explicitly typing out all the fields in your SQL statement to get the orders to match. Now there is also a time saver for this...



We can then just copy the result and paste it into our SQL to save us typing. 

No comments:

Post a Comment