Data Analysis – The Second Requisite

To recap, the first requisite of a competent data analyst is the ability to look.  On reflection, we might clarify this by saying that it’s the willingness to demonstrate the ability to look.  I mean let’s face it – if given sufficient and/or correct things to look at, anyone should be able to look, right?  Well what about this?  What about being willing to show that you can look?

Let’s take a short side-excursion and explore this notion, shall we?

Consider a kid.  First or second grade maybe, a little shy (not being entirely used to being out of the home for such long periods of time) but eager to make friends, and full of the idea that if he can do something that people will admire, friendships will occur.  So he takes the one skill that he’s been praised so much for (by his parents): let’s say it’s playing the violin.

He brings his violin to school one day and has the guts to stand up in front of the whole class and play a song.  The class sits silently, then from the back comes a stifled chortle.  From the side, a raspberry.  A titter from the pretty girl in the second row.  The teacher intervenes loudly, but the damage is done.  The kids don’t want to see talent, they want to see the latest fashion, the latest electronic gizmo, the (fill in the blank – it’s something that our burningly embarrassed protagonist doesn’t have).  So he sits down, puts his violin away, and resolves to never, ever show talent again.

Far-fetched?  Ask any K-12 teacher.  I dare you.

Point is, you can look all you want, but if you’re not willing to demonstrate your ability, you might as well have blinders on.

So now let’s go on.  Given that a person can look, what’s next?

Next is the second requisite of a competent data analyst: the ability to see.

Wait a minute.  The eyes are open, there’s something there in front of the guy’s face, there’s nothing wrong with his optic nerve, what do you mean, the ability to see?

Well, let’s do an experiment.  This evening, when you go to bed, bring a book with you.  A nice, thick book that you’ve been promising yourself to read for quite some time now, but never could get started on it.  Atlas Shrugged, maybe.  Whatever it is, start reading.  Eventually, if you’re like most people, you’ll reach the end of a page and suddenly realize that for whatever reason, your eyes have scanned every line but the words that you “saw” somehow never made it to your brain.

So did you “see” them?  Of course you did.

Or did you?

I’m not going to get into cognitive psychology in this discussion – merely point out that just because someone is observing does not mean that he is observant.

The consequences of having someone “see” data and not really see it are, of course, similar to the consequences of driving a train while texting.

I’ll give you an example.

There’s a well-known data aggregator that uses a procedure as part of their DQ arsenal that they call “stare and compare”.  Two data files are put side by side: on one side, the previous file and on the other,  the new one.  The analyst then pages through the two documents, eyes scanning from one to the other, until he’s satisfied that the data is satisfactory.  Any changes between the two files must be researched and explained in terms of known source data changes or known process changes, otherwise it’s back in the barrel for the production team.

So what happens?  The analyst’s eyes glaze over and God only knows what gets passed on to the customer.

As with the ability to look, the ability to see can be enhanced by training and practice.  And as with the ability to look, the ability to see is followed by yet another requisite for a competent data analyst.  We’ll look at that (and hopefully, see it too) next time.









Data Analysis – The First Requisite

Okay, as promised, we’re going to take a look at what it takes to do data analysis.

First though, let’s clear away some underbrush.

We’re not going to talk about programs here.  If you’ve been reading along, you’ll know that I’m not the sort of person who stands in awe of digital manipulation tools.  Sure, number-crunchers have their uses and give you results that you can’t get any other way, but in terms of getting f2f with the data, there is absolutely nothing in the world that beats actually getting f2f with the data.

Following, it then looks like data analysis has something to do with data quality.  The relationship is obvious: if you can’t see what’s going on, your DQ efforts are going to go nowhere.  (On the plus side, of course, they’ll go nowhere fast, but that may not be the sort of upside you’re looking for.)

So what makes a good data analyst?  If anyone with two or three brain cells can be trained to use the tools, then the answer to this question does not lie in the realm of commercially available (or, for that matter, open source) programs.  It lies within the individual, and if it can be developed and enhanced, that development slash enhancement will probably not come in classrooms, seminars, or tutorials.

In my experience, the first requisite for data analysis is the ability to look.  Notice that this does not say what to look at, or what to look for.  Those are skills that can be learned and drilled to the point of competency.  The simple ability to look underlies those skills and forms the basis of all observation.

For whatever reason, individuals vary in their ability – or willingness – to confront what’s in front of them.  They sit in movie theaters and cover their faces with their hands and exclaim, “Oh, I can’t watch!”.  People shy away from things they cannot confront.  They can’t look at chaos, at evil, at mayhem, and if chaos, evil, and mayhem do not form the matrix one sees when one looks at data, then there are others who similarly can’t look at tables of numbers, printed directions, and computer monitors.

Conversely, there are those who are able to look at what there without flinching, without experiencing an emotional reaction, and without selectively looking at, or looking for, that which coincides with pre-existent belief.  Those are the people who have the first attribute of a (potentially) successful DQ analyst.  Those are the people who have a foundation on which good data analysis practices can be built, and from whom effective data quality measures can flow.

Now this is a quick-and-dirty exposition, and there’s more to be said both about the ability to look and the other core competencies that form the complete foundation of competent DQ.  Next time, we’ll look at number two.




Side note on “madness”

Such an outpouring of reaction to “Madness”, I felt I had to post a response before moving on.

Am I a Luddite?  Is it really true that I eschew electronic data processing in favor of cuneiform on clay tablets, smoke signals instead of email, carbon paper and slide rules instead of scripts and widgets?

No, not really.

The point I wanted to make in “Madness” is that it’s convenient – intentionally made so – to see electronic programs as the be-all and end-all of DQ efforts.  In reality and as with any other tool, ETL tools, analysis tools, reporting tools, and the suite to which these belong do have their limitations.  The most important of these limitations is that because we have the tools and use them, we think that we’re doing what can / should be done to ensure DQ in our products.

It’s an illusion.  A subtle illusion, but an illusion nonetheless.  Basic analysis skills (that have nothing to do with programs or, for that matter, data itself) underly methodology and govern, to a huge degree, the results that we get from using the tools.

The take-away?  Anyone with reasonable intelligence can be trained in the use of the tools.  Not everyone, no matter how intelligent, has the ability – and the willingness to demonstrate that ability – to see what he’s looking at.

Next time, I promise, I’ll go into this in more depth.

There’s Madness to My Method

Before we swing into any sort of discussion of fixing DQ errors – much less finding the pesky little things – let’s take a look at the wonderful array of tools that are available to us.  These tools, every one of them electronic, virtually guarantee that we WILL have errors to find, and fix.

Wait!  What’s that?  The tools that we’re using to handle data quality issues actually CREATE those issues?  That’s crazy!

Or is it?


[pause for effect]


Alright now, settle down and let’s take a look at how this may be possible.  After all, if it IS possible then we, as DQ professionals, better know about it so we don’t all get snookered!

Let’s take a hypothetical scenario.  We have a data shop of some sort, and they have the latest and greatest Cadillac of all data manipulation programs.  I won’t name names since we all know what programs we’re talking about here – they’re the industry leaders, the “must-have-on-resume”s, the ones with the bells and the whistles that management can’t seem to do without.

Now this particular program has been through the development mill.  It’s been beaten and forged and hammered and tonged to within an inch of its binary little life, and by all accounts, even among the developers who were responsible for the hammering and tonging, it does a pretty good job.

Of course it has its limitations – after all, there will be a new version coming out in a year or two to fix some of those limitations – but by and large, it does as good a job as an electronic program can reasonably be expected to do.

Perfect?  No.

Pretty darn close?  You bet.  That’s why you bought it.

But wait.  If it’s not perfect, then there are situations – some known, some ready to rise up and bite you in the keister – where it will fail to one extent or another.  It’ll misplace a decimal point.  It’ll fail to round.  It’ll truncate.  It’ll do something unanticipated in some situation, most likely (a) when you’re least expecting it and (b) when the downside for failure is greatest.

The problem isn’t that the program has flaws.  Everybody knows that the program has flaws!  Just ask the developers!

The problem is that somebody somewhere thinks that it is flawless.  “Look, we ran it through the logic and the results speak for themselves,” they say, as if the Voice of the Machine can never stumble, stutter, or mispronounce a word.

The assumption of perfection – the eyes made glassy watching PowerPoint presentations – collides with the reality of oops, and the result is a trainwreck.

Now that’s half the story.

The other half is the human element: the nut behind the wheel.  And again, it’s assumptions that make the whole thing collapse.  Let’s see how it works.

Back in our scenario, we now have a perfect program!  It works flawlessly in all circumstances and there won’t be another version out until Windows becomes Doors, or whatever the next OS paradigm is.  The program is sitting there, ready to launch, and at the keyboard sits Joe Collegegrad, head filled with stuff that he’s absorbed (more or less, between parties) at school.

Does he know how to run the program?  Yes.

Does he have the manual in case he gets into trouble?  Yes.

Does he know who to ask in case the manual doesn’t say what to do?  Yes.

Does he know where the little boys’ (or girls’) room is?  Of course!  He’s an Employee, and he’s ready to Rock and Roll!

So he launches the program and sets it to work, and we notice, sooner or later, that a curious thing has happened.  You see, it really doesn’t make any difference who this Employee is, how long and how well he was trained, how much expereience he has – sooner or later, he’s gonna goof up.  He’ll sneeze and his little finger’ll hit the CONTROL key by mistake.  It’ll be something and all of a sudden, whether or not you know it, you have errors in the data.

And again, it’s the result of a collision between the assumption (that the guy went to school and had the experience and therefore knows what he’s doing) and the reality (that sooner or later, he’s gonna trip and fall) that throws the sand into the gears.

So what can be done?  Should we throw the programs away and fire the employees who run them?  Of course not.

We should – and here’s what we’ll be talking about next time – keep the limitations of the program and the employee in mind.  Don’t assume that just because you have the program, it’ll work flawlessly every time.  Don’t assume that the guy sitting at the keyboard will be perceptive enough to see what’s going wrong, as it happens.  Having a Program and an Employee puts one in a box: the box that’s wrapped in paper printed with the slogan “It’s all OK”.

If you don’t want to live in that box, you need to cultivate the skills needed to stay out of it.

That’s what we’ll talk about next time.






Corollary to the First Law of Data Quality

If data in its native environment is pristine, then it follows that the best way to “fix” errors in data is to find the point closest to native state, and put the fix in at that point.

Makes sense, doesn’t it?  If the data you see is corrupt, why put band-aid after band-aid on it, trying to make it look OK?  Why not find the point where things first went awry, and put the fix in at that point?

There’s a fallacy at work here that I’ll try my best to obliterate: the fallacy that everything that’s gone before was done right.   Yes, if everything that’s gone before – data identification, collection, and so forth – was done exactly right, then it’s completely justifiable to slap band-aids on your data until all you see is band-aids.

Fact is, though, that I’ve yet to see a data-driven enterprise in which data in its native state is completely and accurately represented in digital form.  It just doesn’t happen.  The minute you touch the data, in its native state, you introduce errors – by which I mean, deviances from what is there to begin with.

So following the logic, remediation begins at the point closest to data in its native state.

Now here’s the kicker.  I didn’t say “the first point after the data was collected”.  I said “the point closest to data in its native state”.  You may be selling  your DQ efforts short if you limit your range to points in the production sequence north of data acquisition.  There may be opportunities to refine, for instance, the methods in which the data was collected in the first place so that errors that creep in at that point are minimized.  It may be possible to refine self-reported analog data south of the point of capture, to the same end.

As DQ professionals we hanstring ourselves if we limit the scope of our activities to points where our fingers and the data we work with come in contact.  Truthfully, we can achieve beneficial improvements before we see one single datum in our capture stream.

Think about it!







The First Law of Data Quality

If data in its native state is pure, then the First Law of Data Quality amounts to this: The more data is manipulated, the worse it gets.

Now obviously, this doesn’t speak to a standard of suitability, but rather to a standard of duplication of that which can be sensed or experienced in the real world.  It may well be that data, in order to be suitable for use, must be manipulated in one way or another.  If it is manipulated, though, a certain amount of error must be taken as a given.

Now in a little while I’m going to set out an example of this and show, step by step, how data degrades with manipulation.  For now, though, let’s take a look at a very simple example that illustrates how even the act of data collection in binary form is prone to the introduction of error.

Let’s look at a click.

John Q Public is sitting at his computer after dinner, noodling around here and there, and he sees a link.  Let’s say it’s a banner that advertises an internet storefront of some sort.  John is intrigued by the banner and clicks it.  Instantly (more-or-less) he’s transported to the storefront and certain data is collected from him.  This data can include the referrer (which site the link was located on), the time of click, and the IP address of John’s computer.

These are all important data for from them (and other data) the webmaster of the storefront site can tell (a) how long John stayed on the site, (b) which pages he visited, and (c) whether or not he bought anything.  In terms of e-commerce, this last bit of knowledge is exceedingly important because it represents, if nothing else, a gauge of how effective the banner was in pulling paying customers into the store.

But there’s another layer to this data; one that’s even more important.  If the webmaster of the storefront site can connect John’s IP address with an email – which is often collected elsewhere on a storefront, as when someone registers to receive sale notices – and an individual’s name – ditto – then customized emailings can be sent directly to John from the storefront, enticing him with merchandise that based on his past purchases he might also want to buy.

Very sophisticated system.  All electronic, and therefore foolproof, right?  Well?  It’s all zeros and ones, right?  Disregarding the trivial instance of typos, is there really any room for error?  John’s IP address is captured, his email address is captured, and his name is captured.  It all goes into a database and it’s all electronic, with no fudge factor possible. Right?


John, you see, is just an average guy and he’s not particularly internet-savvy.  He’s got a wireless connection to the web (doesn’t everybody these days?) but when he set it up, he left it in promiscuous mode. John’s little wireless network, which he uses to get out onto the web, is not password-protected.  And it just so happens that he’s got a neighbor, Sneaky Pete, who’s too poor or too cheap to get his own internet connection and ends up finding and using John’s.  Everything that John does can be linked to John’s IP address, but also, everything that Pete does can also be linked to John’s IP address because essentially, they’re both using the same internet connection.  John certainly doesn’t know about this, and he might not even care.

But back in the database of that storefront, associated with John’s name and IP address and email address, is the name of his neighbor, Pete!  As far as the webmaster of the site is concerned, Pete is the guy who made the purchases and Pete is the guy who gets the emails.  Except that the emails go to John’s email address and John gets all puzzled why he’s getting emails addressed to Pete all of a sudden.

Okay.  That’s the end of the example.  Data is there, it gets collected, and it’s wrong.

(Of course, in real life John’s email address is sold to other parties, so now John finds his emailbox getting cluttered up with advertising from people he knows he’s never dealt with.)

Stay tuned: next time we’ll walk through an example that shows how really big, serious decisions can be made based on data that’s got more holes in it than Albert Hall.





Data in its native state is flawless

This is the rock on which the church is built, ladies and gentlemen.  Data in its native state is flawless.

By this, of course, I mean “that which can be sensed or experienced in the real world is what it is, independent of subsequent manipulation”.  Now it’s true, what I’m talking about here isn’t actually DATA – you will not, ever see a wheelbarrow full of data sitting out in the middle of a field, rolling down a highway, or voting.  What we consider “data” is an abstract representation of something that can be sensed or experienced, though, and it is in the sensing or experiencing that “what’s out there” begins to be seen as something more than an existential phenomenon.

Let’s take an example.

Pitkin County, Colorado is heavily forested: its area includes one National Forest and four named wilderness areas.  From the air, much of Pitkin County looks like this:

This picture (from Google maps) is data. It is a direct representation of something that can be sensed or experienced in the real world.

This particular type of data can be pretty, but most of the time not really useful.  When we manipulate it, though, it gains significance and utility.  For instance, someone could take this picture, measure the sides and count the trees, and derive a number that would represent “trees per acre”. This would be information: it has significance brought about by aligning the original data to an external environment.

Assuming that this picture is of private land, the owner could take that information and combine it with other information (such as the price of lumber) and estimate of how much the timber on his land is worth. This is knowledge.

Based on this knowledge, the owner could then make a business decision whether to harvest his timber and sell it, or whether to let it mature for another season. Making that decision is the reason why the owner needed the data in the first place. Without the data, he would not have had the information needed to give him the knowledge that would enable him to make his business decision.

The data itself – the picture – is a representation of something – the trees – which exist in the real world.  You can go out there in the middle of the forest and put out your hand and feel the bark.  You can hear the birds above and your feet could crunch the dead leaves below.  It’s a forest, it’s real, and before anybody takes a picture of it or otherwise represents it for a decisionmaking process, it is flawless.

There are no errors in data when the data exists in its native state.