Importance of Pursuing Ignorance in Data Science

According to Gartner, through 2022, only 20% of analytic insights will successfully deliver business outcomes. This is in line with the 80-85% failure rate often cited for data science projects. I would like to propose that the cause of this high failure rate is an inability to pursue thoroughly conscious ignorance.

The inspiration for this article is Stuart Firestein’s dialog on perception vs pursuit of science and the importance of ignorance. I highly recommend his book Ignorance: How it Drives Science as well as his Ted Talk.

Stuart Firestein: The pursuit of ignorance

Perception vs Pursuit of Science: 

It is fairly common to think about how science works in one of the following ways: Putting together pieces of a puzzle, peeling away layers of an onion, seeing the tip of an iceberg. Firestein disagrees with this perception of how science works because it implies an endpoint (a complete picture, a kernel, the whole iceberg). Instead, he suggests:

Rather it’s like the magic well: no matter how many buckets of water you remove, there’s always another one to be had. Or even better, it’s like the widening ripples on the surface of a pond, the ever larger circumference in touch with more and more of what’s outside the circle, the unknown.

As knowledge grows, so does the unknown. Like many great minds before him (from Socrates to Marie Curie), he reminds us that ignorance is a scientist’s greatest asset.

But not all ignorance must be embraced: 

One kind of ignorance is willful stupidity; worse than simple stupidity, it is a callow indifference to facts or logic. It shows itself as a stubborn devotion to uninformed opinions, ignoring (same root) contrary ideas, opinions, or data.

But there is another, less pejorative sense of ignorance that describes a particular condition of knowledge: the absence of fact, understanding, insight, or clarity about something. It is not an individual lack of information but a communal gap in knowledge. It is a case where data don’t exist, or more commonly, where the existing data don’t make sense, don’t add up to a coherent explanation, cannot be used to make a prediction or statement about some thing or event. This is knowledgeable ignorance, perceptive ignorance, insightful ignorance. It leads us to frame better questions, the first step to getting better answers. It is the most important resource we scientists have, and using it correctly is the most important thing a scientist does. James Clerk Maxwell, perhaps the greatest physicist between Newton and Einstein, advises that “Thoroughly conscious ignorance is the prelude to every real advance in science.”

Importance of Pursuing Ignorance

So, how do scientists pursue ignorance?

Using the seemingly structured scientific method of proposing a hypothesis and devising an experiment to test it. And this is another misunderstanding; the scientific method can actually be very messy.

As the Princeton mathematician Andrew Wiles describes it: It’s groping and probing and poking, and some bumbling and bungling, and then a switch is discovered, often by accident, and the light is lit, and everyone says, “Oh, wow, so that’s how it looks,” and then it’s off into the next dark room, looking for the next mysterious black feline.

While most of us think of the scientific method as leading from Ignorance to Knowledge, Firestein says that the reverse is true. Real science starts in the facts and edges closer to the unknown, the uncertain. It actually goes from Knowledge to Ignorance.

Science produces ignorance, and ignorance fuels science. We have a quality scale for ignorance. We judge the value of science by the ignorance it defines.

Ignorance in Data Science

The science of data and analytics is no different. It is an iterative process of asking better and better questions. The Data & Analytics Maturity of an organization is reflected in its ability to accept the unknowns and uncertainties. More mature organizations let their ignorance fuel the business questions. However, such organizations are rare.

An obvious source of ignorance in data science projects is low quality data (missing, inaccurate, unreliable, incomplete or irrelevant). This “bad” or “dirty” data often takes the blame when a project fails to give the “right answer”. The blank cells, outliers, high variation data are the stuff of nightmares for an analytics team. 

To increase the level of data & analytics maturity, one needs to face and embrace such ignorance. Instead of labeling data as “bad” and tossing it aside, qualify/quantify this ignorance based on incompleteness, irrelevance, unreliability and uncertainty of the data/analysis/model.

Additionally, instead of judging the success of a project on its ability to give one “right” answer, it must be judged on the quality of questions it raises. The opportunities of improvement it uncovers. The new ripples it generates. In terms of business value, insight is not one right answer. It is an enhancement in one’s ability to discern a situation more clearly.

Data quality is just one example of where organizations deplore ignorance. To think of other venues of ignorance and to explore Maxwell’s idea of a thoroughly conscious ignorance further, let’s use the Johari Window. 

Johari Window

Johari Window Data Science Project

Johari Window, Adapted from Luft (Of Human Interaction, 1969)

This 2×2 matrix is most frequently used as a self-awareness tool in team-building activities. It is named after the two psychologists who first proposed it: Joseph Luft and Harrington Ingham at a conference in the 1950s. It models the interaction between known/unknown to self/others within the 4 panes:

  • Open Area: Behaviors, feelings and motivations known to myself and to others (public)
  • Blind Area: Behaviors, feelings and motivations known to other but not to myself (unaware)
  • Hidden area: Behaviors, feelings and motivations known to myself but not to others (private)
  • Unknown area: Behaviors, feelings and motivations known to neither (potential)

Adapting this to a data science project:

  • Open Area: Well defined business question, relevant data sources, validated assumptions, known data findings, established data definitions
  • Blind Area: business knowledge not known/available to the data thinker, expectations/deliverables of the project not made explicit to the data thinker
  • Hidden area: statistical details (black box models, missing measures of variation, missing measures of uncertainty, quality of data) not shared by the data thinker with the stakeholders
  • Unknown area: unrecognized biases (cognitive, statistical), unexplored data sources, unexpected business challenges, logical fallacies, invalid assumptions

Johari Window for Data Science Projects, Adapted from Luft (Of Human Interaction, 1969)

By sharing expert knowledge (business, statistical, programming), seeking feedback, asking questions, and providing feedback, the Blind as well as the Hidden area can be reduced in proportion to the Open and the Unknown. This increases data maturity and the chance of success because:

  • Reducing the blind area by clarifying project expectations and success measures, helps mitigate the peak of expectations.
  • Reducing the hidden area promotes data literacy

In the adaptation above, I have placed Data Thinker on one axis. This model can actually promote an open communication among all members of the team. E.g. Domain Expert and IS expert, Data analyst and Data quality manager, Manager and Leader etc. The goal is to encourage everyone on the team to uncover and fill gaps in their knowledge.

Three of the four panes on the Johari Window deal with the unknown. I think this makes it an especially effective model of communication and critical thinking because it relieves the pressure of knowing the answers. It makes it okay to not know. This can spark intellectual humility and curiosity in the team. Combined with Socratic Questioning, the effectiveness of this framework can grow exponentially.

Luft theorizes that reducing the blind and the hidden area, expands the Open Area.

However, we are applying this model to science, REAL science, where knowledge leads to ignorance!

So, the growth in the Open Area warrants a growth in the Unknown area. As the discussion leads to sharing of known facts, the Open Area expands. These facts should then prompt unanswered questions, to embrace the Unknown area. 

Destined to fail…

Data science projects that naively place their business question in the Unknown area and then aim to shrink the Unknown area by analyzing the data, will fail.

To be successful:

  • The business question must reside in the Open area.
  • Good questions and an open mind must shrink the blind and the hidden areas
  • The team must expect and accept the Unknown area to grow in size.
  • The team must be comfortable with “groping and probing and poking, and some bumbling and bungling”  in the dark until a switch is found. 

An appeal:

In his book, Firestein addresses the need to revise our education system:

Instead of a system where the collection of facts is an end, where knowledge is equated with accumulation, where ignorance is rarely discussed, we will have to provide the Wiki-raised student with a taste of and for boundaries, the edge of the widening circle of ignorance, how the data, which are not unimportant, frames the unknown.

I would like to extend this eloquent appeal to the way most organizations think of the knowns and unknowns in Data Science.  

As a project progresses through various stages (exploring, analyzing, optimizing, delivering, revising), the size of the panes will vary for all stakeholders. It is crucial to bring everyone to the table frequently and keep the Blind and Hidden areas in check. It is even more important to acknowledge the Unknown Area. This is the pane out of which opportunities to advance and excel will be uncovered. It should not be ignored, feared or looked down upon. 

This is where analytically literate and mature organizations thrive.

This is the pane of innovation and even disruption!