Thursday, March 29, 2012

How Useful is Additional Information?

As an informed and learned man, we are exposed to large amount of data and information all the time. Some of these data are incomplete, noisy; some are confusing and contradicting, and some are misleading or downright wrong.  Over time, with some filtering and learning, we formed our own (biased) opinions, and apply our judgment to event of interest.  While the intent and process may seem innocuous, our biases could and do sometimes have unintended and/or horrific consequences.  Take the recent news from Sanford, Florida as an example.  Trayvon Martin, a 17 years old, unarmed teen was shot dead by a neighborhood watch volunteer George Zimmerman.  While many details and facts are still unavailable to public, we do know that the tragic event started when George Zimmerman, according to his account, saw the hooded teen in the neighborhood and thought “he looks suspicious”.  

In his recent book Thinking, Fast and Slow, Professor Daniel Kahneman, 2002 Nobel Laureate in Economics, discussed how our brains struggle to balance our quick intuitions and slow analytics when making judgment.   We are proud of our build-in survival instincts and life experience that helped our quick intuitive responses.  We insist, while acknowledging such responses are not always correct, they are time tested and often sound.  We are perfectly ok to have such an efficient way of cutting through complex reality to get to the guts of the resolution.  And we can’t imagine how we can survive or live without such ability.  On the flip side, words such as profiling, stereotyping, bias, prejudice are often used to describe the same intuitive conclusions.   As we live in this information age and are constantly bombarded with noisy information and news, one must wonder how good is our gut reactions and how wrong could they be.  

In the chapter “Causes trump Statistics” of the said book, Professor Kahneman gave the following example:  “A cab was involved in a hit-and-run accident at night.  Two cab companies, the Green and the Blue, operate in the city.  You are given the following data: 85% of the cabs in the city are Green and 15% are blue.  A witness identified the cab as Blue.  The court tested the reliability of the witness under the circumstances that existed on the night of the accident and concluded that the witness correctly identified each one of the two colors 80% of the time and failed 20% of the time.”  Then Professor Kahneman asked us to answer intuitively to the question “What is the probability that the cab involved in the accident was Blue rather than Green?”

I don’t know about you.  But if I have to give an answer intuitively, I would probably put my bet on the taxi involved is Blue given the seemingly pretty solid witness’ account.  In real life, each of us runs into this type of situations from time to time and offers our guess freely.  But what if the answer matters (e.g., you are a juror in such a case in court) and what if we are wrong?  

So, what is the odd that I would win my bet in absence of any further information?  In this example, we were first given the fact that only 15% of taxi in this city is Blue.   If we knew nothing else, most of us probably would agree that a reasonable guess would be there is a 15% chance that it is a Blue taxi when an accident involving a taxi occurs.   The challenge comes when we are given further and additional information since we believe we now know a lot more about the event.  

In this example, we are told that there is an eye witness who said the taxi involved is Blue.   After all, the witness is 80% reliable that seems pretty good.  In a different setup, we might be told that Blue taxi is much more prune to having accident based on historical data, etc.   The core issue remains the same however: how can we incorporate such additional information properly?

More than 300 years ago, the 18th century English mathematician Thomas Bayes addressed this very issue and developed an approach of how one could update one’s beliefs when given new evidence.   His work is fundamental to the theory of probability and statistics and bears names like Bayes’ Theorem, Bayes’ law, Bayesian Statistics, and Bayesian inference, etc.   If we applies Bayes’ Theorem to our taxi problem as Professor Kahneman suggested, we could update our estimate as follows once we are told the 2nd piece of information that there is an eye witness who identified the taxi is Blue.  

The calculation is based on the simple observation that the probability of {the taxi is Blue} and {eye witness thinks it is Blue} can be expressed in two ways with priori or posterior probabilities:  this probability is equal to the product of the posterior probability of {that taxi is Blue, given eyewitness thinks it is Blue} of interest and the probability of {eye witness thinks it is Blue}.  It can also be expressed as the product of the probability of {eyewitness thinks it is Blue, given taxi is Blue} which is 0.8 and the priori probability of {that taxi is Blue} which is 0.15.  

But the eyewitness could be wrong.  The eyewitness can be thinking that the taxi is Blue when it is actually Green and he mistaken it to be Blue (20% chance), or taxi is indeed Blue and he got it right which is 80%.  Thus the latter probability of the evidence that eyewitness think it is blue is thus 0.20x0.85 + 0.80x0.15, or 0.29.  Therefore the posterior probability of our interest that the taxi is indeed Blue given eyewitness thinks it is Blue would be the ratio of 0.80x0.15 over 0.29, or 41%.  In other words, with the additional evidence provided by the eyewitness, the chance that the taxi is blue has risen almost three folds from 15% (based on the priori statistic) to 41%.  However, the resulted probability is still smaller than the obvious reference figure of 50% from a random toss of a fair coin if you had absolutely no information other than the fact there are two colors of taxi in the city.

What the example illustrates is that the additional information (in this example, the eyewitness account) may not give us nearly as much as we thought it would.  That is, we tend to give too much significance and weight to such additional information than it really deserves.   For this example, to have the updated estimate that the taxi is Blue be better than 50% (a totally uninformed random guess), the eye witness needs to be more than 85% reliable.   Note if we want to be 95% sure, the eyewitness reliability needs to be better than 98% in this case.

You can easily create examples like this with other subjects and numbers.  Perhaps you are visiting an area and you were told in advance that the minority population there is 15%.   Perhaps you are visiting an area and you were told that a particular group of certain profile is known to be responsible for 80% of the crimes in that area.  When there is a crime reported and a witness identified the perpetrator is of minority race or fits the profile of that group.  What is the chance that crime was indeed committed by a person of minority race or of that group as eye witness claimed?

When George Zimmerman of Sanford thought Trayvon Martin looked suspicious, when he began to follow him with a hand gun in his waist, the encounter ended with a tragic shooting death of the teen.   When the law enforcement and intelligence agencies collect statistics and profile groups, when institutions collect statistics about college admission of minority or under-represented groups, shouldn’t we all be concerned about how the data are used and if there is a rush to judgment?  Shouldn’t we all be worried about making proper inferences when we pick up new knowledge? 

Talk to you soon!


Anonymous said...

Thanks for enlightening on the taxi problem.

I thought it would have been clearer if you could have shown the mathematical formulas and workings to illustrate the solution.

iFROG said...

The answer was given in the 4th to the last paragraph that "... Therefore the posterior probability of our interest that the taxi is indeed Blue given eyewitness thinks it is Blue would be the ratio of 0.80x0.15 over 0.29, or 41%". The derivation was suggested in the preceding paragraph. What we want to find out is the (posterior) probability of {that taxi is Blue, given eyewitness thinks it is Blue}. Since the product of this posterior probability we are looking for AND the probability of {eye witness thinks it is Blue} (which is 0.29) IS EQUAL TO the product of the probability of {eyewitness thinks it is Blue, given taxi is Blue} which is 0.8 AND the priori probability of {that taxi is Blue} which is 0.15, we can calculate the answer knowing three of the four terms in this equality. Perhaps the more intricate part of the derivation is to recognize that eye witness thinks it is blue can be due to either 1)he/she is correct and the taxi is blue, OR 2)he/she is wrong and the taxi is green. Hope this helps.