There is a popular sentence that says: “There are three kinds of lies: lies, damned lies, and statistics.”
Well, as an MA student trying to pursue a career in data journalism I don’t entirely agree with the statement. But the phrase, popularized by the great Mark Twain, is a remind that when dealing with numbers a good dose of skepticism and critical thinking is imperative.
Recently when analysing datasets I could experience how you can create incorrect conclusions when manipulating numbers with not enough thought.
The first dataset was about HIV in the UK. After Nigel Farage’s polemic statement, I wanted to check what was the incidence of new HIV infections while patients were in the UK.
Proportion: a basic and useful concept of statistics
I downloaded a dataset from Public Health England (PHE) about the last nine years of HIV infections and care in the UK.
The dataset specifies the proportion of infections acquired while the patient were in the UK per category: men who have sex with men, heterosexual contact, drug injecting use and “others”. My goal was to find out if there was a trend in HIV infections – and if so, what it was.
I started my analysis focusing on the number of infections per category per year. Then problems appeared.
When looking at the numbers, there was a rise in the number of gay men who had been infected in the UK over the years and a sharp drop of heterosexual people who had contracted the virus while in the country.
It seemed that more gay people were acquiring the disease in this country and less heterosexuals were being infected while being here.
That would be completely fine if not for one reason: it was not telling the whole story.
The big drop in diagnosis amongst heterosexual people was due to a sharp fall in the number of infections amongst this group. It did not mean that less heterosexual people were being infected in the UK.
In fact, the proportion of heterosexual people infected in this country has risen sharply over the past years, from 31 per cent in 2004 to 57 per cent in 2013. And despite the uptrend for gay men seen in the graphic, the proportion of cases UK-acquired had diminished.
In 2013 the proportion of total infections UK-acquired was 66% – an increase from the 48% in 2004. The proportion had increased, despite the fact that the number of infections had dropped.
The correct analysis is here.
Comparisons: getting the numbers right
Recently I have done an analysis about the homicide rate in Brazil. My goal was to check if there was any difference in the way violent crimes affect white and black people.
Again a notion of proportion, a fairly basic principle of statistics, was very necessary.
The dataset I was analyzing had the number of homicides per colour per year from 2010 to 2012 for 5565 municipalities in Brazil. I did a quick pivot table to get the homicide numbers per state and per skin colour.
The numbers were shocking in itself. More than 23,000 thousand young black people were killed in Brazil in 2013. The total number was visibly higher than the 6.806 white young people killed in the same year.
But being Brazil a country with a larger population of black people, how could I compare the two values?
I used the rate formula “fact per 100,000 population“, which is a rate used by demographers in Brazil and by the United Nations when analyzing certain stats, as crimes one.
However, dividing the total number of homicide by 100,000 people wouldn’t neutralize incorrect conclusions. The rule with rates is that, to be a true rate we must try to have only those at risk in the denominator.
There are states in Brazil, such as Bahia, Amazonas e Pará, where the black population accounts for 80 per cent of the inhabitants. In such places, crimes were likely to affect more black people simply because they were the majority.
The solution for a more accurate picture: compare the crimes against black people within the black population and compare the crimes against white people within the white population.
Such analysis made possible, for example, to see that, in the state of Santa Catarina, even though the total number of homicides against white people were three times higher than the number against black people, the homicide rate per 100,000 was still higher for black people.
What I learnt from these episodes: numbers can’t be detached from the whole picture. Proportions are more important than real numbers and you may mislead your audience if you don’t present the context where your numbers are inserted.
Want to know more?
Here are some resources: