China is the world’s biggest consumer of energy. Private funders gave more money to Pakistan flood aid than Germany, France, Sweden or the United Arab Emirates. More than 3.6 million scientific procedures were carried out on live animals in the UK last year, and most of those creatures were mice.
That’s the kind of data that’ve been getting The Guardian’s data blog more than a million page-views a month, on good months. Founded for use by geeks and developers, the data blog is winning popular acclaim from ordinary readers who want to pore over MPs’ expense chits or find out exactly how many times The Beatles have used the word “love” in their songs. The answer is 613, just in case you were wondering.
A brave, pioneering step in open-source journalism; Simon Rogers, The Guardian’s news editor for data, thinks basic free tools, a fearlessness of massive data sets and calling on crowdsourcing had created a resource that is being used by media companies around the world.
“We want to be the trusted curator of data,” said Rogers speaking to The Daily Maverick from London. “People are increasingly interested in the raw data behind the story. If the story is based on data, they are really interested in seeing it.” Rogers said there had been such an explosion of data that people were worried about where information was coming from and whether they could trust the sources. “Most of the data we have are available elsewhere, but it is about the fact that people trust us, and that we make it simple and accessible for people to download and use.”
Rogers said a lot of the data published by the UK government came in “the worst format in the world”. As a curator of data, what The Guardian’s blog does is to make that information accessible and usable, and to apply a journalistic lens to statistics. Then there’s the matter of ensuring information can be read by computers so data sets can be easily combined and aggregated into metadata. “There is a lot of new thinking about semantic data or linked data that Tim Berners-Lee is involved in. This is the idea that if a computer was reading a piece of data it would understand what (the information) stood for.” Rogers uses the example of the different country names for the same place to explain this. For some people the country of Burma is called Myanmar. “How does the computer know to match those two things together? So we use codes where we can that make it easier to mash up different data sets with each other.”
The big change in thinking proposed by The Guardian’s data blog is about opening up sources. The past saw a journalism that was precious with its information and reluctant to share sources or data. “Now it’s all about the ‘mutualisation’ of news. It’s a horrible word that, ‘mutualisation’, but it is good at describing what happens when journalists stop keeping all the stuff to themselves.” Rogers said old-school journalism was all about beautiful words and writing skilfully crafted news that was thrown out to a grateful readership. “Now what we’re finding is that there are people that know much more about some subjects than we do because they have specialist knowledge or can offer us stories,” he said.
A case in point was the UK’s member of parliament expenses scandal, which sounds similar to some of the exposés of largesse in South Africa. “A lot of the members of parliament had been found to do dodgy things with their expense claims. We got the data and there was too much data for us to analyse. There were 450,000 items of PDF receipts that we received. So we put them out there to be crowdsourced and asked our readers what they thought of it, and the stories that they could find.”
Readers pored over the expenses with some people analysing thousands and thousands of receipts and offering The Guardian new angles and insights on the scandal. “It is a reciprocal relationship and at the end of the day and we view the Web as being about information that is open and available.”
The Guardian’s Data Blog isn’t rocket science. On the contrary it’s the use of simple and easy-to- use tools that has made it accessible and well trafficked. “I deliberately keep things simple so we use things like Excel and our illustrators use Adobe Illustrator. We also use tools that are free and available online like ManyEyes which IBM’s visualisation tool, and Timetric which is useful for time-sensitive data that enable you to produce a graphic in five minutes, which is appealing if you want to get stuff out quickly. Then we probably qualify as one of Google Documents first super-users.”
The data unit also built tools for massive information dumps like Wikileaks where the mass of data was so broad and extensive it required the development of databases to manage it. “Wikileaks was a real example how the operation of the newspaper would have been very different if data unity hadn’t been around. Our top investigative journalists were working with massive amounts of spreadsheets and volumes of data they were unused to. Half the work we did was internal facing, helping journalists find stories, helping journalists with the analysis of the data and excising the bits we wanted to get out. What we’re finding is a dual role here. We have this role in helping reporters produce stories and get things out of huge amounts of data, and then we also have the role of helping our readers access and find their way around massive data sets.
“We are now in a position where big data don’t frighten us and having a massive amount of information to process isn’t scary for us. The big lesson from Wikileaks was to ensure we have new tools and the right tools to handle the data. It also taught people outside of The Guardian how to work with us in terms of a story,” said Rogers.
The “good old days” were about journalists who would pride themselves on their lack of numerical ability. Rogers said things had changed, radically. “What you are going to find is that a lot of stories are going to come our way as data, and as journalists’ if you are scared of the tools and scared of treating data seriously as journalism, then you are going to lose out. The organisations that don’t adopt this approach are the ones that are going to lose out.”
The final point Rogers made was telling. “There’s snobbery about what journalism is. In some quarters, journalism is perceived as writing huge amounts of beautiful text. Actually isn’t publishing useful information and data that people can see in new ways and access journalism too?”
By Mandy de Waal
Visit The Guardian’s Data Blog.
WD-40 is not patented as that would force the makers to reveal its formula.
Daily Maverick © All rights reserved