Digging for Digital Gold

Originally published in SPIDER Magazine on October 15, 2010.

If the trends of rising hard disk space and bandwidth requirements have taught us anything, it’s that we generate a staggering amount of data every day. 200 billion emails exchanged each day; more than 20 hours of video uploaded to YouTube every minute; well over a trillion unique links discovered by Google alone and over 112 million active websites on the web.

It may sound over the top when you hear it, but we truly are living in an information age.

Information is valuable, no doubt about it – just ask the militaries, news corporations, stock traders and even individuals that depend on new and reliable information every instant to make and save money and lives. The problem, however, arises when most of the data you have turns out to be useless junk.

That is exactly the case with most of the data out there. Around 90% of those 200 billion emails are spam, and YouTube may be able to provide nearly a day’s worth of new video content every minute but then no one in their right mind would watch all of the inane nonsense that goes on there. Even if it’s not total gibberish, if it doesn’t relate to your purpose might as well be.

Edward Wilson, in his book Consilience: the Unity of Knowledge, wrote “We are drowning in information, while starving for wisdom.” Despite the tons of information available to us almost instantaneously over the internet, it still takes the same amount of time to make sense of anything as ever.

Humans have been manually extracting meaning from data for centuries – that’s basically what statistics are used for. But with the numerous petabytes of data being generated every day, manual analysis just isn’t an option – that is of course where computers come in and data mining begins.

Data mining is, in essence, the use of automated processes to extract patterns from a given set of data. That is, using a computer to analyze raw data for you. Credit card companies, security agencies, pharmaceutical companies, banks, stocks and commodities traders, and even advertisers use it to try and gain the edge over competitors.

But its not just about competition and stealing business away from a rival; data mining techniques can also bring to light patterns and trends in a data set that people had no idea existed, or had misconceptions about.

Hans Rosling, professor of global health at Karolinska Insitute in Sweden, for example, uses raw data to illustrate trends in so-called ‘third world countries’ in sectors such as healthcare, average incomes, birth and death rates, etc. To the surprise of everyone at TED2007 where he first demo’d it, the data said ‘third world countries’ were actually progressing more rapidly than anyone anticipated. This data was available to anyone, but no one had actually used it to analyze what was going on – Rosling mined the boring spreadsheets and tables and hit gold in the form of valuable insight into how global society is progressing. His software, Gapminder, was bought by Google in March 07, but is available for free use on its website.

Another illustration of how data can be mined for information that makes sense is Google Insights for Search – a tool that lets curious visitors look for trends in what people are searching for with Google. The search giant’s philanthropic arm, Google.org further developed Insights to look at flu trends across the globe. By analyzing how many people were searching for flu-related queries and where, they were able to predict flu activity. Published in Nature magazine, their results showed a very accurate correlation between actual flu data provided by the government and their search trends; and in most instances their data was a far more instantaneous indicator of flu activity.

Data mining is not just done at Google though, it’s everywhere.

In the financial market, big-name traders such as Goldman-Sachs use special algorithms, dedicated connections to news services like Reuters and high-grade GPU-powered computers to process news and make decisions on what to buy and what to sell. Without the ability to analyze incoming data from news services, they wouldn’t gain the multi-million dollar profits they make.

Banks and credit card companies use it figure out which customers are most likely to honor their debts by looking at data such as purchasing habits (time of day, etc.). This data is also used for fraud and identity theft detection.

For a business, the ability to identify trends in consumer behavior could be what sets it apart from the competition. As computing power gets cheaper and businesses realize the importance of information in effective decision making, more and more of them will begin to utilize data mining techniques.

The term ‘data mining’ is often perceived negatively – something only a bad guy would do to steal your information. Data mining – the extraction and analysis of information – is a necessary part of business. Companies need information to figure out how their products are doing in the market. The concerns with data mining arise when a person’s anonymity is compromised.

Back in 2006, AOL released logs of searches made by nearly 657,000 Americans in the hopes it would aid academic researchers. In order to anonymize the data, AOL simply replaced usernames with numbers and hoped that would work. But of course, it didn’t. Reporters locked on to user 4417749 and using just her search query history managed to trace her to her home in Lilburn, Ga. It turned out Thelma Arnold, 62, widow, had 3 dogs and frequently looked up ailments her friends and family were suffering from to gauge their symptoms. She had searched for “numb fingers”, “tea for good health”, and wondered how she could send “school supplies for Iraq children.” It was, in essence, her whole life right there in the logs.

The search logs were pulled from AOL’s website but nothing is truly deleted once it gets on the internet. Searches by users revealed intimate personal conditions – one user highlighted in the New York Times article on this debacle suspected their “spouse [was] contemplating cheating.” Another wondered about “depression and medical leave.”

Search history can reveal a lot about us. Most of us use our search engine of choice like a personal oracle, typing in any question we want answered without a second thought. Companies like Google and Microsoft of course recognize the value of their search logs and leverage advertising money out of it.

But when does data mining for bettering business end and where does the privacy invasion begin? Targeted advertising is the pot of gold that advertisers are vying after these days and data mining is the only way it can be cracked. Companies like Phorm or NebuAd for example sit between the ISP and the user to glean data out of the connection to help their advertising partners better target ads to customers. Although this move was blocked by the British government, many privacy advocates fear the day is not far away when this sort of thing is put into practical use.

Another way of mining data are social networks like Facebook and MySpace where people are a bit less guarded about what they reveal inside. Companies are already using these sites to check out prospective employees for questionable behavior, and its not that difficult to build a profile of what a person is like what with all the “Which TV character are you” and “Personality quizzes” people take all the time.

While data mining can be a useful tool in business analysis, it can also be a real threat to users and their privacy. Tools like PASW Modeler and Gapminder can help businesses extract trends from any data they have and provide interesting results. However, the practice of profiling individuals for advertising is perhaps more of a privacy invasion than a business tool.