Statistics

Brian Keegan: Big Data

Editor’s Note: Brian Keegan is a post-doctoral research fellow in Computational Social Science with David Lazer at Northeastern University. He defended his Ph.D. in the Media, Technology, and Society program at Northwestern University’s School of Communication.  He also attended the Massachusetts Institute of Technology and received bachelors degrees in Mechanical Engineering and Science, Technology, and Society in 2006.

His research employs a variety of large-scale behavioral data sets such as Wikipedia article revision histories, massively-multiplayer online game behavioral logs, and user interactions in a crowd-sourced T-shirt design community. He uses methods in network analysis, multilevel statistics, simulation, and content analysis. To learn more about him, please visit his official website Brianckeegan.com.

eTalk’s Niaz Uddin has interviewed Brian Keegan recently to gain his ideas and insights about Big Data, Data Science and Analytics which is given below.

Niaz: Brian we are really excited to have you to talk about Big Data. Let start from the beginning. How do you define Big Data?

Brian: Thank you Niaz for having me. Well, a common joke in the community is that “big data” is anything that makes Excel crash. That’s neither fair to Microsoft because the dirty secret of data science is that you can get pretty far using Excel nor is it fair to researchers whose data could hypothetically fit in Excel, but are so complicated that it would make no sense to try in the first place.

Big data is distinct from traditional industry and academic approaches to data analysis because of what are called the three Vs: volume, variety, velocity.

      • Volume is what we think of immediately – server farms full of terabytes of user data waiting to be analyzed. This data doesn’t fit into a single machine’s memory, hard drive, or even a traditional database. The size of the data makes analyzing with traditional tools really hard which is why new tools are being created.
      • Second, there’s variety that reflects the fact that data aren’t just lists of numbers, but include complex social relationships, collections of text documents, and sensors. The scope of the data means that all these different kinds of data have different structures, granularity, and errors which need to be cleaned and integrated before you can start to look for relationships among them. Cleaning data is fundamentally unsexy and grueling work, but if you put garbage into a model, all you get garbage back out. Making sure all these diverse kinds of data are playing well with each other and the models you run on them is crucial.
      • Finally, there’s velocity that reflects the fact that data are not only being created in real-time, but people want to act on the incoming information in real time as well. This means the analysis also has to happen in real time which is quite different than the old days where a bunch of scientists could sit around for weeks testing different kinds of models on data collected months or years ago before writing a paper or report that takes still more months before its published. APIs, dashboards, and alerts are part of big data because they make data available fast.

Niaz: Can you please provide us some examples?

Brian: Data that is big is definitely not new. The US Census two centuries ago still required collecting and analyzing millions of data points collected by hand. Librarians and archivists have always struggled with how to organize, find, and share information on millions of physical documents like books and journals. Physicists have been grappling with big data for decades where the data is literally astronomical. Biologists sequencing the genome needed ways to manipulate and compare data involving billions of base pairs.

While “data that was big” existed before computers, the availability of cheap computation has accelerated and expanded our ability to collect, process, and analyze data that is big. So while we now think of things like tweets or financial transactions as “big data” because these industries have rushed to adopt or are completely dependent upon computation, it’s important to keep in mind that lots of big data exist outside of social media, finance, and e-commerce and that’s where a lot of opportunities and challenges still exist.

Niaz: What are some of the possible use cases for big data analytic? What are the major companies producing gigantic amount of Data?

Brian: Most people think of internet companies like Google, Facebook, Twitter, LinkedIn, FourSquare, Netflix, Amazon, Yelp, Wikipedia, and OkCupid when they think of big data. These companies are definitely the pioneers of coming up with the algorithms, platforms, and other tools like PageRank, map-reduce, user-generated content, recommender systems that require combining millions of data points to provide fast and relevant content.

    • Companies like Crimson Hexagon mine Twitter and other social media streams for their clients to detect patterns of novel phrases or changes in the the sentiment associated with keywords and products. This can let their clients know if people are having problems with a product or if a new show is generating a lot of buzz despite mediocre ratings.
    • The financial industry uses big data not only for high-frequency trading based on combining signals from across the market, but also evaluating credit risks of customers by combining various data sets. Retailers like Target and WalMart have large analytics teams that examine consumer transactions for behavioral patterns so they know what products to feature. Telecommunications companies like AT&T or Verizon collect call data records produced by every cell phone on their networks that lets them know your location over time so they can improve coverage. Industrial companies like GE and Boeing put more and more sensors into their products so that they can monitor performance and anticipate maintenance.
    • Finally, one of the largest producers and consumers of big data is the government. Law enforcement agencies publish data about crime and intelligence agencies monitor communication data from suspects. The Bureau of Labor Statistics, Federal Reserve, and World Bank collect and publish extremely rich and useful economic time series data. Meteorologists collect and analyze large amounts of data to make weather forecasts.

Niaz: Why has big data become so important now?

Brian: Whether it was business, politics, or military, decisions were (and continue to be) made under uncertainty about history or context because getting timely and relevant data was basically impossible. Directors didn’t know what customers were saying about their product, politicians didn’t know the issues constituents were talking about, and officers faced a fog of war. Ways of getting data were often slow and/or suspect: for example, broadcast stations used to price advertising time by paying a few dozen people in a city to keep journals of what stations they remember hearing every day. Looking back now, this seems like an insane way not only collect data but also make decisions based on obviously unreliable data, but it’s how things were done for decades because there was no better way of measuring what people were doing. The behavioral traces we leave in tweets and receipts are not only much finer-grained and reliable, but also encompass a much larger and more representative sample of people and their behaviors.

Data lets decision makers know and respond to what the world really looks like instead of going on their gut. More data usually gives a more accurate view, but too much data can also overwhelm and wash out the signal with noise. The job of data scientists less trying to find a single needle in a haystack and more like collecting as much hay as possible to be sure there’s a few needles in there before sorting through the much bigger haystack. In other words, data might be collected for one goal, but it can also be repurposed for other goals and follow-on questions that come along to provide new insights. More powerful computers, algorithms, and platforms make assembling and sorting through these big haystacks much easier than before.

Niaz: Recently I have seen IBM has started to work with Big Data. What roles do companies like IBM play in this area?

Brian: IBM is just one of many companies that are racing “upstream” to analyze data on larger and more complex systems like an entire city by aggregating tweets, traffic, surveillance cameras, electricity consumption, emergency services which feed into each other. IBM is an example of an organization that has shifted from providing value from transforming raw materials into products like computers to transforming raw data into unexpected insights about how a system works — or doesn’t. The secret sauce is collecting existing data, building new data collection systems, and developing statistical models and platforms that are able to work in the big data domain of volume, variety, and velocity that traditional academic training doesn’t equip people.

Niaz: What are the benefits of Big Data to Business? How it is influencing innovation and business?

Brian: Consider the market capitalization of three major tech companies on a per capita basis: Microsoft makes software and hardware as well as running web services like Bing based on big data and is worth about $2.5 million per employee, Google mostly makes software and runs web services and is worth about $4.6 million per employee, and Facebook effectively just runs a web service of its social network site and is worth about $19 million per employee. These numbers may outliers or unreliable for a variety of reasons, but the trend suggests that organizations like Facebook focused solely on data produce more value per employee.

This obviously isn’t a prescription for every company — ExxonMobil, WalMart, GE, and Berkshire produce value in fundamentally different ways. But Facebook did find a way to capture and analyze data about the world — our social relationships and preferences — that was previously hidden. There are other processes happening beyond the world of social media that currently go uncaptured, but the advent of new sensors and opportunities for collecting data that will become ripe for the picking. Mobile phones in developing countries will reveal patterns of human mobility that could transform finance, transportation, and health care. RFIDs on groceries and other products could reveal patterns transportation and consumption that could reduce wasted food while opening new markets. Smart meters and grids could turn the tide against global climate change while lowering energy costs. Politicians could be made more accountable and responsive through crowd sourced fundraising and analysis of regulatory disclosures. The list of data out there waiting to be collected and analyzed boggles the mind.

Niaz: How do you define a Data Scientist? What are your suggestions you have for those who want to become a data scientist?

Brian: A data scientist needs familiarity with a wide set of skills, so much so that it’s impossible for them to be expert in all of them.

      • First, data scientists need the computational skills from learning a programming language like Python or Java so that they can acquire, cleanup, and manipulate data from databases and APIs, hack together different programs developed by people who are far more expert in network analysis or natural language processing, and use difficult tools like MySQL and Hadoop. There’s no point-and-click program out there with polished tutorials that does everything you’ll need from end-to-end. Data scientists spend a lot of time writing code, working at the command line, and reading technical documentation but there are tons of great resources like StackOverflow, GitHub, free online classes, and active and friendly developer communities where people are happy to share code and solutions.
      • Second, data scientists need statistical skills at both a theoretical and methodological level. This is the hardest part and favors people who have backgrounds in math and statistics, computer and information sciences, physical sciences and engineering, or quantitative social sciences. Theoretically, they need to know why some kinds of analyses should be run on some kinds but not other kinds of data and what the limitations of one kind of model are compared to others. Methodologically, data scientists need to actually be able to run these analyses using statistical software like R, interpret the output of the analyses, and do the statistical diagnostics to make sure all the assumptions that are baked into a model are actually behaving properly.
      • Third, data scientists need some information visualization and design skills so they can communicate their findings in an effective way with charts or interactive web pages for exploration. This means learning to use packages like ggplot in R or matplotlib in Python for statistical distributions, d3 in Javascript for interactive web visualizations, or Gephi for network visualizations.

All of the packages I mentioned are open-source which also reflects the culture in the data science community; expensive licenses for software or services are very suspect because others should be able to easily replicate and build upon your analysis and findings.

Niaz: Finally, what do you think about the impact of Big Data in our everyday life?

Brian: Big Data is a dual-use technology that can satisfy multiple goals, some of which may be valuable and others which may be unsavory. On one hand it can help entrepreneurs be more nimble and open new markets or researchers make new insights about how the world works, on the other hand, the Arab Spring suggested it can also reinforce the power of repressive regimes to monitor dissidents or unsavory organizations to do invasive personalized marketing.

Danah Boyd and Kate Crawford have argued persuasively about how the various possibilities of big data to address societal ills or undermine social structure obscure the very real but subtle changes that are happening right now that replace existing theory and knowledge, cloak subjectivity with quantitative objectivity, confuse bigger data with better data, separate data from context and meaning, raise real ethical questions, and create or reinforce inequalities.

Big data also raises complicated questions about who has access to data. On one hand, privacy is a paramount concern as organizations shouldn’t be collecting or sharing data about individuals without their consent. On the other hand, there’s also the expectation that data should be shared with other researchers so they can validate findings. Furthermore, data should be preserved and archived so that it is not lost to future researchers who want to compare or study changes over time.

Niaz: Brian, Thank you so much for giving me time in the midst of your busy schedule. It is really great to know the details of Big Data from you. I am wishing you good luck with your study, research, projects and works.

Brian: You are welcome. Good luck to you too.

_  _  _  _  ___  _  _  _  _

Further Reading:

1. Viktor Mayer-Schönberger on Big Data Revolution

2. Gerd Leonhard on Big Data and the Future of Media, Marketing and Technology

3. Ely Kahn on Big Data, Startup and Entrepreneurship

4. James Kobielus on Big Data, Cognitive Computing and Future of Product

5. danah boyd on Future of Technology and Social Media

6. Irving Wladawsky-Berger on Evolution of Technology and Innovation

7. Horace Dediu on Asymco, Apple and Future of Computing

8. James Allworth on Disruptive Innovation