Data Science

James Kobielus: Big Data, Cognitive Computing and Future of Product

Editor’s Note: As IBM’s Big Data Evangelist, James Kobielus is IBM Senior Program Director, Product Marketing, Big Data Analytics Solutions. He is an industry veteran, a popular speaker and social media participant, and a thought leader in Big Data, Hadoop, Enterprise Data Warehousing, Advanced Analytics, Business Intelligence, Data Management, and Next Best Action Technologies. He works with IBM’s product management and marketing teams in Big Data. He has spoken at such leading industry events as IBM Information On Demand, IBM Big Data Integration and Governance, Hadoop Summit, Strata, and Forrester Business Process Forum. He has published several business technology books and is a very popular provider of original commentary on blogs and many social media.

To learn more about his research, works, ideas, theories and knowledge, please check this this this this this this and this out.

eTalk’s Niaz Uddin has interviewed James Kobielus recently to gain insights about his ideas, research and works in the field of Big Data which is given below.

Niaz: Dear James, thank you so much for joining us in the midst of your busy schedule. We are very thrilled and honored to have you at eTalks.

James: And I’m thrilled and honored that you asked me.

Niaz: You are a leading expert on Big Data, as well as on such enabling technologies as enterprise data warehousing, advanced analytics, Hadoop, cloud services, database management systems, business process management, business intelligence, and complex-event processing. At the beginning of our interview can you please tell us about Big Data? How does Big Data make sense of the new world?

James: Big Data refers to approaches for extracting deep value from advanced analytics and trustworthy data at all scales. At the heart of advanced analytics is data mining, which is all about using statistical analysis to find non-obvious patterns (segmentations, correlations, trends, propensities, etc.) within historical data sets.

Some might refer to advanced analytics as tools for “making sense” of this data in ways that are beyond the scope of traditional reporting and visualization. As we aggregate and mine a wider variety of data sources, we can find far more “sense”–also known as “insights”–that previously lay under the surface. Likewise, as we accumulate a larger volume of historical data from these sources and incorporate a wider variety of variables from them into our models, we can build more powerful predictive models of what might happen under various future circumstances. And if we can refresh this data rapidly with high-velocity high-quality feeds, while iterating and refining our models more rapidly, we can ensure that our insights reflect the latest, greatest data and analytics available.

That’s the power of Big Data: achieve more data-driven insights (aka “making sense”) by enabling our decision support tools to leverage the “3 Vs”: a growing Volume of stored data, higher Velocity of data feeds, and broader Variety of data sources.

Niaz: As you know, Big Data has already started to redefine search, media, computing, social media, products, services and so on. Availability of Data helping us analyzing trend and doing interesting things in more accurate and efficient ways than before. What are some of the most interesting uses of big data out there today?

James: Where do I start? There are interesting uses of Big Data in most industries and in most business functions.

I think cognitive computing applications of Big Data are among the most transformative tools in modern business.

Cognitive computing is a term that probably goes over the head of most of the general public. IBM defines it as the ability of automated systems to learn and interact naturally with people to extend what either man or machine could do on their own, thereby helping human experts drill through big data rapidly to make better decisions.

One way I like to describe cognitive computing is as the engine behind “conversational optimization.” In this context, the “cognition” that drives the “conversation” is powered by big data, advanced analytics, machine learning and agile systems of engagement. Rather than rely on programs that predetermine every answer or action needed to perform a function or set of tasks, cognitive computing leverages artificial intelligence and machine learning algorithms that sense, predict, infer and, if they drive machine-to-human dialogues, converse.

Cognitive computing performance improves over time as systems build knowledge and learn a domain’s language and terminology, its processes and its preferred methods of interacting. This is why it’s such a powerful conversation optimizer. The best conversations are deep in give and take, questioning and answering, tackling topics of keenest interest to the conversants. When one or more parties has deep knowledge and can retrieve it instantaneously within the stream of the moment, the conversation quickly blossoms into a more perfect melding of minds. That’s why it has been deployed into applications in healthcare, banking, education and retail that build domain expertise and require human-friendly interaction models.

IBM Watson is one of the most famous exemplars of the power of cognitive computing driving agile human-machine conversations.  In its famous “Jeopardy!” appearance, Watson illustrated how its Deep Question and Answer technology—which is cognitive computing to the core—can revolutionize the sort of highly patterned “conversation” characteristic of a TV quiz show. By having its Deep Q&A results rendered (for the sake of that broadcast) in a synthesized human voice, Watson demonstrated how it could pass (and surpass) any Turing test that tried to tell whether it was a computer rather than, say, Ken Jennings. After all, the Turing test is conversational at its very core.

What’s powering Watson’s Deep Q&A technology is an architecture that supports an intelligent system of engagement. Such an architecture is able to mimic real human conversation, in which the dialogue spans a broad, open domain of subject matter; uses natural human language; is able to process complex language with a high degree of accuracy, precision and nuance; and operates with speed-of-thought fluidity.

Where the “Jeopardy!” conversational test was concerned (and where the other participants were humans literally at the top of that game), Watson was super-optimized. However, in the real-world of natural human conversation, the notion of “conversation optimization” might seem, at first glance, like a pointy-headed pipedream par excellence. However, you don’t have to be an academic sociologist to realize that society, cultures and situational contexts impose many expectations, constraints and other rules to which our conversations and actions must conform (or face disapproval, ostracism, or worse). Optimizing our conversations is critical to surviving and thriving in human society.

Wouldn’t it be great to have a Watson-like Deep Q&A adviser to help us understand the devastating faux pas to avoid and the right bon mot to drop into any conversation while we’re in the thick of it? That’s my personal dream and I’ll bet that before long, with mobile and social coming into everything, it will be quite feasible (no, this is not a product announcement—just the dream of one IBMer). But what excites me even more (and is definitely not a personal pipedream), is IBM Watson Engagement Advisor, which we unveiled earlier this year. It is a cognitive-computing assistant that revolutionizes what’s possible in multichannel B2C conversations. The  solution’s “Ask Watson” feature uses Deep Q&A to greet customers, conduct contextual conversations on diverse topics, and ensure that the overall engagement is rich with answers, guidance and assistance.

Cognitive/conversational computing is also applicable to “next best action,” which is one of today’s hottest new focus areas in intelligent systems. At its heart, next best action refers to an intelligent infrastructure that optimizes agile engagements across many customer-facing channels, including portal, call center, point of sales, e-mail and social. With cognitive-computing infrastructure the silent assistant, customers engage in a never-ending whirligig of conversations with humans and, increasingly, with automated bots, recommendation engines and other non-human components that, to varying degrees, mimic real-human conversation.

Niaz: So do you think machine learning is the right way to analyze Big Data?

James: Machine learning is an important approach for extracting fresh insights from unstructured data in an automated fashion, but it’s not the only approach. For example, machine learning doesn’t eliminate the need for data scientists to build segmentation, regression, propensity, and other models for data mining and predictive analytics.

Fundamentally, machine learning is a productivity tool for data scientists, helping them to get smarter, just as machine learning algorithms can’t get smarter without some ongoing training by data scientists. Machine learning allows data scientists to train a model on an example data set, and then leverage algorithms that automatically generalize and learn both from that example and from fresh feeds of data. To varying degrees, you’ll see the terms “unsupervised learning,” “deep learning,” “computational learning,” “cognitive computing,” “machine perception,” “pattern recognition,” and “artificial intelligence” used in this same general context.

Machine learning doesn’t mean that the resultant learning is always superior to what human analysts might have achieved through more manual knowledge-discovery techniques. But you don’t need to believe that machines can think better than or as well as humans to see the value of machine learning. We gladly offload many cognitive processes to automated systems where there just aren’t enough flesh-and-blood humans to exercise their highly evolved brains on various analytics tasks.

Niaz:What are the available technologies out there those help profoundly to analyze data? Can you please briefly tell us about Big Data technologies and their important uses?

James: Once again, it’s a matter of “where do I start?” The range of Big Data analytics technologies is wide and growing rapidly. We live in the golden age of database and analytics innovation. Their uses are everywhere: in every industry, every business function, and every business process, both back-office and customer-facing.

For starters, Big Data is much more than Hadoop. Another big data “H”—hybrid—is becoming dominant, and Hadoop is an important (but not all-encompassing) component of it. In the larger evolutionary perspective, big data is evolving into a hybridized paradigm under which Hadoop, massively parallel processing enterprise data warehouses, in-memory columnar, stream computing, NoSQL, document databases, and other approaches support extreme analytics in the cloud.

Hybrid architectures address the heterogeneous reality of big data environments and respond to the need to incorporate both established and new analytic database approaches into a common architecture. The fundamental principle of hybrid architectures is that each constituent big data platform is fit-for-purpose to the role for which it’s best suited. These big data deployment roles may include any or all of the following: data acquisition, collection, transformation, movement, cleansing, staging, sandboxing, modeling, governance, access, delivery, archiving, and interactive exploration. In any role, a fit-for-purpose big data platform often supports specific data sources, workloads, applications, and users.

Hybrid is the future of big data because users increasingly realize that no single type of analytic platform is always best for all requirements. Also, platform churn—plus the heterogeneity it usually produces—will make hybrid architectures more common in big data deployments.

Hybrid deployments are already widespread in many real-world big data deployments. The most typical are the three-tier—also called “hub-and-spoke”—architectures. These environments may have, for example, Hadoop (e.g., IBM InfoSphere BigInsights) in the data acquisition, collection, staging, preprocessing, and transformation layer; relational-based MPP EDWs (e.g., IBM PureData System for Analytics) in the hub/governance layer; and in-memory databases (e.g., IBM Cognos TM1) in the access and interaction layer.

The complexity of hybrid architectures depends on range of sources, workloads, and applications you’re trying to support. In the back-end staging tier, you might need different preprocessing clusters for each of the disparate sources: structured, semi-structured, and unstructured.

In the hub tier, you may need disparate clusters configured with different underlying data platforms—RDBMS, stream computing, HDFS, HBase, Cassandra, NoSQL, and so on—-and corresponding metadata, governance, and in-database execution components.

And in the front-end access tier, you might require various combinations of in-memory, columnar, OLAP, dimensionless, and other database technologies to deliver the requisite performance on diverse analytic applications, ranging from operational BI to advanced analytics and complex event processing.

Niaz: That’s really amazing. How to you connect these two dots: Big Data Analytics and Cognitive Computing? How does this connection make sense?

James: The relationship between Cognitive computing and Big Data is simple. Cognitive computing is an advanced analytic approach that helps humans drill through the unstructured data within Big Data repositories more rapidly in order to see correlations, patterns, and insights more rapidly.

Think of cognitive computing as a “speed-of-thought accelerator.” Speed of thought is something we like to imagine operates at a single high-velocity setting. But that’s just not the case. Some modes of cognition are painfully slow, such as pondering the bewildering panoply of investment options available under your company’s retirement plan. But some other modes are instantaneous, such as speaking your native language, recognizing an old friend, or sensing when your life may be in danger.

None of this is news to anybody who studies cognitive psychology has followed advances in artificial intelligence, aka AI, over the past several decades. Different modes of cognition have different styles, speeds, and spheres of application.

When we speak of “cognitive computing,” we’re generally referring to the ability of automated systems to handle the conscious, critical, logical, attentive, reasoning mode of thought that humans engage in when they, say, play “Jeopardy!” or try to master some rigorous academic discipline. This is the “slow” cognition that Nobel-winning psychologist/economist Daniel Kahneman discussed in recent IBM Colloquium speech.

As anybody who has ever watched an expert at work will attest, this “slow” thinking can move at lightning speed when the master is in his or her element. When a subject-domain specialist is expounding on their field of study, they often move rapidly from one brilliant thought to the next. It’s almost as if these thought-gems automatically flash into their mind without conscious effort.

This is the cognitive agility that Kahneman examined in his speech. He described the ability of humans to build skills, which involves mastering “System 2″ cognition (slow, conscious, reasoning-driven) so that it becomes “System 1″ (fast, unconscious, action-driven). Not just that, but an expert is able to switch between both modes of thought within the moment when it becomes necessary to rationally ponder some new circumstance that doesn’t match the automated mental template they’ve developed. Kahneman describes System 2 “slow thinking” as well-suited for probability-savvy correlation thinking, whereas System 1 “fast thinking” is geared to deterministic causal thinking.

Kahneman’s “System 2″ cognition–slow, rule-centric, and attention-dependent–is well-suited for acceleration and automation on big data platforms such as IBM Watson. After all, a machine can process a huge knowledge corpus, myriad fixed rules, and complex statistical models far faster than any mortal. Just as important, a big-data platform doesn’t have the limited attention span of a human; consequently, it can handle many tasks concurrently without losing its train of thought.

Also, Kahneman’s “System 1″ cognition–fast, unconscious, action-driven–is not necessarily something we need to hand to computers alone. We can accelerate it by facilitating data-driven interactive visualization by human beings, at any level of expertise. When a big-data platform drives a self-service business intelligence application such as IBM Cognos, it can help users to accelerate their own “System 1″ thinking by enabling them to visualize meaningful patterns in a flash without having to build statistical models, do fancy programming, or indulge in any other “System 2″ thought.

And finally, based on those two insights, it’s clear to me that cognitive computing is not simply limited to the Watsons and other big-data platforms of the world. Any well-architected big data, advanced analytics, or business intelligence platform is essentially a cognitive-computing platform. To the extent it uses machines to accelerate the slow “System 2″ cognition and/or provides self-service visualization tools to help people speed up their wetware’s “System 1″ thinking, it’s a cognitive-computing platform.

Now I will expand upon the official IBM definition of “cognitive computing” to put it in a larger frame of reference. As far as I’m concerned, the core criterion of cognitive computing is whether the system, however architected, has the net effect of speeding up any form of cognition, executing on hardware and/or wetware.

Niaz: How is Big Data Analytics changing the nature of building great products? What do you think about the future of products?

James: That’s a great question that I haven’t explored too much extent. My sense is that more “products” are in fact “services”–such as online media, entertainment, and gaming–that, as an integral capability, feed on the Big Data generated by its users. Companies tune the designs, interaction models, and user experiences of these productized services through Big Data analytics. To the extent that users respond or don’t respond to particular features of these services, that will be revealed in the data and will trigger continuous adjustments in product/service design. New features might be added on a probationary basis, to see how users respond, and just as quickly withdraw or ramped up in importance.

This new product development/refinement loop is often referred to as “real-world experiments.” The process of continuous, iterative, incremental experimentation both generates and depends on a steady feed of Big Data. It also requires data scientists to play a key role in the product-refinement cycle, in partnership with traditional product designers and engineers.  Leading-edge organizations have begun to emphasize real-world experiments as a fundamental best practice within their data-science, next-best-action, and process-optimization initiatives.

Essentially, real-world experiments put the data-science “laboratory” at the heart of the big data economy.  Under this approach, fine-tuning of everything–business model, processes, products, and experiences–becomes a never-ending series of practical experiments. Data scientists evolve into an operational function, running their experiments–often known as “A/B tests”–24×7 with the full support and encouragement of senior business executives.

The beauty of real-world experiments is that you can continuously and surreptitiously test diverse product models inline to your running business. Your data scientists can compare results across differentially controlled scenarios in a systematic, scientific manner. They can use the results of these in-production experiments – such as improvements in response, acceptance, satisfaction, and defect rates on existing products/services–to determine which work best with various customers under various circumstances.

Niaz: What is a big data product? How can someone make beautiful stuff with data?

James: What is a Big Data product? It’s any product or service that helps people to extract deep value from advanced analytics and trustworthy data at all scales, but especially at the extreme scales of volume (petabytes and beyond), velocity (continuous, streaming, real-time, low-latency), and/or variety (structured, semi-structured, unstructured, streaming, etc.). That definition encompasses products that provide the underlying data storage, database management, algorithms, metadata, modeling, visualization, integration, governance, security, management, and other necessary features to address these use cases. If you track back to my answer above relevant to “hybrid” architectures you’ll see a discussion of some of the core technologies.

Making “beautiful stuff with data”? That suggests advanced visualization to call out the key insights in the data. The best data visualizations provide functional beauty: they make the process of sifting through data easier, more pleasant, and more productive for end users, business analysts, and data scientists.

Niaz: Can you please tell us about building Data Driven culture that posters data driven innovation to build next big product?

James: A key element of any data-driven culture is establishing a data science center of excellence. Data scientists are the core developers in this new era of Big Data, advanced analytics, and cognitive computing.

Game-changing analytics applications don’t spring spontaneously from bare earth. You must plant the seeds through continuing investments in applied data science and, of course, in the big data analytics platforms and tools that bring it all to fruition. But you’ll be tilling infertile soil if you don’t invest in sustaining a data science center of excellence within your company. Applied data science is all about putting the people who drill the data in constant touch with those who understand the applications. In spite of the mythology surrounding geniuses who produce brilliance in splendid isolation, smart people really do need each other. Mutual stimulation and support are critical to the creative process, and science, in any form, is a restlessly creative exercise.

In establishing a center of excellence, you may go the formal or informal route. The formal approach is to institute ongoing process for data-science collaboration, education, and information sharing. As such, the core function of your center of excellence might be to bridge heretofore siloed data-science disciplines that need to engage more effectively. The informal path is to encourage data scientists to engage with each other using whatever established collaboration tools, communities, and confabs your enterprise already has in place. This is the model under which centers of excellence coalesce organically from ongoing conversations.

Creeping polarization, like general apathy, will kill your data science center of excellence if you don’t watch out. Don’t let the center of excellence, formal or informal, degenerate into warring camps of analytics professionals trying to hardsell their pet approaches as the one true religion. Centers of excellence must serve as a bridge, not a barrier, for communication, collegiality, and productivity in applied data science.

Niaz: As you know leaders and managers have always been challenged to get the right information to make good decisions. Now with the digital revolution and technological advancement, they have opportunities to access huge amount of data. How this trend will change management practice? What do you think about the future of decision making, strategy and running organizations?

James: Business agility is paramount in a turbulent world.  Big Data is changing the way that management responds to–and gets ahead–of changes in their markets, competitive landscape, and operational conditions.

Increasingly, I prefer to think of big data in the broader context of business agility. What’s most important is that your data platform has the agility to operate cost-effectively at any scale, speed, and scope of business that your circumstances demand.

In terms of scale of business, organizations operate at every scale from breathtakingly global to intensely personal. You should be able to acquire a low-volume data platform and modularly scale it out to any storage, processing, memory and I/O capacity you may need in the future. Your platform should elastically scale up and down as requirements oscillate. Your end-to-end infrastructure should also be able to incorporate platforms of diverse scales—petabyte, terabyte, gigabyte, etc.—with those platforms specialized to particular functions and all of them interoperating in a common fabric.

Where speed is concerned, businesses often have to keep pace with stop-and-start rhythms that oscillate between lightning fast and painfully slow. You should be able to acquire a low-velocity data platform and modularly accelerate it through incorporation of faster software, faster processors, faster disks, faster cache and more DRAM as your need for speed grows. You should be able to integrate your data platform with a stream computing platform for true real-time ingest, processing and delivery. And your platform should also support concurrent processing of diverse latencies, from batch to streaming, within a common fabric.

And on the matter of scope, businesses manage almost every type of human need, interaction and institution. You should be able to acquire a low-variety data platform—perhaps a RDBMS dedicated to marketing—and be able to evolve it as needs emerge into a multifunctional system of record supporting all business functions. Your data platform should have the agility to enable speedy inclusion of a growing variety of data types from diverse sources. It should have the flexibility to handle structured and unstructured data, as well as events, images, video, audio and streaming media with equal agility. It should be able to process the full range of data management, analytics and content management workloads. It should serve the full scope of users, devices and downstream applications.

Agile Big Data platforms can serve as the common foundation for all of your data requirements. Because, after all, you shouldn’t have to go big, fast, or all-embracing in your data platforms until you’re good and ready.

Niaz: In your opinion, given the current available Big Data technologies, what is the most difficult challenge in filtering big data to find useful information?

James: The most difficult challenge is in figuring out which data to ignore, and which data is trustworthy enough to serve as a basis for downstream decision-support and advanced analytics.

Most important, don’t always trust the “customer sentiment” that you social-media listening tools as if it were gospel. Yes, you care deeply about how your customers regard your company, your products, and your quality of service. You may be listening to social media to track how your customers—collectively and individually—are voicing their feelings. But do you bother to save and scrutinize every last tweet, Facebook status update, and other social utterance from each of your customers? And if you are somehow storing and analyzing that data—which is highly unlikely—are you linking the relevant bits of stored sentiment data to each customer’s official record in your databases?

If you are, you may be the only organization on the face of the earth that makes the effort. Many organizations implement tight governance only on those official systems of record on which business operations critically depend, such as customers, finances, employees, products, and so forth. For those data domains, data management organizations that are optimally run have stewards with operational responsibility for data quality, master data management, and information lifecycle management.

However, for many big data sources that have emerged recently, such stewardship is neither standard practice nor should it be routine for many new subject-matter data domains. These new domains refer to mainly unstructured data that you may be processing in your Hadoop clusters, stream-computing environments, and other big data platforms, such as social, event, sensor, clickstream, geospatial, and so on.

The key difference from system-of-record data is that many of the new domains are disposable to varying degrees and are not regarded as a single version of the truth about some real-world entity. Instead, data scientists and machine learning algorithms typically distill the unstructured feeds for patterns and subsequently discard the acquired source data, which quickly become too voluminous to retain cost-effectively anyway. Consequently, you probably won’t need to apply much, if any, governance and security to many of the recent sources.

Where social data is concerned, there are several reasons for going easy on data quality and governance. First of all, data quality requirements stem from the need for an officially sanctioned single version of the truth. But any individual social media message constituting the truth of how any specific customer or prospect feels about you is highly implausible. After all, people prevaricate, mislead, and exaggerate in every possible social context, and not surprisingly they convey the same equivocation in their tweets and other social media remarks. If you imagine that the social streams you’re filtering are rich founts of only honest sentiment, you’re unfortunately mistaken.

Second, social sentiment data rarely has the definitive, authoritative quality of an attribute—name, address, phone number—that you would include in or link to a customer record. In other words, few customers declare their feelings about brands and products in the form of tweets or Facebook updates that represent their semiofficial opinion on the topic. Even when people are bluntly voicing their opinions, the clarity of their statements is often hedged by the limitations of most natural human language. Every one of us, no matter how well educated, speaks in sentences that are full of ambiguity, vagueness, situational context, sarcasm, elliptical speech, and other linguistic complexities that may obscure the full truth of what we’re trying to say. Even highly powerful computational linguistic algorithms are challenged when wrestling these and other peculiarities down to crisp semantics.

Third, even if every tweet was the gospel truth about how a customer is feeling and all customers were amazingly articulate on all occasions, the quality of social sentiment usually emerges from the aggregate. In other words, the quality of social data lies in the usefulness of the correlations, trends, and other patterns you derive from it. Although individual data points can be of marginal value in isolation, they can be quite useful when pieced into a larger puzzle.

Consequently, there is little incremental business value from scrutinizing, retaining, and otherwise managing every single piece of social media data that you acquire. Typically, data scientists drill into it to distill key patterns, trends, and root causes, and you would probably purge most of it once it has served its core tactical purpose. This process generally takes a fair amount of mining, slicing, and dicing. Many social-listening tools, including the IBM® Cognos® Consumer Insight application, are geared to assessing and visualizing the trends, outliers, and other patterns in social sentiment. You don’t need to retain every single thing that your customers put on social media to extract the core intelligence that you seek, as in the following questions: Do they like us? How intensely? Is their positive sentiment improving over time? In fact, doing so might be regarded as encroaching on privacy, so purging most of that data once you’ve gleaned the broader patterns is advised.

Fourth, even outright customer lies propagated through social media can be valuable intelligence if we vet and analyze each effectively. After all, it’s useful knowing whether people’s words—”we love your product”—match their intentions—”we have absolutely no plans to ever buy your product”—as revealed through their eventual behavior—for example, buying your competitor’s product instead.

If we stay hip to this quirk of human nature, we can apply the appropriate predictive weights to behavioral models that rely heavily on verbal evidence, such as tweets, logs of interactions with call-center agents, and responses to satisfaction surveys. I like to think of these weights as a truthiness metric, courtesy of Stephen Colbert.

What we can learn from social sentiment data of dubious quality is the situational contexts in which some customer segments are likely to be telling the truth about their deep intentions. We can also identify the channels in which they prefer to reveal those truths. This process helps determine which sources of customer sentiment data to prioritize and which to ignore in various application contexts.

Last but not least, apply only strong governance to data that has a material impact on how you engage with customers, remembering that social data rarely meets that criterion. Customer records contain the key that determines how you target pitches to them, how you bill them, where you ship their purchases, and so forth. For these purposes, the accuracy, currency, and completeness of customers’ names, addresses, billing information, and other profile data are far more important than what they tweeted about the salesclerk in your Poughkeepsie branch last Tuesday. If you screw up the customer records, the adverse consequences for all concerned are far worse than if you misconstrue their sentiment about your new product as slightly positive, when in fact it’s deeply negative.

However, if you greatly misinterpret an aggregated pattern of customer sentiment, the business risks can be considerable. Customers’ aggregate social data helps you compile a comprehensive portrait of the behavioral tendencies and predispositions of various population segments. This compilation is essential market research that helps gauge whether many high-stakes business initiatives are likely to succeed. For example, you don’t want to invest in an expensive promotional campaign if your target demographic isn’t likely to back up their half-hearted statement that your new product is “interesting” by whipping out their wallets at the point of sale.

The extent to which you can speak about the quality of social sentiment data all comes down to relevance. Sentiment data is good only if it is relevant to some business initiative, such as marketing campaign planning or brand monitoring. It is also useful only if it gives you an acceptable picture of how customers are feeling and how they might behave under various future scenarios. Relevance means having sufficient customer sentiment intelligence, in spite of underlying data quality issues, to support whatever business challenge confronts you.

Niaz: How do you see data science evolving in the near future?

James: In the near future, many business analysts will enroll in data science training curricula to beef up their statistical analysis and modeling skills in order to stay relevant in this new age.

However, they will confront a formidable learning curve. To be an effective, well-rounded data scientist, you will need a degree, or something substantially like it, to prove you’re committed to this career. You will need to submit yourself to a structured curriculum to certify you’ve spent the time, money and midnight oil necessary for mastering this demanding discipline.

Sure, there are run-of-the-mill degrees in data-science-related fields, and then there are uppercase, boldface, bragging-rights “DEGREES.” To some extent, it matters whether you get that old data-science sheepskin from a traditional university vs. an online school vs. a vendor-sponsored learning program. And it matters whether you only logged a year in the classroom vs. sacrificed a considerable portion of your life reaching for the golden ring of a Ph.D. And it certainly matters whether you simply skimmed the surface of old-school data science vs. pursued a deep specialization in a leading-edge advanced analytic discipline.

But what matters most to modern business isn’t that every data scientist has a big honking doctorate. What matters most is that a substantial body of personnel has a common grounding in core curriculum of skills, tools and approaches. Ideally, you want to build a team where diverse specialists with a shared foundation can collaborate productively.

Big data initiatives thrive if all data scientists have been trained and certified on a curriculum with the following foundation: paradigms and practices, algorithms and modeling, tools and platforms, and applications and outcomes.

Classroom instruction is important, but a data-science curriculum that is 100 percent devoted to reading books, taking tests and sitting through lectures is insufficient. Hands-on laboratory work is paramount for a truly well-rounded data scientist. Make sure that your data scientists acquire certifications and degrees that reflect them actually developing statistical models that use real data and address substantive business issues.

A business-oriented data-science curriculum should produce expert developers of statistical and predictive models. It should not degenerate into a program that produces analytics geeks with heads stuffed with theory but whose diplomas are only fit for hanging on the wall.

Niaz: We have already seen the huge implication and remarkable results of Big Data from tech giants. Do you think Big Data can also have great role in solving social problems? Can we measure and connect all of our big and important social problems and design the sustainable solutions with the help of Big Data?

James: Of course. Big Data is already being used worldwide to address the most pressing problems confronting humanity on this planet. In terms of “measuring and connecting all our big and important social problems and designing sustainable solutions,” that’s a matter for collective human ingenuity. Big Data is a tool, not panacea.

Niaz: Can you please tell us about ‘Open Source Analytics’ for Big Data? What are the initiatives regarding open source that IBM’s Big Data group and others group (startups) have done or are planning?

James: The principal open-source community in the big data analytics industry are Apache Hadoop and R. IBM is an avid participant in both communities, and has incorporated these technologies into our solution portfolio.

Niaz: What are some of the concerns (privacy, security, regulation) that you think can dampen the promise of Big Data?

James: You’ve named three of them. Overall, businesses should embrace the concept of “privacy by design” – a systematic approach that takes privacy into account from the start – instead of trying to add protection after the fact. In addition, the sheer complexity of the technology and the learning curve of the technologies are a barrier to realizing their full promise. All of these factors introduce time, cost, and risk into the Big Data ROI equation.

Niaz: What are the new technologies you are mostly passionate about? What are going to be the next big things?

James: Where to start? I prefer that your readers follow my IBM Big Data Hub blog to see the latest things I’m passionate about.

Niaz: Last but not least, what are you advices for Big Data startups and for the people those who are working with Big Data?

James: Find your niche in the Big Data analytics industry ecosystem, go deep, and deliver innovation. It’s a big, growing, exciting industry. Brace yourself for constant change. Be prepared to learn (and unlearn) something new every day.

Niaz: Dear James, thank you very much for your invaluable time and also for sharing us your incredible ideas, insights, knowledge and experiences. We are wishing you very good luck for all of your upcoming great endeavors.

_  _  _  _  ___  _  _  _  _

Further Reading:

1. Viktor Mayer-Schönberger on Big Data Revolution

2. Gerd Leonhard on Big Data and the Future of Media, Marketing and Technology

3. Ely Kahn on Big Data, Startup and Entrepreneurship

4. Brian Keegan on Big Data

5. danah boyd on Future of Technology and Social Media

6. Irving Wladawsky-Berger on Evolution of Technology and Innovation

7. Horace Dediu on Asymco, Apple and Future of Computing

8. James Allworth on Disruptive Innovation

Ely Kahn: Big Data, Startup and Entrepreneurship

Editor’s Note: Ely Kahn is the Co-founder and VP of Business Development for Sqrrl, a Big Data Startup. Previously, Ely served in a variety of positions in the Federal Government, including Director of Cybersecurity at the National Security Staff in White House, Deputy Chief of Staff at the National Protection Programs Directorate in the Department of Homeland Security, and Director of Risk Management and Strategic Innovation in the Transportation Security Administration. Before his service in the Federal Government, Ely was a management consultant with Booz Allen Hamilton. Ely has a BA from Harvard University and a MBA from the Wharton School at the University of Pennsylvania.

You can find him on Twitter and LinkedIn. Learn more about his Big Data Startup Sqrrl [here]

eTalk’s Niaz Uddin has interviewed Ely Kahn recently to gain insights about Big Data, Startup and Entrepreneurship which is given below.

Niaz: Dear Ely, thank you so much for joining us in the midst of your busy schedule. We are very thrilled to have you at eTalks.

Ely: My pleasure. Thank you for having me.

Niaz: You’re a former management consultant and senior government official who turned Big Data Entrepreneur. At the beginning of our interview, can you please tell us something about entrepreneurship? What is entrepreneurship? Why are you an entrepreneur?

Ely: While in government, I viewed myself as an “intrapreneur”, and I focused on developing new public sector programs that could disrupt traditional ways of doing business.  Moving to private sector entrepreneurship was a natural evolution for me.  Entrepreneurship takes all different forms, but the type of entrepreneurship that is most interesting to me is modeled around Clayton Christensen’s theory of “Disruptive Innovation.”

Niaz: You have a BA from Harvard University and a MBA from the Wharton School at the University of Pennsylvania. You’ve served in a variety of positions in the Federal Government and before your service in the Federal Government; you were a management consultant with Booz Allen Hamilton. How have you transformed your career into entrepreneurship and why? What’s the most exciting thing about entrepreneurship to you?

Ely: Innovation has been a key theme in all my jobs so far and cuts across consulting, government, and startups.  However, business school was actually an incredibly valuable tool for making the transition from government to a technology startup.  More than anything, it was two years that allowed me to explore different startup ideas in a very low risk environment.

The most exciting thing about entrepreneurship for me is the continuous learning environment.  Every week it seems I am picking up something new across a wide variety of functional areas, including sales, marketing, business development, product management, and finance.

Niaz: You’re the Co-founder and VP of Business Development of Sqrrl, a Big Data company. How did the idea Sqrrl come up and how have you started?

Ely: Sqrrl’s technology has its roots in the National Security Agency (NSA) and that technology is called Accumulo.  Accumulo powers many of NSA’s analytic programs.  I was introduced to the NSA engineers that helped create Accumulo while I was in business school, and from there I started to put together the business plan and investor pitch to commercialize Accumulo.

Niaz: At this point, can you please kindly tell us a bit of funding? Who are the core investors at Sqrrl?

Ely: We have two world-class investors:  Atlas Venture and Matrix Partners.  We closed a $2M seed round with them in August 2012.

Niaz: So everything you are doing at Sqrrl is all about Big Data and Big Data Products. Can you please tell us what is Big Data?

Ely: Big Data is generally referred to as data that cannot be processed using traditional database technologies because of the volume, velocity, and variety of data.  Big Data typically includes tera- and petabytes of structured, semi-structured, and unstructured data, and examples are sensor data, social media, clickstreams, and log files.

Niaz: Why do you think Big Data is the next big opportunity for all of us?

Ely: Big Data technologies like Hadoop and Accumulo enable companies to analyze datasets that were previously too expensive or burdensome to process.  This analysis can become new forms of competitive advantage or can open up completely new lines of business.

Niaz: How do you define Big Data Product? Can you please give us some examples of Big Data products?

Ely: Big Data products span a wide range of technologies, including storage, databases, analytical tools, and visualization platforms.  Two classes of Big Data technologies that are of particular importance are Hadoop vendors and NoSQL database vendors.  Hadoop + NoSQL enable organizations to process petabytes of multi-structured data in real-time.

Niaz: How will Big Data products change the perception of building products?

Ely: Many Big Data products are still “crossing the chasm” from early adopters to mainstream users.  However, these products have the potential to bring the power of massive parallel computing to many companies.  Historically, these types of capabilities have been the domain of massive web companies like Google and Facebook or large government agencies like the NSA.

Niaz: Now can you please briefly tell us about Sqrrl?

Ely: Sqrrl is the provider of a Big Data platform that powers secure, real-time applications.  Our technology leverages both Apache Hadoop and Apache Accumulo, which are open source software technologies.

Niaz: What are your core products and who are the main customers of Sqrrl?

Ely: Our technology offering is called Sqrrl Enterprise and it enables organizations to securely bring their data together on a single platform and easily build real-time applications that leverage this data.  Some of the use cases for Sqrrl Enterprise include serving as the platform for applications that detect insider threats in financial services companies or serving as the platform for predictive medicine in healthcare companies.

Niaz: You’ve started at August 2012. How’s company doing now?

Ely: The company is doing great.  We now have about 20 employees and a number of customers in a variety of industries.

Niaz: What is your vision at Sqrrl?

Ely: Our vision is to enable organizations to “securely analyze everything.”  Our Big Data platform helps organizations perform analytics on massive amounts of data and often times this data has very strict privacy or security requirements on it.

Niaz: How big is Big Data industry?

Ely: According to the analyst firm Wikibon “the Big Data market is projected to reach $18.1 billion in 2013… [and] on pace to exceed $47 billion by 2017.”

Niaz: What do you think about the other Big Data startups? How’s Big Data community doing?

Ely: There is an amazing ecosystem of Big Data startups that are doing some amazingly innovative things.  I am paying particular close attention to startups focused on machine learning and data visualization, as these are complementary areas to our product.

Niaz: Well, we all know that starting a company is not an easy task for us. So, can you please put in the picture what are the difficulties of starting a company we may face?

Ely: The thing that is fascinating about doing a startup is that there is a never ending series of challenges:  raising funding, hiring, finding product-market fit, customer acquisition and retention, and the list goes on.  The key is to be continuously prioritizing where to spend your time.

Niaz: What have you learned by starting a company?

Ely: I have learned many things, but the lesson that I am continuously learning is to be resilient.  Startups are inevitably filled with small failures, but the key is to quickly learn from them to avoid any large failures.

Niaz: What are the mistakes an entrepreneur can make in the early stage?

Ely: I think the biggest mistake that an entrepreneur can make is being afraid to make mistakes.  Early stage entrepreneurs need to be continuously running experiments to find product-market fit.

Niaz: Can you please share some of your life lessons for our readers?

Ely: Stay humble.  Entrepreneurship requires both luck and skill, and I think people sometime mistake luck for skill.

Niaz: Thank you so much for joining us and sharing your invaluable ideas, insights and knowledge. We are wishing you very good luck for the greater success of Sqrrl.

Ely: Many thanks.

_  _  _  _  ___  _  _  _  _

Further Reading:

1. James Allworth on Disruptive Innovation

2. Viktor Mayer-Schönberger on Big Data Revolution

3. Gerd Leonhard on Big Data and the Future of Media, Marketing and Technology

4. Brian Keegan on Big Data

5. Irving Wladawsky-Berger on Evolution of Technology and Innovation

Brian Keegan: Big Data

Editor’s Note: Brian Keegan is a post-doctoral research fellow in Computational Social Science with David Lazer at Northeastern University. He defended his Ph.D. in the Media, Technology, and Society program at Northwestern University’s School of Communication.  He also attended the Massachusetts Institute of Technology and received bachelors degrees in Mechanical Engineering and Science, Technology, and Society in 2006.

His research employs a variety of large-scale behavioral data sets such as Wikipedia article revision histories, massively-multiplayer online game behavioral logs, and user interactions in a crowd-sourced T-shirt design community. He uses methods in network analysis, multilevel statistics, simulation, and content analysis. To learn more about him, please visit his official website Brianckeegan.com.

eTalk’s Niaz Uddin has interviewed Brian Keegan recently to gain his ideas and insights about Big Data, Data Science and Analytics which is given below.

Niaz: Brian we are really excited to have you to talk about Big Data. Let start from the beginning. How do you define Big Data?

Brian: Thank you Niaz for having me. Well, a common joke in the community is that “big data” is anything that makes Excel crash. That’s neither fair to Microsoft because the dirty secret of data science is that you can get pretty far using Excel nor is it fair to researchers whose data could hypothetically fit in Excel, but are so complicated that it would make no sense to try in the first place.

Big data is distinct from traditional industry and academic approaches to data analysis because of what are called the three Vs: volume, variety, velocity.

      • Volume is what we think of immediately – server farms full of terabytes of user data waiting to be analyzed. This data doesn’t fit into a single machine’s memory, hard drive, or even a traditional database. The size of the data makes analyzing with traditional tools really hard which is why new tools are being created.
      • Second, there’s variety that reflects the fact that data aren’t just lists of numbers, but include complex social relationships, collections of text documents, and sensors. The scope of the data means that all these different kinds of data have different structures, granularity, and errors which need to be cleaned and integrated before you can start to look for relationships among them. Cleaning data is fundamentally unsexy and grueling work, but if you put garbage into a model, all you get garbage back out. Making sure all these diverse kinds of data are playing well with each other and the models you run on them is crucial.
      • Finally, there’s velocity that reflects the fact that data are not only being created in real-time, but people want to act on the incoming information in real time as well. This means the analysis also has to happen in real time which is quite different than the old days where a bunch of scientists could sit around for weeks testing different kinds of models on data collected months or years ago before writing a paper or report that takes still more months before its published. APIs, dashboards, and alerts are part of big data because they make data available fast.

Niaz: Can you please provide us some examples?

Brian: Data that is big is definitely not new. The US Census two centuries ago still required collecting and analyzing millions of data points collected by hand. Librarians and archivists have always struggled with how to organize, find, and share information on millions of physical documents like books and journals. Physicists have been grappling with big data for decades where the data is literally astronomical. Biologists sequencing the genome needed ways to manipulate and compare data involving billions of base pairs.

While “data that was big” existed before computers, the availability of cheap computation has accelerated and expanded our ability to collect, process, and analyze data that is big. So while we now think of things like tweets or financial transactions as “big data” because these industries have rushed to adopt or are completely dependent upon computation, it’s important to keep in mind that lots of big data exist outside of social media, finance, and e-commerce and that’s where a lot of opportunities and challenges still exist.

Niaz: What are some of the possible use cases for big data analytic? What are the major companies producing gigantic amount of Data?

Brian: Most people think of internet companies like Google, Facebook, Twitter, LinkedIn, FourSquare, Netflix, Amazon, Yelp, Wikipedia, and OkCupid when they think of big data. These companies are definitely the pioneers of coming up with the algorithms, platforms, and other tools like PageRank, map-reduce, user-generated content, recommender systems that require combining millions of data points to provide fast and relevant content.

    • Companies like Crimson Hexagon mine Twitter and other social media streams for their clients to detect patterns of novel phrases or changes in the the sentiment associated with keywords and products. This can let their clients know if people are having problems with a product or if a new show is generating a lot of buzz despite mediocre ratings.
    • The financial industry uses big data not only for high-frequency trading based on combining signals from across the market, but also evaluating credit risks of customers by combining various data sets. Retailers like Target and WalMart have large analytics teams that examine consumer transactions for behavioral patterns so they know what products to feature. Telecommunications companies like AT&T or Verizon collect call data records produced by every cell phone on their networks that lets them know your location over time so they can improve coverage. Industrial companies like GE and Boeing put more and more sensors into their products so that they can monitor performance and anticipate maintenance.
    • Finally, one of the largest producers and consumers of big data is the government. Law enforcement agencies publish data about crime and intelligence agencies monitor communication data from suspects. The Bureau of Labor Statistics, Federal Reserve, and World Bank collect and publish extremely rich and useful economic time series data. Meteorologists collect and analyze large amounts of data to make weather forecasts.

Niaz: Why has big data become so important now?

Brian: Whether it was business, politics, or military, decisions were (and continue to be) made under uncertainty about history or context because getting timely and relevant data was basically impossible. Directors didn’t know what customers were saying about their product, politicians didn’t know the issues constituents were talking about, and officers faced a fog of war. Ways of getting data were often slow and/or suspect: for example, broadcast stations used to price advertising time by paying a few dozen people in a city to keep journals of what stations they remember hearing every day. Looking back now, this seems like an insane way not only collect data but also make decisions based on obviously unreliable data, but it’s how things were done for decades because there was no better way of measuring what people were doing. The behavioral traces we leave in tweets and receipts are not only much finer-grained and reliable, but also encompass a much larger and more representative sample of people and their behaviors.

Data lets decision makers know and respond to what the world really looks like instead of going on their gut. More data usually gives a more accurate view, but too much data can also overwhelm and wash out the signal with noise. The job of data scientists less trying to find a single needle in a haystack and more like collecting as much hay as possible to be sure there’s a few needles in there before sorting through the much bigger haystack. In other words, data might be collected for one goal, but it can also be repurposed for other goals and follow-on questions that come along to provide new insights. More powerful computers, algorithms, and platforms make assembling and sorting through these big haystacks much easier than before.

Niaz: Recently I have seen IBM has started to work with Big Data. What roles do companies like IBM play in this area?

Brian: IBM is just one of many companies that are racing “upstream” to analyze data on larger and more complex systems like an entire city by aggregating tweets, traffic, surveillance cameras, electricity consumption, emergency services which feed into each other. IBM is an example of an organization that has shifted from providing value from transforming raw materials into products like computers to transforming raw data into unexpected insights about how a system works — or doesn’t. The secret sauce is collecting existing data, building new data collection systems, and developing statistical models and platforms that are able to work in the big data domain of volume, variety, and velocity that traditional academic training doesn’t equip people.

Niaz: What are the benefits of Big Data to Business? How it is influencing innovation and business?

Brian: Consider the market capitalization of three major tech companies on a per capita basis: Microsoft makes software and hardware as well as running web services like Bing based on big data and is worth about $2.5 million per employee, Google mostly makes software and runs web services and is worth about $4.6 million per employee, and Facebook effectively just runs a web service of its social network site and is worth about $19 million per employee. These numbers may outliers or unreliable for a variety of reasons, but the trend suggests that organizations like Facebook focused solely on data produce more value per employee.

This obviously isn’t a prescription for every company — ExxonMobil, WalMart, GE, and Berkshire produce value in fundamentally different ways. But Facebook did find a way to capture and analyze data about the world — our social relationships and preferences — that was previously hidden. There are other processes happening beyond the world of social media that currently go uncaptured, but the advent of new sensors and opportunities for collecting data that will become ripe for the picking. Mobile phones in developing countries will reveal patterns of human mobility that could transform finance, transportation, and health care. RFIDs on groceries and other products could reveal patterns transportation and consumption that could reduce wasted food while opening new markets. Smart meters and grids could turn the tide against global climate change while lowering energy costs. Politicians could be made more accountable and responsive through crowd sourced fundraising and analysis of regulatory disclosures. The list of data out there waiting to be collected and analyzed boggles the mind.

Niaz: How do you define a Data Scientist? What are your suggestions you have for those who want to become a data scientist?

Brian: A data scientist needs familiarity with a wide set of skills, so much so that it’s impossible for them to be expert in all of them.

      • First, data scientists need the computational skills from learning a programming language like Python or Java so that they can acquire, cleanup, and manipulate data from databases and APIs, hack together different programs developed by people who are far more expert in network analysis or natural language processing, and use difficult tools like MySQL and Hadoop. There’s no point-and-click program out there with polished tutorials that does everything you’ll need from end-to-end. Data scientists spend a lot of time writing code, working at the command line, and reading technical documentation but there are tons of great resources like StackOverflow, GitHub, free online classes, and active and friendly developer communities where people are happy to share code and solutions.
      • Second, data scientists need statistical skills at both a theoretical and methodological level. This is the hardest part and favors people who have backgrounds in math and statistics, computer and information sciences, physical sciences and engineering, or quantitative social sciences. Theoretically, they need to know why some kinds of analyses should be run on some kinds but not other kinds of data and what the limitations of one kind of model are compared to others. Methodologically, data scientists need to actually be able to run these analyses using statistical software like R, interpret the output of the analyses, and do the statistical diagnostics to make sure all the assumptions that are baked into a model are actually behaving properly.
      • Third, data scientists need some information visualization and design skills so they can communicate their findings in an effective way with charts or interactive web pages for exploration. This means learning to use packages like ggplot in R or matplotlib in Python for statistical distributions, d3 in Javascript for interactive web visualizations, or Gephi for network visualizations.

All of the packages I mentioned are open-source which also reflects the culture in the data science community; expensive licenses for software or services are very suspect because others should be able to easily replicate and build upon your analysis and findings.

Niaz: Finally, what do you think about the impact of Big Data in our everyday life?

Brian: Big Data is a dual-use technology that can satisfy multiple goals, some of which may be valuable and others which may be unsavory. On one hand it can help entrepreneurs be more nimble and open new markets or researchers make new insights about how the world works, on the other hand, the Arab Spring suggested it can also reinforce the power of repressive regimes to monitor dissidents or unsavory organizations to do invasive personalized marketing.

Danah Boyd and Kate Crawford have argued persuasively about how the various possibilities of big data to address societal ills or undermine social structure obscure the very real but subtle changes that are happening right now that replace existing theory and knowledge, cloak subjectivity with quantitative objectivity, confuse bigger data with better data, separate data from context and meaning, raise real ethical questions, and create or reinforce inequalities.

Big data also raises complicated questions about who has access to data. On one hand, privacy is a paramount concern as organizations shouldn’t be collecting or sharing data about individuals without their consent. On the other hand, there’s also the expectation that data should be shared with other researchers so they can validate findings. Furthermore, data should be preserved and archived so that it is not lost to future researchers who want to compare or study changes over time.

Niaz: Brian, Thank you so much for giving me time in the midst of your busy schedule. It is really great to know the details of Big Data from you. I am wishing you good luck with your study, research, projects and works.

Brian: You are welcome. Good luck to you too.

_  _  _  _  ___  _  _  _  _

Further Reading:

1. Viktor Mayer-Schönberger on Big Data Revolution

2. Gerd Leonhard on Big Data and the Future of Media, Marketing and Technology

3. Ely Kahn on Big Data, Startup and Entrepreneurship

4. James Kobielus on Big Data, Cognitive Computing and Future of Product

5. danah boyd on Future of Technology and Social Media

6. Irving Wladawsky-Berger on Evolution of Technology and Innovation

7. Horace Dediu on Asymco, Apple and Future of Computing

8. James Allworth on Disruptive Innovation