Editor’s Note: As IBM’s Big Data Evangelist, James Kobielus is IBM Senior Program Director, Product Marketing, Big Data Analytics Solutions. He is an industry veteran, a popular speaker and social media participant, and a thought leader in Big Data, Hadoop, Enterprise Data Warehousing, Advanced Analytics, Business Intelligence, Data Management, and Next Best Action Technologies. He works with IBM’s product management and marketing teams in Big Data. He has spoken at such leading industry events as IBM Information On Demand, IBM Big Data Integration and Governance, Hadoop Summit, Strata, and Forrester Business Process Forum. He has published several business technology books and is a very popular provider of original commentary on blogs and many social media.
eTalk’s Niaz Uddin has interviewed James Kobielus recently to gain insights about his ideas, research and works in the field of Big Data which is given below.
Niaz: Dear James, thank you so much for joining us in the midst of your busy schedule. We are very thrilled and honored to have you at eTalks.
James: And I’m thrilled and honored that you asked me.
Niaz: You are a leading expert on Big Data, as well as on such enabling technologies as enterprise data warehousing, advanced analytics, Hadoop, cloud services, database management systems, business process management, business intelligence, and complex-event processing. At the beginning of our interview can you please tell us about Big Data? How does Big Data make sense of the new world?
James: Big Data refers to approaches for extracting deep value from advanced analytics and trustworthy data at all scales. At the heart of advanced analytics is data mining, which is all about using statistical analysis to find non-obvious patterns (segmentations, correlations, trends, propensities, etc.) within historical data sets.
Some might refer to advanced analytics as tools for “making sense” of this data in ways that are beyond the scope of traditional reporting and visualization. As we aggregate and mine a wider variety of data sources, we can find far more “sense”–also known as “insights”–that previously lay under the surface. Likewise, as we accumulate a larger volume of historical data from these sources and incorporate a wider variety of variables from them into our models, we can build more powerful predictive models of what might happen under various future circumstances. And if we can refresh this data rapidly with high-velocity high-quality feeds, while iterating and refining our models more rapidly, we can ensure that our insights reflect the latest, greatest data and analytics available.
That’s the power of Big Data: achieve more data-driven insights (aka “making sense”) by enabling our decision support tools to leverage the “3 Vs”: a growing Volume of stored data, higher Velocity of data feeds, and broader Variety of data sources.
Niaz: As you know, Big Data has already started to redefine search, media, computing, social media, products, services and so on. Availability of Data helping us analyzing trend and doing interesting things in more accurate and efficient ways than before. What are some of the most interesting uses of big data out there today?
James: Where do I start? There are interesting uses of Big Data in most industries and in most business functions.
Cognitive computing is a term that probably goes over the head of most of the general public. IBM defines it as the ability of automated systems to learn and interact naturally with people to extend what either man or machine could do on their own, thereby helping human experts drill through big data rapidly to make better decisions.
One way I like to describe cognitive computing is as the engine behind “conversational optimization.” In this context, the “cognition” that drives the “conversation” is powered by big data, advanced analytics, machine learning and agile systems of engagement. Rather than rely on programs that predetermine every answer or action needed to perform a function or set of tasks, cognitive computing leverages artificial intelligence and machine learning algorithms that sense, predict, infer and, if they drive machine-to-human dialogues, converse.
Cognitive computing performance improves over time as systems build knowledge and learn a domain’s language and terminology, its processes and its preferred methods of interacting. This is why it’s such a powerful conversation optimizer. The best conversations are deep in give and take, questioning and answering, tackling topics of keenest interest to the conversants. When one or more parties has deep knowledge and can retrieve it instantaneously within the stream of the moment, the conversation quickly blossoms into a more perfect melding of minds. That’s why it has been deployed into applications in healthcare, banking, education and retail that build domain expertise and require human-friendly interaction models.
IBM Watson is one of the most famous exemplars of the power of cognitive computing driving agile human-machine conversations. In its famous “Jeopardy!” appearance, Watson illustrated how its Deep Question and Answer technology—which is cognitive computing to the core—can revolutionize the sort of highly patterned “conversation” characteristic of a TV quiz show. By having its Deep Q&A results rendered (for the sake of that broadcast) in a synthesized human voice, Watson demonstrated how it could pass (and surpass) any Turing test that tried to tell whether it was a computer rather than, say, Ken Jennings. After all, the Turing test is conversational at its very core.
What’s powering Watson’s Deep Q&A technology is an architecture that supports an intelligent system of engagement. Such an architecture is able to mimic real human conversation, in which the dialogue spans a broad, open domain of subject matter; uses natural human language; is able to process complex language with a high degree of accuracy, precision and nuance; and operates with speed-of-thought fluidity.
Where the “Jeopardy!” conversational test was concerned (and where the other participants were humans literally at the top of that game), Watson was super-optimized. However, in the real-world of natural human conversation, the notion of “conversation optimization” might seem, at first glance, like a pointy-headed pipedream par excellence. However, you don’t have to be an academic sociologist to realize that society, cultures and situational contexts impose many expectations, constraints and other rules to which our conversations and actions must conform (or face disapproval, ostracism, or worse). Optimizing our conversations is critical to surviving and thriving in human society.
Wouldn’t it be great to have a Watson-like Deep Q&A adviser to help us understand the devastating faux pas to avoid and the right bon mot to drop into any conversation while we’re in the thick of it? That’s my personal dream and I’ll bet that before long, with mobile and social coming into everything, it will be quite feasible (no, this is not a product announcement—just the dream of one IBMer). But what excites me even more (and is definitely not a personal pipedream), is IBM Watson Engagement Advisor, which we unveiled earlier this year. It is a cognitive-computing assistant that revolutionizes what’s possible in multichannel B2C conversations. The solution’s “Ask Watson” feature uses Deep Q&A to greet customers, conduct contextual conversations on diverse topics, and ensure that the overall engagement is rich with answers, guidance and assistance.
Cognitive/conversational computing is also applicable to “next best action,” which is one of today’s hottest new focus areas in intelligent systems. At its heart, next best action refers to an intelligent infrastructure that optimizes agile engagements across many customer-facing channels, including portal, call center, point of sales, e-mail and social. With cognitive-computing infrastructure the silent assistant, customers engage in a never-ending whirligig of conversations with humans and, increasingly, with automated bots, recommendation engines and other non-human components that, to varying degrees, mimic real-human conversation.
Niaz: So do you think machine learning is the right way to analyze Big Data?
James: Machine learning is an important approach for extracting fresh insights from unstructured data in an automated fashion, but it’s not the only approach. For example, machine learning doesn’t eliminate the need for data scientists to build segmentation, regression, propensity, and other models for data mining and predictive analytics.
Fundamentally, machine learning is a productivity tool for data scientists, helping them to get smarter, just as machine learning algorithms can’t get smarter without some ongoing training by data scientists. Machine learning allows data scientists to train a model on an example data set, and then leverage algorithms that automatically generalize and learn both from that example and from fresh feeds of data. To varying degrees, you’ll see the terms “unsupervised learning,” “deep learning,” “computational learning,” “cognitive computing,” “machine perception,” “pattern recognition,” and “artificial intelligence” used in this same general context.
Machine learning doesn’t mean that the resultant learning is always superior to what human analysts might have achieved through more manual knowledge-discovery techniques. But you don’t need to believe that machines can think better than or as well as humans to see the value of machine learning. We gladly offload many cognitive processes to automated systems where there just aren’t enough flesh-and-blood humans to exercise their highly evolved brains on various analytics tasks.
Niaz: What are the available technologies out there those help profoundly to analyze data? Can you please briefly tell us about Big Data technologies and their important uses?
James: Once again, it’s a matter of “where do I start?” The range of Big Data analytics technologies is wide and growing rapidly. We live in the golden age of database and analytics innovation. Their uses are everywhere: in every industry, every business function, and every business process, both back-office and customer-facing.
For starters, Big Data is much more than Hadoop. Another big data “H”—hybrid—is becoming dominant, and Hadoop is an important (but not all-encompassing) component of it. In the larger evolutionary perspective, big data is evolving into a hybridized paradigm under which Hadoop, massively parallel processing enterprise data warehouses, in-memory columnar, stream computing, NoSQL, document databases, and other approaches support extreme analytics in the cloud.
Hybrid architectures address the heterogeneous reality of big data environments and respond to the need to incorporate both established and new analytic database approaches into a common architecture. The fundamental principle of hybrid architectures is that each constituent big data platform is fit-for-purpose to the role for which it’s best suited. These big data deployment roles may include any or all of the following: data acquisition, collection, transformation, movement, cleansing, staging, sandboxing, modeling, governance, access, delivery, archiving, and interactive exploration. In any role, a fit-for-purpose big data platform often supports specific data sources, workloads, applications, and users.
Hybrid is the future of big data because users increasingly realize that no single type of analytic platform is always best for all requirements. Also, platform churn—plus the heterogeneity it usually produces—will make hybrid architectures more common in big data deployments.
Hybrid deployments are already widespread in many real-world big data deployments. The most typical are the three-tier—also called “hub-and-spoke”—architectures. These environments may have, for example, Hadoop (e.g., IBM InfoSphere BigInsights) in the data acquisition, collection, staging, preprocessing, and transformation layer; relational-based MPP EDWs (e.g., IBM PureData System for Analytics) in the hub/governance layer; and in-memory databases (e.g., IBM Cognos TM1) in the access and interaction layer.
The complexity of hybrid architectures depends on range of sources, workloads, and applications you’re trying to support. In the back-end staging tier, you might need different preprocessing clusters for each of the disparate sources: structured, semi-structured, and unstructured.
In the hub tier, you may need disparate clusters configured with different underlying data platforms—RDBMS, stream computing, HDFS, HBase, Cassandra, NoSQL, and so on—-and corresponding metadata, governance, and in-database execution components.
And in the front-end access tier, you might require various combinations of in-memory, columnar, OLAP, dimensionless, and other database technologies to deliver the requisite performance on diverse analytic applications, ranging from operational BI to advanced analytics and complex event processing.
Niaz: That’s really amazing. How to you connect these two dots: Big Data Analytics and Cognitive Computing? How does this connection make sense?
James: The relationship between Cognitive computing and Big Data is simple. Cognitive computing is an advanced analytic approach that helps humans drill through the unstructured data within Big Data repositories more rapidly in order to see correlations, patterns, and insights more rapidly.
Think of cognitive computing as a “speed-of-thought accelerator.” Speed of thought is something we like to imagine operates at a single high-velocity setting. But that’s just not the case. Some modes of cognition are painfully slow, such as pondering the bewildering panoply of investment options available under your company’s retirement plan. But some other modes are instantaneous, such as speaking your native language, recognizing an old friend, or sensing when your life may be in danger.
None of this is news to anybody who studies cognitive psychology has followed advances in artificial intelligence, aka AI, over the past several decades. Different modes of cognition have different styles, speeds, and spheres of application.
When we speak of “cognitive computing,” we’re generally referring to the ability of automated systems to handle the conscious, critical, logical, attentive, reasoning mode of thought that humans engage in when they, say, play “Jeopardy!” or try to master some rigorous academic discipline. This is the “slow” cognition that Nobel-winning psychologist/economist Daniel Kahneman discussed in recent IBM Colloquium speech.
As anybody who has ever watched an expert at work will attest, this “slow” thinking can move at lightning speed when the master is in his or her element. When a subject-domain specialist is expounding on their field of study, they often move rapidly from one brilliant thought to the next. It’s almost as if these thought-gems automatically flash into their mind without conscious effort.
This is the cognitive agility that Kahneman examined in his speech. He described the ability of humans to build skills, which involves mastering “System 2″ cognition (slow, conscious, reasoning-driven) so that it becomes “System 1″ (fast, unconscious, action-driven). Not just that, but an expert is able to switch between both modes of thought within the moment when it becomes necessary to rationally ponder some new circumstance that doesn’t match the automated mental template they’ve developed. Kahneman describes System 2 “slow thinking” as well-suited for probability-savvy correlation thinking, whereas System 1 “fast thinking” is geared to deterministic causal thinking.
Kahneman’s “System 2″ cognition–slow, rule-centric, and attention-dependent–is well-suited for acceleration and automation on big data platforms such as IBM Watson. After all, a machine can process a huge knowledge corpus, myriad fixed rules, and complex statistical models far faster than any mortal. Just as important, a big-data platform doesn’t have the limited attention span of a human; consequently, it can handle many tasks concurrently without losing its train of thought.
Also, Kahneman’s “System 1″ cognition–fast, unconscious, action-driven–is not necessarily something we need to hand to computers alone. We can accelerate it by facilitating data-driven interactive visualization by human beings, at any level of expertise. When a big-data platform drives a self-service business intelligence application such as IBM Cognos, it can help users to accelerate their own “System 1″ thinking by enabling them to visualize meaningful patterns in a flash without having to build statistical models, do fancy programming, or indulge in any other “System 2″ thought.
And finally, based on those two insights, it’s clear to me that cognitive computing is not simply limited to the Watsons and other big-data platforms of the world. Any well-architected big data, advanced analytics, or business intelligence platform is essentially a cognitive-computing platform. To the extent it uses machines to accelerate the slow “System 2″ cognition and/or provides self-service visualization tools to help people speed up their wetware’s “System 1″ thinking, it’s a cognitive-computing platform.
Now I will expand upon the official IBM definition of “cognitive computing” to put it in a larger frame of reference. As far as I’m concerned, the core criterion of cognitive computing is whether the system, however architected, has the net effect of speeding up any form of cognition, executing on hardware and/or wetware.
Niaz: How is Big Data Analytics changing the nature of building great products? What do you think about the future of products?
James: That’s a great question that I haven’t explored too much extent. My sense is that more “products” are in fact “services”–such as online media, entertainment, and gaming–that, as an integral capability, feed on the Big Data generated by its users. Companies tune the designs, interaction models, and user experiences of these productized services through Big Data analytics. To the extent that users respond or don’t respond to particular features of these services, that will be revealed in the data and will trigger continuous adjustments in product/service design. New features might be added on a probationary basis, to see how users respond, and just as quickly withdraw or ramped up in importance.
This new product development/refinement loop is often referred to as “real-world experiments.” The process of continuous, iterative, incremental experimentation both generates and depends on a steady feed of Big Data. It also requires data scientists to play a key role in the product-refinement cycle, in partnership with traditional product designers and engineers. Leading-edge organizations have begun to emphasize real-world experiments as a fundamental best practice within their data-science, next-best-action, and process-optimization initiatives.
Essentially, real-world experiments put the data-science “laboratory” at the heart of the big data economy. Under this approach, fine-tuning of everything–business model, processes, products, and experiences–becomes a never-ending series of practical experiments. Data scientists evolve into an operational function, running their experiments–often known as “A/B tests”–24×7 with the full support and encouragement of senior business executives.
The beauty of real-world experiments is that you can continuously and surreptitiously test diverse product models inline to your running business. Your data scientists can compare results across differentially controlled scenarios in a systematic, scientific manner. They can use the results of these in-production experiments – such as improvements in response, acceptance, satisfaction, and defect rates on existing products/services–to determine which work best with various customers under various circumstances.
Niaz: What is a big data product? How can someone make beautiful stuff with data?
James: What is a Big Data product? It’s any product or service that helps people to extract deep value from advanced analytics and trustworthy data at all scales, but especially at the extreme scales of volume (petabytes and beyond), velocity (continuous, streaming, real-time, low-latency), and/or variety (structured, semi-structured, unstructured, streaming, etc.). That definition encompasses products that provide the underlying data storage, database management, algorithms, metadata, modeling, visualization, integration, governance, security, management, and other necessary features to address these use cases. If you track back to my answer above relevant to “hybrid” architectures you’ll see a discussion of some of the core technologies.
Making “beautiful stuff with data”? That suggests advanced visualization to call out the key insights in the data. The best data visualizations provide functional beauty: they make the process of sifting through data easier, more pleasant, and more productive for end users, business analysts, and data scientists.
Niaz: Can you please tell us about building Data Driven culture that posters data driven innovation to build next big product?
James: A key element of any data-driven culture is establishing a data science center of excellence. Data scientists are the core developers in this new era of Big Data, advanced analytics, and cognitive computing.
Game-changing analytics applications don’t spring spontaneously from bare earth. You must plant the seeds through continuing investments in applied data science and, of course, in the big data analytics platforms and tools that bring it all to fruition. But you’ll be tilling infertile soil if you don’t invest in sustaining a data science center of excellence within your company. Applied data science is all about putting the people who drill the data in constant touch with those who understand the applications. In spite of the mythology surrounding geniuses who produce brilliance in splendid isolation, smart people really do need each other. Mutual stimulation and support are critical to the creative process, and science, in any form, is a restlessly creative exercise.
In establishing a center of excellence, you may go the formal or informal route. The formal approach is to institute ongoing process for data-science collaboration, education, and information sharing. As such, the core function of your center of excellence might be to bridge heretofore siloed data-science disciplines that need to engage more effectively. The informal path is to encourage data scientists to engage with each other using whatever established collaboration tools, communities, and confabs your enterprise already has in place. This is the model under which centers of excellence coalesce organically from ongoing conversations.
Creeping polarization, like general apathy, will kill your data science center of excellence if you don’t watch out. Don’t let the center of excellence, formal or informal, degenerate into warring camps of analytics professionals trying to hardsell their pet approaches as the one true religion. Centers of excellence must serve as a bridge, not a barrier, for communication, collegiality, and productivity in applied data science.
Niaz: As you know leaders and managers have always been challenged to get the right information to make good decisions. Now with the digital revolution and technological advancement, they have opportunities to access huge amount of data. How this trend will change management practice? What do you think about the future of decision making, strategy and running organizations?
James: Business agility is paramount in a turbulent world. Big Data is changing the way that management responds to–and gets ahead–of changes in their markets, competitive landscape, and operational conditions.
Increasingly, I prefer to think of big data in the broader context of business agility. What’s most important is that your data platform has the agility to operate cost-effectively at any scale, speed, and scope of business that your circumstances demand.
In terms of scale of business, organizations operate at every scale from breathtakingly global to intensely personal. You should be able to acquire a low-volume data platform and modularly scale it out to any storage, processing, memory and I/O capacity you may need in the future. Your platform should elastically scale up and down as requirements oscillate. Your end-to-end infrastructure should also be able to incorporate platforms of diverse scales—petabyte, terabyte, gigabyte, etc.—with those platforms specialized to particular functions and all of them interoperating in a common fabric.
Where speed is concerned, businesses often have to keep pace with stop-and-start rhythms that oscillate between lightning fast and painfully slow. You should be able to acquire a low-velocity data platform and modularly accelerate it through incorporation of faster software, faster processors, faster disks, faster cache and more DRAM as your need for speed grows. You should be able to integrate your data platform with a stream computing platform for true real-time ingest, processing and delivery. And your platform should also support concurrent processing of diverse latencies, from batch to streaming, within a common fabric.
And on the matter of scope, businesses manage almost every type of human need, interaction and institution. You should be able to acquire a low-variety data platform—perhaps a RDBMS dedicated to marketing—and be able to evolve it as needs emerge into a multifunctional system of record supporting all business functions. Your data platform should have the agility to enable speedy inclusion of a growing variety of data types from diverse sources. It should have the flexibility to handle structured and unstructured data, as well as events, images, video, audio and streaming media with equal agility. It should be able to process the full range of data management, analytics and content management workloads. It should serve the full scope of users, devices and downstream applications.
Agile Big Data platforms can serve as the common foundation for all of your data requirements. Because, after all, you shouldn’t have to go big, fast, or all-embracing in your data platforms until you’re good and ready.
Niaz: In your opinion, given the current available Big Data technologies, what is the most difficult challenge in filtering big data to find useful information?
James: The most difficult challenge is in figuring out which data to ignore, and which data is trustworthy enough to serve as a basis for downstream decision-support and advanced analytics.
Most important, don’t always trust the “customer sentiment” that you social-media listening tools as if it were gospel. Yes, you care deeply about how your customers regard your company, your products, and your quality of service. You may be listening to social media to track how your customers—collectively and individually—are voicing their feelings. But do you bother to save and scrutinize every last tweet, Facebook status update, and other social utterance from each of your customers? And if you are somehow storing and analyzing that data—which is highly unlikely—are you linking the relevant bits of stored sentiment data to each customer’s official record in your databases?
If you are, you may be the only organization on the face of the earth that makes the effort. Many organizations implement tight governance only on those official systems of record on which business operations critically depend, such as customers, finances, employees, products, and so forth. For those data domains, data management organizations that are optimally run have stewards with operational responsibility for data quality, master data management, and information lifecycle management.
However, for many big data sources that have emerged recently, such stewardship is neither standard practice nor should it be routine for many new subject-matter data domains. These new domains refer to mainly unstructured data that you may be processing in your Hadoop clusters, stream-computing environments, and other big data platforms, such as social, event, sensor, clickstream, geospatial, and so on.
The key difference from system-of-record data is that many of the new domains are disposable to varying degrees and are not regarded as a single version of the truth about some real-world entity. Instead, data scientists and machine learning algorithms typically distill the unstructured feeds for patterns and subsequently discard the acquired source data, which quickly become too voluminous to retain cost-effectively anyway. Consequently, you probably won’t need to apply much, if any, governance and security to many of the recent sources.
Where social data is concerned, there are several reasons for going easy on data quality and governance. First of all, data quality requirements stem from the need for an officially sanctioned single version of the truth. But any individual social media message constituting the truth of how any specific customer or prospect feels about you is highly implausible. After all, people prevaricate, mislead, and exaggerate in every possible social context, and not surprisingly they convey the same equivocation in their tweets and other social media remarks. If you imagine that the social streams you’re filtering are rich founts of only honest sentiment, you’re unfortunately mistaken.
Second, social sentiment data rarely has the definitive, authoritative quality of an attribute—name, address, phone number—that you would include in or link to a customer record. In other words, few customers declare their feelings about brands and products in the form of tweets or Facebook updates that represent their semiofficial opinion on the topic. Even when people are bluntly voicing their opinions, the clarity of their statements is often hedged by the limitations of most natural human language. Every one of us, no matter how well educated, speaks in sentences that are full of ambiguity, vagueness, situational context, sarcasm, elliptical speech, and other linguistic complexities that may obscure the full truth of what we’re trying to say. Even highly powerful computational linguistic algorithms are challenged when wrestling these and other peculiarities down to crisp semantics.
Third, even if every tweet was the gospel truth about how a customer is feeling and all customers were amazingly articulate on all occasions, the quality of social sentiment usually emerges from the aggregate. In other words, the quality of social data lies in the usefulness of the correlations, trends, and other patterns you derive from it. Although individual data points can be of marginal value in isolation, they can be quite useful when pieced into a larger puzzle.
Consequently, there is little incremental business value from scrutinizing, retaining, and otherwise managing every single piece of social media data that you acquire. Typically, data scientists drill into it to distill key patterns, trends, and root causes, and you would probably purge most of it once it has served its core tactical purpose. This process generally takes a fair amount of mining, slicing, and dicing. Many social-listening tools, including the IBM® Cognos® Consumer Insight application, are geared to assessing and visualizing the trends, outliers, and other patterns in social sentiment. You don’t need to retain every single thing that your customers put on social media to extract the core intelligence that you seek, as in the following questions: Do they like us? How intensely? Is their positive sentiment improving over time? In fact, doing so might be regarded as encroaching on privacy, so purging most of that data once you’ve gleaned the broader patterns is advised.
Fourth, even outright customer lies propagated through social media can be valuable intelligence if we vet and analyze each effectively. After all, it’s useful knowing whether people’s words—”we love your product”—match their intentions—”we have absolutely no plans to ever buy your product”—as revealed through their eventual behavior—for example, buying your competitor’s product instead.
If we stay hip to this quirk of human nature, we can apply the appropriate predictive weights to behavioral models that rely heavily on verbal evidence, such as tweets, logs of interactions with call-center agents, and responses to satisfaction surveys. I like to think of these weights as a truthiness metric, courtesy of Stephen Colbert.
What we can learn from social sentiment data of dubious quality is the situational contexts in which some customer segments are likely to be telling the truth about their deep intentions. We can also identify the channels in which they prefer to reveal those truths. This process helps determine which sources of customer sentiment data to prioritize and which to ignore in various application contexts.
Last but not least, apply only strong governance to data that has a material impact on how you engage with customers, remembering that social data rarely meets that criterion. Customer records contain the key that determines how you target pitches to them, how you bill them, where you ship their purchases, and so forth. For these purposes, the accuracy, currency, and completeness of customers’ names, addresses, billing information, and other profile data are far more important than what they tweeted about the salesclerk in your Poughkeepsie branch last Tuesday. If you screw up the customer records, the adverse consequences for all concerned are far worse than if you misconstrue their sentiment about your new product as slightly positive, when in fact it’s deeply negative.
However, if you greatly misinterpret an aggregated pattern of customer sentiment, the business risks can be considerable. Customers’ aggregate social data helps you compile a comprehensive portrait of the behavioral tendencies and predispositions of various population segments. This compilation is essential market research that helps gauge whether many high-stakes business initiatives are likely to succeed. For example, you don’t want to invest in an expensive promotional campaign if your target demographic isn’t likely to back up their half-hearted statement that your new product is “interesting” by whipping out their wallets at the point of sale.
The extent to which you can speak about the quality of social sentiment data all comes down to relevance. Sentiment data is good only if it is relevant to some business initiative, such as marketing campaign planning or brand monitoring. It is also useful only if it gives you an acceptable picture of how customers are feeling and how they might behave under various future scenarios. Relevance means having sufficient customer sentiment intelligence, in spite of underlying data quality issues, to support whatever business challenge confronts you.
Niaz: How do you see data science evolving in the near future?
James: In the near future, many business analysts will enroll in data science training curricula to beef up their statistical analysis and modeling skills in order to stay relevant in this new age.
However, they will confront a formidable learning curve. To be an effective, well-rounded data scientist, you will need a degree, or something substantially like it, to prove you’re committed to this career. You will need to submit yourself to a structured curriculum to certify you’ve spent the time, money and midnight oil necessary for mastering this demanding discipline.
Sure, there are run-of-the-mill degrees in data-science-related fields, and then there are uppercase, boldface, bragging-rights “DEGREES.” To some extent, it matters whether you get that old data-science sheepskin from a traditional university vs. an online school vs. a vendor-sponsored learning program. And it matters whether you only logged a year in the classroom vs. sacrificed a considerable portion of your life reaching for the golden ring of a Ph.D. And it certainly matters whether you simply skimmed the surface of old-school data science vs. pursued a deep specialization in a leading-edge advanced analytic discipline.
But what matters most to modern business isn’t that every data scientist has a big honking doctorate. What matters most is that a substantial body of personnel has a common grounding in core curriculum of skills, tools and approaches. Ideally, you want to build a team where diverse specialists with a shared foundation can collaborate productively.
Big data initiatives thrive if all data scientists have been trained and certified on a curriculum with the following foundation: paradigms and practices, algorithms and modeling, tools and platforms, and applications and outcomes.
Classroom instruction is important, but a data-science curriculum that is 100 percent devoted to reading books, taking tests and sitting through lectures is insufficient. Hands-on laboratory work is paramount for a truly well-rounded data scientist. Make sure that your data scientists acquire certifications and degrees that reflect them actually developing statistical models that use real data and address substantive business issues.
A business-oriented data-science curriculum should produce expert developers of statistical and predictive models. It should not degenerate into a program that produces analytics geeks with heads stuffed with theory but whose diplomas are only fit for hanging on the wall.
Niaz: We have already seen the huge implication and remarkable results of Big Data from tech giants. Do you think Big Data can also have great role in solving social problems? Can we measure and connect all of our big and important social problems and design the sustainable solutions with the help of Big Data?
James: Of course. Big Data is already being used worldwide to address the most pressing problems confronting humanity on this planet. In terms of “measuring and connecting all our big and important social problems and designing sustainable solutions,” that’s a matter for collective human ingenuity. Big Data is a tool, not panacea.
Niaz: Can you please tell us about ‘Open Source Analytics’ for Big Data? What are the initiatives regarding open source that IBM’s Big Data group and others group (startups) have done or are planning?
James: The principal open-source community in the big data analytics industry are Apache Hadoop and R. IBM is an avid participant in both communities, and has incorporated these technologies into our solution portfolio.
Niaz: What are some of the concerns (privacy, security, regulation) that you think can dampen the promise of Big Data?
James: You’ve named three of them. Overall, businesses should embrace the concept of “privacy by design” – a systematic approach that takes privacy into account from the start – instead of trying to add protection after the fact. In addition, the sheer complexity of the technology and the learning curve of the technologies are a barrier to realizing their full promise. All of these factors introduce time, cost, and risk into the Big Data ROI equation.
Niaz: What are the new technologies you are mostly passionate about? What are going to be the next big things?
James: Where to start? I prefer that your readers follow my IBM Big Data Hub blog to see the latest things I’m passionate about.
Niaz: Last but not least, what are you advices for Big Data startups and for the people those who are working with Big Data?
James: Find your niche in the Big Data analytics industry ecosystem, go deep, and deliver innovation. It’s a big, growing, exciting industry. Brace yourself for constant change. Be prepared to learn (and unlearn) something new every day.
Niaz: Dear James, thank you very much for your invaluable time and also for sharing us your incredible ideas, insights, knowledge and experiences. We are wishing you very good luck for all of your upcoming great endeavors.
_ _ _ _ ___ _ _ _ _
1. Viktor Mayer-Schönberger on Big Data Revolution
2. Gerd Leonhard on Big Data and the Future of Media, Marketing and Technology
3. Ely Kahn on Big Data, Startup and Entrepreneurship
4. Brian Keegan on Big Data
5. danah boyd on Future of Technology and Social Media
6. Irving Wladawsky-Berger on Evolution of Technology and Innovation
7. Horace Dediu on Asymco, Apple and Future of Computing
8. James Allworth on Disruptive Innovation