A Whitepaper on
Big Data-A New Analytic Science
October 28, 2013
The promise of data-driven decision-making is now being recognized broadly, and there is growing enthusiasm for the notion of “Big Data.’’
Heterogeneity, scale, timeliness, complexity, and privacy problems with Big Data impede progress at all phases of the pipeline that can create value from data.
Today the term big data draws a lot of attention, for decades, companies have been making business decisions based on transactional data stored in relational databases.
Much data today is not natively in structured format; beyond that critical data, however, is a potential treasure trove of non-traditional, less structured data: weblogs, social media, email, sensors, and photographs that can be mined for useful information. Also tweets and blogs are weakly structured pieces of text, while images and video are structured for storage and display, but not for semantic content and search: transforming such content into a structured format for later analysis is a major challenge. Decreases in the cost of both storage and compute power have made it feasible to collect this data – which would have been thrown away only a few years ago.
The value of data explodes when it can be linked with other data, thus data integration is a major creator of value. Since most data is directly generated in digital format today, we have the opportunity and the challenge both to influence the creation to facilitate later linkage and to automatically link previously created data. Data analysis, organization, retrieval, and modeling are other foundational challenges. Data analysis is a clear bottleneck in many applications, both due to lack of scalability of the underlying algorithms and due to the complexity of the data that needs to be analyzed. Finally, presentation of the results and its interpretation by non-technical domain experts is crucial to extracting actionable knowledge.
Need for Big Data
Today is a world of now, and your business needs to be able to quickly analyze new data and take action.
IDC’s most recent worldwide Big Data technology and services market forecast shows that the worldwide Big Data technology and services market will grow at a 31.7% compound annual growth rate (CAGR) – about seven times the rate of the overall information and communication technology (ICT) market – with revenues reaching $23.8 billion in 2016. The Big Data market is expanding rapidly as large IT companies and startups vie for customers and market share, providing technology buyers with more opportunities to use Big Data technology to improve operational efficiency and to drive innovation.
The intelligent economy produces a constant stream of data that is being monitored and analyzed. IDC estimates that in 2011, the amount of information created and replicated surpassed 1.8ZB (1.6 trillion gigabytes). Social interactions, mobile devices, facilities, equipment, R&D, simulations, and physical infrastructure all contribute to the flow. In aggregate, this is what is called Big Data.
Structured / Unstructured Data:
Data management needs have evolved from traditional relational storage to both relational and non-relational storage and a modern information management platform needs to support all types of data. To deliver insight on any data, you need a platform that provides a complete set of capabilities for data management across relational, non-relational and streaming data while being able to seamlessly move data from one type to another and being able to monitor and manage all your data regardless of the type of data or data structure it is. All without the application having to worry about scale, performance, security and availability.
As you start integrating more diverse data and the larger the data sets become, the more you need an analytical solution to help you figure out what’s important. As companies are able to identify, combine, and manage multiple sources of data, they need the capability to build advanced analytics models for predicting and optimizing outcomes. They also must possess the muscle to transform the organization so that the data and models actually yield better decisions and helps in predicting their customers’ behaviour.
Social Media Drivers
Companies are taking advantage of social media’s growing user base, using platforms such as Facebook, LinkedIn and Twitter to engage with customers directly. Other examples of online communities include consumers reviewing products at mobile app stores and third-party merchant websites. Firms such as AT&T, Carphone Warehouse, Domino’s, Procter & Gamble, Tesco and Unilever now regularly use a variety of these platforms to engage with their customers. Data from social media helps organizations undertake sentiment analysis on their consumers and better tailor their offerings.
There are lot of mobile devices, particularly smart phones and tablets which makes it easier to use social media and other data-generating applications. Mobile devices also collect and transmit location data. There is a strong need to process such data and provide actionable insight from this data generated from different devices. Electronic devices of all sorts – including servers and other IT hardware, smart energy meters and temperature sensors — all create semi-structured log data that record every action.
Challenges of Big Data
Acquire a “Big Data” solution
Almost every vendor would also be happy to provide a “Big Data” solution to you whether it is their own distribution of Hadoop or a full appliance that simply comes pre-installed with Hadoop. While this might be something you have decided to do, you realize that there is a big learning curve when your IT department needs to re-orient themselves around HDFS, MapReduce, Hive, HBase, etc. rather than T-SQL and a standard RDBSMS design. It will require a significant re-training around Hadoop and the ecosystem as well as a major effort to integrate the Hadoop implementation with the data warehouse.
High Costs–Buy a brand new tier one hardware appliance
One of the options you can decide is to purchase that tier one hardware appliance you’ve heard about from the several hardware vendors. The problem with this approach is that oftentimes these appliances also come with a high price tag. With an average price tag of 1 Million dollars, this just isn’t something you want to invest in at this moment. Further, many have a disjoined “Big Data” story with their data warehouse that require a separate unintegrated environment.
Companies might need to add to their company’s big data skill set by hiring data scientists, mathematicians and information architects. Big Data deployments also require new IT administration and application developer skill sets, and people with these skills are likely to be in short supply for quite some time. You may be able to retrain some existing team members, but once you do, they will be highly sought after by competitors and Big Data solutions providers.
Making sense of the explosion of data
Organizations need the right tools to make sense of the overwhelming amount of data generated by declining hardware costs and complex data sources.
Understanding a wider variety of data
Organizations need to analyze both relational and non-relational data. Over 85 percent of data captured is unstructured.
Enabling real-time analysis of data
New data sources—such as social media sites like Twitter, Facebook, and LinkedIn—are producing unprecedented volumes of data in real time, which cannot be analyzed effectively with simple batch processing. A typical Analyst spends too much time searching for the right data from thousands of sources, which adversely impacts productivity.
Achieving simplified deployment and management
Organizations need a more simple and streamlined deployment and setup experience. Ideally they would prefer to have fewer installation files that package the required Hadoop-related projects instead of making the choice themselves.
Internal IT Challenges
Big Data also pose a number of internal IT challenges. Big Data buildouts can disrupt current datacenter transformation plans. The use of Big Data pools in the cloud may help overcome this challenge for many companies.
Introduction of Big Data
We are awash in a flood of data today. In a broad range of application areas, data is being collected at unprecedented scale. Decisions that previously were based on guesswork, or on painstakingly constructed models of reality, can now be made based on the data itself. Such Big Data analysis now drives nearly every aspect of our modern society, including mobile services, retail, manufacturing, financial services, life sciences, and physical sciences.
Defining Big Data
Big data typically refers to the following types of data:-
Traditional enterprise data – includes customer information from CRM systems, transactional ERP data, web store transactions, and general ledger data.
Machine-generated /sensor data – includes Call Detail Records (“CDR”), weblogs, smart meters, manufacturing sensors, equipment logs (often referred to as digital exhaust), and trading systems data.
Social data – includes customer feedback streams, micro-blogging sites like Twitter, social media platforms like Facebook
But big data has changed dramatically. The evolution of the Web has redefined:
- The speed at which information flows into these primary online systems
- The number of customers a company must deal with
- The acceptable interval between the time that data first enters a system, and its transformation into information that can be analyzed to make key business decisions
- The kind of data that needs to be handled and tracked
The McKinsey Global Institute estimates that data volume is growing 40% per year, and will grow 44x between 2009 and 2020. But while it’s often the most visible parameter, volume of data is not the only characteristic that matters. In fact, there are four key characteristics that define big data:-
Volume— Machine-generated data is produced in much larger quantities than non-traditional data. For instance, a single jet engine can generate 10TB of data in 30 minutes. With more than 25,000 airline flights per day, the daily volume of just this single data source runs into the Petabytes. Smart meters and heavy industrial equipment like oil refineries and drilling rigs generate similar data volumes, compounding the problem.
There can be new data sources like RFID, the web, and social media.
Velocity— Social media data streams – while not as massive as machine-generated data – produce a large influx of opinions and relationships valuable to customer relationship management. Even at 140 characters per tweet, the high velocity (or frequency) of Twitter data ensures large volumes (over 8 TB per day).
Fueled by real-time data capture from websites, ATMs, point-of-sale devices, and other sources.
Variety–Traditional data formats tend to be relatively well defined by a data schema and change slowly. In contrast, non-traditional data formats exhibit a dizzying rate of change. As new services are added, new sensors deployed, or new marketing campaigns executed, new data types are needed to capture the resultant information.
Spurred by a variety of information—text, blogs, videos, photos, names, addresses, and purchase history inventory
Value. The economic value of different data varies significantly. Typically there is good information hidden amongst a larger body of non-traditional data; the challenge is identifying what is valuable and then transforming and extracting that data for analysis.
To make the most of big data, enterprises must evolve their IT infrastructures to handle these new high-volume, high-velocity, high-variety sources of data and integrate them with the pre-existing enterprise data to be analyzed.
So, What Is Big Data?
Big Data is about the growing challenge that organizations face as they deal with large and fast-growing sources of data or information that also present a complex range of analysis and use problems. These can include:
- Having a computing infrastructure that can ingest, validate, and analyze high volumes (size and/or rate) of data
- Assessing mixed data (structured and unstructured) from multiple sources
- Dealing with unpredictable content with no apparent schema or structure
- Enabling real-time or near-real-time collection, analysis, and answers
Big Data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.
Big Data defined by top Research Experts
Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. The volume, velocity and variety of big data can be overwhelming to IT organizations and their leaders. Gartner predicts that by 2015, big data demand will reach 4.4 million jobs. While this provides many opportunities to collect, manage and deploy data, a well thought out strategy is needed.
IDC defines Big Data technologies as a new generation of technologies and architectures designed to extract value economically from very large volumes of a wide variety of data by enabling high-velocity capture, discovery, and/or analysis.
Measured in terms of volume, velocity, and variety, big data represents a major disruption in the business intelligence and data management landscape, upending fundamental notions about governance and IT delivery. With traditional solutions becoming too expensive to scale or adapt to rapidly evolving conditions, companies are scrambling to find affordable technologies that will help them store, process, and query all of their data.
The Importance of Big Data
When big data is distilled and analyzed in combination with traditional enterprise data, enterprises can develop a more thorough and insightful understanding of their business, which can lead to enhanced productivity, a stronger competitive position and greater innovation – all of which can have a significant impact on the bottom line.
For example, in the delivery of healthcare services, management of chronic or long-term conditions is expensive. Use of in-home monitoring devices to measure vital signs, and monitor progress is just one way that sensor data can be used to improve patient health and reduce both office visits and hospital admittance.
Big Data in Practice
Regardless of industry or sector, the ultimate value of Big Data implementations will be judged based on one or more of three criteria:
Does it provide more useful information? For example, a major retailer might implement a digital video system throughout its stores, not only to monitor theft but to implement a Big Data system to analyze the flow of shoppers — including demographical information such as gender and age — through the store at different times of the day, week, and year. It could also compare flows in different regions with core customer demographics. This move makes it easier for the retailer to tune layouts and promotion spaces on a store-by-store basis.
Does it improve the fidelity of the information? For example, IDC spoke to several earth science and medical epidemiological research teams using Big Data systems to monitor and assess the quality of data being collected from remote sensor systems; they are using Big Data not just to look for patterns but to identify and eliminate false data caused by malfunctions, user error, or temporary environmental anomalies.
Does it improve the timeliness of the response? For example, several private and government healthcare agencies around the world are deploying Big Data systems to reduce the time to detect insurance fraud from months (after checks have been mailed and cashed) to days (eliminating the legal and financial costs associated with fund recovery).
Impact of Big Data on businesses
In the global marketplace, businesses, suppliers and customers are creating and consuming vast amounts of information. Gartner predicts that enterprise data in all forms will grow 650 percent over the next five years.
According to IDC, the world’s volume of data doubles every 18 months. This flood of data, often referred to as “information overload,” “data deluge” and “big data,” clearly creates a challenge for business leaders.
In three executives is regularly unable to find the right people who can provide the information they need when they need it. And, according to a recent industry report, during the latest recession, more than one-quarter of executives have lost business because they couldn’t access the right information. It is clear information overload is real and causing problems for business leaders.
Deriving Business Value from Big Data
Many companies lack the basic measures to manage big data, but see huge potential benefits if they can learn to leverage it effectively. 46 percent of companies report they have made an inaccurate business decision as a result of bad or outdated data. It is imperative that organizations address this filter failure to reduce detrimental business decisions and position themselves to be able to react quickly to business conditions.
Big Data strategy can address the following key organization requirements:
Flexible data management layer that supports all data types—structured, semi-structured, and unstructured data at rest or in motion.
Enrichment layer for discovering, transforming, sharing, and governing data.
Compelling suite of BI tools to help users gain insight from analytics.
Deeper insights that combine an organization’s data with data and services from external sources.
Thus, Big Data can help companies to extract usable insight from data sources to help them make better business decisions.
How Big Data Works?
Big Data solution gives you the power to manage virtually any data, regardless of size or location; add value to your data by enriching it with external input; and enable anyone in your organization to easily glean insight from your data so they can make smarter decisions.
Benefits related to gathering the data. Big Data solution enables you to gather the data of any type whether it be structured, semi-structured or completely unstructured. You can easily bring such data into one system and start processing that data. Next is, you can gather data of any size ranging from few gigabytes to petabytes. You can also gather this data in real-time i.e. streaming data.
Benefits related to processing the data. Big Data solution enables you to transform the value hidden behind lots of data. This means moving beyond the data that is easy to manage and taking away the fear of handling, massaging, and playing with unstructured data. It’s about opening up new sources of data from within your business that you haven’t tackled before—such as Tweets. Uncover non-intuitive relationship in data that you may not have thought of before. By thinking globally and looking externally, you can gain a richer picture of the factors affecting your business. This is about taking your data culture to the next level, by obtaining publically available data and mashing that up with your internal data.
Benefits related to analytics over the processed data. Big Data enables you to gain faster insights to the processed data. Its enables you to identify patterns to predict future opportunities. You can move from responding to customer needs to anticipating those needs, potentially before they are even aware of them, and in so doing delight your customers in ways they never knew they wanted. The analytics tools allows you to analyse and make more informed decisions using the processed data.
Big Data Technologies
There is a growing number of technologies used to aggregate, manipulate, manage, and analyze big data. We have detailed some of the more prominent technologies but this list is not exhaustive, especially as more technologies continue to be developed to support big data techniques, some of which we have listed.
An open source software framework for processing huge datasets on certain kinds of problems on a distributed system. Its development was inspired by Google’s MapReduce and Google File System. It was originally developed at Yahoo! and is now managed as a project of the Apache Software Foundation. Its responsibilities include chunking up the input data, sending it to each machine, running code on each chunk, checking that the code ran, passing any results either on to further processing stages or to the final output location, performing the sort that occurs between the map and reduce stages and sending each chunk of that sorted data to the right machine, and writing debugging information on each job’s progress, among other things.
Hadoop Distributed File System– It is Scalable to 1000s of nodes. Its design assumes that failures (hardware and software) are common and targeted towards small numbers of very large files. It is written once, read multiple times.
An open source, distributed, non-relational database modeled on Google’s Big Table. It was originally developed by Powerset and is now managed as a project of the Apache Software foundation as part of the Hadoop.
A software framework introduced by Google for processing huge datasets on certain kinds of problems on a distributed system. Also implemented in Hadoop. Programming framework for analyzing data sets stored in HDFS. MapReduce is a programming paradigm that involves distributing a task across multiple nodes running a “map” function. The map function takes the problem, splits it into sub-parts and sends them to different machines so that all the sub-parts can run concurrently. The results from the parallel map functions are collected and distributed to a set of servers running “reduce” functions, which then takes the results from the sub-parts and re-combines them to get the single answer. MapReduce jobs are composed of user-supplied Map & Reduce functions:
A relational DBMS that stores its tables in HDFS and uses MapReduce as its target execution language. Hive offers the ability to plug in custom code for situations that don’t fit into SQL, as well as a lot of tools for handling input and output. For using it, structured tables are set that describe input and output, issue load commands to ingest files, and then writing queries as for any other relational database.
A package for moving data between HDFS and relational DB systems. A big data tool that offers the capability to extract data from non-Hadoop data stores, transform the data into a form usable by Hadoop, and then load the data into HDFS. This process is called ETL, for Extract, Transform, and Load.
While getting data into Hadoop is critical for processing using MapReduce, it is also critical to get data out of Hadoop and into an external data source for use in other kinds of application. Sqoop is able to do this as well.
The Apache Pig project is a procedural data processing language designed for Hadoop. In contrast to Hive’s approach of writing logic-driven queries, Pig allows us to specify a series of steps to perform on the data. It is closer to an everyday scripting language, but with a specialized set of functions that help with common data processing problems. Frequently used operations, such as filters and joins, are also supported with Pig.
Microsoft Excel and Analysis Services
Excel can manipulate over 100 million rows of data in memory on the computer and gives analysis. PowerPivot is an Office Professional plus Excel 2013 add-in that can be used to perform powerful data analysis and create sophisticated data models. PowerPivot enables an Excel user to mash up large volumes of data from various sources, perform information analysis rapidly, and share insights easily. In both Excel and in PowerPivot, Data Model can be created, a collection of tables with relationships.
Big Data Business Scenarios
Ecommerce Merchants Use Big Data
Ecommerce merchants requires Big Data to analyze and manage a massive volume of both structured and unstructured data to gain a significant competitive advantage. The “structured” portion of Big Data refers to fixed fields within a database, this could be customer data — address, zip code — that’s stored in a shopping cart. The “unstructured” part encompasses email, video, tweets, and Facebook Likes. None of the unstructured data resides in a fixed database that’s accessible to merchants. But feedbacks from social media is very useful tool for Ecommerce businesses.
Big Data could be used by Ecommerce merchants for comparing traffic to a particular product to the sales of that product. You can expect a correlation between web traffic and sales, but if you find a lot of web traffic and few sales, something is wrong. Big Data helps to get to a decision that you need to confirm the product is competitively priced, has a compelling and informative presentation, an array of colors and sizes, and all other aspects required for customer to purchase the product, whereas in past days without Big Data analytics, the product may have been discarded due to low sales.
Help citizens get to where they need to go and use for planning new infrastructure projects. It also helps citizens to evaluate multiple transportation modes, carriers, routes, shipping strategies, and support to find the lowest cost combination for your transportation optimization needs.
Social network awareness
The recent emergence of social network applications, the prevalence of sensor-equipped mobile devices, and the availability of large amounts of geo-referenced data have enabled the analysis of new context dimensions that involve individual, social, and urban context. Determine trends on what citizens saying or searching about things that could impact services.
Federal government agencies are using geographic information systems (GIS) and big data for many types of activities such as real time location analysis of high volume, high velocity streams of sensor data, fraud detection, and disease surveillance. Use new data sources to find patterns to better identify different levels of fraud. For example, use weather and disaster relief data to verify insurance claims related to an event.
Analyze practical scenario (Case Study)
Crime Analyses and Assessment
In this scenario, I have analyzed crime rate in few cities categorized on small crimes such as bicycle-theft, burglary, expanding on information provided in the dataset. I have used Azure Market Data place services to get crime dataset for US cities. I have then added another scenario of Population Census data of US cities that provide population distribution and age range in urban and rural areas of that particular city. Also added another filter or slicer of Education and Employment data in that particular part of city. Finally concluding in this scenario that particular area in city with specific age range that are unemployed from a long time are more involved in such types of crime much as bicycle theft etc.
My initial goal is to see if there are particular geographic areas in the city that has high rate of such crimes. I have analyzed the Sex ratio, age ratio and education ratio. One hypothesis is that people between particular ages more likely involved in the crime, so wanted to compare age data with the education and employment to get the trend.
In this case study, Crime, Employement, Census and Statelist data has been pulled from the sources available by US Govt. Census data included lot of unstructured data, which has been processed by big data technology HDinsight uing MapReduce technology. Also there are Crime, employment and states data being pulled in excel using Power Query processed in MS Excel. In this analysis, we want to analyze the crime ratio and to capture which areas are more affected by crime.
Excel 2013 Visualizations
In this scenario, first we will first make sure the “Hive for Excel”, “Microsoft Office PowerPivot for Excel 2013”, and “Power View” add-in are enabled.
I have now used Power View to create some analytical visualizations. Using data visualization from Azure Market place dataset that plots Crime in various cities. This dataset also contains census data, population data, and sex ratio of that particular city.
Big Data Competing Technologies
Oracle’s Big Data Solution
Oracle is the first vendor to offer a complete and integrated solution to address the full spectrum of enterprise big data requirements. Oracle’s big data strategy is centered on the idea that you can extend your current enterprise information architecture to incorporate big data. New big data technologies, such as Hadoop and Oracle NoSQL database, run alongside your Oracle data warehouse to deliver business value and address your big data requirements.
IBM’s Big Data Solutions
IBM is unique in developing an enterprise class big data platform that allows to address the full spectrum of big data business challenges.
IBM is the only vendor with this broad and balanced a view of big data and the needs of a platform – and the benefit is pre-integration of its components to reduce your implementation time and cost.
The key platform capabilities include:
Analytic Solutions: Comprehensive business analytics to deliver information-based insights into every process, decision, and action
Visualization and Discovery: Discover, understand, search, and navigate federated sources of big data while leaving that data in place.
Hadoop-based analytics: Store any data type in the low-cost, scalable Hadoop engine to lower the cost of processing and analyzing massive volumes of data.
Stream Computing: Continuously analyze massive volumes of streaming data with sub-millisecond response times for real-time action.
Data Warehousing: Store and analyze large volumes of structured information with workload optimized systems designed for deep & operational analytics
Text Analytics: Analyze textual content to uncover hidden meaning and insight in unstructured information
Information Integration and Governance: Integrate, protect, cleanse, govern, and deliver your trusted information
The platform blends traditional technologies that are well suited for structured, repeatable tasks together with complementary new technologies that address speed and flexibility and are ideal for ad hoc data exploration, discovery and unstructured analysis.
Big data isn’t just hype – and it’s much more than a buzz phrase. Today, companies across industries are finding they not only need to manage increasingly large data volumes in their real-time systems, but also analyze that information so they can make the right decisions – fast – to compete effectively in the market. The challenges include not just the obvious issues of scale, but also heterogeneity, lack of structure, error-handling, privacy, timeliness, provenance, and visualization, at all stages of the analysis pipeline from data acquisition to result interpretation.
Big data is changing the way companies of all sizes, in all industries, go about their business.
When the Economist Intelligence Unit asked survey respondents to describe the impact data has had on their organization over the past five years, nearly 10% said it had completely changed the way they do business. Forty-six percent of respondents said it had become an important factor that drives business decisions.
There is no reason to think these trends will not continue. Of course, big data will always be but one of the tools that companies use to inform decisions. But it is an increasingly critical part of that portfolio. And companies that fail to develop a competency around it are likely to be left behind. Fortunately, the science of extracting insight from data is constantly evolving. Tools are more readily available as industries begin to invest in the technology that supports big data. And as the competency levels of firms continue to move along the big data continuum, increasing value will be realized.