I remember, some 2 years ago when I heard for the first time this term ‘Big Data‘. I was at Capgemini and talking with our colleagues from Oracle. “Big Data! What is this Big Data? Haven’t we enough with ‘Cloud‘ and everyone’s want of being a ‘Social-Media expert‘?” I think in 2012, not a week may have gone by in the life of a CxO without this term being used in some conversation.
I’ll be honest, when I first heard the term I thought it’s a fad. ‘People like making things overly complex and fancy, this craze will die down. After all, how big is Big Data? Essentially you are trying to analyze lot of data. How hard is that. You have various Business Intelligence tools in the market. Big Brother Google is there, indexing the entire Internet and some people are working on web semantics. What Big Data are people raving about?’
Those well versed with Big Data maybe rolling on the floor laughing reading the above statement or shaking their head with pursed lips. For others, I hope in the posts that follow you will see why
This post is only an Introduction to Big Data and some of the more Frequently used Terms you will hear people speak on this topic.
So, how can we define Big Data? Ability to deal with large volumes of various types of data in real-time and accurately, to get deeper insights for better decision-making.
Missing the terms Velocity and Variety is almost criminal in a Big Data definition but this was the shortest explanation I could think of without using ‘jargon’s', to give an overview of Big Data. Big Data has, in essence, 3 aspects to itself – Volume of Data, Velocity of Data and Variety of Data. I could have been cruel and given you 4 definitions which most people read, but since I ran the risk of having you hit on that tiny “x” mark on the browser, I decided not to start with them
More formal would be – McKinsey, IBM, SAS and Forrester with their perspectives about Big Data (I gave Forrester the edge over their competitor not only because Mike Gualtieri and I follow each other on Twitter but because he goes one step ahead making it practical).
Now, let’s get few (umm, let’s keep it at 10) of the common terms used in this world of Big Data.
1) Volume – The ‘Big’ in Big Data perhaps starts with this aspect. Volume here means the size of data.
Remember – bit, byte, KB, MB, GB etc.. Now they’ve gone to petabyte (1,000 TeraBytes) and exabyte (1,000,000 TB) and beyond. While there is no specific starting limit, at the moment (at least not to my knowledge) to classify “Big Data” it’s usually spoken in the petabytes, exabytes and above range. Frankly, ‘Big Data’ volume will differ industry to industry and company to company. We probably cannot even comprehend what amount of data companies like Facebook and Google are dealing with. In short, volume is size and this is relative
2) Velocity & Real-Time – Not just large amounts of data, but the rate at which this data comes to the system is Velocity. The instantaneous processing of data is Real-Time. This aspect makes Big Data different. No more waiting for months, weeks or days – in a few hours, possibly minutes - analysis and interpretation of massive amounts of data is possible . Imagine understanding consumer behavior and needs, stock analysis predictions etc. Velocity is – data coming in the system at any possible interval, amount and speed. Real-Time is – knowledge flowing out of the system as quickly as possible.
3) Variety – Categorizing the different sources and types of data is Variety. Data is both – Structured and Unstructured. Structured data is the data inside a relational database. Unstructured data is – yes, you are right – data not in that database. But this is not exactly clear is it?
Let’s put it this way – until recently a database had a specific architecture (consisting of tables…). Data fed in the required format, could easily be retrieved, accessed and viewed with some commands. Data in this format is structured data. But what happens when you have Image files, E-mails, Word Documents, Voice messages, Videos, ‘Re-Tweets’ on Twitter and ‘Likes’ on Facebook, PowerPoint presentations etc? All of this is ‘stored’ and all of this has information, but they cannot be placed in the traditional databases (Oracle, SQL Server etc). This is unstructured data.
Big Data is about using both – structured as well as unstructured information and make sense out of it. This is hugely significant – because no more do companies simply have to store ALL this data, they can actually use it and make sense out it. This is in line with – web semantics (but we won’t talk of that now)
4) SQL and NoSQL – The use of Structured Query Language (SQL) is for accessing a relational database. A relational database is basically storing of data in a particular format in tables from which it could be accessed later (Oracle RDBMS, SQL Server are examples). NoSQL doesn’t mean ‘No SQL’ – it’s ‘Not Only SQL‘. A NoSQL database is that which can handle unstructured data. It is NOT a relational database. Its architecture and way of functioning is completely different from relational databases and designed for handling large and complex data (Cassandra, MongoDB are examples).
5) MapReduce – Developed by Google, it is a programing framework used to write programs and process large amount of data (especially unstructured data) in parallel. The word “Parallel” is of significance. MapReduce has 2 functions – Map and Reduce.
- Map breaks all the data it gets into smaller parts, assigns it some key/value pairs and distributes it to various ‘computers’ for processing.
A key-value pair lets data be identified by a name. Key is the unique identifier of the data, while the value is the data.
- Reduce takes each set of values with the same key from all the ‘computers’ and combines them to give a single value
The beauty is that you get only 1 result much quicker and don’t know/care how many computers were working on it. A big task is broken into several small tasks. The individual results are then combined to give a final solution without an overload on a system.
6) Hadoop – It’s an open-source, Java-based programing framework based on MapReduce by Apache Software. MapReduce is Google proprietary. Essentially how it works is this – You buy several servers and run the Hadoop software on each of them. Hadoop breaks all the data and sends it to different servers. Hadoop keeps track of the data (This is Hadoop Distributed File System – HDFS) and creates multiple updated copies as well. This ensures, even if nodes/server fail, there is a copy which updates and processing can resume. Distributing the computation makes most efficient use of the processing speed and by using MapReduce you get a 1 final result.
By itself, Hadoop is just a framework. It has dependencies and needs other complimentary services.
7) Business Intelligence / Business Analytics – Some might disagree with my clubbing these terms but frankly I think we’re losing ourselves in terminology. To me, in essence, they are the same. Use of applications or tools on data, to analyze it and improve decision-making by making sense of all the data is BI / BA.
8) Cloud - I hope this earlier post of mine on the Cloud will help you. The cloud has been a big driver for Big Data adoption. With increased digitization and rise of unstructured data, Big Data is increasingly becoming mainstream.
9) Visualization – The graphical representation of the analysis. This area is getting more notice. With large amounts of data being analyzed C-suite members need a simpler representation of the result. Different relations between numbers and topics could graphically be represented and visualization is key for this.
10) In-memory – In-memory means that data is stored in the computer memory (example- RAM – Random Access Memory) than outside of it (example- on hard-disks) to make access to data faster, thereby improving speed of any instruction/process and hence a better final result.
SAP HANA is one such name you will come across almost certainly in any discussion around real-time and in-memory database. HANA is a software/hardware in-memory computing platform designed for high-volume transactions and real-time analytics.
I deliberately left out terms like storage and search and query etc only because I didn’t know what to write about it which would add any value to a person reading my post. Would saying – “Storage : Infrastructure/Application to hold large amounts of data. Example – Data centers, Databases” – help? It is somewhat obvious and yet an important term while talking about Big Data. The problems with Storage are enough to write blog posts after blog posts and I promise to write about it sometime later.
Infographics are always a nice change from boring text. Maybe you can have a look at these 6 Infographics for now.
I hope this post served its purpose of helping you gain familiarity with some terms around Big Data and the concept itself. In my next post we will discuss some of the questions companies should ask themselves around Big Data, a little focus on Oracle in this space and their continued fight with SAP, current trends and problems in Big Data.
While you question if Big Data brings answers from the outside world, until my next post, hope you spend sometime exploring the world within