Big data is now a dictionary term, and I guess everyone understands it. However, if you ask Google, they give you the following answer.
One thing is evident with the Google response that it is a large data set. But the question is how much large?
How big is the big data?
Do you call 100GB data set a Big Data? Or it should be in terabytes or petabytes to be called a
Big Data? This confusion
existed for some time until people agreed to an acceptable definition.
Gartner gives the most widely accepted
definition of big
data.
Big data is high-volume, high-velocity and/or high-variety information assets that demand
cost-effective,
innovative forms of information processing that enable enhanced insight, decision making, and
process
automation.
If you quickly analyse the definition, you will see three characteristics.
- Volume
- Velocity
- Variety
Some people call it 3Vs of big data.
So, coming back to our question.Do we call 100GB data set a Big Data?
In this case, we already know that volume is 100GB. That doesn't appear to be too big. To
call
it big data, we may need to get details of other two Vs. Velocity and Variety. If we realize
that
we are getting data at the rate of 100GB per minute and there is a need to store or process it
at
the same speed, I will want to call it big data.
You may ask me why?
Why would you call it big data? I am still not clear on that.
Ok, let’s look at the other parts of the definition. If you don't have a clear,
cost-effective
solution for the problem created by a combination of 3Vs, and you need to innovate to solve
that
problem, believe me, you have a big data problem.
So, Big Data is a problem characterized by 3Vs. To address these issues, cost-effectively,
you
require innovative thinking and use of innovative tools and techniques. If your question fits
into
this definition, you have a big data problem.
How Bigdata Started?
This Big data problem started more than a decade ago. People may debate on this, but I believe
that it began with the growth
and popularity of World Wide Web. The search engine companies like Google and Yahoo were first
to
recognize it, but soon many Internet-scale companieslike Amazon, and Facebook started realizing
the
problem. They were in the first row because they had to deal with the internet scale of volume,
velocity,
and variety.
In today’s world, the data has become synonymous with oil and electricity. Organizations
are
running on data. Every business has begun to depend on data to derive insights and use them in
decision
making. Now, they are moving into next level and started using data for automating systems,
processes
and even decision making.
So, if you haven’t already started learning and becoming comfortable with a whole new fleet
of
tools and technologies to deal with the Big Data problem, you are losing opportunities just
like
those organizations that are still not realizing Big Data problem.
History of Bigdata and Hadoop?
Hadoop is one of the toolsets that enables us to deal with Big Data. But Hadoop is not the only
one. Many innovations are
happening all over, and you will find specialized tools for specific problems. But Hadoop is
one
of the most successful and widely adopted tools in this space.
As I said, search engine companies were first to realize the Big Data Problem and Google
was
the first to solve the puzzle. They disclosed the first part of the solution in the year 2003
by
publishing a paper “The Google File System.” This article presented a distributed file system
to
write and read large data efficiently. The second part of the solution was released in the year
2004
by another paper on Map Reduce (MapReduce: Simplified Data Processing on Large Clusters). This
article
presented a framework to process a file stored on Google file system.
I am not sure how many people realized the importance of those two Google research papers,
but
Doug Cutting
realized it. He recognized the importance because he was working on a project and was
facing
the Big Data problem. So, He decided to implement both things described in those two Google
papers
and build it successfully to a demonstrable extent.
Later in the year 2005, He was hired by Yahoo and given a dedicated and talented team to
create
a distributed computing platform. He named it Hadoop and Yahoo decided to Open Source it. After
several
iterations, Hadoop 1.0.0 came out in Dec 2011.
So, Initially, The Hadoop had just two core components.
- A distributed file system (HDFS) – They called it Hadoop distributed file system.
- A distributed programming framework (Map Reduce) – They called it Map Reduce.
Both things are an implementation of those two Google papers.
The open source community released Hadoop 2.0 in May 2012. With this version, they added
another
core component called YARN.
As we stand today, Hadoop 3.0 alpha2 is out in January 2017. However, there is no new core
component
introduced in this release.
In upcoming videos, we will learn about these three main parts. We will cover Hadoop 2.x
and
relevant features introduced in Hadoop 3.x.
What is Hadoop ecosystem?
Now, we know about Hadoop core components. But in today’s world, when people refer Hadoop, they don’t just mean these three essential elements. They incorporate a set of tools that work on top of Hadoop core or around Hadoop core. They call it Hadoop ecosystem. There is no precise definition of what comes under Hadoop ecosystem and what stands outside. But without going into that debate, I want to list some names that are widely considered to be part of Hadoop ecosystem.
- Hive
- Pig
- HBase
- Spark
- Sqoop
- Flume
- Kafka
- Oozie
There are many more names, but I have listed only those that evolved along Hadoop. This list is
to give you an indication
of Hadoop ecosystem, but you don’t need to learn all of them. Many of them couldn’t win an
ethical
adoption. Hive, Spark, and Kafka are the three most widely adopted components from this list.
This tutorial will build a sound foundation for Hadoop core. We will cover individual
components
separately. We already have a separate
Apache
Kafka Tutorial and
Apache
Spark Tutorial. We will create similar tutorials on every other tool. Our focus will be
primarily
to cover the perspective of developer and application architect.
So please subscribe to our YouTube channel and stay tuned.
Thank you very much. Keep learning and keep growing.