More Hadoop | Sue Brandreth's Learning Resources

What is Big Data? – A beginner’s tutorial
(http://www.hadoopinrealworld.com/what-is-big-data/)

Content:

What is Big Data?
Examples of Big Data
What are the problems that come with Big Data?
What Hadoop can offer in terms of solutions to the Big Data problems
Hadoop vs. Traditional Solutions

What is Big Data?
If you ask me, tell me in one sentence what is Big Data?
I would say, Big Data is extremely huge volume of data.
To that answer the follow up question you can ask is – What is considered “huge” – 10 GB, 100 GB or 100 TB?
Well, there is no straight hard number that defines “huge” or Big Data. I know that is not the answer you are looking for. (I can see your disappointment :-)) . There are 2 reasons why there is no straight forward answer –
First, what is considered Big or huge today in terms is data size or volume need not be considered as Big a year from now. It is very much a moving target.
Second, it is all relative. What you and I consider to be “Big” may not be the case for companies like Google and Facebook.
Hence for the above 2 reasons, it is very hard to put a number to define big data volume. Let’s say if we are defining Big Data problem in terms of pure volume alone then in our opinion –

100 GB – Not a chance. We all have hard disk greater than 10 GB. Clearly not Big Data.
1 TB – Still no Because a well defined traditional database can handle 1 TB or even more without any issues.
100 TB – Likely. Some would claim 100 TB to be a Big Data problem and others might disagree. Again, its all relative.
1000 TB – Big Data problem.

Also volume of data is not the only factor to classify your data to be Big Data or not. So what other factors should be considered? Perfect segue to our next question.
Lets say we work at a startup and we recently launched a very successful email service where users can login to send and receive emails. Our email service is so good (better than gmail :-)), in 3 months we have 100,000 active users signed up and using our service. Hypothetically let’s say we are currently using a traditional database to store email messages and its attachments. Also, our current size of the database is 1 TB.
So, do we have a Big Data problem? The simple and straight answer is No. 1 TB or even more is a manageable size for a traditional database. The more important question is – at this growth rate, will we have a Big Data problem in the (near) future? To answer that we need to consider 3 factors.

Volume
Obvious factor. In 3 months our startup has 100,000 active users and our volume is 1 TB. If we have positive growth at the same rate at the end of the year we will have 400,000 users and our volume will be 4 TB. End of year 2 with the same growth rate, we will have 8 TB. What if we doubled or tripled our user base every 3 months? So the bottom line is we should not just look at the volume when we think of Big Data we should also look at the rate in which our data grows. In other words, we should watch the velocity or speed of our data growth.

Velocity
This is an important factor. Velocity tells you how fast your data is growing. If your data volume stays at 1 TB for a year all you need is a good database but if the growth rate is 1 TB every week then you have to think about a scalable solution.
Most of the time Volume and Velocity is all you need to decide whether you have a Big Data problem or not.

Variety
Variety adds one more dimension to your analysis. Our data in traditional databases are highly structured i.e. rows and columns. But take for instance our hypothetical startup mail service, it receives data in various formats – text for mail messages, images and videos as attachments. When you have data coming in to your system in different formats and you have to process or analyze the data in different formats traditional database systems are sure to fail and when combined with high volume and velocity you for sure have a Big Data problem.
This happen to Big Data consultants all the time; they will be called in by clients with data storage or performance issues and hope that a Big data Solution like Hadoop is going to solve their problem and most of the time their answers will fail in the volume and velocity tests.
Their volume will be in the higher gigabytes and low Terabytes and their data growth is relatively low for the past 6 months and in the foreseeable future. Hence their volume does not qualify as big data and also the data growth is low and it fails the velocity test as well. What the client needs is to optimize the existing process and not a sophisticated Big Data solution.
Now you know what is Big Data and given a scenario if someone comes up to you and ask whether their data problems can be solved by Big Data solutions you know what are the factors that is – volume, velocity and variety that needs to be considered to make a sound decision.
So when we say Big Data we are potentially talking about 100s to 1000s of Terabytes.. if you are new to the Big Data space you are probably wondering is there really a use case? The answer is absolutely yes across all domains – science, government or private sector.

Examples of Big Data

Lets talk about science first. The large Hadron Collider at CERN produce about 1 Petabyte of data every second, mostly sensor data. Their volume is so huge they don’t even retain or store all the data they produce.
Source

NASA gathers about 1.73 Gigabytes of data every hour about weather, geo location data etc.
Source

Lets talk about the government.. NSA is known for its controversial data collection programs and guess how much NSA’s data center at Utah can house in terms of volume? —- A Yottabyte of data that is, 1 Trillion Terabytes of data, pretty massive isn’t it?
Source

In March of 2012 Obama’s administration announced to spend $200 million dollars in Big Data initiatives.
Source

Even though we can not technically classify the next one under government, it’s an interesting use case so we included it anyway. Obama’s 2nd term election campaign used big data analytics which gave them a competitive edge to win the election.
Source

Next let’s look at the private sector. With the advent of social media like Facebook, Twitter, LinkedIn etc. there is no scarcity of data. eBay is known to have at 30 PB cluster and Facebook 30 PB. The numbers are probably much much more now since the stats are old.
Source

Not only in social media companies but also in retail space. It is most common in several major retail websites to capture click stream data. For example let’s say you shop at amazon.com, Amazon is not only capturing data when you click checkout, your every click on their website is tracked to bring a personalized shopping experience. When Amazon shows you recommendations, Big data analytics is at work behind the scenes.

You might be interested to know how Target could find out a women is pregnant using data analytics.

What are the problems that come with Big Data?

Now you should be convinced that Big Data exists, even if you did not believe it before.
So we have Big Data, so what?
Big data comes with big problems. Lets talk about few problems you may run in to when you deal with Big Data.
Since the datasets are huge, you need to find a way to store them as efficient as possible, I am not just talking about efficiency just in terms of storage space but also efficiency in storing the dataset that is suitable for computation. Another problem when you deal with big dataset you should worry about about data loss due to corruption in data or due to hardware failure and you need have proper recovery strategies in place.
The main purpose of storing data is to analyze them and how much time does it take to analyze and provide a solution to a problem using your big data is a million dollar question. What’s good in storing the data when you can not analyze or process the data in reasonable time right? With big datasets computation with reasonable execution times is a challenge.
Finally, cost. You are going to need a lot of storage space. So the storage solution that you plan to use should be cost effective.
But what is the need for new Big Data solution like Hadoop? Why traditional database solutions like MySQL or Oracle is not a good solution to store and analyze big datasets?
You will encounter scalability issues with traditional RDBMS when you start moving up in data volume in terms of terabytes. You will be forced to denormalize and pre aggregate the data for faster query execution time. As the data gets bigger you will be forced to make changes to the process in terms of changes to the indexes, optimizing queries etc.
If you worked with databases before, assuming your database is running in with enough hardware resources and when you see a performance issue still you will have to make some changes to the query itself or the way in which the data that you are accessing is stored and there is not working around it. You can not add more hardware resources or more computer nodes and distribute the problem to bring the computation times down. Meaning to say a database is not horizontally scalable.
The second problem with databases is that databases are designed to process structured data. Imagine when you have records that does not conform to a table like structure or each record in your dataset differs in the number of columns and there is no uniformity in the column lengths and types between rows. In this case database is not a good choice and when your data does not have a proper structure a database will struggle to store and process the data. Further more a database is not a good choice when you have a variety of data that is data in several formats like text, images etc.
Other challenge is cost. A good enterprise grade database solution can be quite expensive for relative low volume of data. When you add the hardware costs and the platinum grade storage costs it quickly adds up to the quite expensive.

So what is the solution? A good solution should –

Of course, handle huge volume of data
Efficient Storage – Ability to store data efficiently
Recovery from Data loss – Data loss is unavoidable so the proposed solution should implement good recovery solution in place.
The solution should horizontally scale as your data grows
It should be Cost effective
Minimize the learning curve – Easy for programmers & data analysts to work with the system.

That is exactly what Hadoop offers !!!

Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

Hadoop can handle huge volume of data, it can store data efficiently both in terms of storage and computation, it has good recovery solution for data loss and above all it can horizontally scale, so as your data gets bigger you add more nodes and Hadoop takes care of the rest. That simple.

Above all, Hadoop is cost effective – meaning we don’t need any specialized hardware to run Hadoop and hence great for even startups. Finally easy to learn and implement.

Hadoop vs. Traditional Solutions

High Performance or Grid Computing

Distributed computation is not a new concept. Traditional distributed solutions like Grid computing are essentially many nodes operating on data parallely and hence does faster computation. But there are 2 challenges with that though.

Grid or High performance computating is good for compute intensive tasks with relatively low volume of data but does not perform well when the volume of data is huge..
Grid computing requires a good experience with low level programming to implement and hence not suitable for mainstream..

RDBMS

What about Hadoop vs. RDBMS? Is Hadoop a replacement for Database? The straight answer is – no. There are things Hadoop is good at and there are things that database is good at.
RDBMS works exceptionally well with volume in the low terabytes where as with Hadoop the volume we speak is in terms of Petabytes.
Hadoop can work with Dynamic schema and can support files with many different formats where as the database schema is very strict and not so flexible and can not handle multiple formats
Database solutions can scale vertically meaning you can add more resources to the existing solution but will not horizontally scaling that is you can not bring down the execution time of a query by simply adding more computers.
Finally the cost, database solutions can get expensive very quick when you increase the volume of data you are trying to process. Where as Hadoop offers a cost effective solution. Hadoop infrastructure is made up of commodity computers meaning there is no need to special hardware. Commodity computers does not mean cheap computers it is still enterprise grade hardware but relative inexpensive as opposed specialize equipment.
Hadoop is a batch processing system. It is not as interactive as a database. You can not expect millisecond response times with Hadoop as you would expect in a database.
With Hadoop you write the file or dataset once and operate or analyze the data multiple times whereas with the database you can read and write multiple times.
But these gaps between Hadoop and RDBMS are closing in. Hadoop offers a cost effect solution to the big data problems but Hadoop is not the only solution that is available in the market now. NoSQL databases like HBase, Cassandra brings a great deal of value in analyzing huge volume of data and is a great alternative for RDBMS.

Conclusion

At this day and age, one thing is a fact – there is no shortage of information. That does not mean to say that a decade ago there was very less data. Few years ago, the data we collected but was not analyzed and were simply ignored due to the lack of framework or technology to analyze them at scale. Quite honestly ignoring data is not an option any more and it is important for businesses and organizations to analyze the data to to get realistic and useful predictions about their business and about their customers. With a great technology like Hadoop analyzing huge datasets has never been this easy and cost effective.
At this point you should have a good enough idea of what is classified Big Data and what are the challenges with it. We will move on to learn more about what is Hadoop and how it proposes a solution to store and analyze big datasets.

What is Hadoop? – A beginner’s tutorial to understand Big Data problem and Hadoop
(http://www.hadoopinrealworld.com/what-is-hadoop/)

In this Post we looked at What is Big Data. To learn about what is Hadoop? we need to first understand the challenges with big datasets in terms for storage and computation. Once we understand the problems we are trying to solve we can easily understand the solution – which is Hadoop.
Let’s take a sample big data problem, analyze it and see how we can arrive at a solution together. Ready?
Imagine you work at one of the major exchanges like New York Stock Exchange or Nasdaq. On morning someone from your Risk department stops by your desk and asked you to calculate the maximum closing price of every stock symbol that is ever traded in the exchange since inception. Also assume the size of the dataset you were given is 1 TB.
Immediately the business user who gave this problem asks you to give an ETA on when he can expect the results.
Wow!! There is lot of think here to answer that question. So you ask him to give you some time and you start to work. What would be your next steps?
We need to figure out 2 things – The first one is how to store this dataset and the 2nd how to perform the computation. Lets talk about storage first.
Your workstation has only 20 GB of free space, so you go to your storage team and ask them to copy the dataset to a Network Attached Storage or NAS or SAN server and ask them give you the location of the dataset. A NAS is a data storage server connected to your network so any computer with access to the network and the NAS server can access the data if they are permissioned to see the data. So far good. The data is stored and you have access to the data and now you set out to solve the next problem which is computation.
You are a Java programmer so wrote an optimized Java program to parse the file and perform the computation. Everything looks good and you are ready execute the program against your dataset.
You realize it is already noon, the business user who gave you this request stop by for an ETA. That’s an interesting question, isn’t? For the program to work on the dataset, first the dataset needs to be copied from the storage to the working memory or RAM. So estimate the execution time you need to know the time to read the dataset from the hardisk and how long will it take to compute the result.

So how long does it does to take to copy a 1 TB file from storage ?

Let’s take our traditional Hard Disk Drives (HDD), that is the one that is attached to our laptop or workstation. HDDs have magnetic platters in which the data is stored. When you request to read data, the head in the hard disk first position itself on the platter and start transferring data from the platter to the head. The speed in which the data is transferred from the platter to the head is called the data access rate.
Average data access rates in HDDs is usually about 122 MB per secs. if you do the math, to read 1 TB file from a HDD you would need about 2 hr and 22 mins.

Now that is for a HDD that is connected to your work station. When you transfer a file from the NAS server you should know the transfer rate of the hard disk drives in the NAS server. For now we will assume it is same as the a regular HDD which is 122 MB and hence it would take 2 hrs and 22 mins just to read the data from the hard disk. While you were calculating all this the business user is still standing waiting for a ETA .
Now, what about computation time? Since you have not executed the program yet atleast once you can not say that for sure plus your data comes from a storage in the network so you have to consider the network bandwidth also. With all that in mind you tell him the ETA is about 3 hours but it could be easily over 3 hours since you are not sure about the computation time.
Your user is shocked to hear 3 hours and comes next question. “Can we get it sooner than 3 hrs say may be in 30 mins?” You know there is no way you can get the result in 30 mins. Of course business can not wait for 3 hours, especially in finance where time is money. 3 hours is unacceptable.

So lets work this problem together. How can we calculate the result in less than 30 mins or less?

Lets break this down. Majority of the time you spend in calculating the result set will be attributed to 2 tasks.

Transferring the data from the storage or Hard Disk drive which is about 2 and half hours and
Computation time – that is the time to perform the actual average close price by your program. Let’s say it would take 60 minutes.

Crazy idea! To reduce the time to read the data from hard disk, what if we replace HDD by SSD? Solid State Drives or SSDs are very powerful alternatives to HDD. SSD does not have magnetic platters or heads and they don’t have any moving components and they are based on flash memory and hence extremely fast.

Sounds great. Why don’t we use SSD in place of HDD?

SSD comes with a price. They are usually 5 to 6 times in price than your regular HDD. Although the prices continue to go down over time given the data volume that we are talking about with respect to big data, it is not a viable option now. So for now we are stuck with HDDs. Let’s talk about how we can reduce the computation time. Hypothetically we think the program will take 60 mins to complete and let’s also assume the program is already optimized for execution.

What can be done next? Any ideas? I have a crazy idea!!

How about dividing the 1 TB dataset in to 100 equal size chunks or blocks and have 100 computers or nodes do the computation parallely?

In theory, this means you cut the data access rate or speed with which you read by a factor of 100 and also the computation time by a factor of 100. So with this idea, you can bring the data access time to less than 2 mins and computation time to .6 mins. Beautiful!!

It is a promising idea so lets explore even further.

There are couple of issues with this idea. If you have more than one chunk of your dataset stored in the same hard drive you can not get a true parallel read because you have only one head in your hard drive to read the data. But for the sake of argument, let’s assume you get a true parallel read, which means you have 100 computers reading the data at the same time. Now assuming the data can be read parallely you will now have 100 times 122 MB / sec of data trying to flow through the network.

Imagine this, what would happen when each one of your family member at home starts to stream their favorite TV Show or Movie at the same time using the single internet connection you have at your home?

It would result in a very poor streaming experience with lot of buffering and no one in the family can enjoy their show. What you have essentially done is choked up your network. The download speed requested by each one of the devices combinely exceeded the download bandwidth offered by the internet connection resulting in a poor service. That is exactly what will happen here when 100 nodes tries to transfer the data over the network at the same time.

How can we solve this?

Why do we have to rely on a Network Attached Storage? Why don’t we bring the data closer to computation. That is, let’s store the data locally in its own hard disk in each node. So you would store block 1 of data in node 1, block 2 of data in node 2 etc..
Now we can achieved a true parallel read on all 100 nodes and also we have eliminated the network bandwidth issue.
Now that we have divided our data in to blocks, our exposure to data loss increases because when one of the 100 nodes goes down we would lose the data stored in our node. We would also require all 100 chunks or blocks of data to calculate the maximum closing price. Losing a block of data will result in incomplete result. In reality, data loss is unavoidable even in a single node situation but we need a way to recover from the data loss.

Data Loss

I am sure most of us at least once were in a hard drive failure situation and it is no so much fun. So how can you protect your data from hard disk failure, data corruption etc. ?
Let’s take an example. Let’s say you have a photo of your loved ones and you treasure that picture. In your mind there is no way you could lose that picture. How Would you protect it? You would keep copies of the photo in different places right – may be one in your laptop, one copy in picasa, one copy in your external hard drive. you get the idea.. So if your laptop crashes you can still get that picture from picasa or your external hard drive.
We can do something similar for our data loss situation. Lets do this, why don’t we copy each block of data to 2 more nodes or in other words we can replicate the block in 2 more nodes. So in total we will have 3 copies of the block.
Take a look at below. Node 1 has block 1, 7 and 10. Node 2 has blocks 7, 11 & 42. Node 3 has blocks 1, 7 and 10. So if block 1 is unavailable in node 1 due to a hard disk failure or corruption in the block, it can be easily fetched from node 3.

So this means that node 1, 2 and node 3 must have access to one another and they should be connected in a network, thus forming a 100 node cluster. Conceptually this is great but it has some challenges implementing it.
How does Node 1 knows that Node 3 also has Block 1? Who decides Block 7 should be stored in Node 1, 2 and 3? First of all who will break the 1 TB in to 100 blocks? This solution does not look easy isn’t? and that is just the storage part of it.
Computation brings other challenges. Node 1 can only compute the maximum close price from just block 1, node 2 can only compute the maximum close price from block 2. This brings up a problem because for eg.. data for stock GE can be in block 1 and also in block 2 and could also be on block # 82. So you have to consolidate the result from all nodes together to computer the final result. Who is going to coordinate all that?
The solution we are proposing is distributed computing and as we are seeing it has several complexities to implement both at the storage layer and also at the computation layer.
The answer to all those open questions and complexities is Hadoop. Hadoop offers us a frame work for distributed computing. Hadoop has 2 core components – HDFS & MapReduce.

HDFS

HDFS stands for Hadoop Distributed File System and it takes care of all your storage related complexities like splitting your dataset into blocks, takes care of replicating each block to more than one node and also keep track of which block is stored on which node.. etc.

MapReduce

MapReduce is a programming model and it takes care of all the computational complexeties.. like bringing all the intermediate results from every single node to offer a consolidated output…

What is Hadoop?

Hadoop is framework for distributed processing of large datasets across clusters of commodity computers. The last 2 words in the definition is what makes Hadoop even more special. “Commodity computers” means that all the 100 nodes in the cluster that we talked about does not have any specialized hardware. They are enterprise grade server nodes with a processor, hard disk and RAM in each of them but other than that nothing special about them. Don’t confuse commodity computers with with cheap hardware it means inexpensive hardware but not cheap hardware.

Now you know what Hadoop is and how it can offer an efficient solution to your Maximum close price problem against the 1 TB dataset. Now you can go back to that business user and propose Hadoop to solve the problem and to achieve the execution time that your users are expecting. But if you propose a 100 node cluster to your business expect to get some crazy looks. But that is the beauty of Hadoop, you don’t need to have a 100 node cluster. We have seen Hadoop production environments from small 10 nodes clusters all the way to 800 to 1000 nodes clusters. You can simply start even with a 10 node cluster and if you want to reduce the execution time further all you have to do is add more nodes to your cluster. That simple. In other words Hadoop will help you horizontally scale.

Now you know what is Hadoop and conceptually how it solves the problem of big datasets.