Hadoop | Sue Brandreth's Learning Resources

Apache Hadoop
(https://en.wikipedia.org/wiki/Apache_Hadoop)

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. This approach takes advantage of data locality — nodes manipulating the data they have access to— to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

The base Apache Hadoop framework is composed of the following modules:

Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications;^[6]^[7] and
Hadoop MapReduce – an implementation of the MapReduce programming model for large scale data processing.

The term Hadoop has come to refer not just to the base modules above, but also to the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, Apache Storm.

Apache Hadoop’s MapReduce and HDFS components were inspired by Google papers on their MapReduce and Google File System.

The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with “Hadoop Streaming” to implement the “map” and “reduce” parts of the user’s program.Other projects in the Hadoop ecosystem expose richer user interfaces.

MapReduce
(https://en.wikipedia.org/wiki/MapReduce)

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. Conceptually similar approaches have been very well known since 1995 with the Message Passing Interface standard having reduce and scatter operations.

A MapReduce program is composed of a Map() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The “MapReduce System” (also called “infrastructure” or “framework”) orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

The model is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as in their original forms. The key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine once. As such, a single-threaded implementation of MapReduce will usually not be faster than a traditional (non-MapReduce) implementation, any gains are usually only seen with multi-threaded implementations. The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play. Optimizing the communication cost is essential to a good MapReduce algorithm.

MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology, but has since been genericized. By 2014, Google was no longer using MapReduce as their primary Big Data processing model, and development on Apache Mahout had moved on to more capable and less disk-oriented mechanisms that incorporated full map and reduce capabilities.

Hadoop Distributed File System (HDFS)
(https://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system)

The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. A Hadoop cluster has nominally a single namenode plus a cluster of datanodes, although redundancy options are available for the namenode due to its criticality. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses TCP/IP sockets for communication. Clients use remote procedure call (RPC) to communicate between each other.

HDFS stores large files (typically in the range of gigabytes to terabytes across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence theoretically does not require RAID storage on hosts (but to increase I/O performance some RAID configurations are still useful). With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX-compliant, because the requirements for a POSIX file-system differ from the target goals for a Hadoop application. The trade-off of not having a fully POSIX-compliant file-system is increased performance for data throughput and support for non-POSIX operations such as Append.

HDFS added the high-availability capabilities, as announced for release 2.0 in May 2012, letting the main metadata server (the NameNode) fail over manually to a backup. The project has also started developing automatic fail-over.
The HDFS file system includes a so-called secondary namenode, a misleading name that some might incorrectly interpret as a backup namenode for when the primary namenode goes offline. In fact, the secondary namenode regularly connects with the primary namenode and builds snapshots of the primary namenode’s directory information, which the system then saves to local or remote directories. These checkpointed images can be used to restart a failed primary namenode without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. Because the namenode is the single point for storage and management of metadata, it can become a bottleneck for supporting a huge number of files, especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple namespaces served by separate namenodes.

An advantage of using HDFS is data awareness between the job tracker and task tracker. The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location. For example: if node A contains data (x,y,z) and node B contains data (a,b,c), the job tracker schedules node B to perform map or reduce tasks on (a,b,c) and node A would be scheduled to perform map or reduce tasks on (x,y,z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer. When Hadoop is used with other file systems, this advantage is not always available. This can have a significant impact on job-completion times, which has been demonstrated when running data-intensive jobs.

HDFS was designed for mostly immutable files and may not be suitable for systems requiring concurrent write-operations.

HDFS can be mounted directly with a Filesystem in Userspace (FUSE) virtual file system on Linux and some other Unix systems.

File access can be achieved through the native Java application programming interface (API), the Thrift API to generate a client in the language of the users’ choosing (C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the command-line interface, browsed through the HDFS-UI Web application (webapp) over HTTP, or via 3rd-party network client libraries.

Get Started with Hadoop – Hadoop: Open Insight Anywhere‎
(https://www.ibm.com/analytics/uk/en/technology/hadoop.html)

What is Hadoop?
Apache™ Hadoop® is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.

IBM provides the industry’s premier open Hadoop solution that delivers critical insights, deployed anywhere. IBM is a proud member of ODPi, a shared industry effort focused on promoting and advancing Apache Hadoop for the enterprise.

IBM Open Platform with Apache Hadoop
IBM Open Platform is an industry standard Hadoop distribution that is built to OPDi standards and available for free or with a paid support option.

Welcome to Apache™ Hadoop®!
(https://hadoop.apache.org/)

What Is Apache Hadoop?

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects at Apache include:

Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.

What is Apache Hadoop?
(https://www.sas.com/en_us/insights/big-data/hadoop.html)

Hadoop: What is it and why does it matter?

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Hadoop History
As the World Wide Web grew in the late 1900s and early 2000s, search engines and indexes were created to help locate relevant information amid the text-based content. In the early years, search results were returned by humans. But as the web grew from dozens to millions of pages, automation was needed. Web crawlers were created, many as university-led research projects, and search engine start-ups took off (Yahoo, AltaVista, etc.).

hadoop-timeline-infographic

One such project was an open-source web search engine called Nutch – the brainchild of Doug Cutting and Mike Cafarella. They wanted to return web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously. During this time, another search engine project called Google was in progress. It was based on the same concept – storing and processing data in a distributed, automated way so that relevant web search results could be returned faster.

In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing. The Nutch project was divided – the web crawler portion remained as Nutch and the distributed computing and processing portion became Hadoop (named after Cutting’s son’s toy elephant). In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop’s framework and ecosystem of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors.

Why is Hadoop important?

Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration.
Computing power. Hadoop’s distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.
Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.

What are the challenges of using Hadoop?

MapReduce programming is not a good match for all problems. It’s good for simple information requests and problems that can be divided into independent units, but it’s not efficient for iterative and interactive analytic tasks. MapReduce is file-intensive. Because the nodes don’t intercommunicate except through sorts and shuffles, iterative algorithms require multiple map-shuffle/sort-reduce phases to complete. This creates multiple files between MapReduce phases and is inefficient for advanced analytic computing.

There’s a widely acknowledged talent gap. It can be difficult to find entry-level programmers who have sufficient Java skills to be productive with MapReduce. That’s one reason distribution providers are racing to put relational (SQL) technology on top of Hadoop. It is much easier to find programmers with SQL skills than MapReduce skills. And, Hadoop administration seems part art and part science, requiring low-level knowledge of operating systems, hardware and Hadoop kernel settings.

Data security. Another challenge centers around the fragmented data security issues, though new tools and technologies are surfacing. The Kerberos authentication protocol is a great step toward making Hadoop environments secure.

Full-fledged data management and governance. Hadoop does not have easy-to-use, full-feature tools for data management, data cleansing, governance and metadata. Especially lacking are tools for data quality and standardization.

Hortonworks Logo What is Apache Hadoop?
(http://hortonworks.com/hadoop/)

The open source framework for storing and extracting insight from massive volumes of data.

Open Enterprise Apache Hadoop: The Ecosystem of Projects
Apache Hadoop^® is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data.

Numerous Apache Software Foundation projects make up the services required by an enterprise to deploy, integrate and work with Hadoop. Each project has been developed to deliver an explicit function and each has its own community of developers and individual release cycles.

product-guide-wp-thumb Hortonworks Data Platform

In this guide, discover:

The technology components of HDP and its blueprint for Enterprise Hadoop.
How HDP integrates and complements your existing data systems.
The versatility of applications enabled by Apache Hadoop YARN at the core of HDP.
How HDP provides security, operations and governance capabilities.
Deployment for HDP from Linux to Windows, On-Premise to In-Cloud.
How HDP assists all stakeholders in data teams: from architects to scientists.

HDP is the 100% Open Source Blueprint for Apache Hadoop. Download the Guide Now.

Apache Hadoop – Cloudera
(https://www.cloudera.com/products/apache-hadoop.html)

Apache Hadoop
Hadoop is an ecosystem of open source components that fundamentally changes the way enterprises store, process, and analyze data. Unlike traditional systems, Hadoop enables multiple types of analytic workloads to run on the same data, at the same time. CDH, Cloudera’s open source platform, is the most popular distribution of Hadoop and related projects in the world (with support available via a Cloudera Enterprise subscription).

Why Cloudera?
(https://www.cloudera.com/why-cloudera.html)

What is Apache Hadoop?
(https://www.mapr.com/products/apache-hadoop)

Hadoop and Big Data
Apache Hadoop™ was born out of a need to process an avalanche of big data. The web was generating more and more information on a daily basis, and it was becoming very difficult to index over one billion pages of content. In order to cope, Google invented a new style of data processing known as MapReduce. A year after Google published a white paper describing the MapReduce framework, Doug Cutting and Mike Cafarella, inspired by the white paper, created Hadoop to apply these concepts to an open-source software framework to support distribution for the Nutch search engine project. Given the original case, Hadoop was designed with a simple write-once storage infrastructure.

Hadoop has moved far beyond its beginnings in web indexing and is now used in many industries for a huge variety of tasks that all share the common theme of lots of variety, volume and velocity of data – both structured and unstructured. It is now widely used across industries, including finance, media and entertainment, government, healthcare, information services, retail, and other industries with big data requirements but the limitations of the original storage infrastructure remain.

Download The Forrester Wave™: Big Data Hadoop Solutions, Q1 2014 to read about the top Hadoop solutions, and why MapR scored the highest for its current offering of all the vendors. Download Report

Hadoop Tutorial
(http://www.tutorialspoint.com/hadoop/)

Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

This brief tutorial provides a quick introduction to Big Data, MapReduce algorithm, and Hadoop Distributed File System.

Apache Hadoop
(https://developer.yahoo.com/hadoop/)

Hadoop at Yahoo!
Introduction
Apache Hadoop* is an open source Java framework for processing and querying vast amounts of data on large clusters of commodity hardware. Hadoop is a top level Apache project, initiated and led by Yahoo!. It relies on an active community of contributors from all over the world for its success.

With a significant technology investment by Yahoo!, Apache Hadoop has become an enterprise-ready cloud computing technology. It is becoming the industry de facto framework for big data processing.
The Hadoop project is an integral part of the Yahoo! cloud infrastructure — and is the heart of many of Yahoo!’s important business processes.

We run the world’s largest Hadoop clusters, work with academic institutions and other large corporations on advanced cloud computing research and our engineers are leading participants in the Hadoop community.

What’s new from Yahoo!?
Hadoop with security

Hadoop with security is a significant update to Apache Hadoop. This update integrates Hadoop with Kerberos, a mature open source authentication standard.
Hadoop with security:

Prevents unauthorized access to data on Hadoop clusters
Authenticates users sharing business sensitive data
Reduces operational costs by consolidating Hadoop clusters
Collocates data for new classes of applications

Oozie – Yahoo!’s workflow engine for Hadoop
Oozie, Yahoo!’s workflow engine for Hadoop is an open-source workflow solution to manage and coordinate jobs running on Hadoop, including HDFS, Pig and MapReduce.
Oozie was designed for Yahoo!’s complex workflows and data pipelines at global scale. It is integrated with the Yahoo! Distribution of Hadoop with security and is a primary mechanism to manage complex data analysis workloads across Yahoo!.

Hadoop: What it is, how it works, and what it can do
(https://www.oreilly.com/ideas/what-is-hadoop)

Cloudera CEO Mike Olson on Hadoop’s architecture and its data applications.
Hadoop gets a lot of buzz these days in database and content management circles, but many people in the industry still don’t really know what it is and or how it can be best applied.
Cloudera CEO and Strata speaker Mike Olson, whose company offers an enterprise distribution of Hadoop and contributes to the project, discusses Hadoop’s background and its applications in this interview.

Hadoop Platform and Application Framework
(https://www.coursera.org/learn/hadoop)

University of California, San Diego
Part of a 6-course series, the Big Data Specialization

Drive better business decisions with an overview of how big data is organized, analyzed, and interpreted. Apply your insights to real-world problems and questions. Do you need to understand big data and how it will impact your business? This Specialization is for you. You will gain an understanding of what insights big data can provide through hands-on experience with the tools and systems used by big data scientists and engineers. Previous programming experience is not required! You will be guided through the basics of using Hadoop with MapReduce, Spark, Pig and Hive. By following along with provided code, you will experience how one can perform predictive modeling and leverage graph analytics to model problems. This specialization will prepare you to ask the right questions about data, communicate effectively with data scientists, and do basic exploration of large, complex datasets. In the final Capstone Project, developed in partnership with data software company Splunk, you’ll apply the skills you learned to do basic analyses of big data.

Hadoop: What It Is And How It Works
(http://readwrite.com/2013/05/23/hadoop-what-it-is-and-how-it-works/)

You can’t have a conversation about Big Data for very long without running into the elephant in the room: Hadoop. This open source software platform managed by the Apache Software Foundation has proven to be very helpful in storing and managing vast amounts of data cheaply and efficiently.

But what exactly is Hadoop, and what makes it so special? Basically, it’s a way of storing enormous data sets across distributed clusters of servers and then running “distributed” analysis applications in each cluster.
It’s designed to be robust, in that your Big Data applications will continue to run even when individual servers — or clusters — fail. And it’s also designed to be efficient, because it doesn’t require your applications to shuttle huge volumes of data across your network.

Here’s how Apache formally describes it:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

Look deeper, though, and there’s even more magic at work. Hadoop is almost completely modular, which means that you can swap out almost any of its components for a different software tool. That makes the architecture incredibly flexible, as well as robust and efficient.

Spring for Apache Hadoop
(http://projects.spring.io/spring-hadoop/)

Introduction
Spring for Apache Hadoop simplifies developing Apache Hadoop by providing a unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive. It also provides integration with other Spring ecosystem project such as Spring Integration and Spring Batch enabling you to develop solutions for big data ingest/export and Hadoop workflow orchestration.

Features

Support to create Hadoop applications that are configured using Dependency Injection and run as standard Java applications vs. using Hadoop command line utilities.
Integration with Spring Boot to simply creat Spring apps that connect to HDFS to read and write data.
Create and configure applications that use Java MapReduce, Streaming, Hive, Pig, or HBase
Extensions to Spring Batch to support creating Hadoop based workflows for any type of Hadoop Job or HDFS operation.
Script HDFS operations using any JVM based scripting language.
Easily create custom Spring Boot based aplications that can be deployed to execute on YARN.
DAO support (Template & Callbacks) for HBase.
Support for Hadoop Security.

Spark and Hadoop on Google Cloud Platform (https://cloud.google.com/hadoop/)

You can run powerful and cost-effective Apache Spark and Apache Hadoop clusters easily on Google Cloud Platform. When you run Hadoop and Spark on Google Cloud Platform, you combine these powerful open source ecosystems with Google’s reliable and highly-scalable infrastructure, Google’s pricing philosophy, and a powerful portfolio of cloud technologies.

There are many easy ways to get started with Spark and Hadoop on Google Cloud Platform, including:

Google Cloud Dataproc — A managed Spark and Hadoop service that allows anyone to create and use fast, easy, and cost-effective clusters
Command line tools (bdutil) — A collection of shell scripts to manually create and manage Spark and Hadoop clusters
Third party Hadoop distributions:
- Cloudera — Using the Cloudera Director Plugin for Google Cloud Platform
- Hortonworks — Using bdutil support for Hortonworks HDP
- MapR — Using bdutil support for MapR

Hadoop Course
(http://bigdatauniversity.com/courses/hadoop-course/)

Hadoop Fundamentals I
Hadoop Fundamentals I teaches you the basics of Apache Hadoop and the concept of Big Data. This Hadoop course is entirely free, and so are the materials and software provided. This is the third version of our most popular Hadoop course. Since Version 2 was published, several more detailed courses covering topics such as MapReduce, Hive, HBase, Pig, Oozie, and Zookeeper have been added. We recommend you start here and then dig deeper into the specific Hadoop technology you wish to learn more about.

Learn Hadoop
This Hadoop course is designed to give you a basic understanding of key Big Data technologies. In this Hadoop tutorial, we first begin with describing what Big Data is and the need for Hadoop to be able to process that data in a timely manner. This is followed by describing the Hadoop architecture and how to work with the Hadoop Distributed File System (HDFS) both from the command line and using the BigInsights Console that is supplied with InfoSphere BigInsights.

Hadoop Developer Day Event
(http://bigdatauniversity.com/courses/hadoop-hackathon-onsite-and-live-stream-event/)

Understand Hadoop and the Big Data Problems It Can Solve
Know the key open source components related to Hadoop
Understand how Big Data solutions can work in the Cloud
Understand the components of Hadoop and how to leverage HDFS
Know the essential steps involved with implementing a MapReduce program that runs in Hadoop
Use Hive to implement data warehousing solutions in a Hadoop environment (using HiveQL)
Understand the BigSQL approach to managing structured data in Hadoop
Use BigSQL to store and query large data sets in Hadoop

Intro to Hadoop and MapReduce
(https://www.udacity.com/course/intro-to-hadoop-and-mapreduce–ud617)

How to Process Big Data
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. Learn the fundamental principles behind it, and how you can use its power to make sense of your Big Data.

How Hadoop fits into the world (recognize the problems it solves)
Understand the concepts of HDFS and MapReduce (find out how it solves the problems)
Write MapReduce programs (see how we solve the problems)
Practice solving problems on your own

The history of Hadoop: From 4 nodes to the future of data
(https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/)

Depending on how one defines its birth, Hadoop is now 10 years old. In that decade, Hadoop has gone from being the hopeful answer to Yahoo’s (s yhoo) search-engine woes to a general-purpose computing platform that’s poised to be the foundation for the next generation of data-based applications.

Alone, Hadoop is a software market that IDC predicts will be worth $813 million in 2016 (although that number is likely very low), but it’s also driving a big data market the research firm predicts will hit more than $23 billion by 2016. Since Cloudera launched in 2008, Hadoop has spawned dozens of startups and spurred hundreds of millions in venture capital investment since 2008.

In this four-part series, we’ll explain everything anyone concerned with information technology needs to know about Hadoop. Part I is the history of Hadoop from the people who willed it into existence and took it mainstream. Part II is more graphic; a map of the now-large and complex ecosystem of companies selling Hadoop products. Part III is a look into the future of Hadoop that should serve as an opening salvo for much of the discussion at our Structure: Data conference March 20-21 in New York. Finally, part IV will highlight some the best Hadoop applications and seminal moments in Hadoop history, as reported by GigaOM over the years.

Top 6 Hadoop Vendors providing Big Data Solutions in Open Data Platform
(https://www.dezyre.com/article/-top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93)

With the demand for big data technologies expanding rapidly, Apache Hadoop is at the heart of the big data revolution. It is labelled as the next generation platform for data processing because of its low cost and ultimate scalable data processing capabilities. The open source framework hadoop is somewhat immature and big data analytics companies are now eyeing on Hadoop vendors- a growing community that delivers robust capabilities, tools and innovations for improvised commercial hadoop big data solutions. Here are top 6 big data analytics vendors that are serving Hadoop needs of various big data companies by providing commercial support.

Hadoop Market Share

Allied Market Research predicts that the “Hadoop-as-a-Service” market will grow to $50.2 billion by 2020. The Global Hadoop Market is anticipated to reach $8.74 billion by 2016, growing at a CAGR of 55.63 % from 2012–2016. Wikibon’s latest market analysis states that- spending on Hadoop software and subscriptions accounted for less than 1% of $27.4 billion or approximately $187 million in 2014 in overall Big Data spending. Wikibon predicts that the spending on Hadoop software and subscriptions will increase to approximately $677 million by the end of 2017, with overall big data market anticipated to reach the $50 billion mark.

Big data analytics market share

Big Data and Hadoop are on the verge of revolutionizing enterprise data management architectures. Cloud and enterprise vendors are competing to venture a claim in the big data ‘gold-rush market’ with pure plays of several top Hadoop Vendors. Apache Hadoop is an open source big data technology with HDFS, Hadoop Common, Hadoop MapReduce and Hadoop YARN as the core components .However, without the packaged solutions and support of commercial Hadoop vendors, Hadoop distributions can just go unnoticed.

Need for Commercial Hadoop Vendors
Today, Hadoop is an open-source, catch-all technology solution with incredible scalability, low cost storage systems and fast paced big data analytics with economical server costs.
Hadoop Vendor distributions overcome the drawbacks and issues with the open source edition of Hadoop. These distributions have added functionalities that focus on:
Support:
Most of the Hadoop vendors provide technical guidance and assistance that makes it easy for customers to adopt Hadoop for enterprise level tasks and mission critical applications.
Reliability:
Hadoop vendors promptly act in response whenever a bug is detected. With the intent to make commercial solutions more stable, patches and fixes are deployed immediately.
Completeness:
Hadoop vendors couple their distributions with various other add-on tools which help customers customize the Hadoop application to address their specific tasks.

Top Commercial Hadoop Vendors
Here is a list of top Hadoop Vendors who will play a key role in big data market growth for the coming years:

1) Amazon Web Services Elastic MapReduce Hadoop Distribution
The Amazon Hadoop Vendor has been there since the dawn of Hadoop, and Hadoopers boast of its success stories for the innovative Hadoop distributions in the open data platform. AWS Elastic MapReduce renders an easy to use and well organized data analytics platform built on the powerful HDFS architecture. With major focus on map/reduce queries, AWS EMR exploits Hadoop tools to a great extent by providing a high scale and secure infrastructure platform to its users. Amazon Web Services EMR is among one of the top commercial Hadoop distributions with the highest market share leading the global market.

AWS EMR handles important big data uses like web indexing, scientific simulation, log analysis, bioinformatics, machine learning, financial analysis and data warehousing. AWS EMR is the best choice for organizations who do not want to manage thousands of servers directly – as they can rent out this cloud ready infrastructure of Amazon for big data analysis.

DynamoDB is another major NoSQL database offering by AWS Hadoop Vendor that was deployed to run its giant consumer website. Redshift is a completely managed petabyte scale data analytics solution that is cost effective in big data analysis with BI tools. Redshift has costs as low as $1000 per terabyte annually. According to Forrester, Amazon is the “King of the Cloud” – for companies in need of public cloud hosted Hadoop platforms for big data management services.

2) Hortonworks Hadoop Distribution
Hortonworks Hadoop vendor, features in the list of Top 100 winners of “Red Herring”. Hortonworks is a pure play Hadoop company that drives open source Hadoop distributions in the IT market. The main goal of Hortonworks is to drive all its innovations through the Hadoop open data platform and build an ecosystem of partners that speeds up the process of Hadoop adoption amongst enterprises.

Apache Ambari is an example of Hadoop cluster management console developed by Hortonworks Hadoop vendor for provision, managing and monitoring Hadoop clusters. The Hortonworks Hadoop vendor is reported to attract 60 new customers every quarter with some giant accounts like Samsung, Spotify, Bloomberg and eBay. Hortonworks has garnered strong engineering partnerships with RedHat, Microsoft, SAP and Teradata.

Hortonworks has grown its revenue at a rapid pace. The revenue generated by Hortonworks totaled $33.38 million in first nine months of 2013 which was a significant increase by 109.5% from the previous year. However, the professional services revenue generated by Hortonworks Hadoop vendor increases at a faster pace when compared to support and subscription services revenue.

3) Cloudera Hadoop Distribution
Cloudera Hadoop Vendor ranks top in the big data vendors list for making Hadoop a reliable platform for business use since 2008.Cloudera, founded by a group of engineers from Yahoo, Google and Facebook – is focused on providing enterprise ready solutions of Hadoop with additional customer support and training. Cloudera Hadoop vendor has close to 350 paying customers including the U.S Army, AllState and Monsanto. Some of them boast of deploying 1000 nodes on a Hadoop cluster to crunch big data analytics for one petabyte of data. Cloudera owes its long term success to corporate partners – Oracle, IBM, HP, NetApp and MongoDB that have been consistently pushing its services.

Cloudera Hadoop vendor is just on the right path towards its goal with 53% of the Hadoop market when compared to 11% of Hadoop Market possessed by MapR and 16% by Hortonworks Hadoop vendors. Forrester says “Cloudera’s approach to innovation is to be loyal to core Hadoop but to innovate quickly and aggressively to meet customer demands and differentiate its solution from those of other commercial Hadoop vendors.”

4) MapR Hadoop Distribution
MapR has been recognized extensively for its advanced distributions in Hadoop marking a place in the Gartner report “Cool Vendors in Information Infrastructure and Big Data, 2012.” MapR has scored the top place for its Hadoop distributions amongst all other vendors.

MapR has made considerable investments to get over the obstacles to worldwide adoption of Hadoop which include enterprise grade reliability, data protection, integrating Hadoop into existing environments with ease and infrastructure to render support for real time operations.

In 2015, MapR plans to make further investments to maintain its significance in the Big Data vendors list. Apart from this MapR is all set to announce its technical innovations for Hadoop with the intent of supporting ‘business-as-it-happens’- to increase revenue, mitigate risks and reduce costs.

5) IBM Infosphere BigInsights Hadoop Distribution
IBM Infosphere BigInsights is an industry standard IBM Hadoop distribution that combines Hadoop with enterprise grade characteristics.IBM provides BigSheets and BigInsights as a service via its Smartcloud Enterprise Infrastructure .With IBM Hadoop distributions users can easily set up and move data to Hadoop clusters in no more than 30 minutes with data processing rate of 60 cents per Hadoop cluster, per hour. With IBM BigInsights innovation, customers can get to market at a rapid pace with their applications that incorporate advanced Big Data analytics by harnessing the power of Hadoop.6) Microsoft Hadoop Distribution
Forrester rates Microsoft Hadoop Distribution as 4/5- based on the Big Data Vendor’s current Hadoop Distributions, market presence and strategy – with Cloudera and Hortonworks scoring 5/5.

Microsoft is an IT organization not known for embracing open source software solutions, but it has made efforts to run this open data platform software on Windows. Hadoop as a service offering by Microsoft’s big data solution is best leveraged through its public cloud product -Windows Azure’s HDInsight particularly developed to run on Azure. There is another production ready feature of Microsoft named Polybase that lets the users search for information available on SQL Server during the execution of Hadoop queries. Microsoft has great significance in delivering a growing Hadoop Stack to its customers.

Commercial Hadoop Vendors continue to mature overtime with increased worldwide adoption of Big Data technologies and growing vendor revenue. There are several top Hadoop vendors namely Hortonworks, Cloudera, Microsoft and IBM. These Hadoop vendors are facing a tough competition in the open data platform. With the war heating up amongst big data vendors, nobody is sure as to who will top the list of commercial Hadoop vendors. With Hadoop buying cycle on the upswing, Hadoop vendors must capture the market share at a rapid pace to make the venture investors happy.

The Hadoop Ecosystem Table
(https://hadoopecosystemtable.github.io/)

This page is a summary to keep the track of Hadoop related projects, focused on FLOSS environment.

Udemy Free Online Video Courses
(https://www.udemy.com/)

Big Data and Hadoop Essentials
Essential Knowledge for everyone associated with Big Data & Hadoop
(https://www.udemy.com/big-data-and-hadoop-essentials-free-tutorial/learn/v4/overview)

Are you interested in the world of Big data technologies, but find it a little cryptic and see the whole thing as a big puzzle.

Are you looking to understand how Big Data impact large and small business and people like you and me?
Do you feel many people talk about Big Data and Hadoop, and even do not know the basics like history of Hadoop, major players and vendors of Hadoop. Then this is the course just for you!

This course builds a essential fundamental understanding of Big Data problems and Hadoop as a solution. This course takes you through:

Understanding of Big Data problems with easy to understand examples.
History and advent of Hadoop right from when Hadoop wasn’t even named Hadoop.
What is Hadoop Magic which makes it so unique and powerful.
Understanding the difference between Data science and data engineering, which is one of the big confusions in selecting a carrier or understanding a job role.
And most importantly, demystifying Hadoop vendors like Cloudera, MapR and Hortonworks by understanding about them.

Big Data Basics: Hadoop, MapReduce, Hive, Pig, & Spark
Learn about some of the most popular big data analysis frameworks in commercial use – and look at some real code!
(https://www.udemy.com/big-data-basics-hadoop-mapreduce-hive-pig-spark/learn/v4/overview)

Interested in analyzing big data sets? You should be – according to CareerCast, “data scientist” is the 8th most highly paid profession in the United States! If you have some coding or scripting background, you can make your experience even more valuable by understanding how to use Hadoop, MapReduce, Hive, Pig, and Spark to crunch huge data sets in parallel.

These techniques are used by some of the largest and most prestigious tech employers, including Google, Facebook, Twitter, Amazon, EBay, Yahoo, and many more. After this course, you’ll speak their language!

What We’ll Cover:

What is MapReduce and Hadoop?
What are some real-world applications of these technologies?
A walk-through of designing, coding, and running a real example of MapReduce using real data.
How Hadoop distributes computing across a cluster of machines
An overview of Hive, Pig, and Spark along with a couple of small examples.

Hadoop Starter Kit
Hadoop learning made easy and fun. Learn HDFS, MapReduce and introduction to Pig and Hive with FREE cluster access.
(https://www.udemy.com/hadoopstarterkit/learn/v4/overview)

The objective of this course is to walk you through step by step of all the core components in Hadoop but more importantly make Hadoop learning experience easy and fun.

By enrolling in this course you can also get free access to our multi-node Hadoop training cluster so you can try out what you learn right away in a real multi-node distributed environment.

In the first section you will learn about what is big data with examples. We will discuss the factors to consider when considering whether a problem is big data problem or not. We will talk about the challenges with existing technologies when it comes to big data computation. We will breakdown the Big Data problem in terms of storage and computation and understand how Hadoop approaches the problem and provide a solution to the problem.
In the HDFS, section you will learn about the need for another file system like HDFS. We will compare HDFS with traditional file systems and its benefits. We will also work with HDFS and discuss the architecture of HDFS.

In the MapReduce section you will learn about the basics of MapReduce and phases involved in MapReduce. We will go over each phase in detail and understand what happens in each phase. Then we will write a MapReduce program in Java to calculate the maximum closing price for stock symbols from a stock dataset.

In the next two sections, we will introduce you to Apache Pig & Hive. We will try to calculate the maximum closing price for stock symbols from a stock dataset using Pig and Hive.

Learning Apache Hadoop EcoSystem- Hive
Learn Apache Hive and Start working with SQL queries which is on Data which is in Hadoop
(https://www.udemy.com/learning-apache-hive/)

This tutorial starts with understanding need for hive Architecture and different configuration parameters in Hive. During this course you will learn different aspects of Hive and how it fits as datawarehousing patform on Hadoop. Please subscribe to my Youtube Channel “Hadooparch” for more details.

This Course covers Hive, the SQL of Hadoop.(HQL) We will learn why and How Hive is installed and configured on Hadoop. We will cover the components and architecture of Hive to see how it stores data in table like structures over HDFS data. Understabd architecture, installation and configuration of Hive. We will install and configure Hive server2 and replace postgresql database with mysql. we will also learn how to install mysql and configure it as Hive Metastore

This Course is full of Hive demonstrations. We’ll cover how to create Databases, understand data types, create external, internal, and partitioned hive tables, bucketing load data from the local filesystem as well as the distributed filesystem (HDFS), setup dynamic partitioning, create views, and manage indexes and how different layers work together on Hive.

We will go through different roles in implementing in Real time projects, how projects are set up and permissions, Auditing, Troubleshooting.

Hadoop Big Data – Must See Introduction to Big Data
Get Your Feet Wet in Big Data: What can you do with Big Data and Hadoop Developing
(https://www.udemy.com/hadoop-big-data-must-see-introduction-to-big-data/)

You are interested in learning about Hadoop and getting into the wide realm of Big Data. You have probably wondered what is the most practical way of getting your feet wet in Hadoop and Data Science (even if you are not interested in this, you should seriously consider so since in the next 10-15 years, Data Science and Artificial Intelligence will be everywhere).

I have learned over 14 programming languages such as JAVA, Python, C++, R, Matlab, Ruby, CSS, HTML, Angular JS, Java Script as well as other . I have also had a successful freelancing career programming software and mobile applications and well as working as a finance data analyst. Since technology is changing every day, it is adding new realms of complexity to Big Data and Data science than is already out there. It is getting exponentially harder for new people to learn and navigate the immense amount of data science aspects. My job in this course is to demyistify Big Data so you can see a clear road to success as a Hadoop Developer.

This course serves to help you navigate Hadoop and know what this seemingly difficult concept really is. In this course, I cover what it is like to be a data analyst, what are some jobs of data analysts, what are the sorts of super powers you can possess by learning Hadoop, and also what resources you will need on your path to development as a Big Data. My hope is to transform you in 4-5 lectures from being a novice to data and what is possible with Hadoop to being someone who has a clear idea of whether he/she is interested in Hadoop and what path they can take to further their knowledge and harness the POWER of Hadooop.

There is no risk for you as a student in this course. I have put together a course that is not only worth your money, but also worth your time. This course encompasses the basics of Hadoop I urge you to join me on this journey to learn how you can start learning how to dominate data analytics and how you can supercharge your business, marketing, or client with your superb analytics skills.

Big Data in Cloud for E-Commerce companies
A learning guide for technology enthusiasts to quickly mind map the evolving concepts such as Big Data and Cloud.
(https://www.udemy.com/big-data-in-cloud-for-e-commerce-companies/)

Welcome to this course on Big Data’s percolation into E-commerce: How Big Data in Cloud can be an added advantage. This course is targeted towards the tech enthusiasts who are willing to keep themselves updated in the technology space. At the end of this course, we aim to help the students Get an understanding of the practical implications of implementing Big Data projects in E-Commerce companies; Get a fair level knowledge on what is Big Data and how it evolved; Spot some of the latest updates in Big Data; Understand the level of maturity attained by Cloud Computing; Derive a high level knowledge on how cloud environment is favourable for implementing Big Data projects; Study some business drivers that is motivating firms to move into the Data-Cloud models.

Data Analytics using Hadoop eco system
Convert NYSE data into useful insights. In this course we will perform top down analysis of stock data based on volume.
(https://www.udemy.com/data-analytics-using-hadoop-in-2-hours/learn/v4/overview)

Are you an IT professional and interested in exploring Hadoop? Just take this free course and you will understand how easy it is not only to explore but also to implement Proof of concepts. You need to have a PC or Mac with 8 GB of RAM to run the code examples. Also you need to be comfortable in writing basic SQL queries.

You will learn setting up Cloudera VM and use tools like Hadoop, Hive and Hue to convert raw data into useful insights. Also you will become familiar with basic Hive commands/queries to process the data. You will also be able to develop basic reports using Tableau Public and publish them to your network.
You need not get overwhelmed of number of tools and technologies that are being referred as part of Hadoop eco system.
No need to struggle command line while developing PoCs, validating data, testing the code etc.
You can be an Architect, Developer, Tester, Analyst, Project Manager or any other IT professional.

What is Hadoop?
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Big data applications: Real-world strategies for managing big data

(http://searchdatamanagement.techtarget.com/essentialguide/Big-data-applications-Real-world-strategies-for-managing-big-data)

Introduction
Big data management strategies and best practices are still evolving, but joining the big data movement has become an imperative for companies across a wide variety of industries. This guide delves into the experiences of early-adopter companies that have already deployed big data applications and technologies. IT professionals, C-level executives and industry analysts offer insights into what strategies work on big data projects and how to best integrate big data management initiatives with related processes such as data warehousing, data governance and data analytics.

The following stories explore the steps these companies took to set up big data systems and to update their approaches as needed. Readers will find practical information on implementing big data strategies, mixing Hadoop clusters and conventional data warehousing tools, incorporating big data analytics into the process and translating big data ideas into successful deployments.

Big data strategy

Advertising firm attracts clients with big data strategy
The search advertising company adMarketplace processes billions of ad requests daily in near real time using a pay-per-click platform. Due partly to the high level of data customization it offers to advertisers, adMarketplace processes 100 gigabytes of data per hour. The first article below explains how the company implemented a platform combining a traditional data warehouse and a NoSQL database to power the big data environment that feeds its search syndication system. Other stories in this section offer more insights into managing and using big data and how it fits into the data warehousing and data governance process.

Successful big data projects

Sprint’s big data project highlighted at IBM conference
At IBM’s most recent Information on Demand conference, an executive from Sprint presented information on the telecommunications company’s efforts to harness big data to gain new insights on how customers are using its network. An IBM vice president also offered guidance for other companies interested in tackling big data initiatives. Below, find related articles that explore IBM’s own use of big data tools and provide advice on developing and implementing big data strategies.

Big data tools

Managing online data: comScore explains big data analytics move
In an effort to improve upon its big data analytics program, Web analytics and customer intelligence services provider comScore Inc. moved from one Hadoop distribution to a competing platform. The lead article below explores the reasons why and offers advice on big data analytics and management best practices. The other related stories offer additional insight and information on Hadoop and other big data technologies.

Data warehouses and big data

Big data applications create new opportunities — and challenges
Catalina Marketing Corp., which tracks and analyzes the shopping activities of consumers, was managing huge data sets long before big data became a C-suite phrase. In the first article in this section, Catalina’s now-retired CIO offers advice to organizations that are launching big data management and analytics strategies, highlighting the importance of using distributed systems to lighten the load on data warehouses. Other users and analysts weigh in as well. The related stories below further explore ideas for effectively combining data warehousing and big data management technologies.

Glossary

Big data terminology
Terms like “big data” are used in a variety of contexts. Check out the technical definitions here.

Big data
Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.

Big data management
Big data management is the organization, administration and governance of large volumes of both structured and unstructured data.

Big data analytics
Big data analytics is the process of examining large data sets containing a variety of data types – ie. big data – to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. The analytical findings can lead to more effective marketing, new revenue opportunities, better customer service, improved operational efficiency, competitive advantages over rival organizations and other business benefits.

Big data as a service(BDaaS)
Big data as a service (BDaaS) is the delivery of statistical analysis tools or information by an outside provider that helps organizations understand and use insights gained from large information sets in order to gain a competitive advantage.

NoSQL (Not Only SQL database)
NoSQL database, also called Not Only SQL, is an approach to data management and database design that’s useful for very large sets of distributed data.

Data warehouse
A data warehouse is a federated repository for all the data that an enterprise’s various business systems collect. The repository may be physical or logical.

Data mining
Data mining is sorting through data to identify patterns and establish relationships.

Data mining parameters include:

Association – looking for patterns where one event is connected to another event
Sequence or path analysis – looking for patterns where one event leads to another later event
Classification – looking for new patterns (May result in a change in the way the data is organized but that’s ok)
Clustering – finding and visually documenting groups of facts not previously known
Forecasting – discovering patterns in data that can lead to reasonable predictions about the future (This area of data mining is known as predictive analytics.)

Data mining techniques are used in a many research areas, including mathematics, cybernetics, genetics and marketing. Web mining, a type of data mining used in customer relationship management (CRM), takes advantage of the huge amount of information gathered by a Web site to look for patterns in user behavior.

IBM Bluemix
(https://www.ibm.com/cloud-computing/bluemix/)

The cloud platform to accelerate innovation on both sides of the firewall

Analytics for Hadoop on Bluemix: Sign in and access the service
(https://www.ibm.com/developerworks/library/ba-analytics-for-hadoop1/index.html)

It has never been easier to get your hands on Apache Hadoop. And now with IBM® Bluemix, you can have a Hadoop instance that runs in the cloud in minutes. IBM’s Analytics for Hadoop Bluemix service allows developers to use the power of IBM’s Hadoop offering, IBM InfoSphere® BigInsights™, to power the latest mobile and web applications. In addition, you are also able to quickly deploy a single cluster of BigInsights in Bluemix, so that you can start playing with Hadoop directly on the cloud. And for a limited time—free! This tutorial series explains how.

Analytics for Hadoop on Bluemix: Navigate the BigInsights web console
(https://www.ibm.com/developerworks/library/ba-analytics-for-hadoop2/index.html)

The first tutorial in this series helped you get an active Bluemix account and IBM ID, with the Analytics for Hadoop instance up and running in your browser. Now, learn how to use the IBM InfoSphere BigInsights web console to check the status of services, view the health of your system, and monitor the status of your big data environment.

Analytics for Hadoop on Bluemix: Load data into InfoSphere BigInsights
(https://www.ibm.com/developerworks/analytics/library/ba-analytics-for-hadoop3/index.html)

Business data is stored in various formats and sources. Before you import your data into the IBM® InfoSphere BigInsights distributed file system, you must:

Determine what questions you want to answer through analysis
Identify the data type of your sources
Use the tools and procedures that best fit your business need

Analytics for Hadoop on Bluemix: Explore data with BigSheets
(https://www.ibm.com/developerworks/analytics/library/ba-analytics-for-hadoop4/index.html)

The previous tutorials in this series helped you sign up for a Bluemix account, select and start the Analytics for Hadoop service, explore the BigInsights™ web console, and upload data to BigInsights.

READ: More in this “Analytics for Hadoop on Bluemix” series

In this tutorial, analyze and review big data with BigSheets: a browser-based, spreadsheet-like tool that can model, filter, combine, and chart data that is collected from multiple sources, and ships with all versions of BigInsights.

Jhajj, Raman – Apache Hadoop Cookbook – Hot Recipes for Apache Hadoop
(JavaCodeGeeks)
Download from here