Hadoop Training | Sue Brandreth's Learning Resources

Hadoop Training – Become a Big Data Expert with Free, Comprehensive Hadoop Online Training Courses
(https://www.mapr.com/services/mapr-academy/big-data-hadoop-online-training)

Hadoop On-Demand Training offers full-length courses on a range of Hadoop technologies for developers, data analysts and administrators. Designed in a format that meets your convenience, availability and flexibility needs, these courses will lead you on the path to becoming a certified Hadoop professional.

HDE 100 – Hadoop Essentials
This is an introductory level course about big data, Hadoop and the Hadoop ecosystem of products. Covered are a big data definition, details about the Hadoop core components, and examples of several common Hadoop use cases: enterprise data hub, large scale log analysis, and building recommendation engines.

HDE 110 – MapR Distribution Essentials
This course is an introduction to the features of the MapR Distribution including Hadoop. Topics include the basic architectural components of the MapR file system (MapR-FS), and information on how this architecture overcomes the limitations of the Hadoop Distributed File System (HDFS). You will also learn the basics of designing MapR-DB tables and how to migrate between HBase and MapR-DB tables. At end of the course, you will be able to describe the components of MapR-FS, compare and contrast HDFS to MapR-FS, and describe the architectural advantages of MapR-DB.

ADM 200 – Cluster Administration: Install a MapR Cluster
This is the first course in the Cluster Administration curriculum. This course covers pre-installation testing and verification, installing a MapR cluster, and performing post-installation benchmarking.

ADM 201 – Cluster Administration: Configure a MapR Cluster
ADM 201 is the second course in the Cluster Administration curriculum. This course covers how to configure the cluster’s storage resources once the cluster has been installed.

ADM 202 – Cluster Administration: Data Access and Protection
ADM 202 is the third course in the Cluster Administration curriculum. This course defines methods for data ingestion, and covers the use of snapshots and mirrors.

ADM 203 – Cluster Administration: Cluster Maintenance
This is the fourth and final course in the Cluster Administration curriculum. This course teaches you how to configure cluster settings, monitor the cluster, resolve issues, and optimize cluster performance.

DEV 301 – Developing Hadoop Applications
This course teaches developers, with lectures and hands-on lab exercises, how to write Hadoop Applications using MapReduce and YARN in Java. The course extensively covers MapReduce programming, debugging, managing jobs, improving performance, working with custom data, managing workflows, and using other programming languages for MapReduce.

DEV 320 – HBase Data Model and Architecture
This course is intended for data analysts, data architects and application developers. DEV 320 provides you with a thorough understanding of the HBase data model and architecture, which is required before going on to designing HBase schemas and developing HBase applications.

DEV 325 – HBase Schema Design
Targeted towards data analysts, data architects and application developers, the goal of this course is to enable you to design HBase schemas based on design guidelines. You will learn about the various elements of schema design and how to design for data access patterns. The course offers an in-depth look at designing row keys, avoiding hot-spotting and designing column families. It discusses how to transition from a relational model to an HBase model. You will learn the differences between tall tables and wide tables. Concepts are conveyed through lectures, hands-on labs and analysis of scenarios.

DEV 330 – Developing HBase Applications: Basics
Targeted towards data architects and application developers who have experience with Java, the goal of this course is to learn how to write HBase programs using Hadoop as a distributed NoSQL datastore.

DEV 335 – Developing HBase Applications: Advanced
Targeted towards data architects and application developers who have experience with Java, the goal of this series of courses is to learn how to write HBase programs using Hadoop as a distributed NoSQL datastore. This course builds on DEV 320 and 325 – HBase Data Model and Schema Design. This is a continuation of DEV 330 – Developing HBase Applications: Basics.

DEV 340 – Apache HBase Applications: Bulk Loading, Performance & Security
Targeted towards data analysts, data architects, and application developers, the goal of this course is to learn more about architecting your Apache HBase applications for performance and security. This course covers how to bulk load data into HBase, performance considerations and tips for designing your HBase application, benchmarking and monitoring your HBase application, and MapR-DB security. Concepts are conveyed through lectures, hands-on labs, and scenario analyses.

DEV 350 – MapR Streams Essentials
This introductory-level course teaches the core concepts necessary to understand and begin using MapR Streams to develop big data processing applications.

DEV 351 – Developing MapR Streams Applications
This course is targeted towards developers and administrators to give them the core concepts necessary to build simple MapR Streams applications.

DEV 360 – Apache Spark Essentials
This introductory course enables developers to get started developing big data applications with Apache Spark. In the first part of the course, you will use Spark’s interactive shell to load and inspect data. The course then describes the various modes for launching a Spark application. You will then go on to build and launch a standalone Spark application.

DEV 361 – Build and Monitor Apache Spark Applications
This course is the second in the Apache Spark series. You will learn to create and modify pair RDDs, perform aggregations, and control the layout of pair RDDs across nodes with data partitioning. This course also discusses Spark SQL and DataFrames, the programming abstraction of Spark SQL. This course also describes the components of the Spark execution model using the Spark Web UI to monitor Spark applications.

DEV 362 – Create Data Pipeline Applications Using Apache Spark
This course is the third in the Apache Spark series. In this course, you cover the following Apache Spark libraries – Spark Streaming, Spark SQL, Spark MLlib, and Spark GraphX. This course describes the benefits of the Apache Spark unified platform and how to build a data pipeline application using Spark Streaming, Spark SQL, Spark GraphX, and MLlib. The concepts are taught using scenarios in Scala that also form the basis of hands-on labs.

DA 410 – Apache Drill Essentials
This introductory Apache Drill course, targeted at Data Analysts, Scientists and SQL programmers, covers how to use Drill to explore known or unknown data without writing code. You will write SQL queries on a variety of data types including structured data in a Hive table, semi-structured data in HBase or MapR-DB, and complex data file types, such as Parquet and JSON.

DA 415 – Apache Drill Architecture
DA 415 is an intermediate level course designed for data analysts, developers, and systems administrators. It is a continuation of DA 410 – Apache Drill Essentials, and describes how a query is received and executed by Drill. You will learn the different services involved at each step, and how Drill optimizes a query for distributed SQL execution.

DA 440 – Apache Hive Essentials
DA 440 is an introductory-level course designed for data analysts and developers. You will learn how Apache Hive fits in the Hadoop ecosystem, how to create and load tables in Hive, and how to query data using the Hive Query Language.

DA 450 – Apache Pig Essentials
DA 450 – Apache Pig Essentials is an introductory-level course designed for data analysts and developers. The course begins with a review of data pipeline tools, then covers how to load and manipulate relations in Pig.

MCHA – MapR Certified Hadoop Administrator
This certification exam measures technical knowledge, skill, and ability to configure, deploy, maintain, and secure a Hadoop cluster. This exam covers the architecture of a Hadoop cluster, planning and preparing the nodes, data ingestion, disaster recovery, availability, management and monitoring.

MCHBD – MapR Certified HBase Developer
This certification exam measures and validates the technical knowledge, skills and abilities required to write HBase programs using HBase as a distributed NoSQL datastore. This exam covers HBase architecture, the HBase data model, APIs, schema design, performance tuning, bulk-loading of data, and storing complex data structures.

MCHD – MapR Certified Hadoop Developer
This certification exam measures the specific technical knowledge, skills and abilities required to design and develop MapReduce programs in Java. This exam covers writing MapReduce programs, using MapReduce API, managing, monitoring and testing MapReduce programs and workflows.

MCSD – MapR Certified Spark Developer
The MapR Certified Spark Developer credential is designed for Engineers, Programmers, and Developers who prepare and process large amounts of data using Spark. The certification tests your ability to use Spark in a production environment; where coding knowledge is tested, we lean toward the use of Scala for our code samples.

The “Getting Started with Hadoop” Tutorial
(https://www.cloudera.com/developers/get-started-with-hadoop-tutorial.html)

Getting started with the Apache Hadoop stack can be a challenge, whether you’re a computer science student or a seasoned developer. There are many moving parts, and unless you get hands-on experience with each of those parts in a broader use-case context with sample data, the climb will be steep.

Following this tutorial—which is based on the use of Cloudera Live (a full, cloud-based Hadoop cluster for testing and learning) as a back-end demo environment—will not only give you examples on how to get started with some of the tools provided in CDH, Cloudera’s platform containing Hadoop and related projects, and how to manage your services via Cloudera Manager, but also give you a taste of what it means to “Ask bigger questions.” By the end of this tutorial you will:

Understand how to use some of the powerful tools in CDH
Know how to setup and execute some basic business intelligence and analytics use cases
Be able to explain to your manager why you deserve a raise!

If you intend to use Cloudera Live as your demo environment for this tutorial (which we recommend), register now and wait for your instructions to arrive. Or, follow along on your desktop using the QuickStart VM.

Note: Some parts of this tutorial require Cloudera Manager to be running; other parts also require an enterprise license or trial to be activated. To enable these parts of the tutorial, choose one of the following options:

To use Cloudera Express (free), run Launch Cloudera Express on the Desktop in Cloudera Manager. This requires at least 8 GB of RAM and at least 2 virtual CPUs.
To begin a 60-day trial of Cloudera Enterprise with advanced management features, run Launch Cloudera Enterprise (trial) on the Desktop. This requires at least 10 GB of RAM and at least 2 virtual CPUs.

Cloudera – Video Tutorials
(https://www.cloudera.com/training/library/developers.html)

Resources for Developers

Cloudera University’s e-learning courses present a deeper dive into the projects, skills, and techniques that aid and complement the core topics covered by the developer learning path. These on-demand videos address the concepts required to achieve true expertise. They also include interactive demonstrations and lab instruction so that you can work your way through technical challenges in your own time and at your own pace.

Online Training: Hadoop Essentials
Learn how Apache Hadoop addresses the limitations of traditional computing, helps businesses overcome real challenges, and powers new types of big data analytics. This series also introduces the rest of the Apache Hadoop ecosystem and outlines how to prepare the data center and manage Hadoop in production.

Explore the basics of Apache Hadoop and learn how Cloudera products and services addresses the limitations of traditional computing, helps businesses overcome real challenges, and powers new types of Big Data analytics. This series also introduces the rest of the Apache Hadoop ecosystem and outlines how to prepare the data center and manage Hadoop in production.

Cloudera Essentials for Apache Hadoop
(https://www.cloudera.com/training/library/hadoop-essentials.html)

Learn how Apache Hadoop addresses the limitations of traditional computing, helps businesses overcome real challenges, and powers new types of big data analytics. This series also introduces the rest of the Apache Hadoop ecosystem and outlines how to prepare the data center and manage Hadoop in production.

Chapters 1: Introduction
Explore the basics of Apache Hadoop, including the Hadoop Distributed File System (HDFS), MapReduce, and the anatomy of a Hadoop cluster.

Chapters 2: Hadoop Basics
Explore the theory behind the creation of Hadoop, the anatomy of Hadoop itself, and an overview of complimentary projects that make up the Hadoop ecosystem. We then share several Hadoop use cases across a few industries including financial services, insurance, telecommunications, intelligence, and healthcare.

Chapter 3 : Hadoop Basic Concepts
There are many components working together in the Apache Hadoop stack. By understanding how each functions, you gain more insight into Hadoop’s functionality in your own IT environment. This chapter goes beyond the motivation for Apache Hadoop and dissects the Hadoop Distributed File System (HDFS), MapReduce, and the general topology of a Hadoop cluster.

Chapter 4: Hadoop Solutions
Learn how Apache Hadoop is used in the real world. This chapter explores ways to use Apache Hadoop to harness Big Data and solve business problems in ways never before imaginable. Explore common business challenges that can be addressed using Hadoop, the origins of Big Data, types of analyses powered by Hadoop, and industry use cases for Hadoop.

Chapter 5: The Hadoop Ecosystem
Various projects make up the Apache Hadoop ecosystem, and each improves data storage, management, interaction, and analysis in its own unique way. This chapter reviews Hive, Pig, Impala, HBase, Flume, Sqoop, and Oozie, how they function within the stack and how they help integrate Hadoop within the production environment.

Chapter 6: Managing Your Hadoop Solution
It is critical to understand how Apache Hadoop will affect the current setup of the data center and to plan ahead. This chapter helps you seamlessly integrate the platform into your environment. Find out what resources are required to deploy Hadoop, how to plan for cluster capacity, and how to staff for your Big Data strategy.

Chapter 7: Conclusion
Once you have Hadoop implemented in your environment, what’s next? How do you get the most out of the technology while managing it on a daily basis? This chapter reviews the previous topics, introduces CDH (Cloudera’s Distribution Including Apache Hadoop), and describes how Cloudera can help you maximize the value of all your data.

Online Training: Introduction to Hadoop and MapReduce
Start on your path to big data expertise with our open, online Udacity course. Cloudera University’s free three-lesson program covers the fundamentals of Hadoop, including getting hands-on by developing MapReduce code on data in HDFS.

Hortonworks Tutorials
(http://hortonworks.com/tutorials/)

Get started on Hadoop with these tutorials based on the Hortonworks Sandbox

Develop with Hadoop
Start developing with Hadoop. These tutorials are designed to ease your way into developing with Hadoop:

Apache Spark on HDP
Apache Zeppelin on HDP 2.4
Apache Zeppelin on HDP Technical Preview
Hands-on Tour of Apache Spark in 5 Minutes
Introduction Apache…
A Lap Around Apache Spark
Interacting with Data on HDP using Apache Zeppelin and Apache Spark
Using Hive with ORC from Apache Spark
Using IPython Notebook with Apache Spark

Hello World
These tutorials are a great jumping off point for your journey with Hadoop.
Learning the Ropes of the Hortonworks Sandbox
Learning the Ropes of the Hortonworks Sandbox Introduction This tutorial is aimed for users who do not have much experience in using the Sandbox….
Hadoop Tutorial – Getting Started with HDP
This tutorial will help you get started with Hadoop and HDP. We will use an Internet of Things (IoT) use case to build your first HDP application.
Zoomdata
Faster Pig with Tez
Introduction In this tutorial, you will explore the difference between running pig with execution engine of MapReduce and Tez.
How to Process Data with Apache Hive
This Hadoop tutorial shows how to Process Data with Hive using a set of Baseball statistics on American players from 1871-2011.
How To Process Data with Apache Pig
This Hadoop tutorial shows how to Process Data with Apache Pig using a set of Baseball statistics on American players from 1871-2011.
Exploring Data with Apache Pig from the Grunt shell
In this tutorial, you will learn how to load a data file into HDFS; Learn about ‘FILTER, FOREACH’ with examples; storing values into HDFS and Grunt shell’s file commands.
Loading and Querying Data with Hadoop
In this tutorial, we will load and review data for a fictitious web retail store in what has become an established use case for Hadoop: deriving insights from large data sources such as web logs.
Get Started with Cascading on Hortonworks Data Platform 2.1
How to get started with Cascading and Hortonworks Data Platform using the Word Count Example.
Get Started with Cascading on Hortonworks Data Platform 2.1
Cascading Pattern
Learn how to use Cascading Pattern to quickly migrate Predictive Models (PMML) from SAS, R, MicroStrategy onto Hadoop and deploy them at scale.
Processing streaming data in Hadoop with Apache Storm
How to use Apache Storm to process real-time streaming data in Hadoop with Hortonworks Data Platform.
Interactive Query for Hadoop with Apache Hive on Apache Tez
How to use Apache Tez and Apache Hive for Interactive Query with Hadoop and Hortonworks Data Platform 2.1
Indexing and Searching Documents with Apache Solr
In this tutorial we will walk through how to run Solr in Hadoop with the index (solr data files) stored on HDFS and using a map reduce jobs to index files.
Define and Process Data Pipelines in Hadoop with Apache Falcon
Use Apache Falcon to define an end-to-end data pipeline and policy for Hadoop and Hortonworks Data Platform 2.1
Introducing Apache Hadoop to Java Developers
Introduction In this tutorial for Hadoop Developers, we will explore the core concepts of Apache Hadoop and examine the process of writing a MapReduce…

Real World Examples
Indexing and Searching text within images with Apache Solr
A very common request from many customers is to be able to index text in image files; for example, text in scanned PNG files.
Incremental Backup of Data from HDP to Azure using Falcon for Disaster Recovery and Burst capacity
Realtime Event Processing in Hadoop with Storm and Kafka
Processing Real-time events with Apache Storm
In this tutorial, we will explore Apache Storm and use it with Apache Kafka to develop a multi-stage event processing pipeline….
Real time Data Ingestion in HBase & Hive using Storm Bolt
In this tutorial, we will build a solution to ingest real time streaming data into HBase and HDFS.
In previous tutorial we have explored generating and processing streaming data with Apache Kafka and Apache Storm. In this tutorial we will create HDFS Bolt & HBase Bolt to read the streaming data from the Kafka Spout and persist in Hive & HBase tables.
Visualize Website Clickstream Data
How do you improve the chances that your online customers will complete a purchase? Hadoop makes it easier to analyze and then change how visitors behave on your website. Here you can see how an online retailer optimized buying paths to reduce bounce rates and improve conversions. HDP can help you capture and refine website clickstream data to exceed your company’s e-commerce goals. The tutorial that comes with this video describes how to refine raw clickstream data using HDP.
How to Refine and Visualize Server Log Data
Security breaches happen. And when they do, server log analysis helps you identify the threat and then protect yourself better in the future. See how Hadoop takes server-log analysis to the next level by speeding forensics, retaining log data for longer and demonstrating compliance with IT policies. The tutorial that comes with this video describes how to refine raw server log data using HDP.
Analyzing Social Media and Customer Sentiment
With Hadoop, you can mine Twitter, Facebook and other social media conversations to analyze customer sentiment about you and your competition. With more social Big Data, you can make more targeted, real-time, decisions. The tutorial that comes with this video describes how to refine raw Twitter data using HDP.
How To Analyze Machine and Sensor Data
Machines know things. Sensors stream low-cost, always-on data. Hadoop makes it easier for you to store and refine that data and identify meaningful patterns, providing you with the insight to make proactive business decisions using predictive analytics. See how Hadoop can be used to analyze heating, ventilation and air conditioning data to maintain ideal office temperatures and minimize expenses
Natural Language Processing and Sentiment Analysis for Retailers using HDP and ITC Infotech Radar
RADAR is a software solution for retailers built using ITC Handy tools (NLP and Sentiment Analysis engine) and utilizing Hadoop technologies in …
Predictive Analytics on H2O and Hortonworks Data Platform
Introduction H2O is the open source in memory solution from 0xdata for predictive analytics on big data. It is a math and machine learning engine…

Open Source Big Data for the Impatient, Part 1: Hadoop tutorial: Hello World with Java, Pig, Hive, Flume, Fuse, Oozie, and Sqoop with Informix, DB2, and MySQL
How to get started with Hadoop and your favorite databases
(https://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/)

This article is focused on explaining Big Data and then providing simple worked examples in Hadoop, the major open-source player in the Big Data space. You’ll be happy to hear that Hadoop is NOT a replacement for Informix® or DB2®, but in fact plays nicely with the existing infrastructure. There are multiple components in the Hadoop family and this article will drill down to specific code samples that show the capabilities. No Elephants will stampede if you try these examples on your own PC.

Hadoop Tutorial – Getting Started with HDP
(http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/)

Introduction
This tutorial describes how to refine data for a Trucking IoT Data Discovery (aka IoT Discovery) use case using the Hortonworks Data Platform. The IoT Discovery use cases involves vehicles, devices and people moving across a map or similar surface. Your analysis is interested in tying together location information with your analytic data. Hello World is often used by developers to familiarize themselves with new concepts by building a simple program. This tutorial aims to achieve a similar purpose by getting practitioners started with Hadoop and HDP. We will use an Internet of Things (IoT) use case to build your first HDP application. For our tutorial we are looking at a use case where we have a truck fleet. Each truck has been equipped to log location and event data. These events are streamed back to a datacenter where we will be processing the data. The company wants to use this data to better understand risk. Here is the video of Analyzing Geolocation Data to show you what you’ll be doing in this tutorial.

The goal of this tutorial is that you get familiar with the basics of following:

Hadoop and HDP
Ambari File User Views and HDFS
Ambari Hive User Views and Apache Hive
Ambari Pig User Views and Apache Pig
Apache Spark
Data Visualization with Excel (Optional)
Data Visualization with Zeppelin (Optional)
Data Visualization with Zoomdata (Optional)

Outline

Introduction
Pre-Requisites
1. Data Set Used: Geolocation.zip
2. Latest Hortonworks Sandbox Version
3. Learning the Ropes of the Hortonworks Sandbox – Become familiar with your Sandbox and Ambari.
Tutorial Overview
Goals of the Tutorial (outcomes)
Hadoop Data Platform Concepts (New to Hadoop or HDP- Refer following)
1. Apache Hadoop and HDP (5 Pillars)
2. Apache Hadoop Distributed File System (HDFS)
3. Apache YARN
4. Apache MapReduce
5. Apache Hive
6. Apache Pig
Get Started with HDP Labs
Next Steps/Try These
1. Practitioner Journey- As a Hadoop Practitioner you can adopt following learning paths
  - Hadoop Developer – Click Here!
  - Hadoop Administrator –Click Here!
  - Data Scientist – Click Here!
2. Case Studies – Learn how Hadoop is being used by various industries.
References and Resources

A Beginners Guide to Hadoop
(http://blog.matthewrathbone.com/2013/04/17/what-is-hadoop.html)

The goal of this article is to provide a 10,000 foot view of Hadoop for those who know next to nothing about it. This article is not designed to get you ready for Hadoop development, but to provide a sound knowledge base for you to take the next steps in learning the technology.

MapReduce Tutorial
(https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html)

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

The MapReduce framework consists of a single master ResourceManager, one slave NodeManager per cluster-node, and MRAppMaster per application (see YARN Architecture Guide).

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration.

The Hadoop job client then submits the job (jar/executable etc.) and configuration to the ResourceManager which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Although the Hadoop framework is implemented in Java™, MapReduce applications need not be written in Java.

MapReduce Tutorial

Hadoop Tutorial: Developing Big-Data Applications with Apache Hadoop
(http://www.coreservlets.com/hadoop-tutorial/)

Following is an extensive series of tutorials on developing Big-Data Applications with Hadoop. Since each section includes exercises and exercise solutions, this can also be viewed as a self-paced Hadoop training course. All the slides, source code, exercises, and exercise solutions are free for unrestricted use. Click on a section below to expand its content. The relatively few parts on IDE development and deployment use Eclipse, but of course none of the actual code is Eclipse-specific. These tutorials assume that you already know Java; they definitely move too fast for those without at least moderate prior Java experience. If you don’t already know the Java language, please see the Java programming tutorial series.

Free Hadoop Tutorial: Master BigData
(http://www.guru99.com/bigdata-tutorials.html)

BigData is the latest buzzword in the IT Industry. Apache’s Hadoop is a leading Big Data platform used by IT giants Yahoo, Facebook & Google. This course is geared to make a Hadoop Expert.