Hadoop tutorial with MapReduce, HDFS, Spark, Flink, Hive, HBase, MongoDB, Cassandra, Kafka + more! Over 25 technologies. Includes 14.5 hours of on-demand video and a certificate of completion.
Also available at Udemy
Buy This Course
Learn at your own pace! Lifetime access to all course videos and materials for this course, with a one-time payment.
The world of Hadoop and “Big Data” can be intimidating – hundreds of different technologies with cryptic names form the Hadoop ecosystem. With this Hadoop tutorial, you’ll not only understand what those systems are and how they fit together – but you’ll go hands-on and learn how to use them to solve real business problems!
Learn and master the most popular big data technologies in this comprehensive course, taught by a former engineer and senior manager from Amazon and IMDb. We’ll go way beyond Hadoop itself, and dive into all sorts of distributed systems you may need to integrate with.
- Install and work with a real Hadoop installation right on your desktop with Hortonworks (now part of Cloudera) and the Ambari UI
- Manage big data on a cluster with HDFS and MapReduce
- Write programs to analyze data on Hadoop with Pig and Spark
- Store and query your data with Sqoop, Hive, MySQL, HBase, Cassandra, MongoDB, Drill, Phoenix, and Presto
- Design real-world systems using the Hadoop ecosystem
- Learn how your cluster is managed with YARN, Mesos, Zookeeper, Oozie, Zeppelin, and Hue
- Handle streaming data in real time with Kafka, Flume, Spark Streaming, Flink, and Storm
Understanding Hadoop is a highly valuable skill for anyone working at companies with large amounts of data.
Almost every large company you might want to work at uses Hadoop in some way, including Amazon, Ebay, Facebook, Google, LinkedIn, IBM, Spotify, Twitter, and Yahoo! And it’s not just technology companies that need Hadoop; even the New York Times uses Hadoop for processing images.
This course is comprehensive, covering over 25 different technologies in over 14 hours of video lectures. It’s filled with hands-on activities and exercises, so you get some real experience in using Hadoop – it’s not just theory.
You’ll find a range of activities in this course for people at every level. If you’re a project manager who just wants to learn the buzzwords, there are web UI’s for many of the activities in the course that require no programming knowledge. If you’re comfortable with command lines, we’ll show you how to work with them too. And if you’re a programmer, I’ll challenge you with writing real scripts on a Hadoop system using Scala, Pig Latin, and Python.
You’ll walk away from this course with a real, deep understanding of Hadoop and its associated distributed systems, and you can apply Hadoop to real-world problems. Plus a valuable completion certificate is waiting for you at the end!
Please note the focus on this course is on application development, not Hadoop administration. Although you will pick up some administration skills along the way.
Knowing how to wrangle “big data” is an incredibly valuable skill for today’s top tech employers. Don’t be left behind – enroll now!
Andrew Corkill
This is an excellent course for someone looking to understand the Hadoop eco-system as a whole. While it covers a lot of different technologies, they are still covered in enough detail with working examples to gain sufficient understanding about the basics to get started (the hardest step), and also about what the problem the technology solves, and its strengths and weaknesses. The most valuable part of all this information for me was not just the information about when to use this technology, but the consistent comparison between the technology and other similar technologies, giving a great set of points to decide which to use in a given business case, and how to combine the technologies into a system. This was invaluable given the number of technologies around that all do similar things, and this really helped clear up decision making process. Good job Frank. A+
Bobby Baker
Frank is one of the best instructors on any platform! His clear explanations, obviously from years of hands on, real world experience, and course flow make it a pleasure to learn from Frank. I’m a 40 yr veteran of IT and always learn from Frank’s courses!
Jose Eduardo Thurler Tecles
Great course! I had no previous knowledge about Hadoop and I learned a lot. Very clear explanations, good hands-on practices.
Varun Tyagi
The instructor explained everything very very clearly. The best part is the hands-on work after introducing every technology. I would recommend this course to every aspiring Big Data Engineer or professional
Frank Kane
Author
Our courses are led by Frank Kane, a former Amazon and IMDb developer with extensive experience in machine learning and data science. With 26 issued patents and 9 years of experience at the forefront of recommendation systems, Frank brings real-world expertise to his teaching. His ability to explain complex concepts in accessible terms has helped over one million students worldwide gain valuable skills in machine learning, data engineering, and AI development.
Buy This Course
Learn at your own pace! Lifetime access to all course videos and materials for this course, with a one-time payment.
Learn all the buzzwords! And install the Hortonworks Data Platform Sandbox.
If you have trouble downloading Hortonworks Data Platform…
Lesson 1 of 4 within section Learn all the buzzwords! And install the Hortonworks Data Platform Sandbox..
You must enroll in this course to access course content.
Lesson 2 of 4 within section Learn all the buzzwords! And install the Hortonworks Data Platform Sandbox..
You must enroll in this course to access course content.
Hadoop Overview and History
Lesson 3 of 4 within section Learn all the buzzwords! And install the Hortonworks Data Platform Sandbox..
You must enroll in this course to access course content.
Using Hadoop’s Core: HDFS and MapReduce
HDFS: What it is, and how it works
Lesson 1 of 11 within section Using Hadoop's Core: HDFS and MapReduce.
You must enroll in this course to access course content.
Installing the MovieLens Dataset
Lesson 2 of 11 within section Using Hadoop's Core: HDFS and MapReduce.
You must enroll in this course to access course content.
[Activity] Install the MovieLens dataset into HDFS using the command line
Lesson 3 of 11 within section Using Hadoop's Core: HDFS and MapReduce.
You must enroll in this course to access course content.
MapReduce: What it is, and how it works
Lesson 4 of 11 within section Using Hadoop's Core: HDFS and MapReduce.
You must enroll in this course to access course content.
How MapReduce distributes processing
Lesson 5 of 11 within section Using Hadoop's Core: HDFS and MapReduce.
You must enroll in this course to access course content.
MapReduce example: Break down movie ratings by rating score
Lesson 6 of 11 within section Using Hadoop's Core: HDFS and MapReduce.
You must enroll in this course to access course content.
[Activity] Installing Python, MRJob, and nano
Lesson 7 of 11 within section Using Hadoop's Core: HDFS and MapReduce.
You must enroll in this course to access course content.
[Activity] Code up the ratings histogram MapReduce job and run it
Lesson 8 of 11 within section Using Hadoop's Core: HDFS and MapReduce.
You must enroll in this course to access course content.
[Exercise] Rank movies by their popularity
Lesson 9 of 11 within section Using Hadoop's Core: HDFS and MapReduce.
You must enroll in this course to access course content.
[Activity] Check your results against mine!
Lesson 10 of 11 within section Using Hadoop's Core: HDFS and MapReduce.
You must enroll in this course to access course content.
File Formats (Avro, Parquet, ORC, Protobuf, JSON, XML)
Lesson 11 of 11 within section Using Hadoop's Core: HDFS and MapReduce.
You must enroll in this course to access course content.
Programming Hadoop with Pig
Lesson 1 of 7 within section Programming Hadoop with Pig.
You must enroll in this course to access course content.
Lesson 2 of 7 within section Programming Hadoop with Pig.
You must enroll in this course to access course content.
Example: Find the oldest movie with a 5-star rating using Pig
Lesson 3 of 7 within section Programming Hadoop with Pig.
You must enroll in this course to access course content.
[Activity] Find old 5-star movies with Pig
Lesson 4 of 7 within section Programming Hadoop with Pig.
You must enroll in this course to access course content.
Lesson 5 of 7 within section Programming Hadoop with Pig.
You must enroll in this course to access course content.
[Exercise] Find the most-rated one-star movie
Lesson 6 of 7 within section Programming Hadoop with Pig.
You must enroll in this course to access course content.
Pig Challenge: Compare Your Results to Mine!
Lesson 7 of 7 within section Programming Hadoop with Pig.
You must enroll in this course to access course content.
Programming Hadoop with Spark
Lesson 1 of 8 within section Programming Hadoop with Spark.
You must enroll in this course to access course content.
The Resilient Distributed Dataset (RDD)
Lesson 2 of 8 within section Programming Hadoop with Spark.
You must enroll in this course to access course content.
[Activity] Find the movie with the lowest average rating – with RDD’s
Lesson 3 of 8 within section Programming Hadoop with Spark.
You must enroll in this course to access course content.
Lesson 4 of 8 within section Programming Hadoop with Spark.
You must enroll in this course to access course content.
[Activity] Find the movie with the lowest average rating – with DataFrames
Lesson 5 of 8 within section Programming Hadoop with Spark.
You must enroll in this course to access course content.
[Activity] Movie recommendations with MLLib
Lesson 6 of 8 within section Programming Hadoop with Spark.
You must enroll in this course to access course content.
[Exercise] Filter the lowest-rated movies by number of ratings
Lesson 7 of 8 within section Programming Hadoop with Spark.
You must enroll in this course to access course content.
[Activity] Check your results against mine!
Lesson 8 of 8 within section Programming Hadoop with Spark.
You must enroll in this course to access course content.
Using relational data stores with Hadoop
Lesson 1 of 9 within section Using relational data stores with Hadoop.
You must enroll in this course to access course content.
[Activity] Use Hive to find the most popular movie
Lesson 2 of 9 within section Using relational data stores with Hadoop.
You must enroll in this course to access course content.
Lesson 3 of 9 within section Using relational data stores with Hadoop.
You must enroll in this course to access course content.
[Exercise] Use Hive to find the movie with the highest average rating
Lesson 4 of 9 within section Using relational data stores with Hadoop.
You must enroll in this course to access course content.
Compare your solution to mine.
Lesson 5 of 9 within section Using relational data stores with Hadoop.
You must enroll in this course to access course content.
Integrating MySQL with Hadoop
Lesson 6 of 9 within section Using relational data stores with Hadoop.
You must enroll in this course to access course content.
[Activity] Install MySQL and import our movie data
Lesson 7 of 9 within section Using relational data stores with Hadoop.
You must enroll in this course to access course content.
[Activity] Use Sqoop to import data from MySQL to HFDS/Hive
Lesson 8 of 9 within section Using relational data stores with Hadoop.
You must enroll in this course to access course content.
[Activity] Use Sqoop to export data from Hadoop to MySQL
Lesson 9 of 9 within section Using relational data stores with Hadoop.
You must enroll in this course to access course content.
Using non-relational data stores with Hadoop
Lesson 1 of 13 within section Using non-relational data stores with Hadoop.
You must enroll in this course to access course content.
Lesson 2 of 13 within section Using non-relational data stores with Hadoop.
You must enroll in this course to access course content.
[Activity] Import movie ratings into HBase
Lesson 3 of 13 within section Using non-relational data stores with Hadoop.
You must enroll in this course to access course content.
[Activity] Use HBase with Pig to import data at scale.
Lesson 4 of 13 within section Using non-relational data stores with Hadoop.
You must enroll in this course to access course content.
Lesson 5 of 13 within section Using non-relational data stores with Hadoop.
You must enroll in this course to access course content.
If you have trouble installing Cassandra…
Lesson 6 of 13 within section Using non-relational data stores with Hadoop.
You must enroll in this course to access course content.
[Activity] Installing Cassandra
Lesson 7 of 13 within section Using non-relational data stores with Hadoop.
You must enroll in this course to access course content.
[Activity] Write Spark output into Cassandra
Lesson 8 of 13 within section Using non-relational data stores with Hadoop.
You must enroll in this course to access course content.
[Activity] Install MongoDB, and integrate Spark with MongoDB
Lesson 10 of 13 within section Using non-relational data stores with Hadoop.
You must enroll in this course to access course content.
[Activity] Using the MongoDB shell
Lesson 11 of 13 within section Using non-relational data stores with Hadoop.
You must enroll in this course to access course content.
Choosing a database technology
Lesson 12 of 13 within section Using non-relational data stores with Hadoop.
You must enroll in this course to access course content.
[Exercise] Choose a database for a given problem
Lesson 13 of 13 within section Using non-relational data stores with Hadoop.
You must enroll in this course to access course content.
Querying your Data Interactively
Lesson 1 of 9 within section Querying your Data Interactively.
You must enroll in this course to access course content.
[Activity] Setting up Drill
Lesson 2 of 9 within section Querying your Data Interactively.
You must enroll in this course to access course content.
[Activity] Querying across multiple databases with Drill
Lesson 3 of 9 within section Querying your Data Interactively.
You must enroll in this course to access course content.
Lesson 4 of 9 within section Querying your Data Interactively.
You must enroll in this course to access course content.
[Activity] Install Phoenix and query HBase with it
Lesson 5 of 9 within section Querying your Data Interactively.
You must enroll in this course to access course content.
[Activity] Integrate Phoenix with Pig
Lesson 6 of 9 within section Querying your Data Interactively.
You must enroll in this course to access course content.
Lesson 7 of 9 within section Querying your Data Interactively.
You must enroll in this course to access course content.
[Activity] Install Presto, and query Hive with it.
Lesson 8 of 9 within section Querying your Data Interactively.
You must enroll in this course to access course content.
[Activity] Query both Cassandra and Hive using Presto.
Lesson 9 of 9 within section Querying your Data Interactively.
You must enroll in this course to access course content.
Managing your Cluster
Lesson 1 of 13 within section Managing your Cluster.
You must enroll in this course to access course content.
Lesson 2 of 13 within section Managing your Cluster.
You must enroll in this course to access course content.
[Activity] Use Hive on Tez and measure the performance benefit
Lesson 3 of 13 within section Managing your Cluster.
You must enroll in this course to access course content.
Lesson 4 of 13 within section Managing your Cluster.
You must enroll in this course to access course content.
Lesson 5 of 13 within section Managing your Cluster.
You must enroll in this course to access course content.
[Activity] Simulating a failing master with ZooKeeper
Lesson 6 of 13 within section Managing your Cluster.
You must enroll in this course to access course content.
Lesson 7 of 13 within section Managing your Cluster.
You must enroll in this course to access course content.
[Activity] Set up a simple Oozie workflow
Lesson 8 of 13 within section Managing your Cluster.
You must enroll in this course to access course content.
Lesson 9 of 13 within section Managing your Cluster.
You must enroll in this course to access course content.
[Activity] Use Zeppelin to analyze movie ratings, part 1
Lesson 10 of 13 within section Managing your Cluster.
You must enroll in this course to access course content.
[Activity] Use Zeppelin to analyze movie ratings, part 2
Lesson 11 of 13 within section Managing your Cluster.
You must enroll in this course to access course content.
Lesson 12 of 13 within section Managing your Cluster.
You must enroll in this course to access course content.
Other technologies worth mentioning
Lesson 13 of 13 within section Managing your Cluster.
You must enroll in this course to access course content.
Feeding Data to your Cluster
Lesson 1 of 6 within section Feeding Data to your Cluster.
You must enroll in this course to access course content.
[Activity] Setting up Kafka, and publishing some data.
Lesson 2 of 6 within section Feeding Data to your Cluster.
You must enroll in this course to access course content.
[Activity] Publishing web logs with Kafka
Lesson 3 of 6 within section Feeding Data to your Cluster.
You must enroll in this course to access course content.
Lesson 4 of 6 within section Feeding Data to your Cluster.
You must enroll in this course to access course content.
[Activity] Set up Flume and publish logs with it.
Lesson 5 of 6 within section Feeding Data to your Cluster.
You must enroll in this course to access course content.
[Activity] Set up Flume to monitor a directory and store its data in HDFS
Lesson 6 of 6 within section Feeding Data to your Cluster.
You must enroll in this course to access course content.
Analyzing Streams of Data
Spark Streaming: Introduction
Lesson 1 of 8 within section Analyzing Streams of Data.
You must enroll in this course to access course content.
[Activity] Analyze web logs published with Flume using Spark Streaming
Lesson 2 of 8 within section Analyzing Streams of Data.
You must enroll in this course to access course content.
[Exercise] Monitor Flume-published logs for errors in real time
Lesson 3 of 8 within section Analyzing Streams of Data.
You must enroll in this course to access course content.
Exercise solution: Aggregating HTTP access codes with Spark Streaming
Lesson 4 of 8 within section Analyzing Streams of Data.
You must enroll in this course to access course content.
Apache Storm: Introduction
Lesson 5 of 8 within section Analyzing Streams of Data.
You must enroll in this course to access course content.
[Activity] Count words with Storm
Lesson 6 of 8 within section Analyzing Streams of Data.
You must enroll in this course to access course content.
Lesson 7 of 8 within section Analyzing Streams of Data.
You must enroll in this course to access course content.
Designing Real-World Systems
Lesson 1 of 7 within section Designing Real-World Systems.
You must enroll in this course to access course content.
Review: How the pieces fit together
Lesson 2 of 7 within section Designing Real-World Systems.
You must enroll in this course to access course content.
Understanding your requirements
Lesson 3 of 7 within section Designing Real-World Systems.
You must enroll in this course to access course content.
Sample application: consume webserver logs and keep track of top-sellers
Lesson 4 of 7 within section Designing Real-World Systems.
You must enroll in this course to access course content.
[Exercise] Design a system to report web sessions per day
Lesson 6 of 7 within section Designing Real-World Systems.
You must enroll in this course to access course content.
Exercise solution: Design a system to count daily sessions
Lesson 7 of 7 within section Designing Real-World Systems.
You must enroll in this course to access course content.
Learning More
Books and online resources
Lesson 1 of 2 within section Learning More.
You must enroll in this course to access course content.
Continue your Learning Journey!
Lesson 2 of 2 within section Learning More.
You must enroll in this course to access course content.
This course was really good in terms of covering the topics from breadth perspective. This is the first time i have subscribed Sundog education course due to Frank. I have been following few courses from Frank for last 3 years on other platform like Udemy. This course is must for the beginners who are trying their luck in BIg Data space.
Dear Frank,
I am following your course “The Ultimate Hands-On Hadoop: Tame your Big Data!” on O’Reilly. I have installed HDP 2.6.5 version. As you had mentioned, I have followed the steps to install “mrjob” but I am getting an error as below:
################## Error Message Start ##################
[root@sandbox-hdp maria_dev]# pip install mrjob==0.5.11
Collecting mrjob==0.5.11
Using cached https://files.pythonhosted.org/packages/31/1c/f3bd5f21ebe57e6d9212b3942af9e9c3a48dce9f4ba921081971f7b41a0f/mrjob-0.5.11-py2.py3-none-any.whl
Collecting google-api-python-client>=1.5.0 (from mrjob==0.5.11)
Using cached https://files.pythonhosted.org/packages/66/e8/edabba76d451d2f82f817d72f0ccddb1cb8dc9dda84596a973dc2ef6f10b/google_api_python_client-2.33.0-py2.py3-none-any.whl
Collecting filechunkio (from mrjob==0.5.11)
Using cached https://files.pythonhosted.org/packages/10/4d/1789767002fa666fcf486889e8f6a2a90784290be9c0bc28d627efba401e/filechunkio-1.8.tar.gz
Collecting PyYAML>=3.08 (from mrjob==0.5.11)
Using cached https://files.pythonhosted.org/packages/36/2b/61d51a2c4f25ef062ae3f74576b01638bebad5e045f747ff12643df63844/PyYAML-6.0.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File “”, line 1, in
File “/tmp/pip-build-lIGSu6/PyYAML/setup.py”, line 67, in
import sys, os, os.path, pathlib, platform, shutil, tempfile, warnings
ImportError: No module named pathlib
—————————————-
Command “python setup.py egg_info” failed with error code 1 in /tmp/pip-build-lIGSu6/PyYAML/
You are using pip version 8.1.2, however version 21.3.1 is available.
You should consider upgrading via the ‘pip install –upgrade pip’ command.
################## Error Message Ends ##################
Things I have tried:
1. pip install google-api-python-client==1.6.4 (Error: error in httplib2 setup command: ‘install_requires’ must be a string or list of strings containing valid project/version requirement specifiers. Command “python setup.py egg_info” failed with error code 1 in /tmp/pip-build-r6iFbW/httplib2/)
2. pip install –upgrade setuptools (Error: AttributeError: find_module. Command “python setup.py egg_info” failed with error code 1 in /tmp/pip-build-2JTceI/setuptools/)
3. Tried installing Python 3 but it complicates the entire setup and doesn’t work.
Could you please help me with it?
You need to install pathlib, and downgrade PyYAML.
pip install pathlib
pip install pyyaml==3.10
Seems the course videos on O’Reilly are out of date. I’ll see about getting them updated.
I have bought this course online from Udemy, where am I supposed to get access to the course materials from?
If you bought it on Udemy, you should access the course from Udemy. The first few lectures walk you through getting set up and downloading any materials you need. Generally scripts etc. are just downloaded within each individual activity as needed.
I came from C# and Visual Studio background, Is there any kind debugging tool / IDE we can use to write spark python code e.g. I want to debug each line of code what’s in the object after mentioning breakpoint in the code?
How you debug your own code, If you write some Python Spark code and code get breaks after calling on Putty, how you debug the code line by line that where the actual problem is, because error message sometime not so helpful?
Well… there isn’t one, really. Spark works very differently from C# programs. Things don’t execute sequentially, you’re just building up a queue of operations that Spark will later build a directed acyclic graph from, and the distribute the processing across the machines in your cluster when a time comes that an operation requires a final result of some sort. So for most commands, nothing really happens – you can’t debug Spark programs in the same way as you would with lower-level systems stepping into individual commands and seeing what they do. What you can do is force your Spark driver scripts to output intermediate results and print them out for debugging purposes… and pay attention to any error messages at runtime. With datasets, Spark can detect more errors at compile time to make life a little easier.
Further complicating matters, your Python Spark code is ultimately converted and run using a Java Virtual Machine, making things even more complicated for debugging purposes. And there is the problem of debugging both on the driver side, and on the executor side (which may be running someplace else entirely.)
That said it’s not *impossible*, just really really hard and only worth the effort when you’re really stuck. More details are at https://spark.apache.org/docs/latest/api/python/development/debugging.html
I have bought this course online from Udemy, where am I supposed to get access to the course materials (codes and slides ) from?
If you bought it on Udemy, you should access the course from Udemy. The first few lectures walk you through getting set up and downloading any materials you need. Generally scripts etc. are just downloaded within each individual activity as needed.
I could not see a resource file for the codes you wrote in section 4 (Hadoop with spark) in Udemy. Where can I reach them?
Please post questions from the Udemy course in Udemy’s Q&A. In fact you would have found the answer to this there. http://media.sundog-soft.com/hadoop/HadoopMaterials.zip
I don’t remember where in the course this is mentioned exactly, but if you are taking the lessons out of sequence it’s possible you missed it.
Thank you , it’s my mistake now I saw it when I looked at the udemy QA section.