Course Summary :
Taught by a 4 person team including 2 Stanford-educated, ex-Googlers and 2 ex-Flipkart Lead Analysts. This team has decades of practical experience in working with Java and with billions of rows of data.
This course is a zoom-in, zoom-out, hands-on workout involving Hadoop, MapReduce and the art of thinking parallel.
Let’s parse that.
Zoom-in, Zoom-Out: This course is both broad and deep. It covers the individual components of Hadoop in great detail, and also gives you a higher level picture of how they interact with each other.
Hands-on workout involving Hadoop, MapReduce : This course will get you hands-on with Hadoop very early on. You'll learn how to set up your own cluster using both VMs and the Cloud. All the major features of MapReduce are covered - including advanced topics like Total Sort and Secondary Sort.
The art of thinking parallel: MapReduce completely changed the way people thought about processing Big Data. Breaking down any problem into parallelizable units is an art. The examples in this course will train you to "think parallel".
Lot's of cool stuff ..
- Using MapReduce to
- Recommend friends in a Social Networking site: Generate Top 10 friend recommendations using a Collaborative filtering algorithm.
- Build an Inverted Index for Search Engines: Use MapReduce to parallelize the humongous task of building an inverted index for a search engine.
- Generate Bigrams from text: Generate bigrams and compute their frequency distribution in a corpus of text.
- Build your Hadoop cluster:
- Install Hadoop in Standalone, Pseudo-Distributed and Fully Distributed modes
- Set up a hadoop cluster using Linux VMs.
- Set up a cloud Hadoop cluster on AWS with Cloudera Manager.
- Understand HDFS, MapReduce and YARN and their interaction
- Customize your MapReduce Jobs:
- Chain multiple MR jobs together
- Write your own Customized Partitioner
- Total Sort : Globally sort a large amount of data by sampling input files
- Secondary sorting
- Unit tests with MR Unit
- Integrate with Python using the Hadoop Streaming API
.. and of course all the basics:
- MapReduce : Mapper, Reducer, Sort/Merge, Partitioning, Shuffle and Sort
- HDFS & YARN: Namenode, Datanode, Resource manager, Node manager, the anatomy of a MapReduce application, YARN Scheduling, Configuring HDFS and YARN to performance tune your cluster.
Write us about anything - anything! - and we will always reply :-) Haopy Learning at Unanth.
What are the requirements?
- You'll need an IDE where you can write Java code or open the source code that's shared. IntelliJ and Eclipse are both great options.
- You'll need some background in Object-Oriented Programming, preferably in Java. All the source code is in Java and we dive right in without going into Objects, Classes etc
- A bit of exposure to Linux/Unix shells would be helpful, but it won't be a blocker
Target Audience :
What is the target audience?
- Yep! Analysts who want to leverage the power of HDFS where traditional databases don't cut it anymore
- Yep! Engineers who want to develop complex distributed computing applications to process lot's of data
- Yep! Data Scientists who want to add MapReduce to their bag of tricks for processing data
Section 1 - Introduction
You, this course and Us01:52
Section 2 - Why is Big Data a Big Deal
DOWNLOAD SECTION 2- WhyBigData
The Big Data Paradigm
Serial vs Distributed Computing
What is Hadoop?
HDFS or the Hadoop Distributed File System
YARN or Yet Another Resource Negotiator
Section 3 - Installing Hadoop in a Local Environment
DOWNLOAD SECTION 3-Install-Guides
Hadoop Install Modes
Setup a Virtual Linux Instance (For Windows users)
Hadoop Standalone mode Install
Hadoop Pseudo-Distributed mode Install
Section 4 - The MapReduce "Hello World"
DOWNLOAD SECTION 4-MR-IntroSimpleWordCount
DOWNLOAD SECTION 4- SourceCode
The basic philosophy underlying MapReduce
MapReduce - Visualized And Explained
MapReduce - Digging a little deeper at every step
"Hello World" in MapReduce
Section 5 - Run a MapReduce Job
Get comfortable with HDFS
Run your first MapReduce Job
Section 6 - Juicing your MapReduce - Combiners, Shuffle and Sort and The Streaming API
DOWNLOAD SECTION 6-MR-CombinerStreamingAPIMultipleReduceShuffleSort
Parallelize the reduce phase - use the Combiner
Not all Reducers are Combiners
How many mappers and reducers does your MapReduce have?
Parallelizing reduce using Shuffle And Sort
MapReduce is not limited to the Java language - Introducing the Streaming API
Python for MapReduce
Section 7 - HDFS and Yarn
DOWNLOAD SECTION 7-HDFS
HDFS - Protecting against data loss using replication
HDFS - Name nodes and why they're critical
HDFS - Checkpointing to backup name node information
DOWNLOAD SECTION 7-YARN
Yarn - Basic components
Yarn - Submitting a job to Yarn
Yarn - Plug in scheduling policies
Yarn - Configure the scheduler
Section 8 - Setting up a Hadoop Cluster
Manually configuring a Hadoop cluster (Linux VMs)
Getting started with Amazon Web Servicies
Start a Hadoop Cluster with Cloudera Manager on AWS
Section 9 - MapReduce Customizations For Finer Grained Control
DOWNLOAD SECTION 9-Customizing-MR
Setting up your MapReduce to accept command line arguments
The Tool, ToolRunner and GenericOptionsParser
Configuring properties of the Job object
Customizing the Partitioner, Sort Comparator, and Group Comparator
Section 10 - The Inverted Index, Custom Data Types for Keys, Bigram Counts and Unit Tests!
DOWNLOAD SECTION 10-MR-InvertedIndex-WritableInterface-Bigram-MRUnit
The heart of search engines - The Inverted Index
Generating the inverted index using MapReduce
Custom data types for keys - The Writable Interface
Represent a Bigram using a WritableComparable
MapReduce to count the Bigrams in input text
Test your MapReduce job using MRUnit
Section 11 - Input and Output Formats and Customized Partitioning
DOWNLOAD SECTION 11-Formats-And-Sorting
Introducing the File Input Format
Text And Sequence File Formats
Data partitioning using a custom partitioner
Make the custom partitioner real in code
Total Order Partitioning
Input Sampling, Distribution, Partitioning and configuring these
Section 12 - Recommendation Systems using Collaborative Filtering
DOWNLOAD SECTION 12-MR-CollaborativeFiltering-Recommendations
Introduction to Collaborative Filtering
Friend recommendations using chained MR jobs
Get common friends for every pair of users - the first MapReduce
Top 10 friend recommendation for every user - the second MapReduce
Section 13 - Hadoop as a Database
DOWNLOAD SECTION 13-MR-Databases-Select-Grouping
Structured data in Hadoop Preview
Running an SQL Select with MapReduce
Running an SQL Group By with MapReduce
A MapReduce Join - The Map Side
A MapReduce Join - The Reduce Side
A MapReduce Join - Sorting and Partitioning
A MapReduce Join - Putting it all together
Section 14 - K-Means Clustering
DOWNLOAD SECTION 14-MR-Kmeans-Algo
What is K-Means Clustering?
A MapReduce job for K-Means Clustering
K-Means Clustering - Measuring the distance between points
K-Means Clustering - Custom Writables for Input/Output
K-Means Clustering - Configuring the Job
K-Means Clustering - The Mapper and Reducer
K-Means Clustering : The Iterative MapReduce Job
Loonycorn A 4-ppl team;ex-Google.
Loonycorn is us, Janani Ravi, Vitthal Srinivasan, Swetha Kolalapudi and Navdeep Singh. Between the four of us, we have studied at Stanford, IIM Ahmedabad, the IITs and have spent years (decades, actually) working in tech, in the Bay Area, New York, Singapore and Bangalore. Janani: 7 years at Google (New York, Singapore); Studied at Stanford; also worked at Flipkart and Microsoft Vitthal: Also Google (Singapore) and studied at Stanford; Flipkart, Credit Suisse and INSEAD too Swetha: Early Flipkart employee, IIM Ahmedabad and IIT Madras alum Navdeep: longtime Flipkart employee too, and IIT Guwahati alum We think we might have hit upon a neat way of teaching complicated tech courses in a funny, practical, engaging way, which is why we are so excited to be here on Unanth! We hope you will try our offerings, and think you'll like them :-)
posted 4 month before
Very Good Course....Value for money.
Very good course, concepts explained with details. At some places pace is fast but manageable with attached documents. I recommend this course.