I'll
try to give a very crude overview of how the pieces fit in together,
because the details span multiple books. Please forgive me for some
oversimplifications.
- MapReduce is the Google paper that started it all (Page on googleusercontent.com).
Its a paradigm for writing distributed code inspired by some elements
of functional programming. You don't have to do things this way, but it
neatly fits a lot of problems we try to solve in a distributed way. The
Google internal implementation is called MapReduce and Hadoop is it's open-source implementation. Amazon's Hadoop instance is called Elastic MapReduce (EMR) and has plugins for multiple languages.
- HDFS is an implementation inspired by the Google File System (GFS) to store files across a bunch of machines when it's too big for one. Hadoop consumes data in HDFS (Hadoop Distributed File System).
- Apache Spark is an
emerging platform that has more flexibility than MapReduce but more
structure than a basic message passing interface. It relies on the
concept of distributed data structures (what it calls RDDs) and
operators. See this page for more: The Apache Software Foundation
- Because
Spark is a lower level thing that sits on top of a message passing
interface, it has higher level libraries to make it more accessible to
data scientists. The Machine Learning library built on top of it is
called MLib and there's a distributed graph library called GraphX.
- Pregel and it's open source twin Giraph is
a way to do graph algorithms on billions of nodes and trillions of
edges over a cluster of machines. Notably, the MapReduce model is not
well suited to graph processing so Hadoop/MapReduce are avoided in this
model, but HDFS/GFS is still used as a data store.
- Zookeeper
is a coordination and synchronization service that a distributed set of
computer make decisions by consensus, handles failure, etc.
- Flume and Scribe are logging services,
Flume is an Apache project and Scribe is an open-source Facebook
project. Both aim to make it easy to collect tons of logged data,
analyze it, tail it, move it around and store it to a distributed store.
- Google BigTable and it's open source twin HBase
were meant to be read-write distributed databases, originally built for
the Google Crawler that sit on top of GFS/HDFS and MapReduce/Hadoop. Google Research Publication: BigTable
- Hive and Pig
are abstractions on top of Hadoop designed to help analysis of tabular
data stored in a distributed file system (think of excel sheets too big
to store on one machine). They operate on top of a data warehouse,
so the high level idea is to dump data once and analyze it by reading
and processing it instead of updating cells and rows and columns
individually much. Hive has a language similar to SQL while Pig is
inspired by Google's Sawzall - Google Research Publication: Sawzall. You generally don't update a single cell in a table when processing it with Hive or Pig.
- Hive
and Pig turned out to be slow because they were built on Hadoop which
optimizes for the volume of data moved around, not latency. To get
around this, engineers bypassed and went straight to HDFS. They also
threw in some memory and caching and this resulted in Google's Dremel (Dremel: Interactive Analysis of Web-Scale Datasets), F1 (F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business), Facebook's Presto (Presto | Distributed SQL Query Engine for Big Data), Apache Spark SQL (Page on apache.org ), Cloudera Impala (Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real), Amazon's Redshift,
etc. They all have slightly different semantics but are essentially
meant to be programmer or analyst friendly abstractions to analyze
tabular data stored in distributed data warehouses.
- Mahout (Scalable machine learning and data mining) is a collection of machine learning libraries written in the MapReduce paradigm, specifically for Hadoop. Google has it's own internal version but they haven't published a paper on it as far as I know.
- Oozie is a workflow scheduler. The oversimplified description would be that it's something that puts together a pipeline of
the tools described above. For example, you can write an Oozie script
that will scrape your production HBase data to a Hive warehouse nightly,
then a Mahout script will train with this data. At the same time, you
might use pig to pull in the test set into another file and when Mahout
is done creating a model you can pass the testing data through the model
and get results. You specify the dependency graph of these tasks
through Oozie (I may be messing up terminology since I've never used
Oozie but have used the Facebook equivalent).
- Lucene is
a bunch of search-related and NLP tools but it's core feature is being a
search index and retrieval system. It takes data from a store like
HBase and indexes it for fast retrieval from a search query. Solr uses Lucene under the hood to provide a convenient REST API for indexing and searching data. ElasticSearch is similar to Solr.
- Sqoop is
a command-line interface to back SQL data to a distributed warehouse.
It's what you might use to snapshot and copy your database tables to a
Hive warehouse every night.
- Hue is a web-based GUI to a subset of the above tools - http://gethue.com
Lets start from beginning:
- Code in at least one object oriented programming language: C++, Java, or Python
Beginner Online Resources:
Coursera - Learn to Program: The Fundamentals,
MIT Intro to Programming in Java,
Google's Python Class,
Coursera - Introduction to Python,
Python Open Source E-Book
Intermediate Online Resources:
Udacity's Design of Computer Programs,
Coursera - Learn to Program: Crafting Quality Code,
Coursera - Programming Languages,
Brown University - Introduction to Programming Languages
- Learn other Programming Languages
Notes: Add to your repertoire - Java Script, CSS, HTML, Ruby, PHP, C, Perl, Shell. Lisp, Scheme.
Online Resources: w3school.com - HTML Tutorial, Learn to code
- Test Your Code
Notes: Learn how to catch bugs, create tests, and break your software
Online Resources: Udacity - Software Testing Methods, Udacity - Software Debugging
- Develop logical reasoning and knowledge of discrete math
Online Resources:
MIT Mathematics for Computer Science,
Coursera - Introduction to Logic,
Coursera - Linear and Discrete Optimization,
Coursera - Probabilistic Graphical Models,
Coursera - Game Theory.
- Develop strong understanding of Algorithms and Data Structures
Notes:
Learn about fundamental data types (stack, queues, and bags), sorting
algorithms (quicksort, mergesort, heapsort), and data structures (binary
search trees, red-black trees, hash tables), Big O.
Online Resources:
MIT Introduction to Algorithms,
Coursera - Introduction to Algorithms Part 1 & Part 2,
Wikipedia - List of Algorithms,
Wikipedia - List of Data Structures,
Book: The Algorithm Design Manual
- Develop a strong knowledge of operating systems
Online Resources: UC Berkeley Computer Science 162
- Learn Artificial Intelligence Online Resources:
Stanford University - Introduction to Robotics, Natural Language Processing, Machine Learning
- Learn how to build compilers
Online Resources: Coursera - Compilers
- Learn cryptography
Online Resources: Coursera - Cryptography, Udacity - Applied Cryptography
- Learn Parallel Programming
Online Resources: Coursera - Heterogeneous Parallel Programming
Tools and technologies for Bigdata:Apache spark -
Apache Spark is an open-source data analytics cluster computing
framework originally developed in the AMPLab at UC Berkeley.[1] Spark
fits into the Hadoop open-source community, building on top of the
Hadoop Distributed File System (HDFS).[2] However, Spark is not tied to
the two-stage MapReduce paradigm, and promises performance up to 100
times faster than Hadoop MapReduce for certain applications.
Database pipelining -
As you will notice it's just not about processing the data, but
involves a lot of other components. Collection, storage, exploration, ML
and visualization are critical to the proect's success.
SOLR
- Solr to build a highly scalable data analytics engine to enable
customers to engage in lightning fast, real-time knowledge discovery.
Solr (pronounced "solar") is an open source enterprise search platform
from the Apache Lucene project. Its major features include full-text
search, hit highlighting, faceted search, dynamic clustering, database
integration, and rich document (e.g., Word, PDF) handling. Providing
distributed search and index replication, Solr is highly scalable.[1]
Solr is the most popular enterprise search engine.[2] Solr 4 adds NoSQL
features
S3 - Amazon S3 is an online file storage web
service offered by Amazon Web Services. Amazon S3 provides storage
through web services interfaces. Wikipedia
Hadoop - Apache Hadoop is an
open-source software framework for storage and large-scale processing of
data-sets on
clusters of
commodity hardware. Hadoop is an
Apache top-level project being built and used by a global community of contributors and users. It is licensed under the
Apache License 2.0.
Apache Hadoop MapReduce :
Hadoop MapReduce is a software framework for easily writing
applications which process vast amounts of data (multi-terabyte
data-sets) in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
A MapReduce
job usually splits the input data-set into independent chunks which are processed by the
map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the
reduce tasks.
Typically both the input and the output of the job are stored in a
file-system. The framework takes care of scheduling tasks, monitoring
them and re-executes the failed tasks.
Corona : Corona,
a new scheduling framework that separates cluster resource management
from job coordination.[1] Corona introduces a cluster managerwhose only
purpose is to track the nodes in the cluster and the amount of free
resources. A dedicated job tracker is created for each job, and can run
either in the same process as the client (for small jobs) or as a
separate process in the cluster (for large jobs).
One
major difference from our previous Hadoop MapReduce implementation is
that Corona uses push-based, rather than pull-based, scheduling. After
the cluster manager receives resource requests from the job tracker, it
pushes the resource grants back to the job tracker. Also, once the job
tracker gets resource grants, it creates tasks and then pushes these
tasks to the task trackers for running. There is no periodic heartbeat
involved in this scheduling, so the scheduling latency is minimized. Ref
:
Under the Hood: Scheduling MapReduce jobs more efficiently with Corona HBase : HBase is an
open source,
non-relational,
distributed database modeled after
Google's BigTable and written in
Java. It is developed as part of
Apache Software Foundation's
Apache Hadoop project and runs on top of
HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a
fault-tolerant way of storing large quantities of
sparse
data (small amounts of information caught within a large collection of
empty or unimportant data, such as finding the 50 largest items in a
group of 2 billion records, or finding the non-zero items representing
less than 0.1% of a huge collection).
Zookeeper - Apache ZooKeeper is a software project of the
Apache Software Foundation, providing an
open source distributed configuration service, synchronization service, and naming registry for large
distributed systems.
[clarification needed] ZooKeeper was a sub project of
Hadoop but is now a top-level project in its own right.
Hive - Apache Hive is a
data warehouse infrastructure built on top of
Hadoop for providing data summarization, query, and analysis. While initially developed by
Facebook, Apache Hive is now used and developed by other companies such as
Netflix. Amazon maintains a software fork of Apache Hive that is included in
Amazon Elastic MapReduce on
Amazon Web Services.
Mahout - Apache Mahout is a project of the
Apache Software Foundation to produce
free implementations of
distributed or otherwise
scalable machine learning algorithms focused primarily in the areas of
collaborative filtering, clustering and classification. Many of the implementations use the
Apache Hadoop
platform. Mahout also provides Java libraries for common maths
operations (focused on linear algebra and statistics) and primitive Java
collections. Mahout is a work in progress; the number of implemented
algorithms has grown quickly,
[3] but various algorithms are still missing.
Lucene is
a bunch of search-related and NLP tools but it's core feature is being a
search index and retrieval system. It takes data from a store like
HBase and indexes it for fast retrieval from a search query.
Solr uses Lucene under the hood to provide a convenient REST API for indexing and searching data.
ElasticSearch is similar to Solr.
Sqoop is
a command-line interface to back SQL data to a distributed warehouse.
It's what you might use to snapshot and copy your database tables to a
Hive warehouse every night.
Hue is a web-based GUI to a subset of the above tools. Hue aggregates the most common
Apache Hadoop
components into a single interface and targets the user experience. Its
main goal is to have the users "just use" Hadoop without worrying about
the underlying complexity or using a command line
Pregel and it's open source twin
Giraph is
a way to do graph algorithms on billions of nodes and trillions of
edges over a cluster of machines. Notably, the MapReduce model is not
well suited to graph processing so Hadoop/MapReduce are avoided in this
model, but HDFS/GFS is still used as a data store.
NLTK - The
Natural Language Toolkit, or more commonly
NLTK, is a suite of
libraries and programs for symbolic and statistical
natural language processing (NLP) for the
Python programming language.
NLTK includes graphical demonstrations and sample data. It is
accompanied by a book that explains the underlying concepts behind the
language processing tasks supported by the toolkit, plus a cookbook.
NLTK is intended to support research and teaching in NLP or closely related areas, including empirical
linguistics,
cognitive science,
artificial intelligence,
information retrieval, and
machine learning.
For Python- Scikit Learn
Numpy
Scipy
Freebase - Freebase
is a large collaborative knowledge base consisting of metadata composed
mainly by its community members. It is an online collection of
structured data harvested from many sources, including individual 'wiki'
contributions.
DBPedia :
DBpedia (from "DB" for "
database") is a project aiming to extract
structured content from the information created as part of the
Wikipedia project. This structured information is then made available on the
World Wide Web.
DBpedia allows users to query relationships and properties associated
with Wikipedia resources, including links to other related
datasets. DBpedia has been described by
Tim Berners-Lee as one of the more famous parts of the decentralized
Linked Data effort.
Visualization tool ggplot in R
Tableu
Qlikview
Mathematics : ) Calculus, Statistic, Probability, linear algebra and coordinate geometry
NER- Named
Entity Recognition (NER) labels sequences of words in a text which are
the names of things, such as person and company names, or gene and
protein names.
Faceted search : Faceted search also
called faceted navigation or faceted browsing, is a technique for
accessing information organized according to a faceted classification
system, allowing users to explore a collection of information by
applying multiple filters. A faceted classification system classifies
each information element along multiple explicit dimensions, called
facets, enabling the classifications to be accessed and ordered in
multiple ways rather than in a single, pre-determined, taxonomic order
Source :
Wikipedia, the free encyclopedia