A Hadoop Primer

Preview:

DESCRIPTION

A simple introduction to Hadoop talk given to the Maine Java Users' Group February 15, 2011.

Citation preview

10.20.2005

A Hadoop Primer

Feb 2011

2

http://redmonk.com/public/hadoop.pdf

3

The Background

4

October, 2003

5

December, 2004

6

Map::Reduce

7

Job::Map Reduce::Output

8

Counting Shakespeare

9

The Birth of Hadoop

10

11

12

Project Architecture

Source: Running Hadoop On Ubuntu Linux, Michael G. Noll, 8.8.07

13

Project Traction

14

Employment Potential

15

Hadoop Users

16

Why Hadoop?

17

More Machines = More Faster

18

The reason everyone knows

19

BIG DATA

20

“The big issue is not that everyone will suddenly operate at petabyte scale; a lot of folks do not have that much data.

The more important topics are the specifics of the storage and processing infrastructure and what approaches best suit each problem.”

- Bradford Cross, Flightcaster/Woven

21

The reason not everyone knows

22

DatanU s tr u

ct u

er

d

23

What Hadoop Is

24

“build Amazon's product search indices”“build the recommender system for behavioral targeting”“ETL style processing and statistics generation”“information extraction & search”“searching and analysis of millions of rental bookings”“we use Hadoop to summarize of user's tracking data”“we use Hadoop to store ad serving logs”“the freedom to query the data in an ad-hoc manner”“generating web graphs on 100 nodes”“we use Hadoop for batch-processing large RDF datasets”“facial similarity and recognition across large datasets““We are using Hadoop and Nutch to crawl Blog posts”“Used for ETL & data analysis on terascale datasets”

Source: http://wiki.apache.org/hadoop/PoweredBy

25

What Hadoop Isn't

26

A relational database killer

No Yes

27

Beyond Hadoop

28

The Hadoop Ecosystem

29

What We Use Hadoop For

30

Crawling Largeish Unstructured Datasets

31

Like 1.3M StackOverflow Questions

32

Or 1.7M HackerNews Entries

33

Or Years of Apache Log Files

34

How to Get Started

35

We use Cloudera

36

Mostly because it's easy

37

This easy

38

Or if you prefer

39

Or maybe this

40

QUESTIONS

41

Student? Talk to us