must support a single application generating and collecting many types of data
3 - Velocity
Late decisions → missing opportunities
for websites like twitter, where gigabytes are being created every minute, you must be able to queue, and process data extremely fast
Kafka is a common buffering system
4 - Veracity
Data in doubt
can all the data be trusted
Big Picture
What’s driving Big Data
Small value, low complexity tasks like Business Intelligence
data mining techniques
structures data, typical sources
ad-hoc querying and reporting
high value, high complexity tasks like data mining
more real time tasks
very large datasets
all types of data, from many sources
complex statistical analysis
What will we learn
Hadoop/MapReduce technology, NoSQL, Big Table, Cassandra, SPARK, Stream
Learn the platform
Learn writing Hadoop jobs/SPARK in different languages
Learn advanced analytics tools on top of Hadoop
Parallel
multiple CPUs each sharing some memory space
Problems with Parallelization
How do we assign work to CPUS?
What if workers need to share partial results?
How do we know when all workers are done? What if a worker dies during the process?
Distributed
Multiple CPUs each with their own memory space
Inverted Index
Google when crawling the web will create a word count for certain terms of every page that it has processed.
It then has an Inverted Index for every term they want to track, perhaps 100s of thousands of terms. These terms are sorted by some metric, and these terms have a pointer to different pages that contain those keywords and what page they appear on.
Map Reduce
Map Reduce comes from functional programming
Two important functional programming functions
Map: do something to everything in a list
function is applied to every element in a list
result is a new list
Works in parallel for each element in the list, where they all are put through some function concurrently
Fold: combine the results of a list in some way
Steps of operation
accumulator set to some value
function applied to list element and the accumulator
result stored in the accumulator
repeated for every item
results is the final value in accumulator
Ex: If we have the list (1,2,3,4,5)
Acc = 0
Acc + 1 = 1
Acc + 2 + 3
Acc + 3 = 6
Acc + 4 = 10
Acc + 5 = 15
Result = 15
Doesn’t work well with parallelization because each sub value must be passed on to the next operation
TA Lecture - Sept 16
given a word count situation with three documents. the naive solution would have n key value pairs. where n is the number of words within all the documents
This would be a massive amount of network traffic to transmit every <word, 1> Key-value pairs
Solution 1
Each mapper will reduce its own word count, to remove duplicate keys within the single document / mapper
Ex. <a, 1> <a, 1> <a, 1> -⇒ <a, 3>
If a mapper handles multiple documents, there will still be duplicates between documents
Each mapper then will have the total sum of each letter or word it saw within its domain/ document. Not the global sum however. for that we need to use the reducers
So these intermediate key value pairs (Ex. <a, 3> <a, 5> etc.) are sent to the reducer sorted by key, which can combine the kv pairs to the lowest form (Ex. <a, 8>)
Solution 2
using a mapper class that is initialized once, and has a map method that can iterate over several documents we are able to emit a single set of key value pairs to cover all documents in the domain. Thus, reducing network traffic to the reducers.
Computing the Average count of a word within the documents
Method 1
For each document, find the number of “apples” / # key-value pairs. (this number of pairs can have duplicates. for example a mapper with 2 documents could have 2 <k,v> pairs with the same key)
This is a problem because the average is different based on the number of files, even with constant number of words.
Method 2
for each document, use a mapper to emit <k, 1>
Using a combiner emit the sum of all key value pairs. emitting <k, v>
Using a reducer to take results from all mapper/combiners into one list of the global sums of each key in the domain
Method 3
Initialize a Term dictionary and a count dictionary
the map function will take in a string and an integer representing the count of that key within the document.
The count dictionary value is incremented, as will the string dictionary
Stripes Method
Each mapper counts and sums a key value pair for every word / letter
The reducer then does a term wise summation (summation of all a, b, c etc.)
Map Reduce in Java
Setup
Map / Reduce
worker threads need to read some split of the file / data
workers local write to local disk in two partitions
Worker 1 has its own two partitions, those two partitions sent to the two reducers
reduce workers remote read the intermediate files, and emit the output results.
Cleanup
Failures
Map worker failure
tasks are reset to idle
reduce workers are notified when task is rescheduled on another worker
Reduce worker failure
only in-progress tasks are reset and restarted
if the failure happens after completion no restart necessary
Master Failure
whole map reduce task is aborted and client is notified
Map Reduce Algorithm to Process Relational Data
when we have some sensor outputting data 1000 times a second, over a year we have an insane amount of data.
We can use map reduce to sort and handle this amount of data
We would expect the mapper to output KV pairs of the form <timestamp, reading>