12345678910111213141516171819202122232425262728293031323334 |
- Big Data Notes:
- - climb mountains
- - especially everest
- - stalk everyone
- - MEMES
- - use more buzzwords on your resume
- - MapReduce, Spark, Flink, Pig, Dryad, Hive, noSQL, Pregel, Giraph, Storm/Heron
- - Data Science, Data Analytics, Business Intelligence, Data Warehouses, Data Lakes
- - build more servers
- - remember cs 341
- - Data Intensive vs Compute Intensive seems to be the "difficulty" being lots of data, not hard work
- - Coarse vs Fine grained parallelism
- - Coarse a bunch of shit doin the same thing
- - Fine means threads are not necessarily "clones" of each other
- - functional operation called fold (like map)
- - take function on A + B, produces C, does so until entire "list" is done
- - oh we're using scala. cool beans. probably because Spark
- - MapReduce just parallelizes map and reduce/fold. lol
- - programmer specifices a map and reduce function
- - map takes a key and value and produces some list of keys value pairs
- - reduce takes a key and a list of values and creates a list of pairs with a new key and values
- - all values with the same key get reduced on the same reducer (thread?)
- - grouping/moving outputs of maps to the right worker for the reducer is slow and shitty
- - framework (runtime environment) handles scheduling, threading/synchronisation, distribution of data across the cluster, etc
- - oh yeah it uses HDFS
- HE LIED there are 2 more functions:
- - partition takes in a intermediate key and information on the partitions, and outputs a partition that something will go to
- - this "splits up" the key space to define where the keys go and how many reducers there will be
- - combiner runs after map, and does something
- course website: lintool.github.io/bigdata-2018f
- STAR WARS REFERENCES!!!
|