Big Data Notes: - climb mountains - especially everest - stalk everyone - MEMES - use more buzzwords on your resume - MapReduce, Spark, Flink, Pig, Dryad, Hive, noSQL, Pregel, Giraph, Storm/Heron - Data Science, Data Analytics, Business Intelligence, Data Warehouses, Data Lakes - build more servers - remember cs 341 - Data Intensive vs Compute Intensive seems to be the "difficulty" being lots of data, not hard work - Coarse vs Fine grained parallelism - Coarse a bunch of shit doin the same thing - Fine means threads are not necessarily "clones" of each other - functional operation called fold (like map) - take function on A + B, produces C, does so until entire "list" is done - oh we're using scala. cool beans. probably because Spark - MapReduce just parallelizes map and reduce/fold. lol - programmer specifices a map and reduce function - map takes a key and value and produces some list of keys value pairs - reduce takes a key and a list of values and creates a list of pairs with a new key and values - all values with the same key get reduced on the same reducer (thread?) - grouping/moving outputs of maps to the right worker for the reducer is slow and shitty - framework (runtime environment) handles scheduling, threading/synchronisation, distribution of data across the cluster, etc - oh yeah it uses HDFS HE LIED there are 2 more functions: - partition takes in a intermediate key and information on the partitions, and outputs a partition that something will go to - this "splits up" the key space to define where the keys go and how many reducers there will be - combiner runs after map, and does something course website: lintool.github.io/bigdata-2018f STAR WARS REFERENCES!!!