|
@@ -0,0 +1,34 @@
|
|
|
+Big Data Notes:
|
|
|
+- climb mountains
|
|
|
+ - especially everest
|
|
|
+- stalk everyone
|
|
|
+- MEMES
|
|
|
+- use more buzzwords on your resume
|
|
|
+ - MapReduce, Spark, Flink, Pig, Dryad, Hive, noSQL, Pregel, Giraph, Storm/Heron
|
|
|
+ - Data Science, Data Analytics, Business Intelligence, Data Warehouses, Data Lakes
|
|
|
+- build more servers
|
|
|
+- remember cs 341
|
|
|
+- Data Intensive vs Compute Intensive seems to be the "difficulty" being lots of data, not hard work
|
|
|
+- Coarse vs Fine grained parallelism
|
|
|
+ - Coarse a bunch of shit doin the same thing
|
|
|
+ - Fine means threads are not necessarily "clones" of each other
|
|
|
+- functional operation called fold (like map)
|
|
|
+ - take function on A + B, produces C, does so until entire "list" is done
|
|
|
+- oh we're using scala. cool beans. probably because Spark
|
|
|
+- MapReduce just parallelizes map and reduce/fold. lol
|
|
|
+ - programmer specifices a map and reduce function
|
|
|
+ - map takes a key and value and produces some list of keys value pairs
|
|
|
+ - reduce takes a key and a list of values and creates a list of pairs with a new key and values
|
|
|
+ - all values with the same key get reduced on the same reducer (thread?)
|
|
|
+ - grouping/moving outputs of maps to the right worker for the reducer is slow and shitty
|
|
|
+ - framework (runtime environment) handles scheduling, threading/synchronisation, distribution of data across the cluster, etc
|
|
|
+ - oh yeah it uses HDFS
|
|
|
+
|
|
|
+HE LIED there are 2 more functions:
|
|
|
+ - partition takes in a intermediate key and information on the partitions, and outputs a partition that something will go to
|
|
|
+ - this "splits up" the key space to define where the keys go and how many reducers there will be
|
|
|
+ - combiner runs after map, and does something
|
|
|
+
|
|
|
+course website: lintool.github.io/bigdata-2018f
|
|
|
+
|
|
|
+STAR WARS REFERENCES!!!
|