Browse Source

sept 6th notes. lots of memes, some stuff about mapreduce

tarfeef101 6 years ago
commit
a19ab2cda3
1 changed files with 34 additions and 0 deletions
  1. 34 0
      sept_6

+ 34 - 0
sept_6

@@ -0,0 +1,34 @@
+Big Data Notes:
+- climb mountains
+  - especially everest
+- stalk everyone
+- MEMES
+- use more buzzwords on your resume
+  - MapReduce, Spark, Flink, Pig, Dryad, Hive, noSQL, Pregel, Giraph, Storm/Heron
+  - Data Science, Data Analytics, Business Intelligence, Data Warehouses, Data Lakes
+- build more servers
+- remember cs 341
+- Data Intensive vs Compute Intensive seems to be the "difficulty" being lots of data, not hard work
+- Coarse vs Fine grained parallelism
+  - Coarse a bunch of shit doin the same thing
+  - Fine means threads are not necessarily "clones" of each other
+- functional operation called fold (like map)
+  - take function on A + B, produces C, does so until entire "list" is done
+- oh we're using scala. cool beans. probably because Spark
+- MapReduce just parallelizes map and reduce/fold. lol
+  - programmer specifices a map and reduce function
+    - map takes a key and value and produces some list of keys value pairs
+    - reduce takes a key and a list of values and creates a list of pairs with a new key and values
+  - all values with the same key get reduced on the same reducer (thread?)
+    - grouping/moving outputs of maps to the right worker for the reducer is slow and shitty
+  - framework (runtime environment) handles scheduling, threading/synchronisation, distribution of data across the cluster, etc
+  - oh yeah it uses HDFS
+
+HE LIED there are 2 more functions:
+  - partition takes in a intermediate key and information on the partitions, and outputs a partition that something will go to
+    - this "splits up" the key space to define where the keys go and how many reducers there will be
+  - combiner runs after map, and does something
+
+course website: lintool.github.io/bigdata-2018f
+
+STAR WARS REFERENCES!!!