tarfeef101
/
cs451-notes


			
				
					
						
						
							12345678910111213141516171819202122232425262728293031323334
							Big Data Notes:
- climb mountains
  - especially everest
- stalk everyone
- MEMES
- use more buzzwords on your resume
  - MapReduce, Spark, Flink, Pig, Dryad, Hive, noSQL, Pregel, Giraph, Storm/Heron
  - Data Science, Data Analytics, Business Intelligence, Data Warehouses, Data Lakes
- build more servers
- remember cs 341
- Data Intensive vs Compute Intensive seems to be the "difficulty" being lots of data, not hard work
- Coarse vs Fine grained parallelism
  - Coarse a bunch of shit doin the same thing
  - Fine means threads are not necessarily "clones" of each other
- functional operation called fold (like map)
  - take function on A + B, produces C, does so until entire "list" is done
- oh we're using scala. cool beans. probably because Spark
- MapReduce just parallelizes map and reduce/fold. lol
  - programmer specifices a map and reduce function
    - map takes a key and value and produces some list of keys value pairs
    - reduce takes a key and a list of values and creates a list of pairs with a new key and values
  - all values with the same key get reduced on the same reducer (thread?)
    - grouping/moving outputs of maps to the right worker for the reducer is slow and shitty
  - framework (runtime environment) handles scheduling, threading/synchronisation, distribution of data across the cluster, etc
  - oh yeah it uses HDFS

HE LIED there are 2 more functions:
  - partition takes in a intermediate key and information on the partitions, and outputs a partition that something will go to
    - this "splits up" the key space to define where the keys go and how many reducers there will be
  - combiner runs after map, and does something

course website: lintool.github.io/bigdata-2018f

STAR WARS REFERENCES!!!