sept_6 1.7 KB

12345678910111213141516171819202122232425262728293031323334
  1. Big Data Notes:
  2. - climb mountains
  3. - especially everest
  4. - stalk everyone
  5. - MEMES
  6. - use more buzzwords on your resume
  7. - MapReduce, Spark, Flink, Pig, Dryad, Hive, noSQL, Pregel, Giraph, Storm/Heron
  8. - Data Science, Data Analytics, Business Intelligence, Data Warehouses, Data Lakes
  9. - build more servers
  10. - remember cs 341
  11. - Data Intensive vs Compute Intensive seems to be the "difficulty" being lots of data, not hard work
  12. - Coarse vs Fine grained parallelism
  13. - Coarse a bunch of shit doin the same thing
  14. - Fine means threads are not necessarily "clones" of each other
  15. - functional operation called fold (like map)
  16. - take function on A + B, produces C, does so until entire "list" is done
  17. - oh we're using scala. cool beans. probably because Spark
  18. - MapReduce just parallelizes map and reduce/fold. lol
  19. - programmer specifices a map and reduce function
  20. - map takes a key and value and produces some list of keys value pairs
  21. - reduce takes a key and a list of values and creates a list of pairs with a new key and values
  22. - all values with the same key get reduced on the same reducer (thread?)
  23. - grouping/moving outputs of maps to the right worker for the reducer is slow and shitty
  24. - framework (runtime environment) handles scheduling, threading/synchronisation, distribution of data across the cluster, etc
  25. - oh yeah it uses HDFS
  26. HE LIED there are 2 more functions:
  27. - partition takes in a intermediate key and information on the partitions, and outputs a partition that something will go to
  28. - this "splits up" the key space to define where the keys go and how many reducers there will be
  29. - combiner runs after map, and does something
  30. course website: lintool.github.io/bigdata-2018f
  31. STAR WARS REFERENCES!!!