sept_20 1.6 KB

123456789101112131415161718192021222324252627
  1. XML is shit
  2. Users are fucking shit
  3. Yahoo actually used to be a thing. Remember that?!
  4. Oh god don't remind me of HIVE... Thanks for that.
  5. apparently that's actually HIVE SQL actually does turn into mapreduce... i thought it was separate. more like an interpreter. cool beans
  6. Assignment 1 Hint: generating final answers often requires multiple mapreduce jobs
  7. Think of data processing as shoving data through a processing graph:
  8. nodes are different operations
  9. edges define the order of them
  10. NOTE: This does assume a static input
  11. step 1: we NEED a map function to do per-record transformations, and this is super ez to parallelize
  12. step 2: we need somewhere to store and move intermediate results. this is kinda like a group by. if we can figure out how to address these values, we can easily to reduce
  13. as we can see from these basic steps, this is mapreduce. so that is a pretty basic system. let's see what more you can do
  14. one limitation of mapreduce is:
  15. we have to write output of one mapreduce job to disk and read it again to do a chain of things. 3 maps in a row requires 3 jobs instead of say, mapmapmapreduce is one.
  16. Spark: RDD=Resilient Distributed Dataset
  17. has a lot more functions (transformations) and stuff than "map" and "reduce"
  18. you can do mapreduce in spark. slides/google will give the transformations needed to perfectly replicate it (if that's even desired)
  19. A1 Stuff:
  20. - he said "don't overthink". hopefully that means shit won't get weird
  21. - most shit will need chained jobs (as mentioned above), so figure out how to chain them properly
  22. - investigate how to load in side data with mapreduce
  23. -