sept_18 1.3 KB

12345678910111213141516
  1. Do assignment 0. Totally missed the first 5 mins cause it was about that
  2. map, reduce, combine, partition - how to explain any problem with these functions
  3. think of it as a design pattern
  4. you can make things run faster by avoiding communication between nodes
  5. use combiners to avoid this!
  6. in the algorithm, spill files (disk storage holding mapper output) are sorted!
  7. spills are taken and merged together (how many spills build up before they start merging can be controlled)
  8. during this time, combiners are run
  9. when reducers take the segments of the spills that match its key from each mapper, it needs to combine again
  10. so, sort/combine is done again here
  11. you can put your combine stuffs into the mapper, and that is fine. you don't need to "emit" each iteration of your loop over inputs, you can emit as many times as you want whenever you want
  12. a "mapper" and a "reducer" are objects which have setup, "mapping/reducing" (called once per key/value pair), and cleanup methods
  13. you can override the "mapper" class to make your own cleanup and map/reduce functions
  14. this allows preserving state/storing intermediary values
  15. this is better than using combiners since these operate during mapping (in mem) vs combiners that operate on spills on disk
  16. IMPORTANT: Combiners are NOT guaranteed to run. so don't rely on them to provide necessary functionality