Browse Source

more notes, some hints for a1

tarfeef101 6 years ago
parent
commit
f711d03393
2 changed files with 28 additions and 1 deletions
  1. 1 1
      sept_18
  2. 27 0
      sept_20

+ 1 - 1
sept_18

@@ -13,4 +13,4 @@ a "mapper" and a "reducer" are objects which have setup, "mapping/reducing" (cal
   you can override the "mapper" class to make your own cleanup and map/reduce functions
   this allows preserving state/storing intermediary values
     this is better than using combiners since these operate during mapping (in mem) vs combiners that operate on spills on disk
-IMPORTANT: Combiners are NOT guaranteed to run. so don't rely on them to provide necessary functionality
+IMPORTANT: Combiners are NOT guaranteed to run. so don't rely on them to provide necessary functionality

+ 27 - 0
sept_20

@@ -0,0 +1,27 @@
+XML is shit
+Users are fucking shit
+Yahoo actually used to be a thing. Remember that?!
+Oh god don't remind me of HIVE... Thanks for that.
+  apparently that's actually HIVE SQL actually does turn into mapreduce... i thought it was separate. more like an interpreter. cool beans
+Assignment 1 Hint: generating final answers often requires multiple mapreduce jobs
+
+Think of data processing as shoving data through a processing graph:
+  nodes are different operations
+  edges define the order of them
+  NOTE: This does assume a static input
+  step 1: we NEED a map function to do per-record transformations, and this is super ez to parallelize
+  step 2: we need somewhere to store and move intermediate results. this is kinda like a group by. if we can figure out how to address these values, we can easily to reduce
+  as we can see from these basic steps, this is mapreduce. so that is a pretty basic system. let's see what more you can do
+  one limitation of mapreduce is:
+    we have to write output of one mapreduce job to disk and read it again to do a chain of things. 3 maps in a row requires 3 jobs instead of say, mapmapmapreduce is one.
+  
+Spark: RDD=Resilient Distributed Dataset
+  has a lot more functions (transformations) and stuff than "map" and "reduce"
+  you can do mapreduce in spark. slides/google will give the transformations needed to perfectly replicate it (if that's even desired)
+  
+
+A1 Stuff:
+  - he said "don't overthink". hopefully that means shit won't get weird
+  - most shit will need chained jobs (as mentioned above), so figure out how to chain them properly
+  - investigate how to load in side data with mapreduce
+  -