Browse Source

more notes, some hints for a1

tarfeef101 6 years ago
parent
commit
f711d03393
2 changed files with 28 additions and 1 deletions
  1. 1 1
      sept_18
  2. 27 0
      sept_20

+ 1 - 1
sept_18

@@ -13,4 +13,4 @@ a "mapper" and a "reducer" are objects which have setup, "mapping/reducing" (cal
   you can override the "mapper" class to make your own cleanup and map/reduce functions
   you can override the "mapper" class to make your own cleanup and map/reduce functions
   this allows preserving state/storing intermediary values
   this allows preserving state/storing intermediary values
     this is better than using combiners since these operate during mapping (in mem) vs combiners that operate on spills on disk
     this is better than using combiners since these operate during mapping (in mem) vs combiners that operate on spills on disk
-IMPORTANT: Combiners are NOT guaranteed to run. so don't rely on them to provide necessary functionality
+IMPORTANT: Combiners are NOT guaranteed to run. so don't rely on them to provide necessary functionality

+ 27 - 0
sept_20

@@ -0,0 +1,27 @@
+XML is shit
+Users are fucking shit
+Yahoo actually used to be a thing. Remember that?!
+Oh god don't remind me of HIVE... Thanks for that.
+  apparently that's actually HIVE SQL actually does turn into mapreduce... i thought it was separate. more like an interpreter. cool beans
+Assignment 1 Hint: generating final answers often requires multiple mapreduce jobs
+
+Think of data processing as shoving data through a processing graph:
+  nodes are different operations
+  edges define the order of them
+  NOTE: This does assume a static input
+  step 1: we NEED a map function to do per-record transformations, and this is super ez to parallelize
+  step 2: we need somewhere to store and move intermediate results. this is kinda like a group by. if we can figure out how to address these values, we can easily to reduce
+  as we can see from these basic steps, this is mapreduce. so that is a pretty basic system. let's see what more you can do
+  one limitation of mapreduce is:
+    we have to write output of one mapreduce job to disk and read it again to do a chain of things. 3 maps in a row requires 3 jobs instead of say, mapmapmapreduce is one.
+  
+Spark: RDD=Resilient Distributed Dataset
+  has a lot more functions (transformations) and stuff than "map" and "reduce"
+  you can do mapreduce in spark. slides/google will give the transformations needed to perfectly replicate it (if that's even desired)
+  
+
+A1 Stuff:
+  - he said "don't overthink". hopefully that means shit won't get weird
+  - most shit will need chained jobs (as mentioned above), so figure out how to chain them properly
+  - investigate how to load in side data with mapreduce
+  -