6 years ago · f711d03393
--- a/sept_18
+++ b/sept_18
@@ -13,4 +13,4 @@ a "mapper" and a "reducer" are objects which have setup, "mapping/reducing" (cal
 
															   you can override the "mapper" class to make your own cleanup and map/reduce functions
														
 
															   this allows preserving state/storing intermediary values
														
 
															     this is better than using combiners since these operate during mapping (in mem) vs combiners that operate on spills on disk
														
 
															-IMPORTANT: Combiners are NOT guaranteed to run. so don't rely on them to provide necessary functionality
														
 
															+IMPORTANT: Combiners are NOT guaranteed to run. so don't rely on them to provide necessary functionality
														
--- a/sept_20
+++ b/sept_20
@@ -0,0 +1,27 @@
 
															+XML is shit
														
 
															+Users are fucking shit
														
 
															+Yahoo actually used to be a thing. Remember that?!
														
 
															+Oh god don't remind me of HIVE... Thanks for that.
														
 
															+  apparently that's actually HIVE SQL actually does turn into mapreduce... i thought it was separate. more like an interpreter. cool beans
														
 
															+Assignment 1 Hint: generating final answers often requires multiple mapreduce jobs
														
 
															+
														
 
															+Think of data processing as shoving data through a processing graph:
														
 
															+  nodes are different operations
														
 
															+  edges define the order of them
														
 
															+  NOTE: This does assume a static input
														
 
															+  step 1: we NEED a map function to do per-record transformations, and this is super ez to parallelize
														
 
															+  step 2: we need somewhere to store and move intermediate results. this is kinda like a group by. if we can figure out how to address these values, we can easily to reduce
														
 
															+  as we can see from these basic steps, this is mapreduce. so that is a pretty basic system. let's see what more you can do
														
 
															+  one limitation of mapreduce is:
														
 
															+    we have to write output of one mapreduce job to disk and read it again to do a chain of things. 3 maps in a row requires 3 jobs instead of say, mapmapmapreduce is one.
														
 
															+  
														
 
															+Spark: RDD=Resilient Distributed Dataset
														
 
															+  has a lot more functions (transformations) and stuff than "map" and "reduce"
														
 
															+  you can do mapreduce in spark. slides/google will give the transformations needed to perfectly replicate it (if that's even desired)
														
 
															+  
														
 
															+
														
 
															+A1 Stuff:
														
 
															+  - he said "don't overthink". hopefully that means shit won't get weird
														
 
															+  - most shit will need chained jobs (as mentioned above), so figure out how to chain them properly
														
 
															+  - investigate how to load in side data with mapreduce
														
 
															+  -