tarfeef101
/
cs451-notes


			
				
					
						
						
							123456789101112131415161718192021222324252627
							XML is shit
Users are fucking shit
Yahoo actually used to be a thing. Remember that?!
Oh god don't remind me of HIVE... Thanks for that.
  apparently that's actually HIVE SQL actually does turn into mapreduce... i thought it was separate. more like an interpreter. cool beans
Assignment 1 Hint: generating final answers often requires multiple mapreduce jobs

Think of data processing as shoving data through a processing graph:
  nodes are different operations
  edges define the order of them
  NOTE: This does assume a static input
  step 1: we NEED a map function to do per-record transformations, and this is super ez to parallelize
  step 2: we need somewhere to store and move intermediate results. this is kinda like a group by. if we can figure out how to address these values, we can easily to reduce
  as we can see from these basic steps, this is mapreduce. so that is a pretty basic system. let's see what more you can do
  one limitation of mapreduce is:
    we have to write output of one mapreduce job to disk and read it again to do a chain of things. 3 maps in a row requires 3 jobs instead of say, mapmapmapreduce is one.
  
Spark: RDD=Resilient Distributed Dataset
  has a lot more functions (transformations) and stuff than "map" and "reduce"
  you can do mapreduce in spark. slides/google will give the transformations needed to perfectly replicate it (if that's even desired)
  

A1 Stuff:
  - he said "don't overthink". hopefully that means shit won't get weird
  - most shit will need chained jobs (as mentioned above), so figure out how to chain them properly
  - investigate how to load in side data with mapreduce
  -