XML is shit Users are fucking shit Yahoo actually used to be a thing. Remember that?! Oh god don't remind me of HIVE... Thanks for that. apparently that's actually HIVE SQL actually does turn into mapreduce... i thought it was separate. more like an interpreter. cool beans Assignment 1 Hint: generating final answers often requires multiple mapreduce jobs Think of data processing as shoving data through a processing graph: nodes are different operations edges define the order of them NOTE: This does assume a static input step 1: we NEED a map function to do per-record transformations, and this is super ez to parallelize step 2: we need somewhere to store and move intermediate results. this is kinda like a group by. if we can figure out how to address these values, we can easily to reduce as we can see from these basic steps, this is mapreduce. so that is a pretty basic system. let's see what more you can do one limitation of mapreduce is: we have to write output of one mapreduce job to disk and read it again to do a chain of things. 3 maps in a row requires 3 jobs instead of say, mapmapmapreduce is one. Spark: RDD=Resilient Distributed Dataset has a lot more functions (transformations) and stuff than "map" and "reduce" you can do mapreduce in spark. slides/google will give the transformations needed to perfectly replicate it (if that's even desired) A1 Stuff: - he said "don't overthink". hopefully that means shit won't get weird - most shit will need chained jobs (as mentioned above), so figure out how to chain them properly - investigate how to load in side data with mapreduce -