|
@@ -0,0 +1,27 @@
|
|
|
+XML is shit
|
|
|
+Users are fucking shit
|
|
|
+Yahoo actually used to be a thing. Remember that?!
|
|
|
+Oh god don't remind me of HIVE... Thanks for that.
|
|
|
+ apparently that's actually HIVE SQL actually does turn into mapreduce... i thought it was separate. more like an interpreter. cool beans
|
|
|
+Assignment 1 Hint: generating final answers often requires multiple mapreduce jobs
|
|
|
+
|
|
|
+Think of data processing as shoving data through a processing graph:
|
|
|
+ nodes are different operations
|
|
|
+ edges define the order of them
|
|
|
+ NOTE: This does assume a static input
|
|
|
+ step 1: we NEED a map function to do per-record transformations, and this is super ez to parallelize
|
|
|
+ step 2: we need somewhere to store and move intermediate results. this is kinda like a group by. if we can figure out how to address these values, we can easily to reduce
|
|
|
+ as we can see from these basic steps, this is mapreduce. so that is a pretty basic system. let's see what more you can do
|
|
|
+ one limitation of mapreduce is:
|
|
|
+ we have to write output of one mapreduce job to disk and read it again to do a chain of things. 3 maps in a row requires 3 jobs instead of say, mapmapmapreduce is one.
|
|
|
+
|
|
|
+Spark: RDD=Resilient Distributed Dataset
|
|
|
+ has a lot more functions (transformations) and stuff than "map" and "reduce"
|
|
|
+ you can do mapreduce in spark. slides/google will give the transformations needed to perfectly replicate it (if that's even desired)
|
|
|
+
|
|
|
+
|
|
|
+A1 Stuff:
|
|
|
+ - he said "don't overthink". hopefully that means shit won't get weird
|
|
|
+ - most shit will need chained jobs (as mentioned above), so figure out how to chain them properly
|
|
|
+ - investigate how to load in side data with mapreduce
|
|
|
+ -
|