123456789101112131415161718192021222324252627 |
- XML is shit
- Users are fucking shit
- Yahoo actually used to be a thing. Remember that?!
- Oh god don't remind me of HIVE... Thanks for that.
- apparently that's actually HIVE SQL actually does turn into mapreduce... i thought it was separate. more like an interpreter. cool beans
- Assignment 1 Hint: generating final answers often requires multiple mapreduce jobs
- Think of data processing as shoving data through a processing graph:
- nodes are different operations
- edges define the order of them
- NOTE: This does assume a static input
- step 1: we NEED a map function to do per-record transformations, and this is super ez to parallelize
- step 2: we need somewhere to store and move intermediate results. this is kinda like a group by. if we can figure out how to address these values, we can easily to reduce
- as we can see from these basic steps, this is mapreduce. so that is a pretty basic system. let's see what more you can do
- one limitation of mapreduce is:
- we have to write output of one mapreduce job to disk and read it again to do a chain of things. 3 maps in a row requires 3 jobs instead of say, mapmapmapreduce is one.
-
- Spark: RDD=Resilient Distributed Dataset
- has a lot more functions (transformations) and stuff than "map" and "reduce"
- you can do mapreduce in spark. slides/google will give the transformations needed to perfectly replicate it (if that's even desired)
-
- A1 Stuff:
- - he said "don't overthink". hopefully that means shit won't get weird
- - most shit will need chained jobs (as mentioned above), so figure out how to chain them properly
- - investigate how to load in side data with mapreduce
- -
|