Basically work. But in class. MadReduce Stuff: GFS (google filesystem for google MadReduce (as opposed to hadoop): files are broken into 64MB chunks each chunk is in at least 3 places a single master coordinates everything no caching since everything is serial/sequential reads/writes simplify the API compared to say, POSIX hadoop is this but probably worse Data nodes have blocks stored on their own filesystems (linux, presumably any actual fs?) namenode/master node(s) has the mapping of the chunks (fs tree, basically) it does NOT hold/transfer data through itself rather, edge nodes ask namenode for stuff, namenode says to ask the right datanodes, edgenode then reads from datanodes there is also a Jobtracker (JT) which coordinates/manages MadReduce jobs (as opposed to managing HDFS) each physical machine has datanodes, as discussed before, and a tasktracker, which is basically a compute node seems that tasktracker and datanode are daemons, so these are services. this is WITHOUT YARN queues Running Jobs: # of mapper nodes in MadReduce is usually equal to the # of nodes with chunks since that's where a map job can run this means data doesn't move around to map (usually), since the host of a chunk will do the map for that part of the data individual maps are single-core jobs, so you can run up to n jobs for n cores, as long as you have that many chunks if you have "extra compute room", maps will run for the same chunk in multiple locations and the 1st completion will be used when your key/value pairs don't fit into 64MB perfectly (entire "piece" of data in 2 chunks), a "remote read" is done to get missing parts from another node Distributed Group By: mappers write output into "circular buffer" (takes input and writes to disk when full to open space). it writes to disk in sorted order. these get merged onto reducers (they copy what they need from mappers with that data), again on disk. presumably reduce puts in memory when the spills are being merged, they are combined. then, again on the reducer when they "import" the data.