tarfeef101
/
cs451-notes


			
				
					
						
						
							1234567891011121314151617181920212223242526272829
							Basically work. But in class.

MadReduce Stuff:
GFS (google filesystem for google MadReduce (as opposed to hadoop):
  files are broken into 64MB chunks
  each chunk is in at least 3 places
  a single master coordinates everything
  no caching since everything is serial/sequential reads/writes
  simplify the API compared to say, POSIX
  hadoop is this but probably worse
  Data nodes have blocks stored on their own filesystems (linux, presumably any actual fs?)
  namenode/master node(s) has the mapping of the chunks (fs tree, basically)
    it does NOT hold/transfer data through itself
    rather, edge nodes ask namenode for stuff, namenode says to ask the right datanodes, edgenode then reads from datanodes
  there is also a Jobtracker (JT) which coordinates/manages MadReduce jobs (as opposed to managing HDFS)
  each physical machine has datanodes, as discussed before, and a tasktracker, which is basically a compute node
    seems that tasktracker and datanode are daemons, so these are services.
    this is WITHOUT YARN queues
  
Running Jobs:
  # of mapper nodes in MadReduce is usually equal to the # of nodes with chunks since that's where a map job can run
    this means data doesn't move around to map (usually), since the host of a chunk will do the map for that part of the data
    individual maps are single-core jobs, so you can run up to n jobs for n cores, as long as you have that many chunks
    if you have "extra compute room", maps will run for the same chunk in multiple locations and the 1st completion will be used
    when your key/value pairs don't fit into 64MB perfectly (entire "piece" of data in 2 chunks), a "remote read" is done to get missing parts from another node
  Distributed Group By:
  mappers write output into "circular buffer" (takes input and writes to disk when full to open space). it writes to disk in sorted order. these get merged onto reducers (they copy what they need from mappers with that data), again on disk. presumably reduce puts in memory
  when the spills are being merged, they are combined. then, again on the reducer when they "import" the data.