sept_13 2.1 KB

1234567891011121314151617181920212223242526272829
  1. Basically work. But in class.
  2. MadReduce Stuff:
  3. GFS (google filesystem for google MadReduce (as opposed to hadoop):
  4. files are broken into 64MB chunks
  5. each chunk is in at least 3 places
  6. a single master coordinates everything
  7. no caching since everything is serial/sequential reads/writes
  8. simplify the API compared to say, POSIX
  9. hadoop is this but probably worse
  10. Data nodes have blocks stored on their own filesystems (linux, presumably any actual fs?)
  11. namenode/master node(s) has the mapping of the chunks (fs tree, basically)
  12. it does NOT hold/transfer data through itself
  13. rather, edge nodes ask namenode for stuff, namenode says to ask the right datanodes, edgenode then reads from datanodes
  14. there is also a Jobtracker (JT) which coordinates/manages MadReduce jobs (as opposed to managing HDFS)
  15. each physical machine has datanodes, as discussed before, and a tasktracker, which is basically a compute node
  16. seems that tasktracker and datanode are daemons, so these are services.
  17. this is WITHOUT YARN queues
  18. Running Jobs:
  19. # of mapper nodes in MadReduce is usually equal to the # of nodes with chunks since that's where a map job can run
  20. this means data doesn't move around to map (usually), since the host of a chunk will do the map for that part of the data
  21. individual maps are single-core jobs, so you can run up to n jobs for n cores, as long as you have that many chunks
  22. if you have "extra compute room", maps will run for the same chunk in multiple locations and the 1st completion will be used
  23. when your key/value pairs don't fit into 64MB perfectly (entire "piece" of data in 2 chunks), a "remote read" is done to get missing parts from another node
  24. Distributed Group By:
  25. mappers write output into "circular buffer" (takes input and writes to disk when full to open space). it writes to disk in sorted order. these get merged onto reducers (they copy what they need from mappers with that data), again on disk. presumably reduce puts in memory
  26. when the spills are being merged, they are combined. then, again on the reducer when they "import" the data.