tarfeef101
/
cs451-notes


			
				
					
						
						
							12345678910111213141516171819202122232425262728293031323334353637383940
							Everything is a graph. Except for graphs. Those are charts.

We're going to talk about sparse graphs. This means there aren't too many edges compared to the possible amt of edges.
CS people are masochists. (Because we love graphs, which are hard). Why:
  they can have weird structures (can be fun to make data structs for it)
  weird data access patters (Can be fun to make infrastructure to adjust for that)
  you can optimize things with graph things

What do graphs have to do with MR/Spark?
  lots of graph algos do:
    compute stuff at each node, and continue traversing (propogate results) (often these computations have to do with parents/children)
  
How to represent a graph in MR and Spark?
  Adjacency matrices (n * n matrix, each row/column represents a vertex, each cell is 1 or 0 if there is an edge from row to column)
    while easy to understand, it uses lots of space and is hard to actually code/compute on them
  Adjacency Lists: basically the above but just have a list of n lists, each list is a list of outgoing edges (this is like a postings list for text processing/search)
    compressible just like those postings lists, easy to compute things as your traverse from start -> end (compute over outgoing links). But rather hard to do the opposite without flipping the entire data strcuture beforehand
  Edge Lists: Just a fucking list of edges. Why? Cause it's fucking easy. Also you can always append shit, so good for processing a stream of input. But it uses lots of space, and computing on them efficiently is hard cause it's not in traversal order or anything.

How to process them:
  split by edges or vertices??

How to deal with undirected graphs?
  1) you could store the edge both ways, and ensure your algo handles the dupes
  2) store one edge using a static rule (eg start is always the lower ID)

Things we can do with MR/Spark:
  Invert the graph: flatMap: emit each edge with the pair reversed, regroup as needed depending on which representation you used
  Going from types of representations to each other is pretty self-explanatory

  Make those cool graphical representations of data and relations!:
    treat all vertices like "like charges" (they repel each other)
    treat edges like spring connecting them (so they pull vertices together)
    this aint scalable though. 10,000+ nodes makes this useless and a giant mess to see

  How to do that with say, the interwebs?
    painfully

SSSP:
  read the slides