6 years ago · ddc9223746
--- a/oct_4
+++ b/oct_4
@@ -0,0 +1,40 @@
 
				+Everything is a graph. Except for graphs. Those are charts.
			
 
				+
			
 
				+We're going to talk about sparse graphs. This means there aren't too many edges compared to the possible amt of edges.
			
 
				+CS people are masochists. (Because we love graphs, which are hard). Why:
			
 
				+  they can have weird structures (can be fun to make data structs for it)
			
 
				+  weird data access patters (Can be fun to make infrastructure to adjust for that)
			
 
				+  you can optimize things with graph things
			
 
				+
			
 
				+What do graphs have to do with MR/Spark?
			
 
				+  lots of graph algos do:
			
 
				+    compute stuff at each node, and continue traversing (propogate results) (often these computations have to do with parents/children)
			
 
				+  
			
 
				+How to represent a graph in MR and Spark?
			
 
				+  Adjacency matrices (n * n matrix, each row/column represents a vertex, each cell is 1 or 0 if there is an edge from row to column)
			
 
				+    while easy to understand, it uses lots of space and is hard to actually code/compute on them
			
 
				+  Adjacency Lists: basically the above but just have a list of n lists, each list is a list of outgoing edges (this is like a postings list for text processing/search)
			
 
				+    compressible just like those postings lists, easy to compute things as your traverse from start -> end (compute over outgoing links). But rather hard to do the opposite without flipping the entire data strcuture beforehand
			
 
				+  Edge Lists: Just a fucking list of edges. Why? Cause it's fucking easy. Also you can always append shit, so good for processing a stream of input. But it uses lots of space, and computing on them efficiently is hard cause it's not in traversal order or anything.
			
 
				+
			
 
				+How to process them:
			
 
				+  split by edges or vertices??
			
 
				+
			
 
				+How to deal with undirected graphs?
			
 
				+  1) you could store the edge both ways, and ensure your algo handles the dupes
			
 
				+  2) store one edge using a static rule (eg start is always the lower ID)
			
 
				+
			
 
				+Things we can do with MR/Spark:
			
 
				+  Invert the graph: flatMap: emit each edge with the pair reversed, regroup as needed depending on which representation you used
			
 
				+  Going from types of representations to each other is pretty self-explanatory
			
 
				+
			
 
				+  Make those cool graphical representations of data and relations!:
			
 
				+    treat all vertices like "like charges" (they repel each other)
			
 
				+    treat edges like spring connecting them (so they pull vertices together)
			
 
				+    this aint scalable though. 10,000+ nodes makes this useless and a giant mess to see
			
 
				+
			
 
				+  How to do that with say, the interwebs?
			
 
				+    painfully
			
 
				+
			
 
				+SSSP:
			
 
				+  read the slides