|
@@ -0,0 +1,40 @@
|
|
|
+Everything is a graph. Except for graphs. Those are charts.
|
|
|
+
|
|
|
+We're going to talk about sparse graphs. This means there aren't too many edges compared to the possible amt of edges.
|
|
|
+CS people are masochists. (Because we love graphs, which are hard). Why:
|
|
|
+ they can have weird structures (can be fun to make data structs for it)
|
|
|
+ weird data access patters (Can be fun to make infrastructure to adjust for that)
|
|
|
+ you can optimize things with graph things
|
|
|
+
|
|
|
+What do graphs have to do with MR/Spark?
|
|
|
+ lots of graph algos do:
|
|
|
+ compute stuff at each node, and continue traversing (propogate results) (often these computations have to do with parents/children)
|
|
|
+
|
|
|
+How to represent a graph in MR and Spark?
|
|
|
+ Adjacency matrices (n * n matrix, each row/column represents a vertex, each cell is 1 or 0 if there is an edge from row to column)
|
|
|
+ while easy to understand, it uses lots of space and is hard to actually code/compute on them
|
|
|
+ Adjacency Lists: basically the above but just have a list of n lists, each list is a list of outgoing edges (this is like a postings list for text processing/search)
|
|
|
+ compressible just like those postings lists, easy to compute things as your traverse from start -> end (compute over outgoing links). But rather hard to do the opposite without flipping the entire data strcuture beforehand
|
|
|
+ Edge Lists: Just a fucking list of edges. Why? Cause it's fucking easy. Also you can always append shit, so good for processing a stream of input. But it uses lots of space, and computing on them efficiently is hard cause it's not in traversal order or anything.
|
|
|
+
|
|
|
+How to process them:
|
|
|
+ split by edges or vertices??
|
|
|
+
|
|
|
+How to deal with undirected graphs?
|
|
|
+ 1) you could store the edge both ways, and ensure your algo handles the dupes
|
|
|
+ 2) store one edge using a static rule (eg start is always the lower ID)
|
|
|
+
|
|
|
+Things we can do with MR/Spark:
|
|
|
+ Invert the graph: flatMap: emit each edge with the pair reversed, regroup as needed depending on which representation you used
|
|
|
+ Going from types of representations to each other is pretty self-explanatory
|
|
|
+
|
|
|
+ Make those cool graphical representations of data and relations!:
|
|
|
+ treat all vertices like "like charges" (they repel each other)
|
|
|
+ treat edges like spring connecting them (so they pull vertices together)
|
|
|
+ this aint scalable though. 10,000+ nodes makes this useless and a giant mess to see
|
|
|
+
|
|
|
+ How to do that with say, the interwebs?
|
|
|
+ painfully
|
|
|
+
|
|
|
+SSSP:
|
|
|
+ read the slides
|