12345678910111213141516171819202122232425262728293031323334353637383940 |
- Everything is a graph. Except for graphs. Those are charts.
- We're going to talk about sparse graphs. This means there aren't too many edges compared to the possible amt of edges.
- CS people are masochists. (Because we love graphs, which are hard). Why:
- they can have weird structures (can be fun to make data structs for it)
- weird data access patters (Can be fun to make infrastructure to adjust for that)
- you can optimize things with graph things
- What do graphs have to do with MR/Spark?
- lots of graph algos do:
- compute stuff at each node, and continue traversing (propogate results) (often these computations have to do with parents/children)
-
- How to represent a graph in MR and Spark?
- Adjacency matrices (n * n matrix, each row/column represents a vertex, each cell is 1 or 0 if there is an edge from row to column)
- while easy to understand, it uses lots of space and is hard to actually code/compute on them
- Adjacency Lists: basically the above but just have a list of n lists, each list is a list of outgoing edges (this is like a postings list for text processing/search)
- compressible just like those postings lists, easy to compute things as your traverse from start -> end (compute over outgoing links). But rather hard to do the opposite without flipping the entire data strcuture beforehand
- Edge Lists: Just a fucking list of edges. Why? Cause it's fucking easy. Also you can always append shit, so good for processing a stream of input. But it uses lots of space, and computing on them efficiently is hard cause it's not in traversal order or anything.
- How to process them:
- split by edges or vertices??
- How to deal with undirected graphs?
- 1) you could store the edge both ways, and ensure your algo handles the dupes
- 2) store one edge using a static rule (eg start is always the lower ID)
- Things we can do with MR/Spark:
- Invert the graph: flatMap: emit each edge with the pair reversed, regroup as needed depending on which representation you used
- Going from types of representations to each other is pretty self-explanatory
- Make those cool graphical representations of data and relations!:
- treat all vertices like "like charges" (they repel each other)
- treat edges like spring connecting them (so they pull vertices together)
- this aint scalable though. 10,000+ nodes makes this useless and a giant mess to see
- How to do that with say, the interwebs?
- painfully
- SSSP:
- read the slides
|