|
@@ -0,0 +1,25 @@
|
|
|
+More memes - watch star wars again
|
|
|
+ Respond to piazza posts with LMGTFY links
|
|
|
+"Big Data is 3 parts"
|
|
|
+ Data Science Tools, Analytics Infrastructure, and Execution Infrastructure
|
|
|
+Genetically program better people. We're better than god!!
|
|
|
+Don't change diapers sober
|
|
|
+Data Science = analysis of data to extract insight
|
|
|
+Data Products = what you actually do with that insight
|
|
|
+Getting spied on by your car is good
|
|
|
+Many model comparison graphs are: x-axis for dataset size, y-axis for success/accuracy, and there are lines to represent diff approaches
|
|
|
+ more data more better
|
|
|
+Stereotypes are funny and we should embrace them
|
|
|
+Rogelio wouldn't like our prof cause he likes "data ethics", but he gets frustrated by our lack of ability to use all the data
|
|
|
+
|
|
|
+MadReduce:
|
|
|
+Physical procedure:
|
|
|
+Master node interprets input and schedules/coordinates workers (yarn queues, basically)
|
|
|
+splits up files, sends to workers, workers map and write local, they get sent to reducers (combine), then reduce
|
|
|
+Considerations in MapReduce Clusters
|
|
|
+ Scale "out", not "up" so instead of upgrading to more and more powerful hardware, get a bunch that are a bit less powerful (is cheaper)
|
|
|
+ Assume machines will fail so redundancy/failsafes must exist
|
|
|
+ Process on the cluster, faster to move code to a cluster and use its power than move data to single machines and process there
|
|
|
+ Process data sequentially, random accesses are slow (don't jump around disks or thrash mem)
|
|
|
+ This can be done with algos that read all data, and just do nothing with non-relevant stuff as opposed to jumping around a dataset
|
|
|
+
|