Browse Source

had another class

tarfeef101 6 years ago
parent
commit
862d634c79
1 changed files with 25 additions and 0 deletions
  1. 25 0
      sept_11

+ 25 - 0
sept_11

@@ -0,0 +1,25 @@
+More memes - watch star wars again
+  Respond to piazza posts with LMGTFY links
+"Big Data is 3 parts"
+  Data Science Tools, Analytics Infrastructure, and Execution Infrastructure
+Genetically program better people. We're better than god!!
+Don't change diapers sober
+Data Science = analysis of data to extract insight
+Data Products = what you actually do with that insight
+Getting spied on by your car is good
+Many model comparison graphs are: x-axis for dataset size, y-axis for success/accuracy, and there are lines to represent diff approaches
+  more data more better
+Stereotypes are funny and we should embrace them
+Rogelio wouldn't like our prof cause he likes "data ethics", but he gets frustrated by our lack of ability to use all the data
+
+MadReduce:
+Physical procedure:
+Master node interprets input and schedules/coordinates workers (yarn queues, basically)
+splits up files, sends to workers, workers map and write local, they get sent to reducers (combine), then reduce
+Considerations in MapReduce Clusters
+  Scale "out", not "up" so instead of upgrading to more and more powerful hardware, get a bunch that are a bit less powerful (is cheaper)
+  Assume machines will fail so redundancy/failsafes must exist
+  Process on the cluster, faster to move code to a cluster and use its power than move data to single machines and process there
+  Process data sequentially, random accesses are slow (don't jump around disks or thrash mem)
+    This can be done with algos that read all data, and just do nothing with non-relevant stuff as opposed to jumping around a dataset
+