sept_11 1.5 KB

12345678910111213141516171819202122232425
  1. More memes - watch star wars again
  2. Respond to piazza posts with LMGTFY links
  3. "Big Data is 3 parts"
  4. Data Science Tools, Analytics Infrastructure, and Execution Infrastructure
  5. Genetically program better people. We're better than god!!
  6. Don't change diapers sober
  7. Data Science = analysis of data to extract insight
  8. Data Products = what you actually do with that insight
  9. Getting spied on by your car is good
  10. Many model comparison graphs are: x-axis for dataset size, y-axis for success/accuracy, and there are lines to represent diff approaches
  11. more data more better
  12. Stereotypes are funny and we should embrace them
  13. Rogelio wouldn't like our prof cause he likes "data ethics", but he gets frustrated by our lack of ability to use all the data
  14. MadReduce:
  15. Physical procedure:
  16. Master node interprets input and schedules/coordinates workers (yarn queues, basically)
  17. splits up files, sends to workers, workers map and write local, they get sent to reducers (combine), then reduce
  18. Considerations in MapReduce Clusters
  19. Scale "out", not "up" so instead of upgrading to more and more powerful hardware, get a bunch that are a bit less powerful (is cheaper)
  20. Assume machines will fail so redundancy/failsafes must exist
  21. Process on the cluster, faster to move code to a cluster and use its power than move data to single machines and process there
  22. Process data sequentially, random accesses are slow (don't jump around disks or thrash mem)
  23. This can be done with algos that read all data, and just do nothing with non-relevant stuff as opposed to jumping around a dataset