12345678910111213141516171819202122232425 |
- More memes - watch star wars again
- Respond to piazza posts with LMGTFY links
- "Big Data is 3 parts"
- Data Science Tools, Analytics Infrastructure, and Execution Infrastructure
- Genetically program better people. We're better than god!!
- Don't change diapers sober
- Data Science = analysis of data to extract insight
- Data Products = what you actually do with that insight
- Getting spied on by your car is good
- Many model comparison graphs are: x-axis for dataset size, y-axis for success/accuracy, and there are lines to represent diff approaches
- more data more better
- Stereotypes are funny and we should embrace them
- Rogelio wouldn't like our prof cause he likes "data ethics", but he gets frustrated by our lack of ability to use all the data
- MadReduce:
- Physical procedure:
- Master node interprets input and schedules/coordinates workers (yarn queues, basically)
- splits up files, sends to workers, workers map and write local, they get sent to reducers (combine), then reduce
- Considerations in MapReduce Clusters
- Scale "out", not "up" so instead of upgrading to more and more powerful hardware, get a bunch that are a bit less powerful (is cheaper)
- Assume machines will fail so redundancy/failsafes must exist
- Process on the cluster, faster to move code to a cluster and use its power than move data to single machines and process there
- Process data sequentially, random accesses are slow (don't jump around disks or thrash mem)
- This can be done with algos that read all data, and just do nothing with non-relevant stuff as opposed to jumping around a dataset
|