More memes - watch star wars again Respond to piazza posts with LMGTFY links "Big Data is 3 parts" Data Science Tools, Analytics Infrastructure, and Execution Infrastructure Genetically program better people. We're better than god!! Don't change diapers sober Data Science = analysis of data to extract insight Data Products = what you actually do with that insight Getting spied on by your car is good Many model comparison graphs are: x-axis for dataset size, y-axis for success/accuracy, and there are lines to represent diff approaches more data more better Stereotypes are funny and we should embrace them Rogelio wouldn't like our prof cause he likes "data ethics", but he gets frustrated by our lack of ability to use all the data MadReduce: Physical procedure: Master node interprets input and schedules/coordinates workers (yarn queues, basically) splits up files, sends to workers, workers map and write local, they get sent to reducers (combine), then reduce Considerations in MapReduce Clusters Scale "out", not "up" so instead of upgrading to more and more powerful hardware, get a bunch that are a bit less powerful (is cheaper) Assume machines will fail so redundancy/failsafes must exist Process on the cluster, faster to move code to a cluster and use its power than move data to single machines and process there Process data sequentially, random accesses are slow (don't jump around disks or thrash mem) This can be done with algos that read all data, and just do nothing with non-relevant stuff as opposed to jumping around a dataset