tarfeef101
/
cs451-notes


			
				
					
						
						
							12345678910111213141516171819202122232425
							More memes - watch star wars again
  Respond to piazza posts with LMGTFY links
"Big Data is 3 parts"
  Data Science Tools, Analytics Infrastructure, and Execution Infrastructure
Genetically program better people. We're better than god!!
Don't change diapers sober
Data Science = analysis of data to extract insight
Data Products = what you actually do with that insight
Getting spied on by your car is good
Many model comparison graphs are: x-axis for dataset size, y-axis for success/accuracy, and there are lines to represent diff approaches
  more data more better
Stereotypes are funny and we should embrace them
Rogelio wouldn't like our prof cause he likes "data ethics", but he gets frustrated by our lack of ability to use all the data

MadReduce:
Physical procedure:
Master node interprets input and schedules/coordinates workers (yarn queues, basically)
splits up files, sends to workers, workers map and write local, they get sent to reducers (combine), then reduce
Considerations in MapReduce Clusters
  Scale "out", not "up" so instead of upgrading to more and more powerful hardware, get a bunch that are a bit less powerful (is cheaper)
  Assume machines will fail so redundancy/failsafes must exist
  Process on the cluster, faster to move code to a cluster and use its power than move data to single machines and process there
  Process data sequentially, random accesses are slow (don't jump around disks or thrash mem)
    This can be done with algos that read all data, and just do nothing with non-relevant stuff as opposed to jumping around a dataset