преди 6 години · 862d634c79
--- a/sept_11
+++ b/sept_11
@@ -0,0 +1,25 @@
 
				+More memes - watch star wars again
			
 
				+  Respond to piazza posts with LMGTFY links
			
 
				+"Big Data is 3 parts"
			
 
				+  Data Science Tools, Analytics Infrastructure, and Execution Infrastructure
			
 
				+Genetically program better people. We're better than god!!
			
 
				+Don't change diapers sober
			
 
				+Data Science = analysis of data to extract insight
			
 
				+Data Products = what you actually do with that insight
			
 
				+Getting spied on by your car is good
			
 
				+Many model comparison graphs are: x-axis for dataset size, y-axis for success/accuracy, and there are lines to represent diff approaches
			
 
				+  more data more better
			
 
				+Stereotypes are funny and we should embrace them
			
 
				+Rogelio wouldn't like our prof cause he likes "data ethics", but he gets frustrated by our lack of ability to use all the data
			
 
				+
			
 
				+MadReduce:
			
 
				+Physical procedure:
			
 
				+Master node interprets input and schedules/coordinates workers (yarn queues, basically)
			
 
				+splits up files, sends to workers, workers map and write local, they get sent to reducers (combine), then reduce
			
 
				+Considerations in MapReduce Clusters
			
 
				+  Scale "out", not "up" so instead of upgrading to more and more powerful hardware, get a bunch that are a bit less powerful (is cheaper)
			
 
				+  Assume machines will fail so redundancy/failsafes must exist
			
 
				+  Process on the cluster, faster to move code to a cluster and use its power than move data to single machines and process there
			
 
				+  Process data sequentially, random accesses are slow (don't jump around disks or thrash mem)
			
 
				+    This can be done with algos that read all data, and just do nothing with non-relevant stuff as opposed to jumping around a dataset
			
 
				+