Google processes over 20 petabytes of data per day

Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters. The average MapReduce job ran across approximately 400 machines in September 2007, crunching approximately 11,000 machine years in a single month. These are just some of the facts about the search giant's computational processing infrastructure revealed in an ACM paper by Google Fellows Jeffrey Dean and Sanjay Ghemawat.
Twenty petabytes (20,000 terabytes) per day is a tremendous amount of data processing and a key contributor to Google's continued market dominance. Competing search storage and processing systems at Microsoft (Dyrad) and Yahoo! (Hadoop) are still playing catch-up to Google's suite of GFS, MapReduce, and BigTable.

MapReduce statistics for different months

	Aug. 2004	Mar. 2006	Sep. 2007
Number of jobs (1000s)	29	171	2,217
Avg. completion time (secs)	634	874	395
Machine years used	217	2,002	11,081
`map` input data (TB)	3,288	52,254	403,152
`map` output data (TB)	758	6,743	34,774
`reduce` output data (TB)	193	2,970	14,018
Avg. machines per job	157	268	394
Unique implementations
`map`	395	1,958	4,083
`reduce`	269	1,208	2,418

Google processes its data on a standard machine cluster node consisting two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link. This type of machine costs approximately $2400 each through providers such as Penguin Computing or Dell or approximately $900 a month through a managed hosting provider such as Verio (for startup comparisons).
The average MapReduce job runs across a $1 million hardware cluster, not including bandwidth fees, datacenter costs, or staffing.

Pages

Hack ur PC

Labels

Google processes over 20 petabytes of data per day

0 comments:

RSS & Feed

Popular Posts

Blog Archive

Labels

Translate Page

Total Pageviews