Twenty petabytes (20,000 terabytes) per day is a tremendous amount of data processing and a key contributor to Google's continued market dominance. Competing search storage and processing systems at Microsoft (Dyrad) and Yahoo! (Hadoop) are still playing catch-up to Google's suite of GFS, MapReduce, and BigTable.
MapReduce statistics for different months
Aug. 2004 | Mar. 2006 | Sep. 2007 | |
---|---|---|---|
Number of jobs (1000s) | 29 | 171 | 2,217 |
Avg. completion time (secs) | 634 | 874 | 395 |
Machine years used | 217 | 2,002 | 11,081 |
map input data (TB) | 3,288 | 52,254 | 403,152 |
map output data (TB) | 758 | 6,743 | 34,774 |
reduce output data (TB) | 193 | 2,970 | 14,018 |
Avg. machines per job | 157 | 268 | 394 |
Unique implementations | |||
map | 395 | 1,958 | 4,083 |
reduce | 269 | 1,208 | 2,418 |
Google processes its data on a standard machine cluster node consisting two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link. This type of machine costs approximately $2400 each through providers such as Penguin Computing or Dell or approximately $900 a month through a managed hosting provider such as Verio (for startup comparisons).
The average MapReduce job runs across a $1 million hardware cluster, not including bandwidth fees, datacenter costs, or staffing.
0 comments:
Post a Comment