Forum     

Go Back   Digit Technology Discussion Forum > News > Technology News
Register FAQ Calendar Mark Forums Read

Technology News News from the world of technology that our members stumble across. NOTE: Sources to be mentioned at the beginning of each post.


Closed Thread
 
LinkBack Thread Tools Display Modes
Old 29-05-2008, 12:33 AM   #1 (permalink)
mekalodu
 
iinfi's Avatar
 
Join Date: Oct 2004
Location: Navi Mumbai
Posts: 1,519
Exclamation Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest


Year-old database processes 24 billion events a day
Source
Quote:
Interest in raw computational speed waned — sorry, IBM — after data center managers began turning away from super-expensive supercomputers and toward massive grids comprised of cheap PC servers.

Meanwhile, the rise of business intelligence and its even more technical cousin, business analytics, has spurred interest in super-large data warehouses that boost profits by crunching the behavior patterns of millions of consumers at a time.

Take Yahoo Inc.'s 2-petabyte, specially built data warehouse, which it uses to analyze the behavior of its half-billion Web visitors per month. The Sunnyvale, Calif.-based company makes a strong claim that it is not only the world's single-largest database, but also the busiest.

Based on a heavily modified PostgreSQL engine, the year-old database processes 24 billion events a day, according to Waqar Hasan, vice president of engineering in Yahoo's data group.

And the data, all of it constantly accessed and all of it stored in a structured, ready-to-crunch form, is expected to grow into the multiple tens of petabytes by next year.

By comparison, large enterprise databases typically grow no larger than the tens of terabytes. Large databases about which much is publicly known include the Internal Revenue Service's data warehouse, which weighs in at a svelte 150TB.

EBay Inc. reportedly operates databases that process 10 billion records per day and are also able to do deep business analysis. They collectively store more than 6 petabytes of data, though the single largest system is estimated at about 1.4 petabytes or larger.

Even larger than the databases of Yahoo and eBay are the databases of the National Energy Research Scientific Computing Center in Oakland, Calif., whose archives include 3.5 petabytes of atomic energy research data, and the World Data Centre for Climate in Hamburg, Germany, which has 220TB of data (download PDF) in its Linux database but more than 6 petabytes of data archived on magnetic tape.

But Hasan noted that archived data is far different from live, constantly accessed data.

"It's one thing to have data entombed; it's another to have it readily accessible for your queries," he said. He also pointed out that other large databases store unstructured data such as video and sound files. Those can bulk up a database's size without providing easily analyzable data.

Hasan joined Yahoo more than three years ago. At the time, Yahoo already had huge non-SQL databases storing hundreds of terabytes of data. Problem was, the data was in the form of large collections of compressed files that could be accessed only by writing programs in a language such as C++, rather than more easily and quickly via SQL commands, he said.

One of Hasan's first moves was to buy a Seattle database start-up called Mahat Technologies, which had tweaked the open-source PostgreSQL to run as a column-based database rather than a conventional row-based one. Rotating tables 90 degrees, while slowing down the process of writing data to disk, greatly accelerates the reading of it.

Yahoo brought the database in-house and continued to enhance it, including tighter data compression, more parallel data processing and more optimized queries. The top layer remains PostgreSQL, however, so that Yahoo can use the many off-the-shelf tools available for it.

The largest tables in the database already comprise "multiple trillions of rows," said Hasan, who helped develop database technology at Informix, Hewlett-Packard and IBM before coming to the user side.

The huge table sizes enable Yahoo to do broader, more complicated analyses, so it can better understand how to make its banner and search ads more effective, enabling it to reap more money from advertisers. They also help the company make its Web sites better for users by, for instance, making its search results more relevant, Hasan said. But loading the data takes several hours, so Yahoo does its real-time analysis with a different data warehouse.

The database requires fewer than 1,000 PC servers hosted at several data centers, said Hasan, who declined to reveal the exact number. He did claim that the number of servers used is one-tenth to one-twentieth fewer than the number that would be needed if the database were a conventional one such as Oracle, IBM's DB2 or NCR's Teradata.

Despite the success of Amazon.com Inc.'s EC2 cloud-based application hosting service, Yahoo has no plans right now to rent access to its database as a Web-based utility, nor to sell licenses of the technology to enterprises that want to install it on their own premises, Hasan said.
__________________
mekalodu
iinfi is offline  
Advertisements. Register and be a member of the community to get rid of them.
Advertisement

Old 29-05-2008, 01:34 AM   #2 (permalink)
GaurishSharma.com
 
gary4gar's Avatar
 
Join Date: May 2005
Location: Jaipur
Posts: 4,116
Default Re: Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest

Wonder whats size of google's database is?
gary4gar is offline  
Old 29-05-2008, 12:02 PM   #3 (permalink)
BlackBerry Guru ! :)
 
BBThumbHealer's Avatar
 
Join Date: Dec 2006
Location: New Delhi , NCR
Posts: 1,270
Default Re: Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest

^ As large as the universe ?
__________________
Username Changed - BlackBerry7100g To BBThumbHealer ! :D
BBThumbHealer is offline  
Old 29-05-2008, 12:09 PM   #4 (permalink)
ax3
Cool as a CUCUMBAR ! ! !
 
ax3's Avatar
 
Join Date: Dec 2003
Posts: 5,052
Default Re: Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest

cool .... great move by yahoo ...... its SIZE offering WAR ...... isnt it ?
__________________
... W H O T ...
ax3 is offline  
Old 29-05-2008, 04:55 PM   #5 (permalink)
Legen-wait for it-dary!
 
dheeraj_kumar's Avatar
 
Join Date: Dec 2004
Location: Chennai
Posts: 2,471
Default Re: Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest

I think google's database would be larger than yahoo's. I mean, its one of the largest spiders on the web, and returns more efficient results than yahoo too.
__________________
If the Start Windows Restart when Windows starts check box is checked Windows Restart will start automatically every time Windows is started. - Actual excerpt from a windows program help file
dheeraj_kumar is offline  
Old 29-05-2008, 05:58 PM   #6 (permalink)
 Macboy
 
goobimama's Avatar
 
Join Date: Sep 2004
Location: Goa
Posts: 4,486
Default Re: Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest

Yeah. Caching all those pages is certainly going to require a lot of storage. (for google)
__________________
I'm like a bird... :)
goobimama is offline  
Old 29-05-2008, 09:55 PM   #7 (permalink)
!! RecuZant By Birth !!
 
naveen_reloaded's Avatar
 
Join Date: May 2005
Location: In Everyone`s Heart
Posts: 2,985
Default Re: Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest

Wow ....
__________________
Know My Thoughts..
Visit my Blog @ www.Urssiva.com
Visit My Tech Blog @ www.CloudTechnica.com
naveen_reloaded is offline  
Old 29-05-2008, 11:24 PM   #8 (permalink)
mekalodu
 
iinfi's Avatar
 
Join Date: Oct 2004
Location: Navi Mumbai
Posts: 1,519
Default Re: Size matters: Yahoo claims 2-petabyte database is world's biggest, busiest

more interesting read
http://news.yahoo.com/s/cmp/20080522/tc_cmp/207801579
Quote:
The result is a database made possible by both hardware and software innovations. For example, SQL databases are organized as tables, which consist of rows and columns. They are traditionally arranged as rows of data, but Yahoo chose to store its data as distributed columns.
ADVERTISEMENT

"What we chose to do is organize it as columns," said Hasan. "What that enables, especially with deep analytics queries, is that you can go to only the data that interests you, which makes it very, very effective in terms reducing the amount of data you have to move through for a particular query."

Yahoo is also using advanced techniques for data compression and parallel vector query processing, a method for using parallel processing more efficiently.

Google's BigTable database also uses commodity hardware clusters, but Hasan said that Yahoo's approach differs in that it is designed for an SQL interface. "What that enables is that you can write your programs very, very cheaply," said Hasan. "Typically with BigTable, you'd be writing a C++ or a Java program. Whereas what we can do is get the same job done with SQL, which is much more productive from a programming perspective."

The reason Yahoo developed its database was that commercial database providers just couldn't meet its needs. Hasan said that the commercial vendors did pretty well up to about 25 terabytes, and could even manage up to 100 terabytes. "Our needs are about 100 times higher than that," he said. "The other part we ran into was if you look at the cost, even at 100 terabytes, our engine is roughly 10 and 20 times more cost effective. That's because we were able to build in specializations for our needs."

Yahoo's data needs are substantial. According to Hasan, the travel industry's Sabre system handles 50 million events per day, credit card company Visa handles 120 million events a day, and the New York Stock Exchange has handled over 225 million events in a day. Yahoo, he said, handles 24 billion events a day, fully two orders of magnitude more than other non-Internet companies.

Several years ago, Google and Yahoo fought for bragging rights about which company had the biggest Web index. Google put an end to that game in 2005 when it declared that its index was three times larger than Yahoo's. After that, the debate shifted to search relevance.

Yahoo now is seeking recognition for a different accomplishment: The embattled search company and community portal claims that it has the largest SQL database in a production environment.

"This is the first time, that we know of, that someone has put a one petabyte-plus database into production," said Waqar Hasan, VP of data at Yahoo. "We have built it to scale to tens of petabytes and we intend to get there. Come 2009, we'll be at multiple tens of petabytes."

A petabyte equals one thousand terabytes, one million gigabytes, or 1 trillion megabytes. It's an uncommon enough measurement that the word "petabyte" is not yet recognized by Microsoft Word 2007's spell checker.

"The amount of data that we get is much more than the traditional industry and even in the Internet space is significantly more than other large players," said Hasan. The reason for this, he explained, is that consumers spend twice as long on Yahoo as they do at Google and three times as long on Yahoo as they do at Microsoft's sites. (This, in part, explains Microsoft's interest in acquiring Yahoo.)

The data Yahoo gathers is structured data, as opposed to unstructured data like e-mail and other documents. "It's about how people use our Web site, both from the advertising perspective and from the consumer experience perspective," said Hasan.

Yahoo uses this data to deliver what it hopes will be the best possible experience for its consumers, through personalization, and the most profitable experience for its advertisers, through ad targeting. "Fundamentally, what this is enabling is what we call deep analytics," said Hasan. "Doing deep analytics with a low entry barrier is really what this technology enables."

Yahoo's database is built out of commodity Intel boxes, strung together in large clusters. "The classic industry approach has been to go for big SMP [symmetric multiprocessing] boxes," Hasan explained. "We started from the ground up with the premise that all you get to use is commodity hardware and you get to take lots of little boxes and put them together."

Yahoo's database technology came out of work begun at Mahat Technologies, a Seattle-based start-up that Yahoo quietly acquired in November 2005 for an undisclosed sum.

Yahoo started with the PostgreSQL engine and replaced the query processing layer with code designed for its commodity hardware cluster.
its bigger than google
__________________
mekalodu
iinfi is offline  
Closed Thread

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Crystal Island in Moscow to be the world's biggest building naveen_reloaded Random News 9 02-01-2008 07:35 PM
World's Biggest Supercomputer is a Virus? iMav Technology News 19 13-09-2007 12:42 PM
Unlimited Yahoo Mail Storage Size Starting Today rakeshishere Technology News 21 18-05-2007 07:17 PM
World's Biggest Military Hack of All time | Britain to Hand over the Hacker to US devaraj Technology News 2 08-07-2006 12:45 PM
Page File size MATTERS ???? varunchaddha QnA (read only) 14 22-02-2006 02:56 PM

 
Latest Threads
- by abhidev
- by chris
- by clmlbx

Advertisement




All times are GMT +5.5. The time now is 05:48 AM.


Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2012, vBulletin Solutions, Inc.

Search Engine Optimization by vBSEO 3.3.2