Google: $225B and organizing less than 0.02% of the world’s information

Google’s recent OpenSocial and Open Handset Alliance announcements have pushed the company’s market capitalization north of $225,000,000,000. But at the end of the day, Google has two more products in a portfolio of over than 70 others…of which AdWords and AdSense create the bulk of profits, with AdWords being the better of the two. This means that AdWords ads shown in search results drive Google’s earnings, but Google has organized less than 0.02% of the world’s information. Imagine what the other 99.98% is worth.

~10,000,000 Terabytes of Unique Information Produced Last Year
Berkeley’s School of Information Management Systems calculated how much unique information was produced in 1999 and again in 2002. The SIMS studies are regarded as the benchmarks in this arena, measuring the volume of unique data created in the world each year saved to film, disk, optical, and paper. The 1999 study estimated that between 2,132,238 terabytes (”TB”) and 3,212,731 TB of unique information were produced that year. The later study estimated that between 3,416,281 TB and 5,609,121 TB were produced in 2002, so there was a ~19% annual growth rate between ‘99 and ‘02. Assuming the same growth rate to present, somewhere between 6,869,341 TB and 11,278,629 TB of new information were produced in 2006.

~1085 Terabytes Organized by Google Last Year
In November of 2006, Google released a paper called “Bigtable: A Distributed Storage System for Structured Data” which discussed Google’s proprietary distributed storage system designed to scale to petabytes of data across thousands of commodity servers. The paper also mentioned that Google’s web index at the time was 800 TB and Bigtable as a whole was ~1085 TB. If you take the SIMS numbers as valid, in 2006 Google had organized the equivalent of 0.02% of the unique information produced in that year alone.

Berkeley has since handed this research off to the ISID at UCSD. The Berkeley studies were co-authored by the recently deceased Peter Lyman and Hal R. Varian…who has since become Google’s Chief Economist.

_______________________________

This post got lots of attention on Digg and Reddit - thanks for all the comments! We admit, the post is a bit inflammatory, but what better way to start off a blog than with some linkbait? A couple of thoughts:

1) Eric Schmidt, Google CEO, 10/8/2005: “How much information is there in the world? A study that was done last year indicated roughly five million terabytes. How much is indexable, searchable today? Current estimate: about 170 terabytes. So again we’re back in that two or three percent of the indexed and searchable world.” Check his math: 170 is 0.0034% of 5m, not “two or three percent”.

2) We realize that Google is not out to “save” everything on the web. However, Google’s index/cache is information saved by Google as it’s bots crawl the web. The cache only includes HMTL, text and URL type data - not photos, movies, etc. A couple of folks made the great point that knowing where a 100MB movie is located doesn’t take 100MB of memory. This is clearly true. But if you know where something is located but you don’t actually know what’s in it, have you really organized it?

3) Several commenters noted that Google may have the most interesting information organized - the rest is boring or not even on public servers…stuff like sales receipts, personal emails, and text messages. Google has definitely organized lots information that is on public servers and is interesting. That said, it’s KnowledgeBid’s thesis that nearly all information has value to someone.

Btw, the best comment we got was from CaptainJesusHood on Reddit who pointed us to this great and on-topic Onion article.

30 Responses to “Google: $225B and organizing less than 0.02% of the world’s information”

  1. SteeleR on November 7th, 2007 at 10:57 am

    0.02% is totaly incorrect.. as the article says.. that 10 000 000 TB data includes music, videos and so on and so on.. and the Google index consist of systemized data of websites URL, text and so on.. So, the two have nothing to do with each other. It ’s like comparing a movie to a paper.. 700MB movie is a lot more than 1MB online newspaper.. but the info in that newspaper, I think, could be way more that the info in the movie..

  2. Adam on November 7th, 2007 at 10:57 am

    Is it possible that the stored data doesn’t contain full “tags”? eg. Assuming ~ 95% of the HTML source is simply tags and whitespace, that moves their storage percentage to around 0.5%. Not nearly as “minimized” sounding.

  3. Hagrin on November 7th, 2007 at 11:22 am

    Seems like a bad case of apples and oranges. Much of the 10 million TB figure quoted in the second section contains private or inaccessible data that would never be made public to an entity such as Google (payroll data, financial transactions, etc.). If you wanted to make the case that Google hardware search appliances could effectively “organize” (poor word choice IMO, index would have been a better choice) this private data, then maybe these numbers would have some validity, but since they can’t apply the AdWords/AdSense business model to this data, it still doesn’t apply.

  4. Jay Neely on November 7th, 2007 at 11:28 am

    If you’re saying that ~1085 TB is the size of Google’s organizing system for its index of information, and then saying, “that amount is equal to the amount of information Google has organized”… you’ve failed basic logic.

    ~1085 TB is the size of the card catalog, not the library.

  5. brianreeder.com » Google Organizing the World’s Information. Kind Of. on November 7th, 2007 at 11:41 am

    […] Check out the full article. […]

  6. Andrew on November 7th, 2007 at 12:00 pm

    More data does not necessarily mean better data. There’s no gaurantee that any of the previously uncovered data would become the top search result for a query, let alone make the first 100 pages of results.

  7. Drew on November 7th, 2007 at 12:20 pm

    Ok, so this is interesting BUT… Lets say you have a video site (porn for example) that has 100gb of “data” in the form of movies. Google will pick up the site (well under 1mb) but not the 100gb. So based on that math, Google can pick up the “information” w/out actually comprehending all 100gb of data(they are picking up 0.001%). Am I wrong here?

  8. Google: market cap $225 billion; world’s info, less than 0.02% - WebProNews Blog on November 7th, 2007 at 12:33 pm

    […] Via KnowledgeBid […]

  9. Sujoy on November 7th, 2007 at 12:49 pm

    Thats a whole lot of data. I wonder how they conducted the research

  10. Pedro Côrte-Real on November 7th, 2007 at 12:54 pm

    The fact that BigTable is 1085 TB doesn’t mean they’re only organizing that much information. That is the size of the index which can be much smaller than the size of the actual data being indexed.

  11. Scott Clark on November 7th, 2007 at 2:07 pm

    Nicely stated. It’s good to keep things in perspective. Would be interesting to put that against other organizational monoliths, such as the public library system or the Library of Congress.

  12. afaceri on November 7th, 2007 at 2:09 pm

    impressive. dugged!

  13. Thunk Different. on November 7th, 2007 at 3:25 pm

    astronomical.

  14. First Question on November 7th, 2007 at 4:44 pm

    Is the size of videos included?

  15. Google Might Be Indexing The Smallest Fraction Of The Web : SKIRMISHER: News for the hot-blooded, manly geek on November 7th, 2007 at 8:14 pm

    […] In November 2006, Google proudly released a paper called “Bigtable: A Distributed Storage System for Structured Data,” which mentioned in passing that Google has a web index of 1,085 terabytes. But the thing is, there’s a big-ass paper from the Berkeley’s School of Information Management Systems that sort of implies that an equivalent of 10,000,000 terabytes of unique information were produced in 2006 alone. So Google appears to have indexed only 0.02% of the internets. But don’t worry; there’s only 99.98% left. Baby Jesus, as always, will take care of it. via […]

  16. Axi0n on November 7th, 2007 at 10:32 pm

    I wonder how all these Internet statistics, supposed facts and reports are supposed to support and prove any original hypothesis or theory as to modern data storage scaling… What system are they using to determine unique data?? MD5 hashes?? Heuristical content scanning?? SQL Queries? (Select * from master where x looks like y) then determine the percentage likeness to a given source / content, throw away all content where the % likeness is statistically (guestimate) reasonable it is or is largely a dupe.. Leaving a scientifically, by any standard, a completely assinine way of determining nothing but getting Govt / scientific research grant $$$…

    Being an IT admin, sure data is sent around, too much by anyones standards… My issue is that if 2 documents differ only in signature and creation date, it theoretically would be counted as unique data, despite the actual content being identical… So while there might be endless gobs of new data being counted towards this data storage claim. It still holds true I would believe that 90% of the data is actually duped, plagiarized, re-arranged or overall just meddled with enough for this claim to technically be proven almost true…

    If someone could develop an algorithm to index all the data, before it is cached, find all the dupes, and only present / cache one copy of any given content, then feed the end client a dynamically created runtime page of that data from all the referenced sources, I bet storage requirements would be shrunk quite effectively.

    Look at email.. Imagine a system years back that didn’t support Single Instance Storage for attachments… Think of how many lusers at work forward that 15mb attachment of their vacation to the AllUsers distro… Think of the storage scaling nightmare it would be given todays expectations of corporate email storage quotas and info availability… Now change the example to one that does support SIS… The storage is greatly reduced, less info to index, more drive space, no broken IT budgets, no admins leaping from their office windows, and so on…

    I am rambling now… But the old cliche still holds true… Work smarter. Not Harder.

  17. Axi0n on November 7th, 2007 at 10:48 pm

    Re-reading for effect… How can anyone make a general claim like…

    “volume of unique data created in the world each year saved to film, disk, optical, and paper”…

    Did the Berkley ppl conducting the study not consider that paper is generally the output from an existing document in todays world, thereby it absolutely negates any possibility of that data being unique… In reverse, if indeed there is a unique piece of paper, in today’s world it goes to the ADF scanner and digitized to TIFF/PDF or OCR’d to a document… Then after that it likely gets backed up to tape or otherwise archived. Thereby creating a problem where the same data exists content wise identically in 3 mediums skewing any validity to this claim altogether.

    I am no math or stats guru but when you google anything and check the content of even say the top 10 relevant hits (ads / commercial sites excluded, I bet a significant volume of said pages is originating from a different original source and is just copied verbatim, but of course is considered unique because its from a different site, different code structure, and surrounding “noise” content… (posts, ads, tags, frames, user posts, etc.), just enough to fool just about any indexing spider, algorithm, or other magical bit of comparative indexing computerese.

    We’re obviously not there yet in any regard… I bet 90% of people get sick of seeing unique hits on essentially dupe data, if humans can see it just by eyeballing and skimming the first sentence of any actual content, then we know somehow this data is a random assumption based on a unknown number of unknown variables,

  18. Brian Kerr | links for 2007-11-08 on November 7th, 2007 at 11:31 pm

    […] KnowledgeBid | Google — $225B and organizing less than 0.02% of the world’s information (tags: google informons valuation organization ? berkeley sims teh-googe) […]

  19. Treai on November 8th, 2007 at 12:14 am

    Thanks. You just made me hate Google. On the upside, could it really be so hard to beat that for a competitor now?

  20. Antymatrix » Archiwum bloga » Maleńki Google on November 8th, 2007 at 1:17 am

    […] to, że wyszukiwarka ta umożliwia dotarcie do 0,02 proc. wytworzonej informacji, konkluduje KnowledgeBid. Oczywiście, samo rachunkowe podejście do ilości danych nie odzwierciedla ich wartości. […]

  21. Blogulate on November 8th, 2007 at 3:05 am

    Google indexes just 0.02% of world’s information…

    The Berkeley’s School of Information Management Systems calculated how much unique information was produced in 1999 and again in 2002 (measuring the volume of unique data created in the world each year saved to film, disk, optical, and paper).
    The …

  22. A D on November 8th, 2007 at 6:59 am

    wow… I better get in search business quickly…. there is still 98% of data available for me to organize…

  23. Jay Gaulard Blog » Blog Archive » Google, Digg, Facebook and Being Social on November 8th, 2007 at 8:44 am

    […] Report: Google Organizing Less Than 0.02% of the World’s Information […]

  24. November 8, 2007 | next media update on November 8th, 2007 at 12:01 pm

    […] Study: Google Organizing Less Than 0.02% of the World’s Information KNOWLEDGEBID Berkeley’s School of Information Management Systems calculated how much unique information was produced in 1999 and again in 2002. The study estimated that between 3,416,281 TB and 5,609,121 TB were produced n 2002, so there was a ~19% annual growth rate between ‘99 and ‘02. Assuming the same growth rate to present, somewhere between 6,869,341 TB and 11,278,629 TB of new information were produced in 2006. In November of 2006, Google released a paper called “Bigtable: A Distributed Storage System for Structured Data” which discussed Google’s proprietary distributed storage system designed to scale to petabytes of data across thousands of commodity servers. The paper also mentioned that Google’s web index at the time was 800 TB and Bigtable as a whole was ~1085 TB. If you take the SIMS numbers as valid, in 2006 Google had organized the equivalent of 0.02% of the unique information produced in that year alone. Source> […]

  25.   Ştiri la început de weekend (09.11.2007) — CTI97:=(Creativitate,Tehnologie,Informatie) on November 9th, 2007 at 12:00 pm

    […] Google a organizat doar 0.2% din informaţia circulată pe internet în 2006 […]

  26. Matt Ellsworth on November 10th, 2007 at 12:49 pm

    either way you cut it - thats 2 huge numbers of data.

  27. Reflexiones sobre Google « Blog de Fermín Serrano on November 13th, 2007 at 4:18 pm

    […] y mientras tanto estudios nos dicen que Google solo indexa el 0.02% del total de información de Internet. ¿no es […]

  28. What is that 99.98% worth? | LucaFiligheddu.com on November 14th, 2007 at 2:27 am

    […] Palombi pointed out this wise thought in his blog: AdWords ads shown in search results drive Google’s earnings, but Google searches […]

  29. Whatever-ishere on November 21st, 2007 at 11:32 am

    thanks for the GREAT post! Very useful…

  30. Idetrorce on December 15th, 2007 at 4:12 pm

    very interesting, but I don’t agree with you
    Idetrorce

Leave a Reply