Google and AII read, awhile ago, that Google is all about large amounts of data. Which, given Google’s scale, isn’t all that exciting a statement in and of itself. But reading into the article you learn that Google does things specifically to get more data - that Google believes that having acces to large amounts of data is often more valuable than creating great algorithms.

When you type in “GM” into Google, we know it’s “General Motors.” If you type in “GM foods” we answer with “genetically modified foods.” Because we’re processing so much data, we have a lot of context around things like acronyms. Suddenly, the search engine seems smart like it achieved that semantic understanding, but it hasn’t really. It has to do with brute force.

Tangentially this piece on the latest MapReduce stats is illuminating. Showing an order of magnitude increase in the usage of Google’s crazy awesome means of computing over their gigantic data set. This is brute force in action - and you can see it speeding up in the Googleplex. To give you a sense of scale, the article says that the average input size for these jobs was 403,152 terabytes and it took only 6 minutes to complete.

While I was reading the original piece it reminded me of the Hutter Prize - which is an interesting prize whose goal is to achieve Artificial Intelligence through the use of compression over a huge data set (specifically the English version of Wikipedia). That is, Marcus Hutter believes that achieving a certain level of compression will be the same as artificial intelligence.

The notion revolves around the Turing Test which is a proposed test to determine whether a machine is intelligent or not:

a human judge engages in a natural language conversation with one human and one machine, each of which try to appear human; if the judge cannot reliably tell which is which, then the machine is said to pass the test.

The rationale for the Hutter Prize is long, technical and interesting. But I found this to be quite relevant:

both the judge and contestant are more likely to say or write “recognize speech” than “reckon eyes peach”, and further, they are more likely to judge the second phrase as unlikely or incorrect, compared to the first. Assume that both people do this consistenly given any pair of strings that might occur in human speech or writing. A string might be a dialog, e.g. a sequence of questions and answers. Our two assumptions state that both humans would answer any given question the same way, and that both humans would recognize when an answer is different than the one they would give.

This is basically saying that everyone understands that the probability of saying “recognize speech” is insanely greater than “reckon eyes peach.” That is, they will not only say “recognize peach” instead of the other, but given something that sounds like “reckon eyes peach”, they will understand “recognize speech.”

And then:

Humans are good at quickly applying language rules and real world knowledge to distinguish between high probability strings like “roses are red” and low probability strings such as “roses are rod” or “roses red are”. This is precisely the hard problem that a text compressor must solve so that it can assign shorter codes to the more likely strings.

And that to me sounds like exactly what Google is doing. It is learning about human language simply by the sheer amount of data it gathers. In indexing all that data and building models to store it it inherently and automatically knows about the probabilities of strings of words occurring.

Going back to the initial statement - it isn’t a super clever algorithm that Google wrote that can distinguish when GM means General Motors and when it means genetically modified, it’s just that Google pretty much automatically knew it from all the context it gets from its data.

But maybe someone might argue that instead of saying:

Suddenly, the search engine seems smart like it achieved that semantic understanding, but it hasn’t really. It has to do with brute force.

That the search engine actually got smart all on it’s own and that, like Turing believed, semantic understanding and the apperance of semantic understanding are really the same. :) Hal 9000 here we come!

← newer Money Mark!  ↑  Breakfast Links: Delicious Minas Tirath, Rules of Thumb & No Pants 2k8 older →

TwitterCounter for @nybble73