At Language Log, Geoffrey Pullum has drawn attention to another argument for why Google Translate, though often very effective, tell us very little about how language actually works. GT operates based on statistical distributions of word sequences, primarily three-word sequences (so-called ‘trigrams’). But this is not a good model for how human knowledge of language works.
I have noticed myself that it is extraordinarily easy to take, say, the emails you received this morning, and verify for particular three-word sequences that they seem never to have occurred before in the history of the web. […]
[I]t really is true that the probability for most grammatical sequences of words actually having turned up on the web really is approximately zero, so grammaticality cannot possibly be reduced to probability of having actually occurred. Not even for word trigrams is that a reasonable equation.
As I said in a previous post,
Of course native speakers aren’t carrying n-grams around in their heads. Of course native speakers’ linguistic knowledge doesn’t amount to knowing statistical distributions of collocations of words … right?
Pullum goes on to make the point that this doesn’t mean that n-gram data can’t help linguists: corpora are sources of data. But we shouldn’t confuse the data with the theory.