Monday, January 17, 2011

Google's Ngram viewer and the shortcomings of OCR

I was having a bit of fun playing with Google's Ngram viewer, searching for terms that were coined in the 20th century, to see if anybody had used those same words incidentally in the past. Turns out for some of them, most of the results are an OCR failure. For instance, here it mistakes "Jan" for "Jazz" in an 1805 essay. Whoops.

