Carnegie Mellon Project Boosts Book Digitization Efforts RRS feed

  • Question

  • Carnegie Mellon University researchers have developed a way for people to help create digital records of books every time they solve CAPTCHAs, the distorted word puzzles commonly found when registering at a Web site or making an online purchase. Researchers believe that about 60 million CAPTCHA puzzles are solved everyday around the world, each taking an average of about 10 seconds to solve and type in. "Humanity is wasting 150,000 hours everyday on these," said Carnegie Mellon assistant professor of computer science Luis von Ahn, who helped develop CAPTCHAs about seven years ago. To take advantage of that manpower, Von Ahn devised a system that uses CAPTCHAs to help create digital records of books. The huge, numerous efforts to digitize books and store them online is primarily done through optical character recognition (OCR), but OCR frequently does not work on older, faded, or distorted texts. The Internet Archive scans about 12,000 books a month, and sends von Ahn images that the computer is unable to recognize. Von Ahn then splits those images into single words that can be used in his reCAPTCHA tests. To ensure people are correctly deciphering the printed text, reCAPTCHA requires users to type two words, one of which the system already knows. If the user types the known word correctly, the system has greater confidence that the unknown word was submitted correctly as well. If several visitors type the same answer for the unknown word, the system knows the word can be archived. Internet Archive director Brewster Kahle said he believes reCAPTCHA is a brilliant idea that utilizes the Internet to correct OCR mistakes. "This is an example of why having open collections in the public domain is important," Kahle said
    Tuesday, May 29, 2007 12:14 PM


  • read it hope u will get new information
    Tuesday, May 29, 2007 12:15 PM