Loading ...
Sorry, an error occurred while loading the content.

Who would have thought it?

Expand Messages
  • Dr Elizabeth Hanson-Smith
    Another reason for putting large numbers of human heads together to solve a massive problem. Published in the NY Times. --Elizabeth H-S Deciphering Old Texts,
    Message 1 of 1 , Apr 1, 2011
    • 0 Attachment
      Another reason for putting large numbers of human heads together to solve a massive problem.
      Published in the NY Times.
      --Elizabeth H-S
      Deciphering Old Texts,
      One Woozy, Curvy Word at a Time
      By GUY GUGLIOTTA

      Published: March 28, 2011

      In the old days, anybody interested in seeing a Mets game during a trip
      to New York would have to call the team, or write away, or wait to get
      to the city and visit the box office. No more. Now, all it takes is to
      find an online ticket distributor. Sign in, click “Mets,” pick the date
      and pay.






      Enlarge
      This Image









      Related


      A Speech Lost in Digital Translation
      (March 29, 2011)



      RSS
      Feed


      Get Science News From The New York Times »








      But before taking the money, the Web site might first present the reader
      with two sets of wavy, distorted letters and ask for a transcription.
      These things are called Captchas, and only humans can read them.
      Captchas ensure that robots do not hack secure Web sites.
      What Web readers do not know, however, is that they have also been
      enlisted in a project to transform an old book, magazine, newspaper or
      pamphlet into an accurate, searchable and easily sortable computer text
      file.
      One of the wavy words quite likely came from a digitized image from an
      old, musty text, and while the original page has already been scanned
      into an online database, the scanning programs made a lot of mistakes.
      Mets fans and other Web site users are correcting them. Buy a ticket to
      the ballgame, help preserve history.
      The set of software tools that accomplishes this feat is called
      reCaptcha and was developed by a team of researchers led by Luis von
      Ahn, a computer scientist at Carnegie Mellon University.
      Its pilot project was to clean up the digitized archive of The New York
      Times. Today it has become the principal method used by Google
      to authenticate text in Google Books, its vast project to digitize and
      disseminate rare and out-of-print texts on the Internet.
      Digitization is normally a three-stage process: create a photographic
      image of the text, also known as a bitmap; encode the text in a compact,
      easily handled and searchable form using optical character recognition
      software, commonly called O.C.R.; and, finally, correct the mistakes.

      Today’s technology makes the first two steps relatively straightforward.
      The third, however, can be extremely difficult. For vintage
      19th-century texts in English, O.C.R. programs mess up or miss 10
      percent to 30 percent of the words. Only humans can fix the errors. The
      standard method, called key and verify, uses two transcribers to type
      the text independently and compares the results. This is time-consuming
      and extremely expensive.
      But in 2006, Dr. von Ahn’s team figured out a way around this obstacle.
      The ubiquitous Captchas, familiar to even the most casual Web user, were
      the perfect tools. Captchas, short for “completely automated public
      Turing test to tell computers and humans apart,” are impossible for
      machines to decipher, but easy for humans. (The test is named for the
      British computer pioneer Alan
      Turing.)
      Dr. von Ahn’s group estimated that humans around the world decode at
      least 200 million Captchas per day, at 10 seconds per Captcha. This
      works out to about 500,000 hours per day — a lot of applied brainpower
      being spent on what Dr. von Ahn regards as a fundamentally mindless
      exercise.
      “So we asked, ‘Can we do something useful with this time?’ ” Dr. von Ahn
      recalled in a telephone interview. Instead of making Captchas out of
      random words printed in a woozy way, why not ask Web users to translate
      problem words from archival texts?
      By Dr. von Ahn’s estimate, reCaptcha is being used by 70 percent to 90
      percent of Web sites that have Captchas — including Ticketmaster, Facebook
      and local bank branches.
      Google bought Dr. von Ahn’s start-up in 2009 — he will not say how much
      it paid — and put it to work on Google Books. He says “several million”
      words are being translated every day.
      The Times, published since 1851, had already optically transcribed its
      archive when it contacted Dr. von Ahn. Robert Larson, the
      company’s vice president for search products, said the paper had
      “looked at various ways” to edit the text, “but Luis’s method was faster
      and cheaper.”
      Page images, particularly those printed before 1900, are loaded with
      smudges, stains, watermarks and crooked type, all of which give O.C.R.’s
      the fits. To fix the errors, Dr. von Ahn uses a number of programs,
      which when applied in the proper sequence magically transform troubled
      passages into easy-to-read prose.
      The first step is done in-house. Two different O.C.R. programs scan the
      photographic image. Both will make mistakes, but not necessarily the
      same mistakes.
      ReCaptcha flags as “suspicious” any word that is deciphered differently
      by the two programs or that does not appear in an English dictionary.
      The dictionary catches words that are misspelled the same way by both
      O.C.R.’s. Other programs examine the words on either side of the suspect
      word and make another guess based on that analysis.
      Then each suspicious word is turned into a Captcha. It is crucial to
      understand that the Captcha is a distorted version of the word as
      printed in the original photographic image. It is not made from the
      O.C.R.’s imagined translation, which is often unintelligible. The
      unknown word is then paired with a second Captcha word whose correct
      translation is already known. This is the “control.”
      Several Web users seeking entry to secure sites are then given both
      words and asked to decipher them separately.
      A correct answer for the control word proves that the user is a human
      and not a machine. Answers for the unknown word are compared with the
      O.C.R. guesses and the context analysis. If the system is satisfied that
      the answer is correct, then the game is over.
      Dr. von Ahn acknowledged that some words cannot be transcribed, usually
      because the original text is torn or damaged in some other way. If
      enough users fail to identify an unknown, the word is deemed to be
      indecipherable and is marked as such.
      ReCaptcha also fails badly on cursive, Dr. von Ahn said, adding that
      “nobody reads handwriting anymore.” And reCaptcha so far translates only
      English words, even though many reCaptcha Web sites have overseas
      clients whose users are not necessarily English speakers.
      With all these constraints, reCaptcha nevertheless achieves an accuracy
      rate above 99 percent, which compares favorably with professional human
      transcribers. And Dr. von Ahn is convinced that performance will improve
      with experience, of which there will be no shortage.
      “We’ll be going for a long time,” he said. “There’s a lot of printed
      material out there.”




      [Non-text portions of this message have been removed]
    Your message has been successfully submitted and would be delivered to recipients shortly.