Loading ...
Sorry, an error occurred while loading the content.

nasty text processing problem (long, sorry)

Expand Messages
  • Vernee Stevens
    I m processing some free-form text fields. We want to identify and flag names and nasty words from these comments on some surveys. The comment fields are
    Message 1 of 1 , Jan 27, 2001
    • 0 Attachment
      I'm processing some free-form text fields.

      We want to identify and flag names and nasty words from these comments on
      some
      surveys. The comment fields are $6000. It's sort of the reverse of a
      spell checker... if the word is IN the list, grab it. Spell checkers find
      stuff NOT in the list. Also, we will be doing this in many languages,
      including Asian languages.


      We have a list of names from an employee file. We may also search for
      vulgar words etc.... I'm thinking that George Carlin's list would be useful
      here :-)

      This needs to scale to handle 80-100,000 comment records if possible.

      * get unique names (as words). so I end up with a list of single words,
      both surnames and firstnames.
      * get all unique words from the comment fields
      * match these and get a list of names that actually exist in the data.

      *** THEN, run some kind of search and replace drill through the
      verbatims word by word.

      This has the benefit of not running all 70k names through all the verbatims,
      only the names or curses we know are actually there (800 or so).

      The processing to get the unique words was easily handled by this nifty tool
      I have called TextPipe. http://www.crystalsoftware.com.au/

      So when it comes to the *** THEN .... step...I still don't know a good
      way to rip through the comments.

      I suspect that PERL might offer is the answer, but can't quite zero in on
      the way to do it (being about 24 hrs into PERL). Seems like an iterative
      grep of some sort. (or shoving it through a hash table word by word, but I
      don't necessarily want to replace the word, maybe stick some kind of special
      character on the end to search for.

      I guess I want the record flagged, not text replaced. I think we should
      have a person decide to suppress or not (but would like to flag records to
      review). AND... this survey will be done over around 80,000 people, and in
      several languages, including asian languages. So whatever we do, it has to
      scale. Fortunately, this processing won't be done all at once, but will
      trickle in. Still we don't want this to bog the thing down.

      Any ideas would be appreciated!

      V. Stevens
    Your message has been successfully submitted and would be delivered to recipients shortly.