nasty text processing problem (long, sorry)
- I'm processing some free-form text fields.
We want to identify and flag names and nasty words from these comments on
surveys. The comment fields are $6000. It's sort of the reverse of a
spell checker... if the word is IN the list, grab it. Spell checkers find
stuff NOT in the list. Also, we will be doing this in many languages,
including Asian languages.
We have a list of names from an employee file. We may also search for
vulgar words etc.... I'm thinking that George Carlin's list would be useful
This needs to scale to handle 80-100,000 comment records if possible.
* get unique names (as words). so I end up with a list of single words,
both surnames and firstnames.
* get all unique words from the comment fields
* match these and get a list of names that actually exist in the data.
*** THEN, run some kind of search and replace drill through the
verbatims word by word.
This has the benefit of not running all 70k names through all the verbatims,
only the names or curses we know are actually there (800 or so).
The processing to get the unique words was easily handled by this nifty tool
I have called TextPipe. http://www.crystalsoftware.com.au/
So when it comes to the *** THEN .... step...I still don't know a good
way to rip through the comments.
I suspect that PERL might offer is the answer, but can't quite zero in on
the way to do it (being about 24 hrs into PERL). Seems like an iterative
grep of some sort. (or shoving it through a hash table word by word, but I
don't necessarily want to replace the word, maybe stick some kind of special
character on the end to search for.
I guess I want the record flagged, not text replaced. I think we should
have a person decide to suppress or not (but would like to flag records to
review). AND... this survey will be done over around 80,000 people, and in
several languages, including asian languages. So whatever we do, it has to
scale. Fortunately, this processing won't be done all at once, but will
trickle in. Still we don't want this to bog the thing down.
Any ideas would be appreciated!