- View SourceDear Sirs,

I have 2 sentences : S1 = W1 W2 W3 W4 ... Wn and S2 = Y1 Y2 Y3 Y4 ... Yn.

I want to compare if the sentences S1 and the S2 is equal (with fuzzy-like matching)to produce a

rate of similiraty using ngram.

Using the algorithm of Levenshtein it work but only at the word level. I want to have more

accuracy and test also compare the two methods (levenshtein and ngram)

exemple :

S1 = "des prétraductions seront proposées aux traducteurs au cours du processus de traduction"

S2 = "une traduction sera proposée aux traducteurs au cours du processus de traduction"

with word model the rate is not good because "prétraductions" and "prétraduction" are not the

same, also "seront" et "sera", etc.

But if we compare "prétraductions" and "prétraduction" with ngram the result will be better

because we have only diff. in "s".

Are there any idea? I hope that some one could help...

Many thanks in advances.

Sincerely

Youcef

--- Bob Carpenter <carp@...> a écrit :

> This somehow slipped through and I just found it

_____________________________________________________________________________

> in my unanswered mail bin. Sorry for taking

> so long to respond.

>

> (Please re-mail questions or mail me directly

> if I don't answer -- it's my intention to answer

> all the mail to our mailing list.)

>

> > Here is what I'm trying to do:

> >

> > I have a database that contains documents. Prior to wanting to cluster

> > the documents, I have parsed Named Entities from them and stored them in

> > the database as well. So for each document, I have Person, Location,

> > and Org entities for each document. Now I would like to figure out how

> > closely related (or even just related) a Person in doc1 is to a

> > Person/Loc/Org in doc2 is. I figured that I would want to use

> > clustering to determine that first, the two documents are related and I

> > could use that as my basis for saying an entity in doc1 is related to an

> > entity in doc2 with some sort of accuracy.

>

> What does it mean to you for an entity in one doc

> to be related to another? This is usually the

> problem for clustering.

>

> We did use clustering for the John Smith example,

> to figure out which John Smiths were the same.

> But relatedness is a different problem.

>

> We tend to tackle relatedness by co-occurrence in the

> same document. This can be extended with techniques

> like SVD, but that's not in LingPipe (yet -- I'm

> working on it for the next release, 3.2).

>

> > So, am I 1) on the right track? and 2) how do I use the clustering

> > algos to do this?

>

> (1) I don't think so, and (2) I'm not sure it's

> possible.

>

> > I am basically using the example code with a few

> > tweeks. Instead of loading the documents from the filesystem, I'm

> > creating the document text from the entities that I have stored. So a

> > document contains just a list of entities.

>

> This may not be enough data for the clustering to

> work. Entities tend to be pretty sparse, so you'll

> get a lot of "completely not related" cases.

>

> > I make only one "cluster" as

> > the reference set because I don't know how the documents are related at

> > all. I then run it though the example and I get the Dendrogram and can

> > print it and all that, but I don't know what to do with it from there.

>

> :-) This is the usual problem with clustering.

>

> > My understanding of what is in the Dendrogram is a bunch of iterations

> > of going through k's until it gets to a max which I think is the number

> > of documents (i.e. i have 10 docs so max(k) is 10). I can iterate

> > through the partitions, but I don't get how I know which one to choose.

> > Do I score each one and then choose one with the best accuracy (which in

> > this case seems to always be k=1 because that is what the reference

> > was))? And what number do I use as the accuracy (P, R, F (what is F??),

>

> F is the harmonic mean of P and R. That probably

> doesn't help much, but it's supposed to be a single

> number that is a blend of precision and recall.

>

> > accuracy, etc)? I would like to store my results as something like:

> > |doc1ID|doc2ID|correlationValue|

> > And if my correlationValue is high enough, it would be meaningful

> > (higher than like 50% or something).

>

> I'm afraid our hard clusterers don't compute correlation

> at all. Nor do they compute any other kind of confidence

> measures.

>

> > In any case, I'm super new to all this and could use some explanation on

> > how to use clustering. Any help would be appreciated!

>

> You might want to look at some more general references

> on clustering.

>

> What you might also want to look at is things like

> social network analysis. I was quite taken with

> Chris Volinsky's app at AT&T. It creates a graph of

> actors that played in the same movie and then uses

> network algorithms to estimate how related two indivdiuals

> are. Here's a blog entry Chris's page with links:

>

> http://www.alias-i.com/blog/?p=38

>

> - Bob Carpenter

> Alias-i

>

Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail - View Sourceyoucef bey wrote:

> I have 2 sentences : S1 = W1 W2 W3 W4 ... Wn and S2 = Y1 Y2 Y3 Y4 ... Yn.

I've been meaning to write a tutorial on string

>

> I want to compare if the sentences S1 and the S2 is equal (with

> fuzzy-like matching)to produce a rate of similiraty using ngram.

>

> compare the two methods (levenshtein and ngram)

>

> S1 = "des prétraductions seront proposées aux traducteurs au cours du

> processus de traduction"

>

> S2 = "une traduction sera proposée aux traducteurs au cours du processus

> de traduction"

>

> with word model the rate is not good because "prétraductions" and

> "prétraduction" are not the same, also "seront" et "sera", etc.

>

> But if we compare "prétraductions" and "prétraduction" with ngram the

> result will be better because we have only diff. in "s".

comparison, as it's now relevant for all of our

feature-based classifiers. So expect that at some

future date. For now, let me sketch what you need to

do. Let me know if this isn't clear and I can send

some code samples.

The simplest approach is to use one of the predefined

implementations of Distance<CharSequence> (distances

over character sequences, including strings). These

include TF/IDF distance and Jaccard distance where

you define a tokenizer. Such tokenizers can return

n-grams (of various lengths), word-based tokens, soundex

reduced word tokens, stemmed and/or stoplisted tokens, etc.

Just use the tokenizer filters to set up tokenizer

factories of the right kind.

You can also use our (weighted) edit distance, or

simple EditDistance class. There's also Jaro-Winkler

but that won't work for your sentence problem as it's

really intended for words.

Note that TF/IDF distances need to be trained first

by showing them all the instances using the handle()

method.

The more complex approach is to build up feature

vectors using the TokenFeatureExtractor in the

tokenizer package. This can be done with any of

the tokenizer factories, including RegEx, IndoEuropean,

or n-grams (with min and max n-gram sizes to include).

Then these need to be converted into sparse vectors

and then compared using one of the distance measures.

You'll almost certainly need a symbol table to produce

the map from dimensions to counts needed to create

a sparse floating-point vector.

Here we have taxicab distance (Minkowski L1),

Euclidean distance (Minkowski L2), general Minkowski

distances (Ln), Cosine distance (length-normalized L2),

various kernel distances such as Gaussian Radial

Basis (circular only -- no covariance).

The more general approach plays nicely with feature-extractor

based classifiers like the TfIdf classifier, perceptron

classifier and k-nearest neighbors classifiers, as well

as any of the approaches to clustering based on distance.

> Using the algorithm of Levenshtein it work but only at the word level.

Some folks (notably William Cohen and Mikhail Bilenko) have

defined fuzzy string matching based on combining a token-token

model with a TF/IDF model at the token level (they call it

Soft TF/IDF; it used Jaro-Winkler rather than Levenshtein

at the token level).

- Bob Carpenter

Alias-i - View SourceDear Mr. Bob Carpenter,

I posted since few weeks the fellowing email (see above) where I asked you for possible use of

Lingpipe for sentence or string comparision. Since that I read many papers and method that you

indicated to me.

As I'm still interested to the problem, I would like to know if the tutorial for string comparison

(that told about) is finished, is it included in the last version?

You proposed to send me some code samples. In fact, I need your help to start with some samples. I

could not find how to use these methods aprart the API.

Thank you very much.

---------------------

Youcef BEY

> > I have 2 sentences : S1 = W1 W2 W3 W4 ... Wn and S2 = Y1 Y2 Y3 Y4 ... Yn.

_____________________________________________________________________________

> >

> > I want to compare if the sentences S1 and the S2 is equal (with

> > fuzzy-like matching)to produce a rate of similiraty using ngram.

> >

> > compare the two methods (levenshtein and ngram)

> >

> > S1 = "des prétraductions seront proposées aux traducteurs au cours du

> > processus de traduction"

> >

> > S2 = "une traduction sera proposée aux traducteurs au cours du processus

> > de traduction"

> >

> > with word model the rate is not good because "prétraductions" and

> > "prétraduction" are not the same, also "seront" et "sera", etc.

> >

> > But if we compare "prétraductions" and "prétraduction" with ngram the

> > result will be better because we have only diff. in "s".

>

> I've been meaning to write a tutorial on string

> comparison, as it's now relevant for all of our

> feature-based classifiers. So expect that at some

> future date. For now, let me sketch what you need to

> do. Let me know if this isn't clear and I can send

> some code samples.

>

> The simplest approach is to use one of the predefined

> implementations of Distance<CharSequence> (distances

> over character sequences, including strings). These

> include TF/IDF distance and Jaccard distance where

> you define a tokenizer. Such tokenizers can return

> n-grams (of various lengths), word-based tokens, soundex

> reduced word tokens, stemmed and/or stoplisted tokens, etc.

> Just use the tokenizer filters to set up tokenizer

> factories of the right kind.

>

> You can also use our (weighted) edit distance, or

> simple EditDistance class. There's also Jaro-Winkler

> but that won't work for your sentence problem as it's

> really intended for words.

>

> Note that TF/IDF distances need to be trained first

> by showing them all the instances using the handle()

> method.

>

> The more complex approach is to build up feature

> vectors using the TokenFeatureExtractor in the

> tokenizer package. This can be done with any of

> the tokenizer factories, including RegEx, IndoEuropean,

> or n-grams (with min and max n-gram sizes to include).

>

> Then these need to be converted into sparse vectors

> and then compared using one of the distance measures.

> You'll almost certainly need a symbol table to produce

> the map from dimensions to counts needed to create

> a sparse floating-point vector.

>

> Here we have taxicab distance (Minkowski L1),

> Euclidean distance (Minkowski L2), general Minkowski

> distances (Ln), Cosine distance (length-normalized L2),

> various kernel distances such as Gaussian Radial

> Basis (circular only -- no covariance).

>

> The more general approach plays nicely with feature-extractor

> based classifiers like the TfIdf classifier, perceptron

> classifier and k-nearest neighbors classifiers, as well

> as any of the approaches to clustering based on distance.

>

> > Using the algorithm of Levenshtein it work but only at the word level.

>

> Some folks (notably William Cohen and Mikhail Bilenko) have

> defined fuzzy string matching based on combining a token-token

> model with a TF/IDF model at the token level (they call it

> Soft TF/IDF; it used Jaro-Winkler rather than Levenshtein

> at the token level).

>

> - Bob Carpenter

> Alias-i

>

Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail http://mail.yahoo.fr - View SourceDear Mr. Bob Carpenter,

I posted since few weeks the fellowing email (see below) where I asked you for possible use of

Lingpipe for sentence or string comparision. Since that I read many papers and method that you

indicated to me.

As I'm still interested to the problem, I would like to know if the tutorial for string comparison

(that you told me about) is finished, is it included in the last version?

You proposed to send me some code samples. In fact, I need your help to start with some samples. I

could not find how to use these methods aprart the API.

Thank you very much.

---------------------

Youcef BEY

> > I have 2 sentences : S1 = W1 W2 W3 W4 ... Wn and S2 = Y1 Y2 Y3 Y4 ... Yn.

_____________________________________________________________________________

> >

> > I want to compare if the sentences S1 and the S2 is equal (with

> > fuzzy-like matching)to produce a rate of similiraty using ngram.

> >

> > compare the two methods (levenshtein and ngram)

> >

> > S1 = "des prétraductions seront proposées aux traducteurs au cours du

> > processus de traduction"

> >

> > S2 = "une traduction sera proposée aux traducteurs au cours du processus

> > de traduction"

> >

> > with word model the rate is not good because "prétraductions" and

> > "prétraduction" are not the same, also "seront" et "sera", etc.

> >

> > But if we compare "prétraductions" and "prétraduction" with ngram the

> > result will be better because we have only diff. in "s".

>

> I've been meaning to write a tutorial on string

> comparison, as it's now relevant for all of our

> feature-based classifiers. So expect that at some

> future date. For now, let me sketch what you need to

> do. Let me know if this isn't clear and I can send

> some code samples.

>

> The simplest approach is to use one of the predefined

> implementations of Distance<CharSequence> (distances

> over character sequences, including strings). These

> include TF/IDF distance and Jaccard distance where

> you define a tokenizer. Such tokenizers can return

> n-grams (of various lengths), word-based tokens, soundex

> reduced word tokens, stemmed and/or stoplisted tokens, etc.

> Just use the tokenizer filters to set up tokenizer

> factories of the right kind.

>

> You can also use our (weighted) edit distance, or

> simple EditDistance class. There's also Jaro-Winkler

> but that won't work for your sentence problem as it's

> really intended for words.

>

> Note that TF/IDF distances need to be trained first

> by showing them all the instances using the handle()

> method.

>

> The more complex approach is to build up feature

> vectors using the TokenFeatureExtractor in the

> tokenizer package. This can be done with any of

> the tokenizer factories, including RegEx, IndoEuropean,

> or n-grams (with min and max n-gram sizes to include).

>

> Then these need to be converted into sparse vectors

> and then compared using one of the distance measures.

> You'll almost certainly need a symbol table to produce

> the map from dimensions to counts needed to create

> a sparse floating-point vector.

>

> Here we have taxicab distance (Minkowski L1),

> Euclidean distance (Minkowski L2), general Minkowski

> distances (Ln), Cosine distance (length-normalized L2),

> various kernel distances such as Gaussian Radial

> Basis (circular only -- no covariance).

>

> The more general approach plays nicely with feature-extractor

> based classifiers like the TfIdf classifier, perceptron

> classifier and k-nearest neighbors classifiers, as well

> as any of the approaches to clustering based on distance.

>

> > Using the algorithm of Levenshtein it work but only at the word level.

>

> Some folks (notably William Cohen and Mikhail Bilenko) have

> defined fuzzy string matching based on combining a token-token

> model with a TF/IDF model at the token level (they call it

> Soft TF/IDF; it used Jaro-Winkler rather than Levenshtein

> at the token level).

>

> - Bob Carpenter

> Alias-i

>

Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail http://mail.yahoo.fr - View SourceI'm afraid the short answer is no -- I haven't

written the tutorial for string comparison and

don't currently have any code samples to send

you that aren't tied up in proprietary customer

projects.

I'll try to get to it during the next week.

This is definitely something that should go

in our tutorials.

- Bob Carpenter

Alias-i