Loading ...
Sorry, an error occurred while loading the content.

Sentence comparison

Expand Messages
  • youcef bey
    Dear Sirs, I have 2 sentences : S1 = W1 W2 W3 W4 ... Wn and S2 = Y1 Y2 Y3 Y4 ... Yn. I want to compare if the sentences S1 and the S2 is equal (with fuzzy-like
    Message 1 of 7 , Oct 9, 2007
    View Source
    • 0 Attachment
      Dear Sirs,

      I have 2 sentences : S1 = W1 W2 W3 W4 ... Wn and S2 = Y1 Y2 Y3 Y4 ... Yn.

      I want to compare if the sentences S1 and the S2 is equal (with fuzzy-like matching)to produce a
      rate of similiraty using ngram.

      Using the algorithm of Levenshtein it work but only at the word level. I want to have more
      accuracy and test also compare the two methods (levenshtein and ngram)

      exemple :


      S1 = "des prétraductions seront proposées aux traducteurs au cours du processus de traduction"

      S2 = "une traduction sera proposée aux traducteurs au cours du processus de traduction"

      with word model the rate is not good because "prétraductions" and "prétraduction" are not the
      same, also "seront" et "sera", etc.

      But if we compare "prétraductions" and "prétraduction" with ngram the result will be better
      because we have only diff. in "s".

      Are there any idea? I hope that some one could help...

      Many thanks in advances.

      Sincerely

      Youcef





      --- Bob Carpenter <carp@...> a écrit :

      > This somehow slipped through and I just found it
      > in my unanswered mail bin. Sorry for taking
      > so long to respond.
      >
      > (Please re-mail questions or mail me directly
      > if I don't answer -- it's my intention to answer
      > all the mail to our mailing list.)
      >
      > > Here is what I'm trying to do:
      > >
      > > I have a database that contains documents. Prior to wanting to cluster
      > > the documents, I have parsed Named Entities from them and stored them in
      > > the database as well. So for each document, I have Person, Location,
      > > and Org entities for each document. Now I would like to figure out how
      > > closely related (or even just related) a Person in doc1 is to a
      > > Person/Loc/Org in doc2 is. I figured that I would want to use
      > > clustering to determine that first, the two documents are related and I
      > > could use that as my basis for saying an entity in doc1 is related to an
      > > entity in doc2 with some sort of accuracy.
      >
      > What does it mean to you for an entity in one doc
      > to be related to another? This is usually the
      > problem for clustering.
      >
      > We did use clustering for the John Smith example,
      > to figure out which John Smiths were the same.
      > But relatedness is a different problem.
      >
      > We tend to tackle relatedness by co-occurrence in the
      > same document. This can be extended with techniques
      > like SVD, but that's not in LingPipe (yet -- I'm
      > working on it for the next release, 3.2).
      >
      > > So, am I 1) on the right track? and 2) how do I use the clustering
      > > algos to do this?
      >
      > (1) I don't think so, and (2) I'm not sure it's
      > possible.
      >
      > > I am basically using the example code with a few
      > > tweeks. Instead of loading the documents from the filesystem, I'm
      > > creating the document text from the entities that I have stored. So a
      > > document contains just a list of entities.
      >
      > This may not be enough data for the clustering to
      > work. Entities tend to be pretty sparse, so you'll
      > get a lot of "completely not related" cases.
      >
      > > I make only one "cluster" as
      > > the reference set because I don't know how the documents are related at
      > > all. I then run it though the example and I get the Dendrogram and can
      > > print it and all that, but I don't know what to do with it from there.
      >
      > :-) This is the usual problem with clustering.
      >
      > > My understanding of what is in the Dendrogram is a bunch of iterations
      > > of going through k's until it gets to a max which I think is the number
      > > of documents (i.e. i have 10 docs so max(k) is 10). I can iterate
      > > through the partitions, but I don't get how I know which one to choose.
      > > Do I score each one and then choose one with the best accuracy (which in
      > > this case seems to always be k=1 because that is what the reference
      > > was))? And what number do I use as the accuracy (P, R, F (what is F??),
      >
      > F is the harmonic mean of P and R. That probably
      > doesn't help much, but it's supposed to be a single
      > number that is a blend of precision and recall.
      >
      > > accuracy, etc)? I would like to store my results as something like:
      > > |doc1ID|doc2ID|correlationValue|
      > > And if my correlationValue is high enough, it would be meaningful
      > > (higher than like 50% or something).
      >
      > I'm afraid our hard clusterers don't compute correlation
      > at all. Nor do they compute any other kind of confidence
      > measures.
      >
      > > In any case, I'm super new to all this and could use some explanation on
      > > how to use clustering. Any help would be appreciated!
      >
      > You might want to look at some more general references
      > on clustering.
      >
      > What you might also want to look at is things like
      > social network analysis. I was quite taken with
      > Chris Volinsky's app at AT&T. It creates a graph of
      > actors that played in the same movie and then uses
      > network algorithms to estimate how related two indivdiuals
      > are. Here's a blog entry Chris's page with links:
      >
      > http://www.alias-i.com/blog/?p=38
      >
      > - Bob Carpenter
      > Alias-i
      >



      _____________________________________________________________________________
      Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail
    • Bob Carpenter
      ... I ve been meaning to write a tutorial on string comparison, as it s now relevant for all of our feature-based classifiers. So expect that at some future
      Message 2 of 7 , Oct 9, 2007
      View Source
      • 0 Attachment
        youcef bey wrote:

        > I have 2 sentences : S1 = W1 W2 W3 W4 ... Wn and S2 = Y1 Y2 Y3 Y4 ... Yn.
        >
        > I want to compare if the sentences S1 and the S2 is equal (with
        > fuzzy-like matching)to produce a rate of similiraty using ngram.
        >
        > compare the two methods (levenshtein and ngram)
        >
        > S1 = "des prétraductions seront proposées aux traducteurs au cours du
        > processus de traduction"
        >
        > S2 = "une traduction sera proposée aux traducteurs au cours du processus
        > de traduction"
        >
        > with word model the rate is not good because "prétraductions" and
        > "prétraduction" are not the same, also "seront" et "sera", etc.
        >
        > But if we compare "prétraductions" and "prétraduction" with ngram the
        > result will be better because we have only diff. in "s".

        I've been meaning to write a tutorial on string
        comparison, as it's now relevant for all of our
        feature-based classifiers. So expect that at some
        future date. For now, let me sketch what you need to
        do. Let me know if this isn't clear and I can send
        some code samples.

        The simplest approach is to use one of the predefined
        implementations of Distance<CharSequence> (distances
        over character sequences, including strings). These
        include TF/IDF distance and Jaccard distance where
        you define a tokenizer. Such tokenizers can return
        n-grams (of various lengths), word-based tokens, soundex
        reduced word tokens, stemmed and/or stoplisted tokens, etc.
        Just use the tokenizer filters to set up tokenizer
        factories of the right kind.

        You can also use our (weighted) edit distance, or
        simple EditDistance class. There's also Jaro-Winkler
        but that won't work for your sentence problem as it's
        really intended for words.

        Note that TF/IDF distances need to be trained first
        by showing them all the instances using the handle()
        method.

        The more complex approach is to build up feature
        vectors using the TokenFeatureExtractor in the
        tokenizer package. This can be done with any of
        the tokenizer factories, including RegEx, IndoEuropean,
        or n-grams (with min and max n-gram sizes to include).

        Then these need to be converted into sparse vectors
        and then compared using one of the distance measures.
        You'll almost certainly need a symbol table to produce
        the map from dimensions to counts needed to create
        a sparse floating-point vector.

        Here we have taxicab distance (Minkowski L1),
        Euclidean distance (Minkowski L2), general Minkowski
        distances (Ln), Cosine distance (length-normalized L2),
        various kernel distances such as Gaussian Radial
        Basis (circular only -- no covariance).

        The more general approach plays nicely with feature-extractor
        based classifiers like the TfIdf classifier, perceptron
        classifier and k-nearest neighbors classifiers, as well
        as any of the approaches to clustering based on distance.

        > Using the algorithm of Levenshtein it work but only at the word level.

        Some folks (notably William Cohen and Mikhail Bilenko) have
        defined fuzzy string matching based on combining a token-token
        model with a TF/IDF model at the token level (they call it
        Soft TF/IDF; it used Jaro-Winkler rather than Levenshtein
        at the token level).

        - Bob Carpenter
        Alias-i
      • youcef bey
        Dear Mr. Bob Carpenter, I posted since few weeks the fellowing email (see above) where I asked you for possible use of Lingpipe for sentence or string
        Message 3 of 7 , Feb 7, 2008
        View Source
        • 0 Attachment
          Dear Mr. Bob Carpenter,

          I posted since few weeks the fellowing email (see above) where I asked you for possible use of
          Lingpipe for sentence or string comparision. Since that I read many papers and method that you
          indicated to me.

          As I'm still interested to the problem, I would like to know if the tutorial for string comparison
          (that told about) is finished, is it included in the last version?

          You proposed to send me some code samples. In fact, I need your help to start with some samples. I
          could not find how to use these methods aprart the API.

          Thank you very much.
          ---------------------
          Youcef BEY

          > > I have 2 sentences : S1 = W1 W2 W3 W4 ... Wn and S2 = Y1 Y2 Y3 Y4 ... Yn.
          > >
          > > I want to compare if the sentences S1 and the S2 is equal (with
          > > fuzzy-like matching)to produce a rate of similiraty using ngram.
          > >
          > > compare the two methods (levenshtein and ngram)
          > >
          > > S1 = "des prétraductions seront proposées aux traducteurs au cours du
          > > processus de traduction"
          > >
          > > S2 = "une traduction sera proposée aux traducteurs au cours du processus
          > > de traduction"
          > >
          > > with word model the rate is not good because "prétraductions" and
          > > "prétraduction" are not the same, also "seront" et "sera", etc.
          > >
          > > But if we compare "prétraductions" and "prétraduction" with ngram the
          > > result will be better because we have only diff. in "s".
          >
          > I've been meaning to write a tutorial on string
          > comparison, as it's now relevant for all of our
          > feature-based classifiers. So expect that at some
          > future date. For now, let me sketch what you need to
          > do. Let me know if this isn't clear and I can send
          > some code samples.
          >
          > The simplest approach is to use one of the predefined
          > implementations of Distance<CharSequence> (distances
          > over character sequences, including strings). These
          > include TF/IDF distance and Jaccard distance where
          > you define a tokenizer. Such tokenizers can return
          > n-grams (of various lengths), word-based tokens, soundex
          > reduced word tokens, stemmed and/or stoplisted tokens, etc.
          > Just use the tokenizer filters to set up tokenizer
          > factories of the right kind.
          >
          > You can also use our (weighted) edit distance, or
          > simple EditDistance class. There's also Jaro-Winkler
          > but that won't work for your sentence problem as it's
          > really intended for words.
          >
          > Note that TF/IDF distances need to be trained first
          > by showing them all the instances using the handle()
          > method.
          >
          > The more complex approach is to build up feature
          > vectors using the TokenFeatureExtractor in the
          > tokenizer package. This can be done with any of
          > the tokenizer factories, including RegEx, IndoEuropean,
          > or n-grams (with min and max n-gram sizes to include).
          >
          > Then these need to be converted into sparse vectors
          > and then compared using one of the distance measures.
          > You'll almost certainly need a symbol table to produce
          > the map from dimensions to counts needed to create
          > a sparse floating-point vector.
          >
          > Here we have taxicab distance (Minkowski L1),
          > Euclidean distance (Minkowski L2), general Minkowski
          > distances (Ln), Cosine distance (length-normalized L2),
          > various kernel distances such as Gaussian Radial
          > Basis (circular only -- no covariance).
          >
          > The more general approach plays nicely with feature-extractor
          > based classifiers like the TfIdf classifier, perceptron
          > classifier and k-nearest neighbors classifiers, as well
          > as any of the approaches to clustering based on distance.
          >
          > > Using the algorithm of Levenshtein it work but only at the word level.
          >
          > Some folks (notably William Cohen and Mikhail Bilenko) have
          > defined fuzzy string matching based on combining a token-token
          > model with a TF/IDF model at the token level (they call it
          > Soft TF/IDF; it used Jaro-Winkler rather than Levenshtein
          > at the token level).
          >
          > - Bob Carpenter
          > Alias-i
          >



          _____________________________________________________________________________
          Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail http://mail.yahoo.fr
        • youcef bey
          Dear Mr. Bob Carpenter, I posted since few weeks the fellowing email (see below) where I asked you for possible use of Lingpipe for sentence or string
          Message 4 of 7 , Feb 7, 2008
          View Source
          • 0 Attachment
            Dear Mr. Bob Carpenter,

            I posted since few weeks the fellowing email (see below) where I asked you for possible use of
            Lingpipe for sentence or string comparision. Since that I read many papers and method that you
            indicated to me.

            As I'm still interested to the problem, I would like to know if the tutorial for string comparison
            (that you told me about) is finished, is it included in the last version?

            You proposed to send me some code samples. In fact, I need your help to start with some samples. I
            could not find how to use these methods aprart the API.

            Thank you very much.
            ---------------------
            Youcef BEY

            > > I have 2 sentences : S1 = W1 W2 W3 W4 ... Wn and S2 = Y1 Y2 Y3 Y4 ... Yn.
            > >
            > > I want to compare if the sentences S1 and the S2 is equal (with
            > > fuzzy-like matching)to produce a rate of similiraty using ngram.
            > >
            > > compare the two methods (levenshtein and ngram)
            > >
            > > S1 = "des prétraductions seront proposées aux traducteurs au cours du
            > > processus de traduction"
            > >
            > > S2 = "une traduction sera proposée aux traducteurs au cours du processus
            > > de traduction"
            > >
            > > with word model the rate is not good because "prétraductions" and
            > > "prétraduction" are not the same, also "seront" et "sera", etc.
            > >
            > > But if we compare "prétraductions" and "prétraduction" with ngram the
            > > result will be better because we have only diff. in "s".
            >
            > I've been meaning to write a tutorial on string
            > comparison, as it's now relevant for all of our
            > feature-based classifiers. So expect that at some
            > future date. For now, let me sketch what you need to
            > do. Let me know if this isn't clear and I can send
            > some code samples.
            >
            > The simplest approach is to use one of the predefined
            > implementations of Distance<CharSequence> (distances
            > over character sequences, including strings). These
            > include TF/IDF distance and Jaccard distance where
            > you define a tokenizer. Such tokenizers can return
            > n-grams (of various lengths), word-based tokens, soundex
            > reduced word tokens, stemmed and/or stoplisted tokens, etc.
            > Just use the tokenizer filters to set up tokenizer
            > factories of the right kind.
            >
            > You can also use our (weighted) edit distance, or
            > simple EditDistance class. There's also Jaro-Winkler
            > but that won't work for your sentence problem as it's
            > really intended for words.
            >
            > Note that TF/IDF distances need to be trained first
            > by showing them all the instances using the handle()
            > method.
            >
            > The more complex approach is to build up feature
            > vectors using the TokenFeatureExtractor in the
            > tokenizer package. This can be done with any of
            > the tokenizer factories, including RegEx, IndoEuropean,
            > or n-grams (with min and max n-gram sizes to include).
            >
            > Then these need to be converted into sparse vectors
            > and then compared using one of the distance measures.
            > You'll almost certainly need a symbol table to produce
            > the map from dimensions to counts needed to create
            > a sparse floating-point vector.
            >
            > Here we have taxicab distance (Minkowski L1),
            > Euclidean distance (Minkowski L2), general Minkowski
            > distances (Ln), Cosine distance (length-normalized L2),
            > various kernel distances such as Gaussian Radial
            > Basis (circular only -- no covariance).
            >
            > The more general approach plays nicely with feature-extractor
            > based classifiers like the TfIdf classifier, perceptron
            > classifier and k-nearest neighbors classifiers, as well
            > as any of the approaches to clustering based on distance.
            >
            > > Using the algorithm of Levenshtein it work but only at the word level.
            >
            > Some folks (notably William Cohen and Mikhail Bilenko) have
            > defined fuzzy string matching based on combining a token-token
            > model with a TF/IDF model at the token level (they call it
            > Soft TF/IDF; it used Jaro-Winkler rather than Levenshtein
            > at the token level).
            >
            > - Bob Carpenter
            > Alias-i
            >



            _____________________________________________________________________________
            Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail http://mail.yahoo.fr
          • Bob Carpenter
            I m afraid the short answer is no -- I haven t written the tutorial for string comparison and don t currently have any code samples to send you that aren t
            Message 5 of 7 , Feb 8, 2008
            View Source
            • 0 Attachment
              I'm afraid the short answer is no -- I haven't
              written the tutorial for string comparison and
              don't currently have any code samples to send
              you that aren't tied up in proprietary customer
              projects.

              I'll try to get to it during the next week.
              This is definitely something that should go
              in our tutorials.

              - Bob Carpenter
              Alias-i
            Your message has been successfully submitted and would be delivered to recipients shortly.