Loading ...
Sorry, an error occurred while loading the content.

Some basic questions about linkages / algorithms

Expand Messages
  • Lee Romero
    Hi all - (Apologies for the length of this - as I got to describing what I m trying to do, I ended up writing more than I thought I would....) I am currently
    Message 1 of 5 , Oct 13, 2008
    • 0 Attachment
      Hi all - (Apologies for the length of this - as I got to describing
      what I'm trying to do, I ended up writing more than I thought I
      would....)

      I am currently working on an effort to implement a type of "expertise
      location" function based on generating a profile of someone from
      his/her team/project associations and his/her work products (that's a
      pretty simple description for what I'm doing but it gets the point
      across).

      Part of this results in a set of keywords associated with each person
      in the population - the "expertise location" can then be thought of as
      providing a standard keyword search engine / interface interface on
      top of the set of keywords associated with the people (there's also
      some navigation in the application based on the keywords but I'm going
      to ignore that for the moment).

      For any one person in the system, the defining characteristics are a
      set of activities / team memberships and, for each of those, there are
      a set of keywords. Collapsing together all of the keywords into a
      "pool" of keywords for a person, you can think of the profile of a
      person as a set of weighted keywords (where the weight for a keyword
      is the number of occurrences of for that keyword within the set of
      activities / team memberships).

      There are no specific restrictions about what is a valid keyword
      except that anything you might think of as a common "stop word" in a
      search engine is excluded (things like "the", "of", etc.). So my
      profile might (in part) look like: ("search", 100), ("knowledge", 40),
      ("management", 80), ("engineering", 20), etc.

      Also, because of how the keywords are generated and weighted, there is
      no upper limit on the weight of a keyword for any one person. To
      follow on from my example, I might have an additional 80 keywords with
      a total weight of, say, 5000, while someone else might have a total of
      40 keywords and a total weight of 4500.

      All of this is pretty straightforward and, even though the basic idea
      seems pretty simple, the keyword search function across these profiles
      provides a surprisingly high correlation to finding someone with a
      particular skill or expertise. (This makes me happy, as I wasn't sure
      if it would really "work" as expected like this!)

      What I've been considering now is to take these people profiles and
      try to do two additional things which are related: provide a measure
      of "similarity" between two people and, from that measure of
      similarity, try to identify "invisible" communities of interest (by
      identifying pockets of people who have high similarity among
      themselves).

      The idea of a "similarity" metric is intriguing because by itself it
      means that the presentation of a particular person's profile can
      include a means to identify people similar to the one you're looking
      at. Though it's kind of crude to liken this to an ecommerce site, I
      do think of it as similar to the function you see on many sites where
      when you're looking at a product, you are presented with a list of
      similar products. ("People who have found this person interesting
      might find these other people interesting!" :-) )

      My question: Has anyone done something similar to this before? If
      so, what approach have you taken to defining the similarity
      measurement?

      Here's the quandary I've run into, which has prompted my question:

      * To measure similarity between two people (X and Y), I first match
      the keywords between X and Y.

      * For each keyword (KW) the two people have in common, I credit each
      person with the minimum of the weight of KW for X and the weight of KW
      for Y. So if two people have the keyword "engineering" and one has a
      weight of 20 and another has a weight of 60, their similarity for this
      keyword is 20.

      * I then sum up these keyword weights to get a total "similarity
      weight" between the two people. Let's say it's 800 across all of the
      common keywords for two people.

      * Lastly, in order to reflect how much of that commonality describes
      each of the people, I calculate the percentage of someone's profile
      that is "covered" by the similarity weight to get the overall
      "similarity measure". So if X has a total profile weight of 5000 and
      Y has a total profile weight of 4000, that means that the person X is
      16% (800/5000) similar to Y, while Y is 20% (800/4000) similar to Y.
      This asymmetry makes some sense to me because we are comparing
      "different size" profiles (so person Y can seem more like person X
      then person X might be similar to person Y).

      Now, getting back to my question - I can link people based on either
      of these computations - the "similarity weight" or the "similarity
      measure" - does either make more sense?

      If I use the "similarity weight" then I seem to have an issue where if
      two people both have "heavy" profile weights, they can seem highly
      similar based on their similarity weight even though their similarity
      weight might be a relatively small percentage of their total profile
      weight (say an overlap of 1000 when the weights of the two profiles
      are 8000 and 10000). Similarly, two people who have a small profile
      weight will seem dissimilar even if they had 100% overlap in their
      profiles!

      On the other hand, if I use the "similarity measure", it seems likely
      that anyone with a "heavy" profile weight will seem to have weak links
      because people are likely to have low percentage overlap with them,
      while people with small profiles can seem very similar (and so tightly
      linked) based on just a couple of keywords.


      Any thoughts from ONA practitioners on what might be the best way to
      link people in this situation?

      Sorry for the length - I'm planning to write about this experiment on
      my blog but thought I'd see whether you might have a suggestion for
      how to measure this and how to break the logjam in my head :-)

      Regards
      Lee Romero

      PS - Yes, I am aware of the perils of ONA via data mining - I do not
      ascribe much to the analysis here other than possibly finding these
      "communities of interest" and not so much about how significant
      someone's position in the network might be or anything like that.
    • Valdis Krebs
      Lee, Who picks the words? Who assigns the words to whom? Who weights each word for each person? Do people also nominate each other for who they actually go
      Message 2 of 5 , Oct 13, 2008
      • 0 Attachment
        Lee,

        Who picks the words? Who assigns the words to whom? Who weights each
        word for each person?

        Do people also nominate each other for who they actually go to for
        expertise on A, B or C?

        "X may be a high word & high weight person that no one goes to because
        X is a jerk."

        You can find similarity by attributes [your approach] or links or both.

        See my analysis of political books on Amazon [people that bought this
        also bought that...]

        http://www.orgnet.com/divided.html

        Valdis



        On Oct 13, 2008, at 1:44 PM, Lee Romero wrote:

        > Hi all - (Apologies for the length of this - as I got to describing
        > what I'm trying to do, I ended up writing more than I thought I
        > would....)
        >
        > I am currently working on an effort to implement a type of "expertise
        > location" function based on generating a profile of someone from
        > his/her team/project associations and his/her work products (that's a
        > pretty simple description for what I'm doing but it gets the point
        > across).
        >
        > Part of this results in a set of keywords associated with each person
        > in the population - the "expertise location" can then be thought of as
        > providing a standard keyword search engine / interface interface on
        > top of the set of keywords associated with the people (there's also
        > some navigation in the application based on the keywords but I'm going
        > to ignore that for the moment).
        >
        > For any one person in the system, the defining characteristics are a
        > set of activities / team memberships and, for each of those, there are
        > a set of keywords. Collapsing together all of the keywords into a
        > "pool" of keywords for a person, you can think of the profile of a
        > person as a set of weighted keywords (where the weight for a keyword
        > is the number of occurrences of for that keyword within the set of
        > activities / team memberships).
        >
        > There are no specific restrictions about what is a valid keyword
        > except that anything you might think of as a common "stop word" in a
        > search engine is excluded (things like "the", "of", etc.). So my
        > profile might (in part) look like: ("search", 100), ("knowledge", 40),
        > ("management", 80), ("engineering", 20), etc.
        >
        > Also, because of how the keywords are generated and weighted, there is
        > no upper limit on the weight of a keyword for any one person. To
        > follow on from my example, I might have an additional 80 keywords with
        > a total weight of, say, 5000, while someone else might have a total of
        > 40 keywords and a total weight of 4500.
        >
        > All of this is pretty straightforward and, even though the basic idea
        > seems pretty simple, the keyword search function across these profiles
        > provides a surprisingly high correlation to finding someone with a
        > particular skill or expertise. (This makes me happy, as I wasn't sure
        > if it would really "work" as expected like this!)
        >
        > What I've been considering now is to take these people profiles and
        > try to do two additional things which are related: provide a measure
        > of "similarity" between two people and, from that measure of
        > similarity, try to identify "invisible" communities of interest (by
        > identifying pockets of people who have high similarity among
        > themselves).
        >
        > The idea of a "similarity" metric is intriguing because by itself it
        > means that the presentation of a particular person's profile can
        > include a means to identify people similar to the one you're looking
        > at. Though it's kind of crude to liken this to an ecommerce site, I
        > do think of it as similar to the function you see on many sites where
        > when you're looking at a product, you are presented with a list of
        > similar products. ("People who have found this person interesting
        > might find these other people interesting!" :-) )
        >
        > My question: Has anyone done something similar to this before? If
        > so, what approach have you taken to defining the similarity
        > measurement?
        >
        > Here's the quandary I've run into, which has prompted my question:
        >
        > * To measure similarity between two people (X and Y), I first match
        > the keywords between X and Y.
        >
        > * For each keyword (KW) the two people have in common, I credit each
        > person with the minimum of the weight of KW for X and the weight of KW
        > for Y. So if two people have the keyword "engineering" and one has a
        > weight of 20 and another has a weight of 60, their similarity for this
        > keyword is 20.
        >
        > * I then sum up these keyword weights to get a total "similarity
        > weight" between the two people. Let's say it's 800 across all of the
        > common keywords for two people.
        >
        > * Lastly, in order to reflect how much of that commonality describes
        > each of the people, I calculate the percentage of someone's profile
        > that is "covered" by the similarity weight to get the overall
        > "similarity measure". So if X has a total profile weight of 5000 and
        > Y has a total profile weight of 4000, that means that the person X is
        > 16% (800/5000) similar to Y, while Y is 20% (800/4000) similar to Y.
        > This asymmetry makes some sense to me because we are comparing
        > "different size" profiles (so person Y can seem more like person X
        > then person X might be similar to person Y).
        >
        > Now, getting back to my question - I can link people based on either
        > of these computations - the "similarity weight" or the "similarity
        > measure" - does either make more sense?
        >
        > If I use the "similarity weight" then I seem to have an issue where if
        > two people both have "heavy" profile weights, they can seem highly
        > similar based on their similarity weight even though their similarity
        > weight might be a relatively small percentage of their total profile
        > weight (say an overlap of 1000 when the weights of the two profiles
        > are 8000 and 10000). Similarly, two people who have a small profile
        > weight will seem dissimilar even if they had 100% overlap in their
        > profiles!
        >
        > On the other hand, if I use the "similarity measure", it seems likely
        > that anyone with a "heavy" profile weight will seem to have weak links
        > because people are likely to have low percentage overlap with them,
        > while people with small profiles can seem very similar (and so tightly
        > linked) based on just a couple of keywords.
        >
        >
        > Any thoughts from ONA practitioners on what might be the best way to
        > link people in this situation?
        >
        > Sorry for the length - I'm planning to write about this experiment on
        > my blog but thought I'd see whether you might have a suggestion for
        > how to measure this and how to break the logjam in my head :-)
        >
        > Regards
        > Lee Romero
        >
        > PS - Yes, I am aware of the perils of ONA via data mining - I do not
        > ascribe much to the analysis here other than possibly finding these
        > "communities of interest" and not so much about how significant
        > someone's position in the network might be or anything like that.
        >
        > ------------------------------------
        >
        > Yahoo! Groups Links
        >
        >
        >
      • Lee Romero
        Hi Valdis - for the most part, the words are identified through the person s activities (so, largely, by the person him/herself). Some examples: I am a member
        Message 3 of 5 , Oct 13, 2008
        • 0 Attachment
          Hi Valdis - for the most part, the words are identified through the
          person's activities (so, largely, by the person him/herself).

          Some examples:

          I am a member of a mailing list named "software-development-community"
          and I have posted 20 times to that mailing list in the last year on
          various topics (different subject lines).

          I also may have posted to my blog a number of items with various
          titles and slotted into a set of categories (which I have assigned as
          authors).

          Lastly, I have edited, say, 2 dozen different pages in our wiki site
          with various titles and assigned to various keywords.

          Based on the above, my profile would include keywords generated from:

          1. The names of the mailing lists of which I'm a member with a weight
          equal to the # of posts I've made (so each of "software",
          "development" and "community" are weighted 20 based on 20 posts).

          2. The words from the subject lines of those 20 emails I've posted to
          the software-development-community mailing list, with a weight of 1
          for each occurrence of any given word. So, say, I have written 5 of
          those posts about Eclipse (and the word "Eclipse" occurs in the
          subject line of all five of those) in that span of time, "Eclipse"
          would get the weight of 5 from my mailing list posts.

          3. The titles and categories for my blog posts are also broken into
          keywords and assigned a weight of 1 for each occurrence of any given
          word.

          4. The titles and categories of each wiki page are broken down into
          keywords and assigned a weight equal to the number of times I've
          edited any given page (so if I edit a particular page 10 times, each
          keyword from the title or category is assigned a weight of 10 for that
          page).

          And so on. I say that the keywords are self identified "for the most
          part" because I actually probably don't have a lot of control over the
          actual names of the projects or teams on which I work - my hope as an
          implementor of this idea is that the names will generally be
          rationally chosen to reflect words someone might think of to find
          someone else using.

          I've currently included about a dozen sources of activities or project
          / team memberships in this work - which is not yet sufficient for all
          people who might be included but I think it's sufficient to at least
          validate this approach to doing the "expertise location" that
          originally drove the work.

          So if you write a blog post or send an email or edit a wiki page with
          the word "jerk" in it, that keyword ends up associated with you.
          Because the sources for this are internal (corporate), I generally
          don't think that will be that much of a problem.

          The idea is to try to identify keywords that are relevant to someone
          using a means that directly looks at what you work on or write about
          or are assigned to, etc. Hopefully, it can reduce the need to
          maintain some definition of what my own skills are in some system that
          I (or my manager or perhaps my co-workers) have to or need to update.
          Basically, I'm "tagging" myself indirectly through my work. (I think
          that if you had a system that directly stored skills of workers -
          i.e., a "skills inventory database" - that could be treated as nothing
          more than an additional data source for this - probably one with a
          higher weighting than other sources, obviously).

          Does that answer the question?

          Thanks for the pointers, too, Valdis.

          Regards
          Lee

          On Mon, Oct 13, 2008 at 2:15 PM, Valdis Krebs <valdis@...> wrote:
          > Lee,
          >
          > Who picks the words? Who assigns the words to whom? Who weights each
          > word for each person?
          >
          > Do people also nominate each other for who they actually go to for
          > expertise on A, B or C?
          >
          > "X may be a high word & high weight person that no one goes to because
          > X is a jerk."
          >
          > You can find similarity by attributes [your approach] or links or both.
          >
          > See my analysis of political books on Amazon [people that bought this
          > also bought that...]
          >
          > http://www.orgnet.com/divided.html
          >
          > Valdis
          >
          >
          >
        • Charles Armstrong
          hallo lee what you describe is very close to what trampoline s sonar technology does. sonar server gobbles up emails, ldap, documents and so forth ( work
          Message 4 of 5 , Oct 15, 2008
          • 0 Attachment
            hallo lee

            what you describe is very close to what trampoline's sonar technology does. sonar server gobbles up emails, ldap, documents and so forth ("work products" as you term it), deduces each person's expertise and knowledge through statistical language modeling, then calculates the network characteristics for each person using ona techniques. armed with this intelligence sonar can start to identify documents and contacts which are likely to be relevant to a particular user.

            the specific use case you mention of identifying emergent communities of interest is something we've encountered a fair amount of demand for with our flightdeck product. basically it involves identifying people who share a strong interest in a particular field (though their interests may diverge in other areas) but have no identifiable communication with each other. adding a time element to this, up-weighting areas of interest that have only started being identified recently, helps highlight fast-growing trends.

            i can't answer your specific question about similarity matching as i'm just a humble ethnographer. if you're interested i'd be happy to link you up with someone in the team who knows more about the statistical aspects.

            yours : charles


            chief executive // trampoline systems ltd
            the trampery, 8-15 dereham place, london EC2A 3HJ
            uk cell +44 7792 456807
            usa cell +1 415 728 8656

            On 13 Oct 2008, at 18:44, Lee Romero wrote:

            Hi all - (Apologies for the length of this - as I got to describing
            what I'm trying to do, I ended up writing more than I thought I
            would....)

            I am currently working on an effort to implement a type of "expertise
            location" function based on generating a profile of someone from
            his/her team/project associations and his/her work products (that's a
            pretty simple description for what I'm doing but it gets the point
            across).

            Part of this results in a set of keywords associated with each person
            in the population - the "expertise location" can then be thought of as
            providing a standard keyword search engine / interface interface on
            top of the set of keywords associated with the people (there's also
            some navigation in the application based on the keywords but I'm going
            to ignore that for the moment).

            For any one person in the system, the defining characteristics are a
            set of activities / team memberships and, for each of those, there are
            a set of keywords. Collapsing together all of the keywords into a
            "pool" of keywords for a person, you can think of the profile of a
            person as a set of weighted keywords (where the weight for a keyword
            is the number of occurrences of for that keyword within the set of
            activities / team memberships) .

            There are no specific restrictions about what is a valid keyword
            except that anything you might think of as a common "stop word" in a
            search engine is excluded (things like "the", "of", etc.). So my
            profile might (in part) look like: ("search", 100), ("knowledge" , 40),
            ("management" , 80), ("engineering" , 20), etc.

            Also, because of how the keywords are generated and weighted, there is
            no upper limit on the weight of a keyword for any one person. To
            follow on from my example, I might have an additional 80 keywords with
            a total weight of, say, 5000, while someone else might have a total of
            40 keywords and a total weight of 4500.

            All of this is pretty straightforward and, even though the basic idea
            seems pretty simple, the keyword search function across these profiles
            provides a surprisingly high correlation to finding someone with a
            particular skill or expertise. (This makes me happy, as I wasn't sure
            if it would really "work" as expected like this!)

            What I've been considering now is to take these people profiles and
            try to do two additional things which are related: provide a measure
            of "similarity" between two people and, from that measure of
            similarity, try to identify "invisible" communities of interest (by
            identifying pockets of people who have high similarity among
            themselves).

            The idea of a "similarity" metric is intriguing because by itself it
            means that the presentation of a particular person's profile can
            include a means to identify people similar to the one you're looking
            at. Though it's kind of crude to liken this to an ecommerce site, I
            do think of it as similar to the function you see on many sites where
            when you're looking at a product, you are presented with a list of
            similar products. ("People who have found this person interesting
            might find these other people interesting! " :-) )

            My question: Has anyone done something similar to this before? If
            so, what approach have you taken to defining the similarity
            measurement?

            Here's the quandary I've run into, which has prompted my question:

            * To measure similarity between two people (X and Y), I first match
            the keywords between X and Y.

            * For each keyword (KW) the two people have in common, I credit each
            person with the minimum of the weight of KW for X and the weight of KW
            for Y. So if two people have the keyword "engineering" and one has a
            weight of 20 and another has a weight of 60, their similarity for this
            keyword is 20.

            * I then sum up these keyword weights to get a total "similarity
            weight" between the two people. Let's say it's 800 across all of the
            common keywords for two people.

            * Lastly, in order to reflect how much of that commonality describes
            each of the people, I calculate the percentage of someone's profile
            that is "covered" by the similarity weight to get the overall
            "similarity measure". So if X has a total profile weight of 5000 and
            Y has a total profile weight of 4000, that means that the person X is
            16% (800/5000) similar to Y, while Y is 20% (800/4000) similar to Y.
            This asymmetry makes some sense to me because we are comparing
            "different size" profiles (so person Y can seem more like person X
            then person X might be similar to person Y).

            Now, getting back to my question - I can link people based on either
            of these computations - the "similarity weight" or the "similarity
            measure" - does either make more sense?

            If I use the "similarity weight" then I seem to have an issue where if
            two people both have "heavy" profile weights, they can seem highly
            similar based on their similarity weight even though their similarity
            weight might be a relatively small percentage of their total profile
            weight (say an overlap of 1000 when the weights of the two profiles
            are 8000 and 10000). Similarly, two people who have a small profile
            weight will seem dissimilar even if they had 100% overlap in their
            profiles!

            On the other hand, if I use the "similarity measure", it seems likely
            that anyone with a "heavy" profile weight will seem to have weak links
            because people are likely to have low percentage overlap with them,
            while people with small profiles can seem very similar (and so tightly
            linked) based on just a couple of keywords.

            Any thoughts from ONA practitioners on what might be the best way to
            link people in this situation?

            Sorry for the length - I'm planning to write about this experiment on
            my blog but thought I'd see whether you might have a suggestion for
            how to measure this and how to break the logjam in my head :-)

            Regards
            Lee Romero

            PS - Yes, I am aware of the perils of ONA via data mining - I do not
            ascribe much to the analysis here other than possibly finding these
            "communities of interest" and not so much about how significant
            someone's position in the network might be or anything like that.


          • Lee Romero
            Thanks, Charles! That brings a smile to my face. I figured my own research musings (which is what this is so far) could not have been all that original
            Message 5 of 5 , Oct 15, 2008
            • 0 Attachment
              Thanks, Charles! That brings a smile to my face.

              I figured my own research musings (which is what this is so far) could
              not have been all that original (though I'd like to think I have some
              original ideas in the mix here). It sounds like your sonar technology
              may validate my own hypothesis, though.

              I haven't checked your site yet - is there information available about
              this tool (product?) that I could read through?

              If I have any other specific questions, I'll take them offlist as well.

              Regards
              Lee Romero



              On Wed, Oct 15, 2008 at 3:32 PM, Charles Armstrong
              <charles@...> wrote:
              > hallo lee
              > what you describe is very close to what trampoline's sonar technology does.
              > sonar server gobbles up emails, ldap, documents and so forth ("work
              > products" as you term it), deduces each person's expertise and
              > knowledge through statistical language modeling, then calculates the
              > network characteristics for each person using ona techniques. armed with
              > this intelligence sonar can start to identify documents and contacts which
              > are likely to be relevant to a particular user.
              > the specific use case you mention of identifying emergent communities of
              > interest is something we've encountered a fair amount of demand for with our
              > flightdeck product. basically it involves identifying people who share a
              > strong interest in a particular field (though their interests may diverge in
              > other areas) but have no identifiable communication with each other. adding
              > a time element to this, up-weighting areas of interest that have only
              > started being identified recently, helps highlight fast-growing trends.
              > i can't answer your specific question about similarity matching as i'm just
              > a humble ethnographer. if you're interested i'd be happy to link you up with
              > someone in the team who knows more about the statistical aspects.
              > yours : charles
              >
              > chief executive // trampoline systems ltd
              > the trampery, 8-15 dereham place, london EC2A 3HJ
              > uk cell +44 7792 456807
              > usa cell +1 415 728 8656
              > http://trampolinesystems.com
              >
            Your message has been successfully submitted and would be delivered to recipients shortly.