Loading ...
Sorry, an error occurred while loading the content.

597Some basic questions about linkages / algorithms

Expand Messages
  • Lee Romero
    Oct 13, 2008
    • 0 Attachment
      Hi all - (Apologies for the length of this - as I got to describing
      what I'm trying to do, I ended up writing more than I thought I
      would....)

      I am currently working on an effort to implement a type of "expertise
      location" function based on generating a profile of someone from
      his/her team/project associations and his/her work products (that's a
      pretty simple description for what I'm doing but it gets the point
      across).

      Part of this results in a set of keywords associated with each person
      in the population - the "expertise location" can then be thought of as
      providing a standard keyword search engine / interface interface on
      top of the set of keywords associated with the people (there's also
      some navigation in the application based on the keywords but I'm going
      to ignore that for the moment).

      For any one person in the system, the defining characteristics are a
      set of activities / team memberships and, for each of those, there are
      a set of keywords. Collapsing together all of the keywords into a
      "pool" of keywords for a person, you can think of the profile of a
      person as a set of weighted keywords (where the weight for a keyword
      is the number of occurrences of for that keyword within the set of
      activities / team memberships).

      There are no specific restrictions about what is a valid keyword
      except that anything you might think of as a common "stop word" in a
      search engine is excluded (things like "the", "of", etc.). So my
      profile might (in part) look like: ("search", 100), ("knowledge", 40),
      ("management", 80), ("engineering", 20), etc.

      Also, because of how the keywords are generated and weighted, there is
      no upper limit on the weight of a keyword for any one person. To
      follow on from my example, I might have an additional 80 keywords with
      a total weight of, say, 5000, while someone else might have a total of
      40 keywords and a total weight of 4500.

      All of this is pretty straightforward and, even though the basic idea
      seems pretty simple, the keyword search function across these profiles
      provides a surprisingly high correlation to finding someone with a
      particular skill or expertise. (This makes me happy, as I wasn't sure
      if it would really "work" as expected like this!)

      What I've been considering now is to take these people profiles and
      try to do two additional things which are related: provide a measure
      of "similarity" between two people and, from that measure of
      similarity, try to identify "invisible" communities of interest (by
      identifying pockets of people who have high similarity among
      themselves).

      The idea of a "similarity" metric is intriguing because by itself it
      means that the presentation of a particular person's profile can
      include a means to identify people similar to the one you're looking
      at. Though it's kind of crude to liken this to an ecommerce site, I
      do think of it as similar to the function you see on many sites where
      when you're looking at a product, you are presented with a list of
      similar products. ("People who have found this person interesting
      might find these other people interesting!" :-) )

      My question: Has anyone done something similar to this before? If
      so, what approach have you taken to defining the similarity
      measurement?

      Here's the quandary I've run into, which has prompted my question:

      * To measure similarity between two people (X and Y), I first match
      the keywords between X and Y.

      * For each keyword (KW) the two people have in common, I credit each
      person with the minimum of the weight of KW for X and the weight of KW
      for Y. So if two people have the keyword "engineering" and one has a
      weight of 20 and another has a weight of 60, their similarity for this
      keyword is 20.

      * I then sum up these keyword weights to get a total "similarity
      weight" between the two people. Let's say it's 800 across all of the
      common keywords for two people.

      * Lastly, in order to reflect how much of that commonality describes
      each of the people, I calculate the percentage of someone's profile
      that is "covered" by the similarity weight to get the overall
      "similarity measure". So if X has a total profile weight of 5000 and
      Y has a total profile weight of 4000, that means that the person X is
      16% (800/5000) similar to Y, while Y is 20% (800/4000) similar to Y.
      This asymmetry makes some sense to me because we are comparing
      "different size" profiles (so person Y can seem more like person X
      then person X might be similar to person Y).

      Now, getting back to my question - I can link people based on either
      of these computations - the "similarity weight" or the "similarity
      measure" - does either make more sense?

      If I use the "similarity weight" then I seem to have an issue where if
      two people both have "heavy" profile weights, they can seem highly
      similar based on their similarity weight even though their similarity
      weight might be a relatively small percentage of their total profile
      weight (say an overlap of 1000 when the weights of the two profiles
      are 8000 and 10000). Similarly, two people who have a small profile
      weight will seem dissimilar even if they had 100% overlap in their
      profiles!

      On the other hand, if I use the "similarity measure", it seems likely
      that anyone with a "heavy" profile weight will seem to have weak links
      because people are likely to have low percentage overlap with them,
      while people with small profiles can seem very similar (and so tightly
      linked) based on just a couple of keywords.


      Any thoughts from ONA practitioners on what might be the best way to
      link people in this situation?

      Sorry for the length - I'm planning to write about this experiment on
      my blog but thought I'd see whether you might have a suggestion for
      how to measure this and how to break the logjam in my head :-)

      Regards
      Lee Romero

      PS - Yes, I am aware of the perils of ONA via data mining - I do not
      ascribe much to the analysis here other than possibly finding these
      "communities of interest" and not so much about how significant
      someone's position in the network might be or anything like that.
    • Show all 5 messages in this topic