597Some basic questions about linkages / algorithms
- Oct 13, 2008Hi all - (Apologies for the length of this - as I got to describing
what I'm trying to do, I ended up writing more than I thought I
I am currently working on an effort to implement a type of "expertise
location" function based on generating a profile of someone from
his/her team/project associations and his/her work products (that's a
pretty simple description for what I'm doing but it gets the point
Part of this results in a set of keywords associated with each person
in the population - the "expertise location" can then be thought of as
providing a standard keyword search engine / interface interface on
top of the set of keywords associated with the people (there's also
some navigation in the application based on the keywords but I'm going
to ignore that for the moment).
For any one person in the system, the defining characteristics are a
set of activities / team memberships and, for each of those, there are
a set of keywords. Collapsing together all of the keywords into a
"pool" of keywords for a person, you can think of the profile of a
person as a set of weighted keywords (where the weight for a keyword
is the number of occurrences of for that keyword within the set of
activities / team memberships).
There are no specific restrictions about what is a valid keyword
except that anything you might think of as a common "stop word" in a
search engine is excluded (things like "the", "of", etc.). So my
profile might (in part) look like: ("search", 100), ("knowledge", 40),
("management", 80), ("engineering", 20), etc.
Also, because of how the keywords are generated and weighted, there is
no upper limit on the weight of a keyword for any one person. To
follow on from my example, I might have an additional 80 keywords with
a total weight of, say, 5000, while someone else might have a total of
40 keywords and a total weight of 4500.
All of this is pretty straightforward and, even though the basic idea
seems pretty simple, the keyword search function across these profiles
provides a surprisingly high correlation to finding someone with a
particular skill or expertise. (This makes me happy, as I wasn't sure
if it would really "work" as expected like this!)
What I've been considering now is to take these people profiles and
try to do two additional things which are related: provide a measure
of "similarity" between two people and, from that measure of
similarity, try to identify "invisible" communities of interest (by
identifying pockets of people who have high similarity among
The idea of a "similarity" metric is intriguing because by itself it
means that the presentation of a particular person's profile can
include a means to identify people similar to the one you're looking
at. Though it's kind of crude to liken this to an ecommerce site, I
do think of it as similar to the function you see on many sites where
when you're looking at a product, you are presented with a list of
similar products. ("People who have found this person interesting
might find these other people interesting!" :-) )
My question: Has anyone done something similar to this before? If
so, what approach have you taken to defining the similarity
Here's the quandary I've run into, which has prompted my question:
* To measure similarity between two people (X and Y), I first match
the keywords between X and Y.
* For each keyword (KW) the two people have in common, I credit each
person with the minimum of the weight of KW for X and the weight of KW
for Y. So if two people have the keyword "engineering" and one has a
weight of 20 and another has a weight of 60, their similarity for this
keyword is 20.
* I then sum up these keyword weights to get a total "similarity
weight" between the two people. Let's say it's 800 across all of the
common keywords for two people.
* Lastly, in order to reflect how much of that commonality describes
each of the people, I calculate the percentage of someone's profile
that is "covered" by the similarity weight to get the overall
"similarity measure". So if X has a total profile weight of 5000 and
Y has a total profile weight of 4000, that means that the person X is
16% (800/5000) similar to Y, while Y is 20% (800/4000) similar to Y.
This asymmetry makes some sense to me because we are comparing
"different size" profiles (so person Y can seem more like person X
then person X might be similar to person Y).
Now, getting back to my question - I can link people based on either
of these computations - the "similarity weight" or the "similarity
measure" - does either make more sense?
If I use the "similarity weight" then I seem to have an issue where if
two people both have "heavy" profile weights, they can seem highly
similar based on their similarity weight even though their similarity
weight might be a relatively small percentage of their total profile
weight (say an overlap of 1000 when the weights of the two profiles
are 8000 and 10000). Similarly, two people who have a small profile
weight will seem dissimilar even if they had 100% overlap in their
On the other hand, if I use the "similarity measure", it seems likely
that anyone with a "heavy" profile weight will seem to have weak links
because people are likely to have low percentage overlap with them,
while people with small profiles can seem very similar (and so tightly
linked) based on just a couple of keywords.
Any thoughts from ONA practitioners on what might be the best way to
link people in this situation?
Sorry for the length - I'm planning to write about this experiment on
my blog but thought I'd see whether you might have a suggestion for
how to measure this and how to break the logjam in my head :-)
PS - Yes, I am aware of the perils of ONA via data mining - I do not
ascribe much to the analysis here other than possibly finding these
"communities of interest" and not so much about how significant
someone's position in the network might be or anything like that.
- Next post in topic >>