Loading ...
Sorry, an error occurred while loading the content.

431Re: [SearchCoP] Dealing with your company's name in content / search terms

Expand Messages
  • Lee Romero
    May 14, 2012
      Thanks for the replies, Seth and Walter.

      Walter - in this situation, if I'm understanding it correctly, the tf.idf will be thwarted to some extent because the word in question is going to be very common throughout the whole corpus of content being indexed.  So the weight of a company name is likely very little (I think) because it is in many documents, though in any one document it might not be very common (occurring in cover pages/ titles, in footers, etc.).  Can you elaborate in case I'm missing something?

      Seth - the idea of having a stop word apply to the full content and not to the metadata is interesting.  I'll have to see if we can do that with our engine; we're using Coveo - anyone happen to know whether stop words can be defined that only apply to content?

      Thanks again for your help.  Anyone else have any thoughts or ideas?

      Lee Romero

      On Mon, May 14, 2012 at 11:33 AM, Walter Underwood <wunder@...> wrote:

      First, tf.idf does this for you, automatically. IDF weights terms by how common or rare they are in your documents.

      So, you may need to do nothing.

      Second, IDF for phrases is very good for this, in fact, good for relevance in general. If a phrase is rare and selective, it will have a high IDF, even if it is made of common words. I don't know of any engine that has phrase IDF by default. Ultraseek did have it, but I don't think that is sold any more.

      Solr dismax and edismax can be configured to have a higher weight for phrase matches. This may help.

      If you can handle this algorithmically, do that. Best Bets are labor-intensive, because they need to be re-checked and updated by hand regularly. You do NOT want a stale best bet.

      Walter Underwood
      former Infoseek, Inktomi, Verity, Netflix
      now Chegg Search Guy

      On May 14, 2012, at 5:58 AM, Seth Earley wrote:


      Hi Lee,


      I was thinking Best Bets as I was reading.  That would be my approach.   Either that or a content model with those terms tagged in specific metadata fields with searches scoped or limited to those fields as opposed to full text (“Deloitte” would be a stop word in full text search)


      That would be my approach




      Seth Earley


      Cell: 781-820-8080

      Email: seth@...  

      Web: www.earley.com


      Follow me on twitter: sethearley

      Connect with me on  LinkedIn: www.linkedin.com/in/sethearley   


      From: SearchCoP@yahoogroups.com [mailto:SearchCoP@yahoogroups.com] On Behalf Of Lee Romero
      Sent: Monday, May 14, 2012 7:08 AM
      To: searchcop
      Subject: [SearchCoP] Dealing with your company's name in content / search terms



      Hi all - on an intranet, it's very likely that one of the most common
      words in the content being indexed for your search is the name of your
      company. I work for Deloitte (technically, Deloitte Touche Tohmatsu
      Limited, but it's commonly referred to as "Deloitte" even though
      that's not completely accurate) and so the word "Deloitte" appears in
      pretty much just about every piece of content that's indexed in our

      The effect of this is that in our search, "Deloitte" kinds of behaves
      like a stop word. Not technically, but it likely adds so little in
      terms of differentiation of what a user is looking for that it might
      as well be a stop word.

      The challenge I see is that many search terms used by users might also
      contain that word. Often with just one more word - "Deloitte
      Consulting", or "About Deloitte", etc.

      My question: Do you have any good strategies for how to improve the
      search relevance for searches that use very common terms in your
      content, such as your company's name?

      Are manually-managed best bets (or similar functionality) the only alternative?

      Lee Romero

    • Show all 4 messages in this topic