Loading ...
Sorry, an error occurred while loading the content.

Re: [webalizer] I have a one question at a method.(changes to source)

Expand Messages
  • Bradford L. Barrett
    Note: Search strings are lowercased to increase accuracy.. since the search engines do not do a case sensitive search. That is, searching for Foo Bar
    Message 1 of 5 , Aug 20, 2004
    View Source
    • 0 Attachment
      Note: Search strings are lowercased to increase accuracy.. since the
      search engines do not do a case sensitive search. That is, searching
      for "Foo Bar" returns the same results as "foo bar", or "fOo bAr".
      The webalizer will translate all of these to 'foo bar' to get an accurate
      count of how many times that string was used.

      Also note, the upper-to-lower translation is NOT done on the escaped
      string.. the referrer string is unescaped fairly early on in the
      processing, and search string analysis is performed on the unescaped
      string, not the escaped string as found in the raw log.

      --
      > > Hi, all
      > >
      > > I have a one question.
      > >
      > > Serach string is URLencoded at access_log as
      > > follows.
      > > ¡¡
      > > ¡¡A) %83%8b%83p%83%93%8eO%90%a2
      > >
      > > The above string is changed through webalizer as
      > > follows.
      > >
      > > B) %83%8b%83p%83%93%8eo%90%a2
      > >
      > > Difference betwee A and B is O OR o.
      > >
      > > Through webalizer A-Z is changed a-z.
      > >
      > > Is it method? or Have I a way out.
      > >
      > > Help.
      > >
      --
      Bradford L. Barrett brad@...
      A free electron in a sea of neutrons DoD#1750 KD4NAW

      The only thing Micro$oft has done for society, is make people
      believe that computers are inherently unreliable.
    • enventa2000
      ... Errrr, my bad. This patch was throwing webalizer into an infinite loop every time it found a % character. Not quite the intended result. I have now
      Message 2 of 5 , Aug 21, 2004
      View Source
      • 0 Attachment
        > I have made a small "patch".

        Errrr, my bad. This "patch" was throwing webalizer into an infinite
        loop every time it found a '%' character. Not quite the intended
        result. I have now converted the "else if" block into an "if" block
        and moved it a bit below. Now it works correctly:


        diff webalizer-2.01-10/webalizer.c webalizer-2.01-10_bueno/webalizer.c
        170,172d169
        < int chars_unicode = 0; /* counter for
        unicode strings */
        < int is_unicode = 0; /* Boolean for unicode
        strings */
        <
        1823d1819
        < is_unicode=0;
        1833,1839c1829
        < if (*cp1=='%')
        < {
        < is_unicode=1;
        < chars_unicode=0;
        < }
        < chars_unicode++;
        < if ( chars_unicode!=3 && is_unicode!=0 )
        *cp2++=tolower(*cp1); /* normal character if not unicode */
        ---
        > *cp2++=tolower(*cp1); /* normal
        character */
        diff webalizer-2.01-10/webalizer.h webalizer-2.01-10_bueno/webalizer.h
        222,224d221
        < extern int chars_unicode; /* counter for unicode
        strings */
        < extern int is_unicode; /* Boolean for unicode
        strings */
        <



        --- In webalizer@yahoogroups.com, Enric Naval <enventa2000@y...>
        wrote:
        > Hello:
        >
        > The URL encoding (using "&") doesn't distinguish upper
        > case from lower case. So, "&2B" is the same as "$2b".
        > Changing everything to lower case doesn't change
        > anything. Mr.Barret probably does this to make
        > translation faster, but the encoding remains the same.
        >
        >
        > Problem comes with this encoding, am I right?:
        > "%8eo" --- lower o "o"
        > "%8eO" --- upper o "O"
        >
        > I have made a small "patch". I have added some lines,
        > so the line numbers are wrong. You can add the
        > modifications to your source by hand.
        >
        > The first added line defines a int variable
        > (chars_unicode) to count how many characters are we
        > far from the last "%" character found. The second
        > added line defines an int variable (is_unicode) that
        > will be used as if it was a boolean. By default the
        > compiler will set it to zero (false). The "else if"
        > block is only entered when a "%" character is found in
        > the string. It sets "is_unicode to 1 (true), and
        > resets the unicode counter to zero. Before line 1829 I
        > increase the unicode counter, and in line 1829 itself
        > I have added a condition to prevent lower casing in
        > characters three positions away from "%". This way
        > this code will perform this transformation, where "%",
        > "8" and "E" have had "tolowercase" executed in them,
        > but "O" hasn't, because chars_unicode is equal to 3:
        >
        > "%8EO" --- "%8eO"
        >
        > This allows URL encoding to be lowercased, but
        > prevents these unicode strings from being translated
        > too. Please let me know if it worked correctly for
        > you!
        >
        > You can make the changes by hand. I have also made a
        > small patch that makes things a little bit better,
        > defining variables in webalizer.h, so the program
        > doesn't have to define them every time it executes
        > srch_string:
        >
        > http://griho.udl.es/webalizer/unicode.patch.txt
        >
        >
        > 1800 void srch_string(char *ptr)
        > 1801 {
        > int chars_unicode;
        > int is_unicode;
        > 1820 while (*cp1!='&' && *cp1!=0)
        > 1821 {
        > 1822 if (*cp1=='"' || *cp1==',' || *cp1=='?')
        > 1823 { cp1++; continue; }
        >
        > else if (*cp1=='%')
        > {
        > is_unicode=1;
        > chars_unicode=0;
        > }
        > 1824 else
        > 1825 {
        > 1826 if (*cp1=='+') *cp1=' ';
        >
        > 1827 if (sp_flg && *cp1==' ') { cp1++;
        > continue; }
        > 1828 if (*cp1==' ') sp_flg=1; else sp_flg=0;
        >
        > chars_unicode++;
        > 1829 if ( chars_unicode!=3 && !is_unicode )
        > *cp2++=tolower(*cp1);
        > 1830 cp1++;
        > 1831 }
        > 1832 }
        >
        >
        >
        > Here you have the patch if you want to copy&paste:
        >
        > diff webalizer-2.01-10/webalizer.c
        > webalizer-2.01-10_original/webalizer.c
        > 170,172d169
        > < int chars_unicode =0; /*
        > counter for unicode strings */
        > < int is_unicode; =0; /*
        > Boolean for unicode strings */
        > <
        > 1823d1819
        > < is_unicode=0;
        > 1828,1832d1823
        > < else if (*cp1=='%')
        > < {
        > < is_unicode=1;
        > < chars_unicode=0;
        > < }
        > 1838,1839c1829
        > < chars_unicode++;
        > < if ( chars_unicode!=3 && !is_unicode )
        > *cp2++=tolower(*cp1); /* normal character if not
        > unicode */
        > ---
        > > *cp2++=tolower(*cp1);
        > /* normal character */
        > diff webalizer-2.01-10/webalizer.h
        > webalizer-2.01-10_original/webalizer.h
        > 222,224d221
        > < extern int chars_unicode; /*
        > counter for unicode strings */
        > < extern int is_unicode; /*
        > Boolean for unicode strings */
        > <
        >
        >
        >
        >
        >
        >
        > --- hideyuki nakano <hnakano@f...> wrote:
        >
        > > Hi, all
        > >
        > > I have a one question.
        > >
        > > Serach string is URLencoded at access_log as
        > > follows.
        > > ¡¡
        > > ¡¡A) %83%8b%83p%83%93%8eO%90%a2
        > >
        > > The above string is changed through webalizer as
        > > follows.
        > >
        > > B) %83%8b%83p%83%93%8eo%90%a2
        > >
        > > Difference betwee A and B is O OR o.
        > >
        > > Through webalizer A-Z is changed a-z.
        > >
        > > Is it method? or Have I a way out.
        > >
        > > Help.
        > >
        > >
        > >
        > >
        > > ================ webalizer.c(V2.01)
        > > ====================
        > > 1796 /*********************************************/
        > > 1797 /* SRCH_STRING - get search strings from ref */
        > > 1798 /*********************************************/
        > > 1799
        > > 1800 void srch_string(char *ptr)
        > > 1801 {
        > > 1820 while (*cp1!='&' && *cp1!=0)
        > > 1821 {
        > > 1822 if (*cp1=='"' || *cp1==',' || *cp1=='?')
        > > 1823 { cp1++; continue; }
        > >
        > > 1824 else
        > > 1825 {
        > > 1826 if (*cp1=='+') *cp1=' ';
        > >
        > > 1827 if (sp_flg && *cp1==' ') { cp1++;
        > > continue; }
        > > 1828 if (*cp1==' ') sp_flg=1; else
        > > sp_flg=0;
        > > ¡ú¡ú¡ú1829 *cp2++=tolower(*cp1);
        > >
        > > 1830 cp1++;
        > > 1831 }
        > > 1832 }
        > >
        > ========================================================
        > >
        > >
        > >
        > >
        > >
        >
        >
        > =====
        > Enric Naval
        > Estudiante de Informática de Gestión en la Udl (Lleida)
        > GRIHO webalizer.conf
        > http://griho.udl.es/webalizer/webalizer.conf.txt
      • enventa2000
        Sorry again. This time the patch works correctly, and I have tested it in several different logs. This is the last message about this. I have been able to see
        Message 3 of 5 , Aug 21, 2004
        View Source
        • 0 Attachment
          Sorry again. This time the patch works correctly, and I have tested it
          in several different logs. This is the last message about this.

          I have been able to see that half the people searchs for "AIPO" while
          half the other searches for "aipo".


          http://griho.udl.es/webalizer/unicode.patch.txt


          diff webalizer-2.01-10_bueno/webalizer.c webalizer-2.
          01-10_unicode/webalizer.c
          169a170,172
          > int chars_unicode = 0; /* counter for
          unicode strings */
          > int is_unicode = 0; /* Boolean for unicode
          strings */
          >
          1819a1823,1824
          > is_unicode=0;
          > chars_unicode=0;
          1829c1834,1841
          < *cp2++=tolower(*cp1); /* normal
          character */
          ---
          > if (*cp1=='%')
          > {
          > is_unicode=1;
          > chars_unicode=0;
          > }
          > if ( chars_unicode!=3 && is_unicode!=0 ) {
          *cp2++=tolower(*cp1); } /* normal character if not unicode */
          > else *cp2++=*cp1;
          > chars_unicode++;
          Only in webalizer-2.01-10_unicode/: webalizer.c~
          diff webalizer-2.01-10_bueno/webalizer.h webalizer-2.
          01-10_unicode/webalizer.h
          221a222,224
          > extern int chars_unicode; /* counter for unicode
          strings */
          > extern int is_unicode; /* Boolean for unicode
          strings */
          >
        Your message has been successfully submitted and would be delivered to recipients shortly.