Loading ...
Sorry, an error occurred while loading the content.
 

Re: [webalizer] I have a one question at a method.(changes to source)

Expand Messages
  • Enric Naval
    Hello: The URL encoding (using & ) doesn t distinguish upper case from lower case. So, &2B is the same as $2b . Changing everything to lower case doesn t
    Message 1 of 5 , Aug 20, 2004
      Hello:

      The URL encoding (using "&") doesn't distinguish upper
      case from lower case. So, "&2B" is the same as "$2b".
      Changing everything to lower case doesn't change
      anything. Mr.Barret probably does this to make
      translation faster, but the encoding remains the same.


      Problem comes with this encoding, am I right?:
      "%8eo" --- lower o "o"
      "%8eO" --- upper o "O"

      I have made a small "patch". I have added some lines,
      so the line numbers are wrong. You can add the
      modifications to your source by hand.

      The first added line defines a int variable
      (chars_unicode) to count how many characters are we
      far from the last "%" character found. The second
      added line defines an int variable (is_unicode) that
      will be used as if it was a boolean. By default the
      compiler will set it to zero (false). The "else if"
      block is only entered when a "%" character is found in
      the string. It sets "is_unicode to 1 (true), and
      resets the unicode counter to zero. Before line 1829 I
      increase the unicode counter, and in line 1829 itself
      I have added a condition to prevent lower casing in
      characters three positions away from "%". This way
      this code will perform this transformation, where "%",
      "8" and "E" have had "tolowercase" executed in them,
      but "O" hasn't, because chars_unicode is equal to 3:

      "%8EO" --- "%8eO"

      This allows URL encoding to be lowercased, but
      prevents these unicode strings from being translated
      too. Please let me know if it worked correctly for
      you!

      You can make the changes by hand. I have also made a
      small patch that makes things a little bit better,
      defining variables in webalizer.h, so the program
      doesn't have to define them every time it executes
      srch_string:

      http://griho.udl.es/webalizer/unicode.patch.txt


      1800 void srch_string(char *ptr)
      1801 {
      int chars_unicode;
      int is_unicode;
      1820 while (*cp1!='&' && *cp1!=0)
      1821 {
      1822 if (*cp1=='"' || *cp1==',' || *cp1=='?')
      1823 { cp1++; continue; }

      else if (*cp1=='%')
      {
      is_unicode=1;
      chars_unicode=0;
      }
      1824 else
      1825 {
      1826 if (*cp1=='+') *cp1=' ';

      1827 if (sp_flg && *cp1==' ') { cp1++;
      continue; }
      1828 if (*cp1==' ') sp_flg=1; else sp_flg=0;

      chars_unicode++;
      1829 if ( chars_unicode!=3 && !is_unicode )
      *cp2++=tolower(*cp1);
      1830 cp1++;
      1831 }
      1832 }



      Here you have the patch if you want to copy&paste:

      diff webalizer-2.01-10/webalizer.c
      webalizer-2.01-10_original/webalizer.c
      170,172d169
      < int chars_unicode =0; /*
      counter for unicode strings */
      < int is_unicode; =0; /*
      Boolean for unicode strings */
      <
      1823d1819
      < is_unicode=0;
      1828,1832d1823
      < else if (*cp1=='%')
      < {
      < is_unicode=1;
      < chars_unicode=0;
      < }
      1838,1839c1829
      < chars_unicode++;
      < if ( chars_unicode!=3 && !is_unicode )
      *cp2++=tolower(*cp1); /* normal character if not
      unicode */
      ---
      > *cp2++=tolower(*cp1);
      /* normal character */
      diff webalizer-2.01-10/webalizer.h
      webalizer-2.01-10_original/webalizer.h
      222,224d221
      < extern int chars_unicode; /*
      counter for unicode strings */
      < extern int is_unicode; /*
      Boolean for unicode strings */
      <






      --- hideyuki nakano <hnakano@...> wrote:

      > Hi, all
      >
      > I have a one question.
      >
      > Serach string is URLencoded at access_log as
      > follows.
      > ��
      > ��A) %83%8b%83p%83%93%8eO%90%a2
      >
      > The above string is changed through webalizer as
      > follows.
      >
      > B) %83%8b%83p%83%93%8eo%90%a2
      >
      > Difference betwee A and B is O OR o.
      >
      > Through webalizer A-Z is changed a-z.
      >
      > Is it method? or Have I a way out.
      >
      > Help.
      >
      >
      >
      >
      > ================ webalizer.c(V2.01)
      > ====================
      > 1796 /*********************************************/
      > 1797 /* SRCH_STRING - get search strings from ref */
      > 1798 /*********************************************/
      > 1799
      > 1800 void srch_string(char *ptr)
      > 1801 {
      > 1820 while (*cp1!='&' && *cp1!=0)
      > 1821 {
      > 1822 if (*cp1=='"' || *cp1==',' || *cp1=='?')
      > 1823 { cp1++; continue; }
      >
      > 1824 else
      > 1825 {
      > 1826 if (*cp1=='+') *cp1=' ';
      >
      > 1827 if (sp_flg && *cp1==' ') { cp1++;
      > continue; }
      > 1828 if (*cp1==' ') sp_flg=1; else
      > sp_flg=0;
      > ������1829 *cp2++=tolower(*cp1);
      >
      > 1830 cp1++;
      > 1831 }
      > 1832 }
      >
      ========================================================
      >
      >
      >
      >
      >


      =====
      Enric Naval
      Estudiante de Inform�tica de Gesti�n en la Udl (Lleida)
      GRIHO webalizer.conf
      http://griho.udl.es/webalizer/webalizer.conf.txt



      __________________________________
      Do you Yahoo!?
      Yahoo! Mail - 50x more storage than other providers!
      http://promotions.yahoo.com/new_mail
    • Bradford L. Barrett
      Note: Search strings are lowercased to increase accuracy.. since the search engines do not do a case sensitive search. That is, searching for Foo Bar
      Message 2 of 5 , Aug 20, 2004
        Note: Search strings are lowercased to increase accuracy.. since the
        search engines do not do a case sensitive search. That is, searching
        for "Foo Bar" returns the same results as "foo bar", or "fOo bAr".
        The webalizer will translate all of these to 'foo bar' to get an accurate
        count of how many times that string was used.

        Also note, the upper-to-lower translation is NOT done on the escaped
        string.. the referrer string is unescaped fairly early on in the
        processing, and search string analysis is performed on the unescaped
        string, not the escaped string as found in the raw log.

        --
        > > Hi, all
        > >
        > > I have a one question.
        > >
        > > Serach string is URLencoded at access_log as
        > > follows.
        > > ¡¡
        > > ¡¡A) %83%8b%83p%83%93%8eO%90%a2
        > >
        > > The above string is changed through webalizer as
        > > follows.
        > >
        > > B) %83%8b%83p%83%93%8eo%90%a2
        > >
        > > Difference betwee A and B is O OR o.
        > >
        > > Through webalizer A-Z is changed a-z.
        > >
        > > Is it method? or Have I a way out.
        > >
        > > Help.
        > >
        --
        Bradford L. Barrett brad@...
        A free electron in a sea of neutrons DoD#1750 KD4NAW

        The only thing Micro$oft has done for society, is make people
        believe that computers are inherently unreliable.
      • enventa2000
        ... Errrr, my bad. This patch was throwing webalizer into an infinite loop every time it found a % character. Not quite the intended result. I have now
        Message 3 of 5 , Aug 21, 2004
          > I have made a small "patch".

          Errrr, my bad. This "patch" was throwing webalizer into an infinite
          loop every time it found a '%' character. Not quite the intended
          result. I have now converted the "else if" block into an "if" block
          and moved it a bit below. Now it works correctly:


          diff webalizer-2.01-10/webalizer.c webalizer-2.01-10_bueno/webalizer.c
          170,172d169
          < int chars_unicode = 0; /* counter for
          unicode strings */
          < int is_unicode = 0; /* Boolean for unicode
          strings */
          <
          1823d1819
          < is_unicode=0;
          1833,1839c1829
          < if (*cp1=='%')
          < {
          < is_unicode=1;
          < chars_unicode=0;
          < }
          < chars_unicode++;
          < if ( chars_unicode!=3 && is_unicode!=0 )
          *cp2++=tolower(*cp1); /* normal character if not unicode */
          ---
          > *cp2++=tolower(*cp1); /* normal
          character */
          diff webalizer-2.01-10/webalizer.h webalizer-2.01-10_bueno/webalizer.h
          222,224d221
          < extern int chars_unicode; /* counter for unicode
          strings */
          < extern int is_unicode; /* Boolean for unicode
          strings */
          <



          --- In webalizer@yahoogroups.com, Enric Naval <enventa2000@y...>
          wrote:
          > Hello:
          >
          > The URL encoding (using "&") doesn't distinguish upper
          > case from lower case. So, "&2B" is the same as "$2b".
          > Changing everything to lower case doesn't change
          > anything. Mr.Barret probably does this to make
          > translation faster, but the encoding remains the same.
          >
          >
          > Problem comes with this encoding, am I right?:
          > "%8eo" --- lower o "o"
          > "%8eO" --- upper o "O"
          >
          > I have made a small "patch". I have added some lines,
          > so the line numbers are wrong. You can add the
          > modifications to your source by hand.
          >
          > The first added line defines a int variable
          > (chars_unicode) to count how many characters are we
          > far from the last "%" character found. The second
          > added line defines an int variable (is_unicode) that
          > will be used as if it was a boolean. By default the
          > compiler will set it to zero (false). The "else if"
          > block is only entered when a "%" character is found in
          > the string. It sets "is_unicode to 1 (true), and
          > resets the unicode counter to zero. Before line 1829 I
          > increase the unicode counter, and in line 1829 itself
          > I have added a condition to prevent lower casing in
          > characters three positions away from "%". This way
          > this code will perform this transformation, where "%",
          > "8" and "E" have had "tolowercase" executed in them,
          > but "O" hasn't, because chars_unicode is equal to 3:
          >
          > "%8EO" --- "%8eO"
          >
          > This allows URL encoding to be lowercased, but
          > prevents these unicode strings from being translated
          > too. Please let me know if it worked correctly for
          > you!
          >
          > You can make the changes by hand. I have also made a
          > small patch that makes things a little bit better,
          > defining variables in webalizer.h, so the program
          > doesn't have to define them every time it executes
          > srch_string:
          >
          > http://griho.udl.es/webalizer/unicode.patch.txt
          >
          >
          > 1800 void srch_string(char *ptr)
          > 1801 {
          > int chars_unicode;
          > int is_unicode;
          > 1820 while (*cp1!='&' && *cp1!=0)
          > 1821 {
          > 1822 if (*cp1=='"' || *cp1==',' || *cp1=='?')
          > 1823 { cp1++; continue; }
          >
          > else if (*cp1=='%')
          > {
          > is_unicode=1;
          > chars_unicode=0;
          > }
          > 1824 else
          > 1825 {
          > 1826 if (*cp1=='+') *cp1=' ';
          >
          > 1827 if (sp_flg && *cp1==' ') { cp1++;
          > continue; }
          > 1828 if (*cp1==' ') sp_flg=1; else sp_flg=0;
          >
          > chars_unicode++;
          > 1829 if ( chars_unicode!=3 && !is_unicode )
          > *cp2++=tolower(*cp1);
          > 1830 cp1++;
          > 1831 }
          > 1832 }
          >
          >
          >
          > Here you have the patch if you want to copy&paste:
          >
          > diff webalizer-2.01-10/webalizer.c
          > webalizer-2.01-10_original/webalizer.c
          > 170,172d169
          > < int chars_unicode =0; /*
          > counter for unicode strings */
          > < int is_unicode; =0; /*
          > Boolean for unicode strings */
          > <
          > 1823d1819
          > < is_unicode=0;
          > 1828,1832d1823
          > < else if (*cp1=='%')
          > < {
          > < is_unicode=1;
          > < chars_unicode=0;
          > < }
          > 1838,1839c1829
          > < chars_unicode++;
          > < if ( chars_unicode!=3 && !is_unicode )
          > *cp2++=tolower(*cp1); /* normal character if not
          > unicode */
          > ---
          > > *cp2++=tolower(*cp1);
          > /* normal character */
          > diff webalizer-2.01-10/webalizer.h
          > webalizer-2.01-10_original/webalizer.h
          > 222,224d221
          > < extern int chars_unicode; /*
          > counter for unicode strings */
          > < extern int is_unicode; /*
          > Boolean for unicode strings */
          > <
          >
          >
          >
          >
          >
          >
          > --- hideyuki nakano <hnakano@f...> wrote:
          >
          > > Hi, all
          > >
          > > I have a one question.
          > >
          > > Serach string is URLencoded at access_log as
          > > follows.
          > > ¡¡
          > > ¡¡A) %83%8b%83p%83%93%8eO%90%a2
          > >
          > > The above string is changed through webalizer as
          > > follows.
          > >
          > > B) %83%8b%83p%83%93%8eo%90%a2
          > >
          > > Difference betwee A and B is O OR o.
          > >
          > > Through webalizer A-Z is changed a-z.
          > >
          > > Is it method? or Have I a way out.
          > >
          > > Help.
          > >
          > >
          > >
          > >
          > > ================ webalizer.c(V2.01)
          > > ====================
          > > 1796 /*********************************************/
          > > 1797 /* SRCH_STRING - get search strings from ref */
          > > 1798 /*********************************************/
          > > 1799
          > > 1800 void srch_string(char *ptr)
          > > 1801 {
          > > 1820 while (*cp1!='&' && *cp1!=0)
          > > 1821 {
          > > 1822 if (*cp1=='"' || *cp1==',' || *cp1=='?')
          > > 1823 { cp1++; continue; }
          > >
          > > 1824 else
          > > 1825 {
          > > 1826 if (*cp1=='+') *cp1=' ';
          > >
          > > 1827 if (sp_flg && *cp1==' ') { cp1++;
          > > continue; }
          > > 1828 if (*cp1==' ') sp_flg=1; else
          > > sp_flg=0;
          > > ¡ú¡ú¡ú1829 *cp2++=tolower(*cp1);
          > >
          > > 1830 cp1++;
          > > 1831 }
          > > 1832 }
          > >
          > ========================================================
          > >
          > >
          > >
          > >
          > >
          >
          >
          > =====
          > Enric Naval
          > Estudiante de Informática de Gestión en la Udl (Lleida)
          > GRIHO webalizer.conf
          > http://griho.udl.es/webalizer/webalizer.conf.txt
        • enventa2000
          Sorry again. This time the patch works correctly, and I have tested it in several different logs. This is the last message about this. I have been able to see
          Message 4 of 5 , Aug 21, 2004
            Sorry again. This time the patch works correctly, and I have tested it
            in several different logs. This is the last message about this.

            I have been able to see that half the people searchs for "AIPO" while
            half the other searches for "aipo".


            http://griho.udl.es/webalizer/unicode.patch.txt


            diff webalizer-2.01-10_bueno/webalizer.c webalizer-2.
            01-10_unicode/webalizer.c
            169a170,172
            > int chars_unicode = 0; /* counter for
            unicode strings */
            > int is_unicode = 0; /* Boolean for unicode
            strings */
            >
            1819a1823,1824
            > is_unicode=0;
            > chars_unicode=0;
            1829c1834,1841
            < *cp2++=tolower(*cp1); /* normal
            character */
            ---
            > if (*cp1=='%')
            > {
            > is_unicode=1;
            > chars_unicode=0;
            > }
            > if ( chars_unicode!=3 && is_unicode!=0 ) {
            *cp2++=tolower(*cp1); } /* normal character if not unicode */
            > else *cp2++=*cp1;
            > chars_unicode++;
            Only in webalizer-2.01-10_unicode/: webalizer.c~
            diff webalizer-2.01-10_bueno/webalizer.h webalizer-2.
            01-10_unicode/webalizer.h
            221a222,224
            > extern int chars_unicode; /* counter for unicode
            strings */
            > extern int is_unicode; /* Boolean for unicode
            strings */
            >
          Your message has been successfully submitted and would be delivered to recipients shortly.