Loading ...
Sorry, an error occurred while loading the content.

I have a one question at a method.

Expand Messages
  • hideyuki nakano
    Hi, all I have a one question. Serach string is URLencoded at access_log as follows.    A) %83%8b%83p%83%93%8eO%90%a2 The above string is changed
    Message 1 of 5 , Aug 20, 2004
    • 0 Attachment
      Hi, all

      I have a one question.

      Serach string is URLencoded at access_log as follows.

      A) %83%8b%83p%83%93%8eO%90%a2

      The above string is changed through webalizer as follows.

      B) %83%8b%83p%83%93%8eo%90%a2

      Difference betwee A and B is O OR o.

      Through webalizer A-Z is changed a-z.

      Is it method? or Have I a way out.

      Help.




      ================ webalizer.c(V2.01) ====================
      1796 /*********************************************/
      1797 /* SRCH_STRING - get search strings from ref */
      1798 /*********************************************/
      1799
      1800 void srch_string(char *ptr)
      1801 {
      1820 while (*cp1!='&' && *cp1!=0)
      1821 {
      1822 if (*cp1=='"' || *cp1==',' || *cp1=='?')
      1823 { cp1++; continue; }
      1824 else
      1825 {
      1826 if (*cp1=='+') *cp1=' ';
      1827 if (sp_flg && *cp1==' ') { cp1++; continue; }
      1828 if (*cp1==' ') sp_flg=1; else sp_flg=0;
      1829 *cp2++=tolower(*cp1);
      1830 cp1++;
      1831 }
      1832 }
      ========================================================
    • Enric Naval
      Hello: The URL encoding (using & ) doesn t distinguish upper case from lower case. So, &2B is the same as $2b . Changing everything to lower case doesn t
      Message 2 of 5 , Aug 20, 2004
      • 0 Attachment
        Hello:

        The URL encoding (using "&") doesn't distinguish upper
        case from lower case. So, "&2B" is the same as "$2b".
        Changing everything to lower case doesn't change
        anything. Mr.Barret probably does this to make
        translation faster, but the encoding remains the same.


        Problem comes with this encoding, am I right?:
        "%8eo" --- lower o "o"
        "%8eO" --- upper o "O"

        I have made a small "patch". I have added some lines,
        so the line numbers are wrong. You can add the
        modifications to your source by hand.

        The first added line defines a int variable
        (chars_unicode) to count how many characters are we
        far from the last "%" character found. The second
        added line defines an int variable (is_unicode) that
        will be used as if it was a boolean. By default the
        compiler will set it to zero (false). The "else if"
        block is only entered when a "%" character is found in
        the string. It sets "is_unicode to 1 (true), and
        resets the unicode counter to zero. Before line 1829 I
        increase the unicode counter, and in line 1829 itself
        I have added a condition to prevent lower casing in
        characters three positions away from "%". This way
        this code will perform this transformation, where "%",
        "8" and "E" have had "tolowercase" executed in them,
        but "O" hasn't, because chars_unicode is equal to 3:

        "%8EO" --- "%8eO"

        This allows URL encoding to be lowercased, but
        prevents these unicode strings from being translated
        too. Please let me know if it worked correctly for
        you!

        You can make the changes by hand. I have also made a
        small patch that makes things a little bit better,
        defining variables in webalizer.h, so the program
        doesn't have to define them every time it executes
        srch_string:

        http://griho.udl.es/webalizer/unicode.patch.txt


        1800 void srch_string(char *ptr)
        1801 {
        int chars_unicode;
        int is_unicode;
        1820 while (*cp1!='&' && *cp1!=0)
        1821 {
        1822 if (*cp1=='"' || *cp1==',' || *cp1=='?')
        1823 { cp1++; continue; }

        else if (*cp1=='%')
        {
        is_unicode=1;
        chars_unicode=0;
        }
        1824 else
        1825 {
        1826 if (*cp1=='+') *cp1=' ';

        1827 if (sp_flg && *cp1==' ') { cp1++;
        continue; }
        1828 if (*cp1==' ') sp_flg=1; else sp_flg=0;

        chars_unicode++;
        1829 if ( chars_unicode!=3 && !is_unicode )
        *cp2++=tolower(*cp1);
        1830 cp1++;
        1831 }
        1832 }



        Here you have the patch if you want to copy&paste:

        diff webalizer-2.01-10/webalizer.c
        webalizer-2.01-10_original/webalizer.c
        170,172d169
        < int chars_unicode =0; /*
        counter for unicode strings */
        < int is_unicode; =0; /*
        Boolean for unicode strings */
        <
        1823d1819
        < is_unicode=0;
        1828,1832d1823
        < else if (*cp1=='%')
        < {
        < is_unicode=1;
        < chars_unicode=0;
        < }
        1838,1839c1829
        < chars_unicode++;
        < if ( chars_unicode!=3 && !is_unicode )
        *cp2++=tolower(*cp1); /* normal character if not
        unicode */
        ---
        > *cp2++=tolower(*cp1);
        /* normal character */
        diff webalizer-2.01-10/webalizer.h
        webalizer-2.01-10_original/webalizer.h
        222,224d221
        < extern int chars_unicode; /*
        counter for unicode strings */
        < extern int is_unicode; /*
        Boolean for unicode strings */
        <






        --- hideyuki nakano <hnakano@...> wrote:

        > Hi, all
        >
        > I have a one question.
        >
        > Serach string is URLencoded at access_log as
        > follows.
        >
        > A) %83%8b%83p%83%93%8eO%90%a2
        >
        > The above string is changed through webalizer as
        > follows.
        >
        > B) %83%8b%83p%83%93%8eo%90%a2
        >
        > Difference betwee A and B is O OR o.
        >
        > Through webalizer A-Z is changed a-z.
        >
        > Is it method? or Have I a way out.
        >
        > Help.
        >
        >
        >
        >
        > ================ webalizer.c(V2.01)
        > ====================
        > 1796 /*********************************************/
        > 1797 /* SRCH_STRING - get search strings from ref */
        > 1798 /*********************************************/
        > 1799
        > 1800 void srch_string(char *ptr)
        > 1801 {
        > 1820 while (*cp1!='&' && *cp1!=0)
        > 1821 {
        > 1822 if (*cp1=='"' || *cp1==',' || *cp1=='?')
        > 1823 { cp1++; continue; }
        >
        > 1824 else
        > 1825 {
        > 1826 if (*cp1=='+') *cp1=' ';
        >
        > 1827 if (sp_flg && *cp1==' ') { cp1++;
        > continue; }
        > 1828 if (*cp1==' ') sp_flg=1; else
        > sp_flg=0;
        > 1829 *cp2++=tolower(*cp1);
        >
        > 1830 cp1++;
        > 1831 }
        > 1832 }
        >
        ========================================================
        >
        >
        >
        >
        >


        =====
        Enric Naval
        Estudiante de Informtica de Gestin en la Udl (Lleida)
        GRIHO webalizer.conf
        http://griho.udl.es/webalizer/webalizer.conf.txt



        __________________________________
        Do you Yahoo!?
        Yahoo! Mail - 50x more storage than other providers!
        http://promotions.yahoo.com/new_mail
      • Bradford L. Barrett
        Note: Search strings are lowercased to increase accuracy.. since the search engines do not do a case sensitive search. That is, searching for Foo Bar
        Message 3 of 5 , Aug 20, 2004
        • 0 Attachment
          Note: Search strings are lowercased to increase accuracy.. since the
          search engines do not do a case sensitive search. That is, searching
          for "Foo Bar" returns the same results as "foo bar", or "fOo bAr".
          The webalizer will translate all of these to 'foo bar' to get an accurate
          count of how many times that string was used.

          Also note, the upper-to-lower translation is NOT done on the escaped
          string.. the referrer string is unescaped fairly early on in the
          processing, and search string analysis is performed on the unescaped
          string, not the escaped string as found in the raw log.

          --
          > > Hi, all
          > >
          > > I have a one question.
          > >
          > > Serach string is URLencoded at access_log as
          > > follows.
          > >
          > > A) %83%8b%83p%83%93%8eO%90%a2
          > >
          > > The above string is changed through webalizer as
          > > follows.
          > >
          > > B) %83%8b%83p%83%93%8eo%90%a2
          > >
          > > Difference betwee A and B is O OR o.
          > >
          > > Through webalizer A-Z is changed a-z.
          > >
          > > Is it method? or Have I a way out.
          > >
          > > Help.
          > >
          --
          Bradford L. Barrett brad@...
          A free electron in a sea of neutrons DoD#1750 KD4NAW

          The only thing Micro$oft has done for society, is make people
          believe that computers are inherently unreliable.
        • enventa2000
          ... Errrr, my bad. This patch was throwing webalizer into an infinite loop every time it found a % character. Not quite the intended result. I have now
          Message 4 of 5 , Aug 21, 2004
          • 0 Attachment
            > I have made a small "patch".

            Errrr, my bad. This "patch" was throwing webalizer into an infinite
            loop every time it found a '%' character. Not quite the intended
            result. I have now converted the "else if" block into an "if" block
            and moved it a bit below. Now it works correctly:


            diff webalizer-2.01-10/webalizer.c webalizer-2.01-10_bueno/webalizer.c
            170,172d169
            < int chars_unicode = 0; /* counter for
            unicode strings */
            < int is_unicode = 0; /* Boolean for unicode
            strings */
            <
            1823d1819
            < is_unicode=0;
            1833,1839c1829
            < if (*cp1=='%')
            < {
            < is_unicode=1;
            < chars_unicode=0;
            < }
            < chars_unicode++;
            < if ( chars_unicode!=3 && is_unicode!=0 )
            *cp2++=tolower(*cp1); /* normal character if not unicode */
            ---
            > *cp2++=tolower(*cp1); /* normal
            character */
            diff webalizer-2.01-10/webalizer.h webalizer-2.01-10_bueno/webalizer.h
            222,224d221
            < extern int chars_unicode; /* counter for unicode
            strings */
            < extern int is_unicode; /* Boolean for unicode
            strings */
            <



            --- In webalizer@yahoogroups.com, Enric Naval <enventa2000@y...>
            wrote:
            > Hello:
            >
            > The URL encoding (using "&") doesn't distinguish upper
            > case from lower case. So, "&2B" is the same as "$2b".
            > Changing everything to lower case doesn't change
            > anything. Mr.Barret probably does this to make
            > translation faster, but the encoding remains the same.
            >
            >
            > Problem comes with this encoding, am I right?:
            > "%8eo" --- lower o "o"
            > "%8eO" --- upper o "O"
            >
            > I have made a small "patch". I have added some lines,
            > so the line numbers are wrong. You can add the
            > modifications to your source by hand.
            >
            > The first added line defines a int variable
            > (chars_unicode) to count how many characters are we
            > far from the last "%" character found. The second
            > added line defines an int variable (is_unicode) that
            > will be used as if it was a boolean. By default the
            > compiler will set it to zero (false). The "else if"
            > block is only entered when a "%" character is found in
            > the string. It sets "is_unicode to 1 (true), and
            > resets the unicode counter to zero. Before line 1829 I
            > increase the unicode counter, and in line 1829 itself
            > I have added a condition to prevent lower casing in
            > characters three positions away from "%". This way
            > this code will perform this transformation, where "%",
            > "8" and "E" have had "tolowercase" executed in them,
            > but "O" hasn't, because chars_unicode is equal to 3:
            >
            > "%8EO" --- "%8eO"
            >
            > This allows URL encoding to be lowercased, but
            > prevents these unicode strings from being translated
            > too. Please let me know if it worked correctly for
            > you!
            >
            > You can make the changes by hand. I have also made a
            > small patch that makes things a little bit better,
            > defining variables in webalizer.h, so the program
            > doesn't have to define them every time it executes
            > srch_string:
            >
            > http://griho.udl.es/webalizer/unicode.patch.txt
            >
            >
            > 1800 void srch_string(char *ptr)
            > 1801 {
            > int chars_unicode;
            > int is_unicode;
            > 1820 while (*cp1!='&' && *cp1!=0)
            > 1821 {
            > 1822 if (*cp1=='"' || *cp1==',' || *cp1=='?')
            > 1823 { cp1++; continue; }
            >
            > else if (*cp1=='%')
            > {
            > is_unicode=1;
            > chars_unicode=0;
            > }
            > 1824 else
            > 1825 {
            > 1826 if (*cp1=='+') *cp1=' ';
            >
            > 1827 if (sp_flg && *cp1==' ') { cp1++;
            > continue; }
            > 1828 if (*cp1==' ') sp_flg=1; else sp_flg=0;
            >
            > chars_unicode++;
            > 1829 if ( chars_unicode!=3 && !is_unicode )
            > *cp2++=tolower(*cp1);
            > 1830 cp1++;
            > 1831 }
            > 1832 }
            >
            >
            >
            > Here you have the patch if you want to copy&paste:
            >
            > diff webalizer-2.01-10/webalizer.c
            > webalizer-2.01-10_original/webalizer.c
            > 170,172d169
            > < int chars_unicode =0; /*
            > counter for unicode strings */
            > < int is_unicode; =0; /*
            > Boolean for unicode strings */
            > <
            > 1823d1819
            > < is_unicode=0;
            > 1828,1832d1823
            > < else if (*cp1=='%')
            > < {
            > < is_unicode=1;
            > < chars_unicode=0;
            > < }
            > 1838,1839c1829
            > < chars_unicode++;
            > < if ( chars_unicode!=3 && !is_unicode )
            > *cp2++=tolower(*cp1); /* normal character if not
            > unicode */
            > ---
            > > *cp2++=tolower(*cp1);
            > /* normal character */
            > diff webalizer-2.01-10/webalizer.h
            > webalizer-2.01-10_original/webalizer.h
            > 222,224d221
            > < extern int chars_unicode; /*
            > counter for unicode strings */
            > < extern int is_unicode; /*
            > Boolean for unicode strings */
            > <
            >
            >
            >
            >
            >
            >
            > --- hideyuki nakano <hnakano@f...> wrote:
            >
            > > Hi, all
            > >
            > > I have a one question.
            > >
            > > Serach string is URLencoded at access_log as
            > > follows.
            > >
            > > A) %83%8b%83p%83%93%8eO%90%a2
            > >
            > > The above string is changed through webalizer as
            > > follows.
            > >
            > > B) %83%8b%83p%83%93%8eo%90%a2
            > >
            > > Difference betwee A and B is O OR o.
            > >
            > > Through webalizer A-Z is changed a-z.
            > >
            > > Is it method? or Have I a way out.
            > >
            > > Help.
            > >
            > >
            > >
            > >
            > > ================ webalizer.c(V2.01)
            > > ====================
            > > 1796 /*********************************************/
            > > 1797 /* SRCH_STRING - get search strings from ref */
            > > 1798 /*********************************************/
            > > 1799
            > > 1800 void srch_string(char *ptr)
            > > 1801 {
            > > 1820 while (*cp1!='&' && *cp1!=0)
            > > 1821 {
            > > 1822 if (*cp1=='"' || *cp1==',' || *cp1=='?')
            > > 1823 { cp1++; continue; }
            > >
            > > 1824 else
            > > 1825 {
            > > 1826 if (*cp1=='+') *cp1=' ';
            > >
            > > 1827 if (sp_flg && *cp1==' ') { cp1++;
            > > continue; }
            > > 1828 if (*cp1==' ') sp_flg=1; else
            > > sp_flg=0;
            > > 1829 *cp2++=tolower(*cp1);
            > >
            > > 1830 cp1++;
            > > 1831 }
            > > 1832 }
            > >
            > ========================================================
            > >
            > >
            > >
            > >
            > >
            >
            >
            > =====
            > Enric Naval
            > Estudiante de Informtica de Gestin en la Udl (Lleida)
            > GRIHO webalizer.conf
            > http://griho.udl.es/webalizer/webalizer.conf.txt
          • enventa2000
            Sorry again. This time the patch works correctly, and I have tested it in several different logs. This is the last message about this. I have been able to see
            Message 5 of 5 , Aug 21, 2004
            • 0 Attachment
              Sorry again. This time the patch works correctly, and I have tested it
              in several different logs. This is the last message about this.

              I have been able to see that half the people searchs for "AIPO" while
              half the other searches for "aipo".


              http://griho.udl.es/webalizer/unicode.patch.txt


              diff webalizer-2.01-10_bueno/webalizer.c webalizer-2.
              01-10_unicode/webalizer.c
              169a170,172
              > int chars_unicode = 0; /* counter for
              unicode strings */
              > int is_unicode = 0; /* Boolean for unicode
              strings */
              >
              1819a1823,1824
              > is_unicode=0;
              > chars_unicode=0;
              1829c1834,1841
              < *cp2++=tolower(*cp1); /* normal
              character */
              ---
              > if (*cp1=='%')
              > {
              > is_unicode=1;
              > chars_unicode=0;
              > }
              > if ( chars_unicode!=3 && is_unicode!=0 ) {
              *cp2++=tolower(*cp1); } /* normal character if not unicode */
              > else *cp2++=*cp1;
              > chars_unicode++;
              Only in webalizer-2.01-10_unicode/: webalizer.c~
              diff webalizer-2.01-10_bueno/webalizer.h webalizer-2.
              01-10_unicode/webalizer.h
              221a222,224
              > extern int chars_unicode; /* counter for unicode
              strings */
              > extern int is_unicode; /* Boolean for unicode
              strings */
              >
            Your message has been successfully submitted and would be delivered to recipients shortly.