Loading ...
Sorry, an error occurred while loading the content.

Webcrawlers and Yahoo Groups search

Expand Messages
  • smutterbuggler
    Hi, just started looking at my Webalizer output, which resulted in my joining this group :-) I built a Fedora Core 2 system a few months back, and recently
    Message 1 of 9 , Nov 7, 2004
    View Source
    • 0 Attachment
      Hi,

      just started looking at my Webalizer output, which resulted in my
      joining this group :-)

      I built a Fedora Core 2 system a few months back, and recently opened
      Port 80 on the NAT firewall to allow external access to the web site.

      Webalizer reports a very wide range of sites - .nl, .nz, .it, .rr.com
      and loads of .net sites all visiting my website.

      As the site isn't publicised anywhere (and only has a couple of dumb
      java script frames) I presume that the access is by web crawlers
      which are just ploughing their way through IP addresses looking for
      anything interesting.

      The number of accesses seem to be climbing also, but no massive
      attack (hmmm...cool name for a group) so I presume they aren't
      finding anything useful.


      The reason for the title of this post - I tried to use Yahoo Groups
      search for articles on web crawlers but the search returned all the
      postings in the group.

      I cross checked searching for 'bumfluff' (well, you never know) which
      as expected turned up no postings.

      So as I can't seem to search the archives, does anyone have any
      useful information on these sites? (or why I can't search
      for 'crawlers' in this group)

      Top user agent seems to be Kostiki Client 2.20.40120.0

      Top host for October with 26 hits in 5 visits was 203.118.42.188
      which doesn't have a DNS reverse lookup.

      All direct requests, no referrers.

      I do like the stats from Webalizer - I presume there is a patch or
      similar to Apache (sorry, httpd) which picks up
      the 'http://localhost/usage' URL as I just blasted my old website
      onto the new box then pointed web root somewhere in the middle.

      Nevertheless Apache still serves the usage page just fine.

      Any info. gratefully received.

      Dave R
    • jd_314159
      Most of the hosts that visit your site are likely to be located in China, Singapore and parts of Europe (e.g. Amsterdaam is very notorious for this). For
      Message 2 of 9 , Nov 7, 2004
      View Source
      • 0 Attachment
        Most of the hosts that visit your site are likely to be located in
        China, Singapore and parts of Europe (e.g. Amsterdaam is very
        notorious for this). For example, the IP you mentioned
        (203.118.42.188) is an ISP in Singapore (www.starhub.com).

        What these folks are looking for is usually security holes in your
        setup. The logic is simple - if your HTTP server just serves default
        pages, than it's very likely that the machine hasn't been secured
        well and can be taken over. You will see that the number of such hits
        will decrease if you publish anything that is not under-construction-
        like.

        J.D.

        --- In webalizer@yahoogroups.com, "smutterbuggler" <wibble@b...>
        wrote:
        >
        > Hi,
        >
        > just started looking at my Webalizer output, which resulted in my
        > joining this group :-)
        >
        > I built a Fedora Core 2 system a few months back, and recently
        opened
        > Port 80 on the NAT firewall to allow external access to the web
        site.
        >
        > Webalizer reports a very wide range of sites -
        .nl, .nz, .it, .rr.com
        > and loads of .net sites all visiting my website.
        >
        > As the site isn't publicised anywhere (and only has a couple of
        dumb
        > java script frames) I presume that the access is by web crawlers
        > which are just ploughing their way through IP addresses looking for
        > anything interesting.
        >
        > The number of accesses seem to be climbing also, but no massive
        > attack (hmmm...cool name for a group) so I presume they aren't
        > finding anything useful.
        >
        >
        > The reason for the title of this post - I tried to use Yahoo Groups
        > search for articles on web crawlers but the search returned all the
        > postings in the group.
        >
        > I cross checked searching for 'bumfluff' (well, you never know)
        which
        > as expected turned up no postings.
        >
        > So as I can't seem to search the archives, does anyone have any
        > useful information on these sites? (or why I can't search
        > for 'crawlers' in this group)
        >
        > Top user agent seems to be Kostiki Client 2.20.40120.0
        >
        > Top host for October with 26 hits in 5 visits was 203.118.42.188
        > which doesn't have a DNS reverse lookup.
        >
        > All direct requests, no referrers.
        >
        > I do like the stats from Webalizer - I presume there is a patch or
        > similar to Apache (sorry, httpd) which picks up
        > the 'http://localhost/usage' URL as I just blasted my old website
        > onto the new box then pointed web root somewhere in the middle.
        >
        > Nevertheless Apache still serves the usage page just fine.
        >
        > Any info. gratefully received.
        >
        > Dave R
      • Enric Naval
        About the yahoo groups search problem. I couldn t reproduce your problem. I got the right results when I tried it out. I followed these steps (so you can
        Message 3 of 9 , Nov 8, 2004
        View Source
        • 0 Attachment
          About the yahoo groups search problem. I couldn't
          reproduce your problem. I got the right results when I
          tried it out. I followed these steps (so you can
          compare them with yours): I went to groups.yahoo.com,
          then clicked in the "webalizer" group, then clicked in
          the search box just above the calendar and typed "web
          crawlers" (without the double quotes), and pressed
          enter. It sent to this page. You can go there and
          press "next" for more results.

          http://groups.yahoo.com/group/webalizer/messagesearch?query=web%20crawlers



          About the strange visits: Hum, so, if I have
          understood correctly, your site is empty, yet you are
          receiving many visits from weird places?

          There are two posibilities:

          1- automated programs scanning for vulnerabilities,
          and not finding one. They search for things like
          "vti_bin", "vti_inf" or "command.exe". This wouldn't
          explain the climbing in visits.

          2- one of those automated programs has already found a
          vulnerability, and more and more people is using your
          server as a proxy or something similar as your IP is
          being propagated in the underground proxy lists....


          Could you copy & paste the top URL list in a message?
          A quick look would allow to know wether the visitors
          are malicious or not. The Top KB list could also be
          useful.

          Also: Kostiki is a russian word. I don't know its
          meaning (I think it is a made up word). When I
          searched for it in google I found no results related
          to web crawlers, instead I found a few results related
          to the Counter-Strike videogame and a few other
          results in russian, some of them lists of members.
          Someone going by the nickname "Kostiki" seems to have
          made his own client, or maybe he has changed the
          User-Agent line of an existing client.

          Also: The Top IP visiting you is based in Singapore.
          Is this normal?

          # whois 203.118.42.188
          [Preguntando whois.apnic.net]
          [whois.apnic.net]
          [...]
          netname: STARHUBINTERNET-SG
          descr: 19 Taiseng Drive
          descr: SINGAPORE 535222
          [...]


          --- smutterbuggler <wibble@...> wrote:

          >
          > Hi,
          >
          > just started looking at my Webalizer output, which
          > resulted in my
          > joining this group :-)
          >
          > I built a Fedora Core 2 system a few months back,
          > and recently opened
          > Port 80 on the NAT firewall to allow external access
          > to the web site.
          >
          > Webalizer reports a very wide range of sites - .nl,
          > .nz, .it, .rr.com
          > and loads of .net sites all visiting my website.
          >
          > As the site isn't publicised anywhere (and only has
          > a couple of dumb
          > java script frames) I presume that the access is by
          > web crawlers
          > which are just ploughing their way through IP
          > addresses looking for
          > anything interesting.
          >
          > The number of accesses seem to be climbing also, but
          > no massive
          > attack (hmmm...cool name for a group) so I presume
          > they aren't
          > finding anything useful.
          >
          >
          > The reason for the title of this post - I tried to
          > use Yahoo Groups
          > search for articles on web crawlers but the search
          > returned all the
          > postings in the group.
          >
          > I cross checked searching for 'bumfluff' (well, you
          > never know) which
          > as expected turned up no postings.
          >
          > So as I can't seem to search the archives, does
          > anyone have any
          > useful information on these sites? (or why I can't
          > search
          > for 'crawlers' in this group)
          >
          > Top user agent seems to be Kostiki Client
          > 2.20.40120.0
          >
          > Top host for October with 26 hits in 5 visits was
          > 203.118.42.188
          > which doesn't have a DNS reverse lookup.
          >
          > All direct requests, no referrers.
          >
          > I do like the stats from Webalizer - I presume there
          > is a patch or
          > similar to Apache (sorry, httpd) which picks up
          > the 'http://localhost/usage' URL as I just blasted
          > my old website
          > onto the new box then pointed web root somewhere in
          > the middle.
          >
          > Nevertheless Apache still serves the usage page just
          > fine.
          >
          > Any info. gratefully received.
          >
          > Dave R
          >
          >
          >
          >
          >
          >


          =====
          Enric Naval
          Estudiante de Inform�tica de Gesti�n en la Udl (Lleida)
          GRIHO webalizer.conf
          http://griho.udl.es/webalizer/webalizer.conf.txt



          __________________________________
          Do you Yahoo!?
          Check out the new Yahoo! Front Page.
          www.yahoo.com
        • smutterbuggler
          ... http://groups.yahoo.com/group/webalizer/messagesearch?query=web%20crawlers Hmm..... O.K. the problem seems to have been with the subsequent search of the
          Message 4 of 9 , Nov 8, 2004
          View Source
          • 0 Attachment
            --- In webalizer@yahoogroups.com, Enric Naval <enventa2000@y...> wrote:
            > About the yahoo groups search problem. I couldn't
            > reproduce your problem. I got the right results when I
            > tried it out. I followed these steps (so you can
            > compare them with yours): I went to groups.yahoo.com,
            > then clicked in the "webalizer" group, then clicked in
            > the search box just above the calendar and typed "web
            > crawlers" (without the double quotes), and pressed
            > enter. It sent to this page. You can go there and
            > press "next" for more results.
            >
            >
            http://groups.yahoo.com/group/webalizer/messagesearch?query=web%20crawlers

            Hmm..... O.K. the problem seems to have been with the subsequent
            search of the page returned :-(

            There is a 'webcrawler' entry in webalizer.conf, so I get a hit when
            anyone includes a listing of the webalizer.conf with a query.

            For some reason I couldn't find the string with the 'search' function
            of my browser.

            My bad, no doubt.

            Visitor issues in another response :-)
          • smutterbuggler
            ... ... What are the risks (if any) of publishing the usage pages on the web server for outside viewing? To turn this on for a day (say) would
            Message 5 of 9 , Nov 8, 2004
            View Source
            • 0 Attachment
              --- In webalizer@yahoogroups.com, Enric Naval <enventa2000@y...> wrote:
              <snip>
              > About the strange visits: Hum, so, if I have
              > understood correctly, your site is empty, yet you are
              > receiving many visits from weird places?
              >
              > There are two posibilities:
              >
              > 1- automated programs scanning for vulnerabilities,
              > and not finding one. They search for things like
              > "vti_bin", "vti_inf" or "command.exe". This wouldn't
              > explain the climbing in visits.
              >
              > 2- one of those automated programs has already found a
              > vulnerability, and more and more people is using your
              > server as a proxy or something similar as your IP is
              > being propagated in the underground proxy lists....
              >
              >
              > Could you copy & paste the top URL list in a message?
              > A quick look would allow to know wether the visitors
              > are malicious or not. The Top KB list could also be
              > useful.
              >
              > Also: Kostiki is a russian word. I don't know its
              > meaning (I think it is a made up word). When I
              > searched for it in google I found no results related
              > to web crawlers, instead I found a few results related
              > to the Counter-Strike videogame and a few other
              > results in russian, some of them lists of members.
              > Someone going by the nickname "Kostiki" seems to have
              > made his own client, or maybe he has changed the
              > User-Agent line of an existing client.
              >
              > Also: The Top IP visiting you is based in Singapore.
              > Is this normal?
              >
              > # whois 203.118.42.188
              > [Preguntando whois.apnic.net]
              > [whois.apnic.net]
              > [...]
              > netname: STARHUBINTERNET-SG
              > descr: 19 Taiseng Drive
              > descr: SINGAPORE 535222
              > [...]
              >
              >
              <snip>

              What are the risks (if any) of publishing the 'usage' pages on the web
              server for outside viewing?

              To turn this on for a day (say) would save a lot of cutting and
              pasting etc.
            • Enric Naval
              ... A malicious user can look up the list of Top Users, to learn about the usernames that are used to access the protected parts. The Top URL list could list
              Message 6 of 9 , Nov 8, 2004
              View Source
              • 0 Attachment
                --- smutterbuggler <wibble@...> wrote:

                >
                > --- In webalizer@yahoogroups.com, Enric Naval
                > <enventa2000@y...> wrote:
                > <snip>
                > > About the strange visits: Hum, so, if I have
                > > understood correctly, your site is empty, yet you
                > are
                > > receiving many visits from weird places?
                > >
                > > There are two posibilities:
                > >
                > > 1- automated programs scanning for
                > vulnerabilities,
                > > and not finding one. They search for things like
                > > "vti_bin", "vti_inf" or "command.exe". This
                > wouldn't
                > > explain the climbing in visits.
                > >
                > > 2- one of those automated programs has already
                > found a
                > > vulnerability, and more and more people is using
                > your
                > > server as a proxy or something similar as your IP
                > is
                > > being propagated in the underground proxy
                > lists....
                > >
                > >
                > > Could you copy & paste the top URL list in a
                > message?
                > > A quick look would allow to know wether the
                > visitors
                > > are malicious or not. The Top KB list could also
                > be
                > > useful.
                > >
                > > Also: Kostiki is a russian word. I don't know its
                > > meaning (I think it is a made up word). When I
                > > searched for it in google I found no results
                > related
                > > to web crawlers, instead I found a few results
                > related
                > > to the Counter-Strike videogame and a few other
                > > results in russian, some of them lists of members.
                > > Someone going by the nickname "Kostiki" seems to
                > have
                > > made his own client, or maybe he has changed the
                > > User-Agent line of an existing client.
                > >
                > > Also: The Top IP visiting you is based in
                > Singapore.
                > > Is this normal?
                > >
                > > # whois 203.118.42.188
                > > [Preguntando whois.apnic.net]
                > > [whois.apnic.net]
                > > [...]
                > > netname: STARHUBINTERNET-SG
                > > descr: 19 Taiseng Drive
                > > descr: SINGAPORE 535222
                > > [...]
                > >
                > >
                > <snip>
                >
                > What are the risks (if any) of publishing the
                > 'usage' pages on the web
                > server for outside viewing?

                A malicious user can look up the list of Top Users, to
                learn about the usernames that are used to access the
                protected parts.

                The Top URL list could list private pages that you
                access for administration.

                You may get log-spammed, where one visitor to your
                page fakes its referral so that it holds an URL to a
                commercial page. When google parses the usage files,
                that commercial page's PageRank increases because
                Google believes that your site is linking to it.

                A good security measure is disabling the "All URLs",
                "All Users", etc. lists.

                For a temporal publishing, the safest is copying just
                one HTML page (a 1 month stats page), placing it in an
                empty folder, then aliasing that folder as "usage".
                You can hand-delete the confidential information if
                necesary, as it is only one page.

                Alias /usage "/var/www/html/usage"
                Alias /usage/ "/var/www/html/usage/"


                > To turn this on for a day (say) would save a lot of
                > cutting and
                > pasting etc.
                >

                I thougt it was actually easy, select the files in
                your browser, then pasting them here. It's dirty, and
                the results look ugly, but it works.


                =====
                Enric Naval
                Estudiante de Inform�tica de Gesti�n en la Udl (Lleida)
                GRIHO webalizer.conf
                http://griho.udl.es/webalizer/webalizer.conf.txt



                __________________________________
                Do you Yahoo!?
                Check out the new Yahoo! Front Page.
                www.yahoo.com
              • waldo kitty
                ... what are the risks? logfile spamming for one thing... that s where folk spam stuff not to your site but to your log files for the search engine spiders to
                Message 7 of 9 , Nov 8, 2004
                View Source
                • 0 Attachment
                  smutterbuggler wrote:
                  >
                  > <snip>
                  >
                  > What are the risks (if any) of publishing the 'usage' pages on the web
                  > server for outside viewing?

                  what are the risks? logfile spamming for one thing... that's where folk spam stuff not to your site but to your log
                  files for the search engine spiders to find... if they can get enough hits to their site, they will climb in the search
                  engine rankings... the higher their rankings, the more money they can make...

                  > To turn this on for a day (say) would save a lot of cutting and
                  > pasting etc.

                  for a day or so? i couldn't say... i wouldn't unless i had a good idea when the spiders would be around... as an
                  example, google is a regular on my site but M$'s new search engine spider has been a real nusiance since going online as
                  it walks my site most every day...

                  --
                  _\/
                  (@@) Waldo Kitty, Waldo's Place USA
                  __ooO_( )_Ooo_____________________ telnet://bbs.wpusa.dynip.com
                  _|_____|_____|_____|_____|_____|_____ http://www.wpusa.dynip.com
                  ____|_____|_____|_____|_____|_____|_____ ftp://ftp.wpusa.dynip.com
                  _|_Eat_SPAM_to_email_me!_YUM!__|_____|_____ wkitty42 -at- alltel.net
                • Enric Naval
                  ... You can create a robots.txt file in your root folder, to prevent robots from crawling certains pages. Most robots obey this standard, including google and
                  Message 8 of 9 , Nov 9, 2004
                  View Source
                  • 0 Attachment
                    > > To turn this on for a day (say) would save a lot
                    > of cutting and
                    > > pasting etc.
                    >
                    > for a day or so? i couldn't say... i wouldn't unless
                    > i had a good idea when the spiders would be
                    > around... as an
                    > example, google is a regular on my site but M$'s new
                    > search engine spider has been a real nusiance since
                    > going online as
                    > it walks my site most every day...

                    You can create a robots.txt file in your root folder,
                    to prevent robots from crawling certains pages. Most
                    robots obey this standard, including google and msn.
                    Some sbots will instead use this as an index of what
                    pages you don't want them to see, and crawl them in
                    purpose, but there are very litle of them. For just a
                    day, there is very little risk.

                    User-agent: *
                    Disallow: /usage


                    =====
                    Enric Naval
                    Estudiante de Inform�tica de Gesti�n en la Udl (Lleida)
                    GRIHO webalizer.conf
                    http://griho.udl.es/webalizer/webalizer.conf.txt



                    __________________________________
                    Do you Yahoo!?
                    Check out the new Yahoo! Front Page.
                    www.yahoo.com
                  • Enric Naval
                    ... Silly of me... There is a very safe way to do it. You can protect the directory with a password, then publish the password in this list. This will stop
                    Message 9 of 9 , Nov 9, 2004
                    View Source
                    • 0 Attachment
                      > > To turn this on for a day (say) would save a lot
                      > of cutting and
                      > > pasting etc.
                      >
                      > for a day or so? i couldn't say... i wouldn't unless
                      > i had a good idea when the spiders would be
                      > around... as an
                      > example, google is a regular on my site but M$'s new
                      > search engine spider has been a real nusiance since
                      > going online as
                      > it walks my site most every day...

                      Silly of me... There is a very safe way to do it. You
                      can protect the directory with a password, then
                      publish the password in this list. This will stop
                      crawlers, bots, etc.

                      You can copy&paste the text below in httpd.conf,
                      inside the appropiate "Directory" container, or in a
                      .htaccess file in the directory you want to protect.
                      If you use a .htaccess file then you need to have an
                      AllowOverride line in the apropiate "Directory"
                      container in httpd.conf, or apache will refuse to obey
                      the .htaccess instructions. If you didn't add any
                      directory, that would be between these two lines (they
                      are very near to each other):
                      <Directory />
                      </Directory>


                      this is the line to add to httpd.conf:

                      AllowOverride AuthConfig Limit


                      TEXT TO COPY&PASTE
                      #*******************************


                      AuthType Basic
                      AuthName "Usage page"
                      AuthUserFile /tmp/.htpasswd_usage
                      require user LOGIN

                      # To generate a new password execute:
                      # htapasswd -c /tmp/.htpasswd_usage LOGIN
                      # the type the password you want to use.


                      #*******************************

                      =====
                      Enric Naval
                      Estudiante de Inform�tica de Gesti�n en la Udl (Lleida)
                      GRIHO webalizer.conf
                      http://griho.udl.es/webalizer/webalizer.conf.txt

                      __________________________________________________
                      Do You Yahoo!?
                      Tired of spam? Yahoo! Mail has the best spam protection around
                      http://mail.yahoo.com
                    Your message has been successfully submitted and would be delivered to recipients shortly.