Loading ...
Sorry, an error occurred while loading the content.

Re: [PBML] Which Data Structure to Use?

Expand Messages
  • Bompa
    Seems to me that you want to know how many times UID1 /usr/bin occurs in the array. Like a list of unique values. Seems you eluded to this idea
    Message 1 of 4 , Oct 1, 2001
    • 0 Attachment
      Seems to me that you want to know how many times "UID1 /usr/bin"
      occurs in the array. Like a list of unique values. Seems you eluded
      to this idea in your statement...

      > userid & directory will make it unique. Concatenating the userid &
      > directory and populating a hash with this as the key, (the value
      > being the running count) seemed feasible. But, perhaps also somewhat
      > less than elegant.

      (Personal note: I wish some of you kids would stop worrying about
      elegance, being cool, and 'perlish', and just get the job done, heh)

      Anyway, if you want to concatenate as you say above, I have saved several
      methods of doing this that I saw on this list. Each of these snippets does
      the same; counts the occurances of a value.

      You could go here and read the tread if you have time.
      http://groups.yahoo.com/group/perl-beginner/message/4887


      Bompa


      CREATE LIST OF UNIQUE VALUES

      %seen = ();
      @uniq = ();
      foreach $item (@list) {
      unless ($seen{$item}) {
      # if we get here, we have not seen it before
      $seen{$item} = 1;
      push(@uniq, $item);
      }
      }

      Faster
      %seen = ();
      foreach $item (@list) {
      push(@uniq, $item) unless $seen{$item}++;
      }

      Similar but with user function
      %seen = ();
      foreach $item (@list) {
      some_func($item) unless $seen{$item}++;
      }

      Faster but different
      %seen = ();
      foreach $item (@list) {
      $seen{$item}++;
      }
      @uniq = keys %seen;

      Faster and even more different
      %seen = ();
      @uniqu = grep { ! $seen{$_} ++ } @list;

      by Erik Tank, I think





      msutfin@... wrote:
      >
      > Sysadmin has asked for a report based on our FTP logs. I've parsed
      > the lines in the FTP log based on the action the user took (STOR,
      > DELE, RETR, CWD..) For each action, I now have an array that contains
      > just those lines of the FTP log.
      >
      > There are several 'fields' in each line that I would like to report
      > on. Example: For the change directory (CWD) array, I'm interested in
      > the 3 pieces of info... userid, directory, and a running count for
      > each.
      >
      > I'm stumbling on how to reason (and code) the storage for this data.
      > The end report would look something like this..
      >
      > Userid Directory Number of Hits
      > UID1 /usr/bin 25
      > UID1 /usr/opt 3
      > UID3 /user/bin/perl 6
      >
      > So, as I said before, with a foreach loop and a regex, I've built a
      > change directory array (@CWD). Now with another foreach loop and a
      > split, I stored the $userid ($CWD[6]) and directory($CWD[8]) in
      > scalars and have a counter for the hits. Problem number one... How to
      > identify the combined userid/directory to determine which counter to
      > apply the hit to ..
      >
      > Seems like the high level identifier will be userid, but as the same
      > user may CWD to multiple directories with an FTP session, perhaps the
      > userid & directory will make it unique. Concatenating the userid &
      > directory and populating a hash with this as the key, (the value
      > being the running count) seemed feasible. But, perhaps also somewhat
      > less than elegant.
      >
      > A hash of array (userid => [directory, count]) doesn't get the job
      > done because the userid key is not unique. If I understand the
      > examples in Perl Cookbook, the hash of hashes doesn't work for the
      > same reason.
      >
      > If I wasn't sure I was confused and offtrack before I wrote this,
      > after proofreading it, I've convinced myself that I am. Any help will
      > be much appreciated.
      >
      > With a little help understanding what might be a reasonable storage
      > structure/pseudo code approach, I'll take a stab at the code and
      > revisit y'all when I approach the <wall>.
      >
      > TIA,
      > Mark
      >
      >
      >
      >
      > Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

      --

      http://www.pureperl.org
      http://www.afreeirc.net
    • Charles K. Clarkson
      ... From: To: Sent: Monday, October 01, 2001 1:53 PM Subject: [PBML] Which Data Structure to Use?
      Message 2 of 4 , Oct 1, 2001
      • 0 Attachment
        ----- Original Message -----
        From: <msutfin@...>
        To: <perl-beginner@yahoogroups.com>
        Sent: Monday, October 01, 2001 1:53 PM
        Subject: [PBML] Which Data Structure to Use?


        : Sysadmin has asked for a report based on our FTP logs. I've parsed
        : the lines in the FTP log based on the action the user took (STOR,
        : DELE, RETR, CWD..) For each action, I now have an array that contains
        : just those lines of the FTP log.
        :
        : There are several 'fields' in each line that I would like to report
        : on. Example: For the change directory (CWD) array, I'm interested in
        : the 3 pieces of info... userid, directory, and a running count for
        : each.
        :
        : I'm stumbling on how to reason (and code) the storage for this data.
        : The end report would look something like this..
        :
        : Userid Directory Number of Hits
        : UID1 /usr/bin 25
        : UID1 /usr/opt 3
        : UID3 /user/bin/perl 6
        :
        : So, as I said before, with a foreach loop and a regex, I've built a
        : change directory array (@CWD). Now with another foreach loop and a
        : split, I stored the $userid ($CWD[6]) and directory($CWD[8]) in
        : scalars and have a counter for the hits. Problem number one... How to
        : identify the combined userid/directory to determine which counter to
        : apply the hit to ..
        :
        : Seems like the high level identifier will be userid, but as the same
        : user may CWD to multiple directories with an FTP session, perhaps the
        : userid & directory will make it unique. Concatenating the userid &
        : directory and populating a hash with this as the key, (the value
        : being the running count) seemed feasible. But, perhaps also somewhat
        : less than elegant.


        I like the hash of arrays. The hash would be keyed
        to "$userid.$directory", as you mentioned which should
        aid in counting and sorting. The values would be an
        array:
        [$userid, $directory, $count].

        Counting would be:
        $report{"$userid.$directory"}[2]++;

        Sorting might be:
        print "$_[0]\t$_[1]\t$_[2]\n"
        for @report{sort keys %report};

        : A hash of array (userid => [directory, count]) doesn't get the job
        : done because the userid key is not unique. If I understand the
        : examples in Perl Cookbook, the hash of hashes doesn't work for the
        : same reason.
        :
        : If I wasn't sure I was confused and offtrack before I wrote this,
        : after proofreading it, I've convinced myself that I am. Any help will
        : be much appreciated.

        Perhaps if we saw a snippet of raw data, we might be able to
        offer different solutions.

        : With a little help understanding what might be a reasonable storage
        : structure/pseudo code approach, I'll take a stab at the code and
        : revisit y'all when I approach the <wall>.
        :
      • Mark Sutfin
        ... different solutions. I should have included a sample yesterday... The attached a file contains about a dozen lines from our FTP
        Message 3 of 4 , Oct 2, 2001
        • 0 Attachment
          <Charles K. Clarkson wrote>
          : Perhaps if we saw a snippet of raw data, we might be able to offer
          different solutions.

          I should have included a sample yesterday... The attached a file contains
          about a dozen lines from our FTP log. I included records for actions I'd
          like to report on. Reading across, the columns (split on \s) are: date,
          time, session id, ftp server, ?(don't know), incoming IP address, userid,
          action, filename, full path/filename. Sysadmin has changed their collective
          minds since yesterday, the report now needs to be sorted by: Incoming IP
          address, userid, action, file, and path.

          So, following your logic of the hash of arrays, the key would now be
          "$IP_addr.$userid.$action.$filename.$path" and the values in the array
          would be [$IP_addr, $userid, $action, $filename, $path, $count].

          Seems like the concatenation of all fields (except count) makes it unique.
          Why would I use an array (repeating the fields in the key), when I have a
          key (albeit a long one) and value (count) as in a "regular" hash? Is this so
          that when I get around to reporting on things, I don't have to split the key
          to print it? If so, sounds reasonable. I'll see... and thanks for any other
          approaches that come to mind once you've seen the sample log file. I'm
          really jazzed about Perl, but being visual, I need to try each approach to
          get a feel for what's going on....

          So, for the dozen rows in the attached file, The report may look something
          like this...

          Incoming IP Userid Action File
          Path Count

          11.17.249.4
          sqlupdate
          STOR
          ctcord.txt
          d:/ftproot/users/sqlupdate/ctc/ 1
          RETR
          ses_upd.mdb
          d:/ftproot/users/sqlupdate/tld/ 1
          DELE
          ses_upd.old
          d:/ftproot/users/sqlupdate/ 1
          11.17.249.59
          sqlupdate
          RETR
          ctcord.txt
          d:/ftproot/users/sqlupdate/ctc/ 1
          STOR
          ctcmem.txt
          d:/ftproot/users/sqlupdate/ctc/ 1
          vax.txt
          d:/ftproot/users/sqlupdate/ccr/ 1
          11.3.177.109
          experian
          .....
          63.163.206.81

          TIA,
          Mark

          : Sysadmin has asked for a report based on our FTP logs. I've parsed
          : the lines in the FTP log based on the action the user took (STOR,
          : DELE, RETR, CWD..) For each action, I now have an array that contains
          : just those lines of the FTP log.
          :
          : There are several 'fields' in each line that I would like to report
          : on. Example: For the change directory (CWD) array, I'm interested in
          : the 3 pieces of info... userid, directory, and a running count for
          : each.
          :
          : I'm stumbling on how to reason (and code) the storage for this data.
          : The end report would look something like this..
          :
          : Userid Directory Number of Hits
          : UID1 /usr/bin 25
          : UID1 /usr/opt 3
          : UID3 /user/bin/perl 6
          :
          : So, as I said before, with a foreach loop and a regex, I've built a
          : change directory array (@CWD). Now with another foreach loop and a
          : split, I stored the $userid ($CWD[6]) and directory($CWD[8]) in
          : scalars and have a counter for the hits. Problem number one... How to
          : identify the combined userid/directory to determine which counter to
          : apply the hit to ..
          :
          : Seems like the high level identifier will be userid, but as the same
          : user may CWD to multiple directories with an FTP session, perhaps the
          : userid & directory will make it unique. Concatenating the userid &
          : directory and populating a hash with this as the key, (the value
          : being the running count) seemed feasible. But, perhaps also somewhat
          : less than elegant.


          I like the hash of arrays. The hash would be keyed
          to "$userid.$directory", as you mentioned which should
          aid in counting and sorting. The values would be an
          array:
          [$userid, $directory, $count].

          Counting would be:
          $report{"$userid.$directory"}[2]++;

          Sorting might be:
          print "$_[0]\t$_[1]\t$_[2]\n"
          for @report{sort keys %report};

          : A hash of array (userid => [directory, count]) doesn't get the job
          : done because the userid key is not unique. If I understand the
          : examples in Perl Cookbook, the hash of hashes doesn't work for the
          : same reason.
          :
          : If I wasn't sure I was confused and offtrack before I wrote this,
          : after proofreading it, I've convinced myself that I am. Any help will
          : be much appreciated.

          Perhaps if we saw a snippet of raw data, we might be able to
          offer different solutions.

          : With a little help understanding what might be a reasonable storage
          : structure/pseudo code approach, I'll take a stab at the code and
          : revisit y'all when I approach the <wall>.
          :






          Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
        • Charles K. Clarkson
          Mark Sutfin ... If your going to sort on IP addresses, you should probably pad them a little. 11.17.249.4 becomes 011.017.249.004
          Message 4 of 4 , Oct 3, 2001
          • 0 Attachment
            "Mark Sutfin" <msutfin@...>

            : <Charles K. Clarkson wrote>
            : : Perhaps if we saw a snippet of raw data, we might be able to offer
            : different solutions.
            :
            : I should have included a sample yesterday... The
            : attached a file contains about a dozen lines from
            : our FTP log. I included records for actions I'd
            : like to report on. Reading across, the columns
            : (split on \s) are: date, time, session id, ftp
            : server, ?(don't know), incoming IP address,
            : userid, action, filename, full path/filename.
            : Sysadmin has changed their collective minds since
            : yesterday, the report now needs to be sorted by:
            : Incoming IP address, userid, action, file, and
            : path.
            :
            : So, following your logic of the hash of arrays,
            : the key would now be
            : "$IP_addr.$userid.$action.$filename.$path" and
            : the values in the array would be [$IP_addr,
            : $userid, $action, $filename, $path, $count].
            :
            : Seems like the concatenation of all fields
            : (except count) makes it unique.
            : Why would I use an array (repeating the fields
            : in the key), when I have a key (albeit a long
            : one) and value (count) as in a "regular" hash?
            : Is this so that when I get around to reporting
            : on things, I don't have to split the key to
            : print it?

            If your going to sort on IP addresses, you should
            probably pad them a little.

            11.17.249.4 becomes 011.017.249.004 in the key
            and is still 11.17.249.4 in the array. Also if you
            use . as the key seperator it may be difficult
            to extract the IP from the key later.

            : If so, sounds reasonable. I'll see... and thanks
            : for any other approaches that come to mind once
            : you've seen the sample log file. I'm really
            : jazzed about Perl, but being visual, I need to
            : try each approach to get a feel for what's going
            : on....

            The Sort::Field module comes to mind for the sort
            (after the counts are in). It would need a delimited
            string though. It does allow for ascending and
            descending alphabetical and numeric mixed sorts on
            multiple fields. You might take a look at it.


            HTH,
            Charles K. Clarkson
            Clarkson Energy Homes, Inc.

            Half a cookie is better than
          Your message has been successfully submitted and would be delivered to recipients shortly.