Loading ...
Sorry, an error occurred while loading the content.
 

Re: [PBML] Please help me optimize my Perl keyword parser for large files

Expand Messages
  • fergus
    just realised that the server strips the code - i guess this thread is pretty much dead but i ve attached it again for the archives. ... #!/usr/bin/env perl -w
    Message 1 of 11 , Mar 8, 2002
      just realised that the server strips the code - i guess this thread is
      pretty much dead but i've attached it again for the archives.

      On 08.03-01:51, fergus wrote:
      > had a look at this - just curious. have a look at these two scripts &
      > sample data. try some benchmarking. please let me know the results, i'm
      > still curious.
      >
      > input file was a text dump of the xlib & xintrinsics manuals (about 2Mb
      > in size - obviously not included in mail). NOTE: i've obviously modified
      > the code so i could look as some testing. consequently these scripts
      > will NOT work in your original environment, however, i guess you can easily
      > figure out the changes needed.
      >
      > the first script 'straight.pl' simply parses for all the keywords &
      > keeps a count of the hits. this performs worse with test sample 1 &
      > better with test sample 2.
      > --
      > sample1 2.89 real 2.89 user 0.00 sys
      > sample2 2.81 real 2.78 user 0.02 sys
      > --
      >
      > the second script 'first_find.pl' will only look for an occurance of the
      > word & the stop looking. obviously the results are opposite to
      > 'straight.pl'.
      > --
      > sample1 1.41 real 1.41 user 0.00 sys
      > sample2 2.88 real 2.87 user 0.01 sys
      > --
      >
      > for comparison, the original script has roughly these times (note this
      > is without the sort -u (i.e. elimination of duplicates) as it was not
      > included & i couldn't be bothered adding it.
      > --
      > sample1 10.67 real 10.55 user 0.12 sys
      > sample2 11.02 real 10.56 user 0.04 sys
      > --
      >
      > the performance is as expected, however, there are some anomalies i'm
      > not happy with. rather than enter a detailed discussion i will just
      > quote an addage;
      >
      > "your milage may vary on this"
      >
      > good luck & i'd appreciate to know your milage.
      >
      > On 05.03-18:51, erichawkins_2000 wrote:
      > > I have a working Perl script that looks through a file for keywords.
      > > If it finds a keyword in the file it lets me know. The file is BIG
      > > and the script is SLOW and I am interested in speeding things up. I
      > > run my script using Perl v5.6.1 for sun4-solaris.
      > >
      > > Typically, I parse through an ASCII text file around 100MB in size.
      > > The keywords I search for change, so I load them from a file. Usually
      > > I only parse for, at most, 20 different keywords. Once I find one
      > > keyword I don't care about duplicates. Knowing of the existence of a
      > > keyword in a file is satisfactory.
      > >
      > > I am including a copy of my script. Please be ruthless. I want to
      > > learn how to write quick, efficient, elegant Perl. I learned Perl by
      > > studying coworkers code. They learned Perl the same way. Bad habits
      > > have probably been propagated. I sacrifice my dignity here so I don't
      > > have to do it in front of my design manager.
      > >
      > > >>>>>>>>>>>>>>> BEGIN SCRIPT <<<<<<<<<<<<<<<<<<<
      > >
      > > #!/tooldist/vlsi_local/solrs/perl
      > > eval 'exec perl -S $0 "$@"'
      > > if 0;
      > > # I'm pretty sure that the above is no longer necessary?
      > >
      > > # This is my standard method for getting command line options
      > > use Getopt::Long;
      > >
      > > my @optl = ("help", "infile=s");
      > > GetOptions @optl;
      > >
      > > if (($opt_infile) && (!$opt_help)) {
      > > if (! (-e $opt_infile) ) {
      > > print "$0: ERROR: Can't see $opt_infile, where is it?
      > > \n";
      > > exit 2;
      > > } else {
      > > open (INFILE, "<$opt_infile") or die "$0: ERROR:
      > > Can't open $opt_infile\n";
      > > }
      > > } else { &print_usage; }
      > >
      > > # The unsupported constructs file is usually very small ~100 bytes
      > > open (SYNOPSYS_UNSUPPORTED, "<.synopsys_unsupported_constructs") ||
      > > die "$0: ERROR: Cannot open .synopsys_unsupported_constructs\n";
      > >
      > > # Should I slurp this, or does it matter?
      > > while(<SYNOPSYS_UNSUPPORTED>) { chomp; push @CONSTRUCTS, $_;} close
      > > SYNOPSYS_UNSUPPORTED;
      > >
      > > # Is the subroutine call okay, or do you think too much overhead?
      > > while (<INFILE>) {
      > > chomp;
      > > next if (/^$/);
      > > $_ = check_sdf($_);
      > > push @OUTFILE, "$_\n" if $_; # store results if nonzero
      > > }
      > >
      > > close INFILE;
      > >
      > > # I want to print all the results. Right now I pipe this through
      > > # a "sort -u" filter to only show me unique occurances. What is
      > > # the better way from within this script?
      > > foreach $item (0 .. $#OUTFILE) {
      > > print "$0: RESULTS: Unsupported found: $OUTFILE[$item]\n";
      > > }
      > >
      > > # I'm not computer programmer, I'm electrical design engineer. This
      > > # is a pretty lame keyword search - what is better way? Also, how
      > > # do I delete a keyword from the CONSTRUCTS array once I find it?
      > > # I think this would speed things up, but I don't know how to
      > > # do it.
      > > sub check_sdf {
      > > my $input = $_[0];
      > > my $output;
      > > foreach $pinname ( 0 .. $#CONSTRUCTS ) {
      > > $regex = $CONSTRUCTS[$pinname];
      > > $output = ( $input =~ /[() ]$regex[() ]/) ? $regex :
      > > 0;
      > > last if $output;
      > > }
      > > $output;
      > > }
      > >
      > > # Just part of my standard way of doing things. I try to include this
      > > # in all my scripts.
      > > sub print_usage {
      > > print <<END_HELP;
      > >
      > > Usage: $0: -infile=<input_file> [-help]
      > >
      > > -infile : Input SDF file
      > > -help : Usage info
      > >
      > >
      > > NOTE: Need to create a .synopsys_unsupported_constructs file
      > >
      > >
      > > END_HELP
      > > exit 1;
      > > }
      > >
      > > >>>>>>>>>>>>>>> END SCRIPT <<<<<<<<<<<<<<<<<<<
      > >
      > > Thanks in advance for your help!
      > >
      > >
      > >
      > > Unsubscribing info is here: http://help.yahoo.com/help/us/groups/groups-32.html
      > >
      > > Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/

      ----------

      #!/usr/bin/env perl -w
      eval 'exec perl -S $0 "$@"' if 0;

      use strict ;
      use English ;

      #START<< functions >>
      use Getopt::Long;

      sub getopts {
      my %opt = @_ ; return undef unless ( defined(%opt) ) ;
      my @opt_spec ;
      foreach ( keys %opt ) {
      push ( @opt_spec, $_.$opt{$_} ) ;
      $opt{$_} = undef ;
      } ;
      GetOptions ( \%opt, @opt_spec ) || usage () ;
      return \%opt ;
      } ;

      sub usage {
      print <<USAGE ;
      Usage: $0 --infile=<input_file> --wordfile=<word_file> | --help

      --infile : Input file to parse
      --wordfile : Input word file to parse for
      --help : This message

      NOTE: Need to create a 'word' file
      USAGE
      exit 1 ;
      } ;
      #END<< functions >>

      #START<< main >>
      my %opt = (
      help => '',
      infile => '=s',
      wordfile => '=s'
      ) ;

      %opt = %{ getopts ( %opt ) } ;

      usage() if ( $opt{help} || !$opt{infile} || !$opt{wordfile} ) ;

      # build regex
      open ( WORDS, "<$opt{wordfile}")
      || die ( "opening word source: $!" ) ;
      my @words = <WORDS> ;
      close ( WORDS )
      || die ( "closing word source: $!" ) ;
      chomp ( @words ) ;
      my $re = join ( '|', @words ) ;

      # search
      open ( INFILE, "<$opt{infile}" )
      || die ( "opening source file: $!" ) ;
      my @found ;
      while ( <INFILE> ) {
      if ( /$re/ ) {
      chomp ( $MATCH ) ;
      push ( @found, $MATCH ) ;
      @words = () ; @_ = split ( '\|', $re ) ;
      foreach ( @_ ) {
      push ( @words, $_ ) unless ( $_ eq $MATCH ) ;
      } ;
      $re = join ( '|', @words ) ;
      last if ( length($re) == 0 ) ;
      } ;
      }
      close ( INFILE )
      || die ( "closing source file: $!" ) ;

      foreach ( @found ) {
      print ( "$opt{infile}: unsupported found [ $_ ]\n" ) ;
      } ;
      #END<< main >>

      ----------

      #!/usr/bin/env perl -w
      eval 'exec perl -S $0 "$@"' if 0;

      use strict ;
      use English ;

      #START<< functions >>
      use Getopt::Long;

      sub getopts {
      my %opt = @_ ; return undef unless ( defined(%opt) ) ;
      my @opt_spec ;
      foreach ( keys %opt ) {
      push ( @opt_spec, $_.$opt{$_} ) ;
      $opt{$_} = undef ;
      } ;
      GetOptions ( \%opt, @opt_spec ) || usage () ;
      return \%opt ;
      } ;

      sub usage {
      print <<USAGE ;
      Usage: $0 --infile=<input_file> --wordfile=<word_file> | --help

      --infile : Input file to parse
      --wordfile : Input word file to parse for
      --help : This message

      NOTE: Need to create a 'word' file
      USAGE
      exit 1 ;
      } ;
      #END<< functions >>

      #START<< main >>
      my %opt = (
      help => '',
      infile => '=s',
      wordfile => '=s'
      ) ;

      %opt = %{ getopts ( %opt ) } ;

      usage() if ( $opt{help} || !$opt{infile} || !$opt{wordfile} ) ;

      # build regex
      open ( WORDS, "<$opt{wordfile}")
      || die ( "opening word source: $!" ) ;
      my @words = <WORDS> ;
      close ( WORDS )
      || die ( "closing word source: $!" ) ;
      chomp ( @words ) ;
      my $re = join ( '|', @words ) ;

      # search
      open ( INFILE, "<$opt{infile}" )
      || die ( "opening source file: $!" ) ;
      my %found ;
      while ( <INFILE> ) {
      if ( /$re/o ) {
      chomp ( $MATCH ) ;
      $found{$MATCH}++ ;
      } ;
      }
      close ( INFILE )
      || die ( "closing source file: $!" ) ;

      foreach ( keys %found ) {
      print ( "$opt{infile}: $found{$_} occurances of unsupported found [ $_ ]\n" ) ;
      } ;
      #END<< main >>

      ----------

      XDestroySubwindows
      Specifies
      XRaiseWindow
      window
      colormap
      properties
      Appendix E
      grabDestroyCallback
      XASKwerqji24l@Jk3

      ----------

      kdlf jj31l
      LKJSDKlkwwekjq32487jg
      dksa234safAD324
      sfkljjlkwekjpqeonxz
      mfsdakjqwSDAT#45
      gfaskj4354350gklKSDKNVBX
      salke23C
      ZZXZwdf


      [Non-text portions of this message have been removed]
    Your message has been successfully submitted and would be delivered to recipients shortly.