Loading ...
Sorry, an error occurred while loading the content.

Re: Finding out matching and not matching entries between two files !

Expand Messages
  • Amit Saxena
    Hi Jim, Thanks for the response. I was doing the approach which you have mentioned as the fastest one. I was using nested data structures for the same. The
    Message 1 of 2 , Jul 16, 2009
    • 0 Attachment
      Hi Jim,

      Thanks for the response.

      I was doing the approach which you have mentioned as the fastest one. I was
      using nested data structures for the same. The only problem with that
      approach which I faced are :-

      1. The code was getting too much complex due to use of too many references
      to implement nested data structure.
      2. Finding out the matched line was not much of an issue, however I wanted
      to find out the unmatched records and that's where i had to scan the files
      more than twice. Perhaps my algorithm was not efficient but as performance
      is not the criteria for my script, I can afford the same.

      Is there any CPAN module (preferably built-in) which takes a CSV or
      character delimited file as an input and generate a nested data structure
      containing entire file contents automatically. Also is there any module for
      file comparison of two similar format files.

      Thanks & Regards,
      Amit Saxena

      On Thu, Jul 16, 2009 at 7:18 AM, Jim Gibson <jimsgibson@...> wrote:

      > At 2:12 AM -0700 7/16/09, Amit Saxena wrote:
      >
      >> Hi all,
      >>
      >> I need help regarding the approach to find out matched and unmatched
      >> entries
      >> between two files using perl.
      >>
      >> As the number of lines in the files would be around 10k-50k, I don't want
      >> to
      >> load entire file contents into memory.
      >>
      >
      >
      > The fastest approach is usually to load the shorter of the two files into
      > memory, then read the longer of the two files and process each line,
      > recording whether the line matches any record in the shorter file. A hash is
      > best for this method. 50k files should be no problem.
      >
      > If you really don't or can't read one of the files into memory, then a
      > method that still requires only one pass over each of the two files is to
      > sort the files and save the sorted copies. Then, read one line from each
      > file and compare. If they are equal, record this fact and read two more
      > lines. If they do not match, record the fact and read a line from the file
      > with the lessor of the two line, alphabetically speaking, then compare
      > again.
      >
      > --
      > Jim Gibson
      > Jim@...
      >


      [Non-text portions of this message have been removed]
    Your message has been successfully submitted and would be delivered to recipients shortly.