Loading ...
Sorry, an error occurred while loading the content.

Finding out matching and not matching entries between two files !

Expand Messages
  • Amit Saxena
    Hi all, I need help regarding the approach to find out matched and unmatched entries between two files using perl. As the number of lines in the files would be
    Message 1 of 2 , Jul 16, 2009
    • 0 Attachment
      Hi all,

      I need help regarding the approach to find out matched and unmatched entries
      between two files using perl.

      As the number of lines in the files would be around 10k-50k, I don't want to
      load entire file contents into memory.

      The first file (file1 also known as superset file) contains all the data in
      4 columns in a format like country, state, city and id. The second file
      (file2

      also known as subset file) contains some of the data from superset file with
      additional condition that it does not contains all 4 columns. Instead it

      contains 3 columns only.

      The following information is needed from these input files
      1. Matched file . which lists the contents of the superset file which
      matches the contents of subset file.
      2. Unmatched file .given all the ids for the country - state pair from the
      subset file, list down all the rows from the superset file which contains
      the same

      country - state pair but none of those ids. The sample files are shown
      below.

      File 1 (Superset)

      Country1,State1,City111,id1
      Country1,State1,City112,id2
      Country1,State1,City113,id3
      Country1,State1,City114,id4
      Country1,State1,City115,id5
      Country1,State2,City121,id6
      Country1,State2,City122,id7
      Country1,State2,City123,id8
      Country1,State3,City131,id9
      Country1,State3,City132,id10


      File 2 (subset)

      Country1,State1,City111
      Country1,State1,City112
      Country1,State2,City121
      Country1,State3,City131


      Matched file
      ------------

      Country1,State1,City111,id1
      Country1,State1,City112,id2
      Country1,State2,City121,id6
      Country1,State3,City131,id9


      Unmatched file
      --------------


      Country1,State1,City113,id3
      Country1,State1,City114,id4
      Country1,State1,City115,id5
      Country1,State2,City122,id7
      Country1,State2,City123,id8
      Country1,State3,City132,id10


      As of now, I am reading the subset file line by line and then once there is
      a difference in country and state pair, I find out all records in superset
      file

      which satisfies matching and unmatching condition.

      Please suggest a better approach for the same.

      Thanks & Regards,
      Amit Saxena


      [Non-text portions of this message have been removed]
    • Amit Saxena
      Hi Jim, Thanks for the response. I was doing the approach which you have mentioned as the fastest one. I was using nested data structures for the same. The
      Message 2 of 2 , Jul 16, 2009
      • 0 Attachment
        Hi Jim,

        Thanks for the response.

        I was doing the approach which you have mentioned as the fastest one. I was
        using nested data structures for the same. The only problem with that
        approach which I faced are :-

        1. The code was getting too much complex due to use of too many references
        to implement nested data structure.
        2. Finding out the matched line was not much of an issue, however I wanted
        to find out the unmatched records and that's where i had to scan the files
        more than twice. Perhaps my algorithm was not efficient but as performance
        is not the criteria for my script, I can afford the same.

        Is there any CPAN module (preferably built-in) which takes a CSV or
        character delimited file as an input and generate a nested data structure
        containing entire file contents automatically. Also is there any module for
        file comparison of two similar format files.

        Thanks & Regards,
        Amit Saxena

        On Thu, Jul 16, 2009 at 7:18 AM, Jim Gibson <jimsgibson@...> wrote:

        > At 2:12 AM -0700 7/16/09, Amit Saxena wrote:
        >
        >> Hi all,
        >>
        >> I need help regarding the approach to find out matched and unmatched
        >> entries
        >> between two files using perl.
        >>
        >> As the number of lines in the files would be around 10k-50k, I don't want
        >> to
        >> load entire file contents into memory.
        >>
        >
        >
        > The fastest approach is usually to load the shorter of the two files into
        > memory, then read the longer of the two files and process each line,
        > recording whether the line matches any record in the shorter file. A hash is
        > best for this method. 50k files should be no problem.
        >
        > If you really don't or can't read one of the files into memory, then a
        > method that still requires only one pass over each of the two files is to
        > sort the files and save the sorted copies. Then, read one line from each
        > file and compare. If they are equal, record this fact and read two more
        > lines. If they do not match, record the fact and read a line from the file
        > with the lessor of the two line, alphabetically speaking, then compare
        > again.
        >
        > --
        > Jim Gibson
        > Jim@...
        >


        [Non-text portions of this message have been removed]
      Your message has been successfully submitted and would be delivered to recipients shortly.