Loading ...
Sorry, an error occurred while loading the content.

Re: Searching for unkown duplicates

Expand Messages
  • Benji Fisher
    ... You could do something similar in vim, but it is probably a lot faster in perl. If you do want to do it in vim, I suggest the following approach: 1. Copy
    Message 1 of 5 , Apr 30, 2003
    • 0 Attachment
      Piet Delport wrote:
      > On Mon, 28 Apr 2003 at 15:47:51 -0700, Tom Smith wrote:
      >
      >>QUESTION
      >>How do I search a file (a very long file) for duplicate entries?
      >>
      >>SITUATION
      >>I have a very long file that I know has many duplicates of the records
      >>contained within (a text CSV file).
      >>
      >>I would like to search the file for any values over said length for
      >>duplicates--that is, I want to search every word over 5 characters and
      >>compare these words to every other word to see if any match and, if they
      >>do, I want to delete all but one of those words.
      >>
      >>Is this possible?
      >
      >
      > Hmm, tricky.
      >
      > Not sure how to automatically do all deletions at once, but you can get
      > a sorted listing of word frequencies using something like this:
      >
      > perl -pe 's/\s+/\n/g' <your_file> | grep '.\{5,\}' | sort | uniq -c | sort -rn | head
      >
      > Once you have that, you could use Vim to interactively find/delete those
      > words.
      >
      >
      > [1] passed via standard input, or listed as arguments to perl, BTW.

      You could do something similar in vim, but it is probably a lot faster in
      perl. If you do want to do it in vim, I suggest the following approach:

      1. Copy all words of more than 5 characters to the end of the file, on per line.
      See the Pippo() function in foo.vim (my file of sample vim functions)
      http://www.vim.org/script.php?script_id=72

      2. Sort the newly added lines. This is probably the slowest step, so you might
      want to use an external program, unless your file is small. For a pure vim
      solution, see

      :help eval.txt
      /Sort

      or $VIMRUNTIME/plugin/explorer.vim or maybe Hari Krishna Dara's (sp?)
      genutils.vim at www.vim.org .

      3. Remove non-duplicates. One clever :g command should do it.

      4. For each remaining line (past the end of the original file) remove all lines
      contaiing that word except for the first occurrence.

      If you want to go this route and need more help, feel free to ask.

      HTH --Benji Fisher
    Your message has been successfully submitted and would be delivered to recipients shortly.