Loading ...
Sorry, an error occurred while loading the content.

24196Re: Find Common Words Among Two Documents

Expand Messages
  • Ray Shapp
    Dec 1, 2013
    • 0 Attachment
      To all,

      I have further simplified the problem by deleting file extensions and asterisks from all the file names in the first document. Here are the first ten lines of the first document:

      batt32
      bcryptprimitives32
      bitsprx432
      btprn2k32
      btwp32
      BtXpShell32
      cddbcontrol32
      cdfview32
      CNCFLhPL32
      cp21_graphicslarge1632

      Again, thank you.

      Ray Shapp
      ---


      On Sun, Dec 1, 2013 at 3:20 AM, Ray Shapp <rayshapp@...> wrote:
      To All,

      I've modified the original question:

      I stripped all the extraneous strings from the 59 hits, then I extracted each file name onto a line by itself. Then I sorted the file names. The result is a document with 118 lines each line containing one file name, and they are sorted in ascending order. Here are the first ten lines of the file:

      batt32*.dll
      batt32.dll
      bcryptprimitives32*.dll
      bcryptprimitives32.dll
      bitsprx432*.dll
      bitsprx432.dll
      btprn2k32*.dll
      btprn2k32.dll
      btwp32*.dll
      btwp32.dll 

      Now here's the modified question: Do we have a clip that will compare the contents of each line in this document of file names against the contents of the second document, and produce a list of matching file names that are found in the second document? I think it is safe to assume that no file name in the second document is split between two lines, however, the second document could have lots of text surrounding any listed file names. It is also possible that some file names in the second document have no whitespace around the file name. For example, here are five possible lines from the second document:

      svchost.exe
      RandomTextNoSpacesC:\WINDOWS\Explorer.EXE \e
      RandomText SomeSpacesC:\WINDOWS\system32\spoolsv.exe
      C:\Windows\system32\taskeng.exe \sr this line has following text, but no random text prior to the file name
      Random text with lots of spaces C:\Program Files\iTunes\iTunesHelper.exe

      Thank you for your help.

      Ray Shapp
      ---


      On Sun, Dec 1, 2013 at 12:25 AM, Ray Shapp <rayshapp@...> wrote:
      Hi All,

      Do we have a clip anywhere in the libraries that will find words that are common among two separate documents?

      For example, here are the first six hits for a search for "Sun VirtualBox Guest Additions":

      CLSIDNameFilenameDescriptionStatus
      {********-****-****-****-************}(no name)wmpsrcwp32.dll, wmpsrcwp32*.dll (* = multiple "32" digit additions)Trojan detected by Malwarebytes' Anti-Malware as Trojan.Tracur.X BHO
      {********-****-****-****-************}(no name)fxswzrd32.dll, fxswzrd32*.dll (* = multiple "32" digit additions) Trojan detected by Malwarebytes' Anti-Malware as Trojan.Tracur.X BHO
      {********-****-****-****-************}(no name)fxsdrv32.dll, fxsdrv32*.dll (* = multiple "32" digit additions)Trojan detected by Malwarebytes' Anti-Malware as Trojan.Tracur.X BHO
      {********-****-****-****-************}(no name)efsutil32.dll, efsutil32*.dll (* = multiple "32" digit additions) Trojan detected by Malwarebytes' Anti-Malware as Trojan.Tracur.X BHO
      {********-****-****-****-************}(no name)dsdmo32.dll, dsdmo32*.dll (* = multiple "32" digit additions)Trojan detected by Malwarebytes' Anti-Malware as Trojan.Tracur.X BHO
      {********-****-****-****-************}(no name)dneinobj32.dll, dneinobj32*.dll (* = multiple "32" digit additions)Trojan detected by Malwarebytes' Anti-Malware as Trojan.Tracur.X BHO

      The actual document shows 59 similar hits.

      I want to find any of the words in this document that also occur in a second document which is a scan for malware on a particular user's PC. The scan contains over 2300 words in 390 lines.

      If necessary, I could strip off the uninteresting strings like "********-****-****-****-************} (no name)", "(* = multiple "32" digit additions)", and "Trojan detected by Malwarebytes' Anti-Malware as Trojan.Tracur. X BHO" before doing the comparison. As you can see, I'm interested in only the file names.

      Thank you for your help.

      Ray Shapp
      ---


    • Show all 12 messages in this topic