Loading ...
Sorry, an error occurred while loading the content.

RE: [Clip] Extracting text from WORD files

Expand Messages
  • Timothy Barlow
    When you open Word documentd directly in NTB, you get the binary files message and even what text you have available to you may not be the real file.. There
    Message 1 of 7 , Oct 13, 2003
    • 0 Attachment
      When you open Word documentd directly in NTB, you get the binary files message and even what text you have available to you may not be the 'real' file.. There is plenty of redundancy in the files for some reason.

      Personally, I would not attempted this with clips but I would use the Paste board function. You can open NTB, set the default document to paste board, then open all 26 documents, select all and copy one at a time. It's faster than traditional copy and paste, and if you set a pasteboard divider text sequence (this CAN be done in a clip) then you still have all 26 documents separated by a (hopefully) unique divider. After that, you can manipulate the text as you see fit, using clips to split it up into separate files if you wish.

      As for tables, this will be a problem if they exist. I am pretty sure however that when you copy a word table and paste into a notetab, you get each row as a separate line with each 'column' separated by a tab. Don't quote me on this, it has been a long time since I have had to do it.

      Multi-column pages should be fine, I am pretty sure Word treats the text for such formats as a single stream, and adds the the column as a pure formatting function. In other words, your text should appear in the text file as if it was formatted as a single column page.

      Regards,
      Tim.

      -----Original Message-----
      From: Don Passenger [mailto:dpasseng@...]
      Sent: Tuesday, 14 October 2003 10:41 AM
      To: ntb-clips@yahoogroups.com
      Subject: Re: [Clip] Extracting text from WORD files


      I suppose you could:
      a. open document in word via shell command
      b. copy all content in the document
      c. paste in text document
      d. save with like name.txt

      If the documents have something like tables or columns, you might get
      something odd.

      --

      Don Passenger

      If you need an html fix visit
      http://www.htmlfixit.com

      html/perl/php/xhtml/javascript presented
      in easy tutorials with live help and forums
      to fix your problems
      ----- Original Message -----
      From: "Robin Chapple" <robinski@...>
      To: <ntb-clips@yahoogroups.com>
      Sent: Monday, October 13, 2003 6:20 PM
      Subject: [Clip] Extracting text from WORD files


      > I have a task to extract the text from 26 WORD documents. Is this a task
      > that I can achieve with clips?
      >
      > Thanks,
      >
      > Robin Chapple
      >
      >
      >
      >
      >
      >
      > Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
      >
      >





      Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/



      **********************************************************************
      This email and any files transmitted with it are confidential and
      intended solely for the use of the individual or entity to whom they
      are addressed. If you have received this email in error please notify
      the system manager.

      This footnote also confirms that this email message has been swept by
      MIMEsweeper for the presence of computer viruses.

      www.mimesweeper.com
      **********************************************************************
    • Larry Hamilton
      ... Robin, This is a task I encountered a couple of years ago with over 100 Word documents that contained census readings and all I needed was the text to wrap
      Message 2 of 7 , Oct 13, 2003
      • 0 Attachment
        Robin Chapple wrote:
        > I have a task to extract the text from 26 WORD documents. Is this a
        > task that I can achieve with clips?
        >
        > Thanks,
        >
        > Robin Chapple

        Robin,

        This is a task I encountered a couple of years ago with over 100 Word
        documents that contained census readings and all I needed was the text to
        wrap <pre> </pre> tags around for simple HTML files.

        Here is the clip I used, and it gets the headers and footers, if any.

        There were several methods that I encountered from others on the list, but
        those solutions did not quite do what I needed. I even looked for command
        line utilities to extract text, but none of them could do what opening the
        Word doc itself allowed.

        You may need to adjust the delays on the keyboard commands to get them to do
        what you need. For only 26 files, this will be faster than building a new
        clip of doing it by hand.

        HTH,

        Larry Hamilton
        lmh@...
        My Web Site: http://notlimah.tripod.com/
        Webmaster: Hamilton National Genealogical Society, Inc.
        http://www.hamiltongensociety.org/

        <copy below this line>
        ;March 05, 2002 Larry Hamilton lmh@...
        ;Brute force method to open Word document, and use toolbar coommands to copy
        headers and footers from document. The commandline tools I found do not pull
        out the header and footer text. Only Word saving as Text does so.
        ^!ClearVariables
        ^!SetDebug ON

        ;I just hard coded the path to keep it simple.
        ^!Set %File%=^$GetFileFirst("c:\Census";*.doc)$
        ^!ChDir C:\Census
        :LOOP
        ;The following was used for testing to make sure it does what is desired.
        ;^!Info ^%File% > ^$GetName(^%File%)$.txt

        ^!"C:\Program Files\Microsoft Office\Office\WINWORD.EXE"
        ^$GetShort(^%File%)$

        ^!SetHintInfo ^$GetDate(hh:nn:ss am/pm dddd, mmmm dd, yyyy)$
        ^!FocusApp "Microsoft Word - ^$GetName(^%File%)$"
        ^!IfDiff "^$GetAppTitle$" "Microsoft Word - ^$GetName(^%File%)$" Skip_-2
        ^!StatusClose
        ^!Delay 15
        ;The following Keyboard sequence will save the currently opened document
        with the same name in TXT format. It puts the headers & footers at the end
        of the file, so it still needs to be cleaned up.
        ^!Keyboard ALT+F A &100 TAB &100 T &100 ENTER


        ^!Set %File%=^$GetFileNext$
        ^!GoTo LOOP
        ^!CloseFileFind
        </copy above this line>
      • hugo_paulissen
        Robin, Larry, I had to do this for a couple of hundred of files once. What follows is a very quick and dirty clip (warning!), which opens the documents in Word
        Message 3 of 7 , Oct 14, 2003
        • 0 Attachment
          Robin, Larry,

          I had to do this for a couple of hundred of files once. What follows
          is a very quick and dirty clip (warning!), which opens the documents
          in Word (one at a time), and copies the text to NoteTab. The document
          is then saved with the same name plus a txt-extension... If all files
          are processed the clip should stop.

          Please note that you should have Word open - and that there should be
          no document loaded in Word before you start the clip. (This can be
          fixed by changing the FocusApp line...); the following clip assumes
          the title bar of MS Word only shows Microsoft Word.

          Hugo

          ^!Set %path%="C:\WINDOWS\Desktop\OutlookFiles"
          ^!SetArray %Files%=^$GetFiles("^%path%";*.doc)$
          ^!Set %X%=1
          :EXPORT
          ^!If ^%X% > ^%Files0% END
          ^!FocusApp "Microsoft Word"
          ^!Delay 1
          ^!Keyboard CTRL+O
          ^!Delay 1
          ^!Keyboard #^%Files^%X%%# ENTER
          ^!Delay 1
          ^!Keyboard CTRL+A CTRL+C CTRL+W
          ^!ActivateApp
          ^!Select ALL
          ^!InsertText ^$GetClipboard$
          ^!Save AS "^%Files^%X%%.txt"
          ^!INC %X%
          ^!Delay 1
          ^!GoTo EXPORT
        • Don Passenger
          That does what I said, only it has notetab code in it ;-) -- Don Passenger
          Message 4 of 7 , Oct 14, 2003
          • 0 Attachment
            That does what I said, only it has notetab code in it ;-)

            --

            Don Passenger
          Your message has been successfully submitted and would be delivered to recipients shortly.