Loading ...
Sorry, an error occurred while loading the content.

Re: [Clip] Extracting text from WORD files

Expand Messages
  • Don Passenger
    I suppose you could: a. open document in word via shell command b. copy all content in the document c. paste in text document d. save with like name.txt If the
    Message 1 of 7 , Oct 13, 2003
    • 0 Attachment
      I suppose you could:
      a. open document in word via shell command
      b. copy all content in the document
      c. paste in text document
      d. save with like name.txt

      If the documents have something like tables or columns, you might get
      something odd.

      --

      Don Passenger

      If you need an html fix visit
      http://www.htmlfixit.com

      html/perl/php/xhtml/javascript presented
      in easy tutorials with live help and forums
      to fix your problems
      ----- Original Message -----
      From: "Robin Chapple" <robinski@...>
      To: <ntb-clips@yahoogroups.com>
      Sent: Monday, October 13, 2003 6:20 PM
      Subject: [Clip] Extracting text from WORD files


      > I have a task to extract the text from 26 WORD documents. Is this a task
      > that I can achieve with clips?
      >
      > Thanks,
      >
      > Robin Chapple
      >
      >
      >
      >
      >
      >
      > Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
      >
      >
    • Larry Thomas
      Hi Robin and Don, ... You can also do this while in WORD. Use Save As... and WORD will open a save as dialog window the same as NoteTab does which points to
      Message 2 of 7 , Oct 13, 2003
      • 0 Attachment
        Hi Robin and Don,

        At 08:40 PM 10/13/03 -0400, you wrote:
        >I suppose you could:
        >a. open document in word via shell command
        >b. copy all content in the document
        >c. paste in text document
        >d. save with like name.txt
        >
        >If the documents have something like tables or columns, you might get
        >something odd.
        >
        >--
        >
        >Don Passenger
        >
        >From: "Robin Chapple" <robinski@...>
        >
        >> I have a task to extract the text from 26 WORD documents. Is this a task
        >> that I can achieve with clips?

        You can also do this while in WORD. Use Save As... and WORD will open a
        save as dialog window the same as NoteTab does which points to the folder
        to save the file. Browse to the NoteTab folder where you want to save the
        file. At the bottom there is a box labeled "Save As Type" with a drop down
        list. Drop the list down and select .asc (for ascii) and click ok to save
        the file. The file will now be save in the folder you selected as a plain
        ascii text file with most of the paragraph spacing/formatting saved in the
        file and you can load it into NoteTab. You will not have to convert it as
        WORD will already have done this it you follow the directions above. This
        is just a bit slow with 26 files though and I don't know of a way to speed
        it up.

        Regards,

        Larry
        lrt@... e¿ê
      • Timothy Barlow
        When you open Word documentd directly in NTB, you get the binary files message and even what text you have available to you may not be the real file.. There
        Message 3 of 7 , Oct 13, 2003
        • 0 Attachment
          When you open Word documentd directly in NTB, you get the binary files message and even what text you have available to you may not be the 'real' file.. There is plenty of redundancy in the files for some reason.

          Personally, I would not attempted this with clips but I would use the Paste board function. You can open NTB, set the default document to paste board, then open all 26 documents, select all and copy one at a time. It's faster than traditional copy and paste, and if you set a pasteboard divider text sequence (this CAN be done in a clip) then you still have all 26 documents separated by a (hopefully) unique divider. After that, you can manipulate the text as you see fit, using clips to split it up into separate files if you wish.

          As for tables, this will be a problem if they exist. I am pretty sure however that when you copy a word table and paste into a notetab, you get each row as a separate line with each 'column' separated by a tab. Don't quote me on this, it has been a long time since I have had to do it.

          Multi-column pages should be fine, I am pretty sure Word treats the text for such formats as a single stream, and adds the the column as a pure formatting function. In other words, your text should appear in the text file as if it was formatted as a single column page.

          Regards,
          Tim.

          -----Original Message-----
          From: Don Passenger [mailto:dpasseng@...]
          Sent: Tuesday, 14 October 2003 10:41 AM
          To: ntb-clips@yahoogroups.com
          Subject: Re: [Clip] Extracting text from WORD files


          I suppose you could:
          a. open document in word via shell command
          b. copy all content in the document
          c. paste in text document
          d. save with like name.txt

          If the documents have something like tables or columns, you might get
          something odd.

          --

          Don Passenger

          If you need an html fix visit
          http://www.htmlfixit.com

          html/perl/php/xhtml/javascript presented
          in easy tutorials with live help and forums
          to fix your problems
          ----- Original Message -----
          From: "Robin Chapple" <robinski@...>
          To: <ntb-clips@yahoogroups.com>
          Sent: Monday, October 13, 2003 6:20 PM
          Subject: [Clip] Extracting text from WORD files


          > I have a task to extract the text from 26 WORD documents. Is this a task
          > that I can achieve with clips?
          >
          > Thanks,
          >
          > Robin Chapple
          >
          >
          >
          >
          >
          >
          > Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
          >
          >





          Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/



          **********************************************************************
          This email and any files transmitted with it are confidential and
          intended solely for the use of the individual or entity to whom they
          are addressed. If you have received this email in error please notify
          the system manager.

          This footnote also confirms that this email message has been swept by
          MIMEsweeper for the presence of computer viruses.

          www.mimesweeper.com
          **********************************************************************
        • Don Passenger
          I think you can do it with a shell command if you try very hard. -- Don Passenger snip
          Message 4 of 7 , Oct 13, 2003
          • 0 Attachment
            I think you can do it with a shell command if you try very hard.

            --

            Don Passenger

            snip
            > WORD will already have done this it you follow the directions above. This
            > is just a bit slow with 26 files though and I don't know of a way to speed
            > it up.
          • Larry Hamilton
            ... Robin, This is a task I encountered a couple of years ago with over 100 Word documents that contained census readings and all I needed was the text to wrap
            Message 5 of 7 , Oct 13, 2003
            • 0 Attachment
              Robin Chapple wrote:
              > I have a task to extract the text from 26 WORD documents. Is this a
              > task that I can achieve with clips?
              >
              > Thanks,
              >
              > Robin Chapple

              Robin,

              This is a task I encountered a couple of years ago with over 100 Word
              documents that contained census readings and all I needed was the text to
              wrap <pre> </pre> tags around for simple HTML files.

              Here is the clip I used, and it gets the headers and footers, if any.

              There were several methods that I encountered from others on the list, but
              those solutions did not quite do what I needed. I even looked for command
              line utilities to extract text, but none of them could do what opening the
              Word doc itself allowed.

              You may need to adjust the delays on the keyboard commands to get them to do
              what you need. For only 26 files, this will be faster than building a new
              clip of doing it by hand.

              HTH,

              Larry Hamilton
              lmh@...
              My Web Site: http://notlimah.tripod.com/
              Webmaster: Hamilton National Genealogical Society, Inc.
              http://www.hamiltongensociety.org/

              <copy below this line>
              ;March 05, 2002 Larry Hamilton lmh@...
              ;Brute force method to open Word document, and use toolbar coommands to copy
              headers and footers from document. The commandline tools I found do not pull
              out the header and footer text. Only Word saving as Text does so.
              ^!ClearVariables
              ^!SetDebug ON

              ;I just hard coded the path to keep it simple.
              ^!Set %File%=^$GetFileFirst("c:\Census";*.doc)$
              ^!ChDir C:\Census
              :LOOP
              ;The following was used for testing to make sure it does what is desired.
              ;^!Info ^%File% > ^$GetName(^%File%)$.txt

              ^!"C:\Program Files\Microsoft Office\Office\WINWORD.EXE"
              ^$GetShort(^%File%)$

              ^!SetHintInfo ^$GetDate(hh:nn:ss am/pm dddd, mmmm dd, yyyy)$
              ^!FocusApp "Microsoft Word - ^$GetName(^%File%)$"
              ^!IfDiff "^$GetAppTitle$" "Microsoft Word - ^$GetName(^%File%)$" Skip_-2
              ^!StatusClose
              ^!Delay 15
              ;The following Keyboard sequence will save the currently opened document
              with the same name in TXT format. It puts the headers & footers at the end
              of the file, so it still needs to be cleaned up.
              ^!Keyboard ALT+F A &100 TAB &100 T &100 ENTER


              ^!Set %File%=^$GetFileNext$
              ^!GoTo LOOP
              ^!CloseFileFind
              </copy above this line>
            • hugo_paulissen
              Robin, Larry, I had to do this for a couple of hundred of files once. What follows is a very quick and dirty clip (warning!), which opens the documents in Word
              Message 6 of 7 , Oct 14, 2003
              • 0 Attachment
                Robin, Larry,

                I had to do this for a couple of hundred of files once. What follows
                is a very quick and dirty clip (warning!), which opens the documents
                in Word (one at a time), and copies the text to NoteTab. The document
                is then saved with the same name plus a txt-extension... If all files
                are processed the clip should stop.

                Please note that you should have Word open - and that there should be
                no document loaded in Word before you start the clip. (This can be
                fixed by changing the FocusApp line...); the following clip assumes
                the title bar of MS Word only shows Microsoft Word.

                Hugo

                ^!Set %path%="C:\WINDOWS\Desktop\OutlookFiles"
                ^!SetArray %Files%=^$GetFiles("^%path%";*.doc)$
                ^!Set %X%=1
                :EXPORT
                ^!If ^%X% > ^%Files0% END
                ^!FocusApp "Microsoft Word"
                ^!Delay 1
                ^!Keyboard CTRL+O
                ^!Delay 1
                ^!Keyboard #^%Files^%X%%# ENTER
                ^!Delay 1
                ^!Keyboard CTRL+A CTRL+C CTRL+W
                ^!ActivateApp
                ^!Select ALL
                ^!InsertText ^$GetClipboard$
                ^!Save AS "^%Files^%X%%.txt"
                ^!INC %X%
                ^!Delay 1
                ^!GoTo EXPORT
              • Don Passenger
                That does what I said, only it has notetab code in it ;-) -- Don Passenger
                Message 7 of 7 , Oct 14, 2003
                • 0 Attachment
                  That does what I said, only it has notetab code in it ;-)

                  --

                  Don Passenger
                Your message has been successfully submitted and would be delivered to recipients shortly.