RE: [Clip] Extracting text from WORD files
- When you open Word documentd directly in NTB, you get the binary files message and even what text you have available to you may not be the 'real' file.. There is plenty of redundancy in the files for some reason.
Personally, I would not attempted this with clips but I would use the Paste board function. You can open NTB, set the default document to paste board, then open all 26 documents, select all and copy one at a time. It's faster than traditional copy and paste, and if you set a pasteboard divider text sequence (this CAN be done in a clip) then you still have all 26 documents separated by a (hopefully) unique divider. After that, you can manipulate the text as you see fit, using clips to split it up into separate files if you wish.
As for tables, this will be a problem if they exist. I am pretty sure however that when you copy a word table and paste into a notetab, you get each row as a separate line with each 'column' separated by a tab. Don't quote me on this, it has been a long time since I have had to do it.
Multi-column pages should be fine, I am pretty sure Word treats the text for such formats as a single stream, and adds the the column as a pure formatting function. In other words, your text should appear in the text file as if it was formatted as a single column page.
From: Don Passenger [mailto:dpasseng@...]
Sent: Tuesday, 14 October 2003 10:41 AM
Subject: Re: [Clip] Extracting text from WORD files
I suppose you could:
a. open document in word via shell command
b. copy all content in the document
c. paste in text document
d. save with like name.txt
If the documents have something like tables or columns, you might get
If you need an html fix visit
in easy tutorials with live help and forums
to fix your problems
----- Original Message -----
From: "Robin Chapple" <robinski@...>
Sent: Monday, October 13, 2003 6:20 PM
Subject: [Clip] Extracting text from WORD files
> I have a task to extract the text from 26 WORD documents. Is this a task
> that I can achieve with clips?
> Robin Chapple
> Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.
- I think you can do it with a shell command if you try very hard.
> WORD will already have done this it you follow the directions above. This
> is just a bit slow with 26 files though and I don't know of a way to speed
> it up.
- Robin Chapple wrote:
> I have a task to extract the text from 26 WORD documents. Is this aRobin,
> task that I can achieve with clips?
> Robin Chapple
This is a task I encountered a couple of years ago with over 100 Word
documents that contained census readings and all I needed was the text to
wrap <pre> </pre> tags around for simple HTML files.
Here is the clip I used, and it gets the headers and footers, if any.
There were several methods that I encountered from others on the list, but
those solutions did not quite do what I needed. I even looked for command
line utilities to extract text, but none of them could do what opening the
Word doc itself allowed.
You may need to adjust the delays on the keyboard commands to get them to do
what you need. For only 26 files, this will be faster than building a new
clip of doing it by hand.
My Web Site: http://notlimah.tripod.com/
Webmaster: Hamilton National Genealogical Society, Inc.
<copy below this line>
;March 05, 2002 Larry Hamilton lmh@...
;Brute force method to open Word document, and use toolbar coommands to copy
headers and footers from document. The commandline tools I found do not pull
out the header and footer text. Only Word saving as Text does so.
;I just hard coded the path to keep it simple.
;The following was used for testing to make sure it does what is desired.
;^!Info ^%File% > ^$GetName(^%File%)$.txt
^!"C:\Program Files\Microsoft Office\Office\WINWORD.EXE"
^!SetHintInfo ^$GetDate(hh:nn:ss am/pm dddd, mmmm dd, yyyy)$
^!FocusApp "Microsoft Word - ^$GetName(^%File%)$"
^!IfDiff "^$GetAppTitle$" "Microsoft Word - ^$GetName(^%File%)$" Skip_-2
;The following Keyboard sequence will save the currently opened document
with the same name in TXT format. It puts the headers & footers at the end
of the file, so it still needs to be cleaned up.
^!Keyboard ALT+F A &100 TAB &100 T &100 ENTER
</copy above this line>
- Robin, Larry,
I had to do this for a couple of hundred of files once. What follows
is a very quick and dirty clip (warning!), which opens the documents
in Word (one at a time), and copies the text to NoteTab. The document
is then saved with the same name plus a txt-extension... If all files
are processed the clip should stop.
Please note that you should have Word open - and that there should be
no document loaded in Word before you start the clip. (This can be
fixed by changing the FocusApp line...); the following clip assumes
the title bar of MS Word only shows Microsoft Word.
^!If ^%X% > ^%Files0% END
^!FocusApp "Microsoft Word"
^!Keyboard #^%Files^%X%%# ENTER
^!Keyboard CTRL+A CTRL+C CTRL+W
^!Save AS "^%Files^%X%%.txt"