Converting legacy documents
- The following is a message I recently received off-list from a
developer named Damon Butler <dbutler@...>. Damon,
whose job title is "conversion specialist", provides some details
about conversion of legacy documents (in this case, Word
documents) that may interest some of you.
Well, getting XML out of Word is actually the basis of my job here
at Impressions. I'm a self-trained Visual Basic programmer, and
I've created a VBA program, hosted by Word, that extracts
fully-coded text out of entire batches of input Word documents.
For the most part, this software has been used in-house only, but
I just created a slightly different version of it for the
University of California Press Electronic Manuscript System (an
entire suite of VBA macros, also all built by me, that facilitates
the prepping, coding, and editing of Word documents).
As you know, the amount and usefulness of the codes such a program
can extract depends upon the regularity and formatting of the
source Word docs. Even so, just extracting information about
italics, bold, etc., and embedded notes and comments is a *huge*
timesaver along the road towards even well-formed XML.
Please forgive me if I begin to "explain" items you are more than
familiar with (and ultimately, I expect the following information
to be more useful to the readers on your list), but I am
*astonished* at the number of individuals, even organizations, who
have failed to make this very simple link:
(1) Word uses coding
(2) VBA can completely control the Word application thus
(3) One can use VBA to instruct Word to locate its own coding and
mark it with "tags" in XML syntax
So, the first thing to recognize is that Word uses coding on three
levels of granularity: paragraph styles, character styles, and
local formatting (this being just simple application of italic,
bold, and so forth). All one has to do is find a way to express
this Word "coding" as "real" codes in an ASCII text file.
Within each Word document my script processes, I've instructed
(1) Locate all instances of every type of local formatting (e.g.,
italic, bold, etc.) and mark the text with simple codes.
<I>...</I> for italic, <B>...</B> for bold, and so forth.
(2) Locate all instances of character styles and mark the text
with codes constructed of the name of the character style. For
example, each range of text marked with Word's "Emphasis"
character style would get marked with <Emphasis>...</Emphasis>
(3) Iterate through all the paragraphs in the document, marking
each paragraph with codes constructed of the name of paragraph
style applied to it. For example, each paragraph marked with
Word's "Heading 1" style would get marked with
<Heading 1>...</Heading 1> codes.
The other major function the script handles is conversion of
special, non-ASCII characters. Since Word 97, the DOC format has
used Unicode to represent all non-ASCII characters. Even if you're
working on a non-Unicode-aware OS, like the Mac or Windows 95,
Word itself still uses Unicode. Word can also search for Unicode
characters, just like it can search for plain old "a"s and "b"s.
On the Internet, I found the ISOPUB tables which cross-referenced
standard SGML character entities with the Unicode positions of the
corresponding glyphs. Thus, I'm able to instruct Word to locate
characters by their Unicode position and replace them with SGML
entities (e.g., ě).
These are the basic functions that allow one to extract usefully
coded ASCII text from a Word file. <shameless plug>My own scripts
do much more, such as compile lists of converted codes
cross-referenced with the files they were found in, allow embedded
notes to be written to separate files, code Word tables using the
HTML table coding syntax, etc.</shameless plug> As already
mentioned, the more one uses Word styles (even more importantly,
the more *consistent* one does *anything*, including using
styles), the more useful the output coding is.
I hope this information proves helpful. Please feel free to
contact me here at Impressions any time for more information.
Damon Butler <dbutler@...>