Loading ...
Sorry, an error occurred while loading the content.

Converting legacy documents

Expand Messages
  • Michael Smith
    The following is a message I recently received off-list from a developer named Damon Butler . Damon, whose job title is conversion
    Message 1 of 1 , Aug 1, 2000
    • 0 Attachment
      The following is a message I recently received off-list from a
      developer named Damon Butler <dbutler@...>. Damon,
      whose job title is "conversion specialist", provides some details
      about conversion of legacy documents (in this case, Word
      documents) that may interest some of you.


      Damon writes:

      Well, getting XML out of Word is actually the basis of my job here
      at Impressions. I'm a self-trained Visual Basic programmer, and
      I've created a VBA program, hosted by Word, that extracts
      fully-coded text out of entire batches of input Word documents.
      For the most part, this software has been used in-house only, but
      I just created a slightly different version of it for the
      University of California Press Electronic Manuscript System (an
      entire suite of VBA macros, also all built by me, that facilitates
      the prepping, coding, and editing of Word documents).

      As you know, the amount and usefulness of the codes such a program
      can extract depends upon the regularity and formatting of the
      source Word docs. Even so, just extracting information about
      italics, bold, etc., and embedded notes and comments is a *huge*
      timesaver along the road towards even well-formed XML.

      Please forgive me if I begin to "explain" items you are more than
      familiar with (and ultimately, I expect the following information
      to be more useful to the readers on your list), but I am
      *astonished* at the number of individuals, even organizations, who
      have failed to make this very simple link:

      (1) Word uses coding
      (2) VBA can completely control the Word application thus
      (3) One can use VBA to instruct Word to locate its own coding and
      mark it with "tags" in XML syntax

      So, the first thing to recognize is that Word uses coding on three
      levels of granularity: paragraph styles, character styles, and
      local formatting (this being just simple application of italic,
      bold, and so forth). All one has to do is find a way to express
      this Word "coding" as "real" codes in an ASCII text file.

      Within each Word document my script processes, I've instructed
      Word to:

      (1) Locate all instances of every type of local formatting (e.g.,
      italic, bold, etc.) and mark the text with simple codes.
      <I>...</I> for italic, <B>...</B> for bold, and so forth.
      (2) Locate all instances of character styles and mark the text
      with codes constructed of the name of the character style. For
      example, each range of text marked with Word's "Emphasis"
      character style would get marked with <Emphasis>...</Emphasis>
      codes.
      (3) Iterate through all the paragraphs in the document, marking
      each paragraph with codes constructed of the name of paragraph
      style applied to it. For example, each paragraph marked with
      Word's "Heading 1" style would get marked with
      <Heading 1>...</Heading 1> codes.

      The other major function the script handles is conversion of
      special, non-ASCII characters. Since Word 97, the DOC format has
      used Unicode to represent all non-ASCII characters. Even if you're
      working on a non-Unicode-aware OS, like the Mac or Windows 95,
      Word itself still uses Unicode. Word can also search for Unicode
      characters, just like it can search for plain old "a"s and "b"s.
      On the Internet, I found the ISOPUB tables which cross-referenced
      standard SGML character entities with the Unicode positions of the
      corresponding glyphs. Thus, I'm able to instruct Word to locate
      characters by their Unicode position and replace them with SGML
      entities (e.g., ě).

      These are the basic functions that allow one to extract usefully
      coded ASCII text from a Word file. <shameless plug>My own scripts
      do much more, such as compile lists of converted codes
      cross-referenced with the files they were found in, allow embedded
      notes to be written to separate files, code Word tables using the
      HTML table coding syntax, etc.</shameless plug> As already
      mentioned, the more one uses Word styles (even more importantly,
      the more *consistent* one does *anything*, including using
      styles), the more useful the output coding is.

      I hope this information proves helpful. Please feel free to
      contact me here at Impressions any time for more information.

      Damon Butler <dbutler@...>
    Your message has been successfully submitted and would be delivered to recipients shortly.