RE: [NH] Tidy weirdness
- Hmmm...the first thing that pops into my mind is that Word 2000 doesn't
produce HTML, it produces a MS-specification XML page. What you really need
is something to scrub out the XML. There are a couple of XML cleaners on
the Notetab library site at http://www.notetab.com/html.htm that you can
use. Look at the whole page, not just the XML portion of it.
Of course, you could do what I do and refuse to use Word as a HTML editor
;-) Even Front Page Express :-P does a better job of wysiwyg layout than
Word and provides code that's easier to clean. Or just use Notetab
I hope this helped a bit, and good luck!
From: swirus@... [mailto:swirus@...]
Sent: Monday, August 13, 2001 9:51 AM
Subject: [NH] Tidy weirdness
This is my first post, I've been a registered Notetab user for over a
year now. I've come up against an odd problem, and I can't see that
it's ever happened before. I don't know if the problem is Notetab or
Basically I've used MS Word 2000 to output a number of files into its
own version of HTML, which is extremely bloaty. The first thing I
want to do when I get them into Notetab is to run HTML Tidy with its
Word-2000 yes option to get rid of all of this gubbins. However, this
does not work. While it will tidy other documents correctly, it
appears to just select all, then nothing happens.
It's really odd behaviour, since I can't even tidy a subsection of
the code copied and pasted into a blank document. HTML Tidy will run
on the code in command line mode, but I find this interface very
unwieldy and am not comfortable using it.
Anybody experienced anything like this before? Am I doing something
Any advice would be great,
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
- There is a special tidy switch which can be used to clean word html files.
The best way to get all the switch options from withen notetab is to use a
In my xhtml library (available from the library download repository on the
notetab site)there are two tidy related clips
which help generate a complete tidy config file via a notetab wizard.
The wizard contains all the tidy config options including the 'word-2000: '
switch which you can try.
The two clips stand alone and be taken out of the general xhtml library.
> Basically I've used MS Word 2000 to output a number of files into itsRemember an email client will produce unwanted line breaks.
> own version of HTML, which is extremely bloaty. The first thing I
> want to do when I get them into Notetab is to run HTML Tidy with its
> Word-2000 yes option to get rid of all of this gubbins. However, this
> does not work. While it will tidy other documents correctly, it
> appears to just select all, then nothing happens.
> It's really odd behaviour, since I can't even tidy a subsection of
> the code copied and pasted into a blank document. HTML Tidy will run
> on the code in command line mode, but I find this interface very
> unwieldy and am not comfortable using it.
Each tidy switch option in the wizard will end with a closing bracket ']'
;Ctrl -- set up config file
;Alt -- skip errors file
;Shift -- Tidy home page
;%Allow% array sets allowable file types to tidy.
^!IfFileExist "^$GetTidyExe$" NEXT
^!IfFileExist "^$GetAppPath$Tidy.cfg" NEXT Else Config
^!IfTrue "^$IsCtrlKeyDown$" Config Else Next
^!IfTrue "^$IsShiftKeyDown$" TidyHome Else Next
^!IF ^%i% > ^%allow^%i%% Exit
^!IFSame "^$GetExt(^$GetDocName$)$" ".^%allow^%i%%" Skip
^!IfTrue "^$IsAltKeyDown$" SKIP
^!Toolbar "New Document"
add-xml-pi: ^?[add-xml-pi: Add the XML processing instruction when
outputting XML or XHTML=yes|_no]
add-xml-decl: ^?[add-xml-pi: add the XML declaration when outputting XML or
add-xml-space: ^?[add-xml-space: causes Tidy to add xml:space preserve to
elements such as pre, style and script when generating XML.=_yes|no]
assume-xml-procins: ^?[assume-xml-procins: Change the parsing of processing
instructions to require ?> as the terminator rather than >. =_yes|no]
break-before-br: ^?[break-before-br: Break before br tag=_yes|no]
char-encoding: ^?[char-encoding: Determines how Tidy interprets character
clean: ^?[clean: Strip out surplus presentational tags=_yes|no]
doctype: ^?[doctype: Adds doctype=omit|auto|_strict|loose]
drop-empty-paras: ^?[drop-empty-paras: Discard empty paragraphs=yes|_no]
drop-font-tags: ^?[drop-font-tags: discard font and center tags=_yes|no]
enclose-text: ^?[enclose-text: Enclose any text it finds in the body element
within a p element=_yes|no]
enclose-block-text: ^?[enclose-block-text: insert a p element to enclose any
text it finds in any element that allows mixed content for HTML transitional
but not HTML strict=_yes|no]
fix-backslash: ^?[fix-backslash: Cause backslash characters "\" in URLs to
be replaced by forward slashes "/"=_yes|no]
indent-attributes: ^?[indent-attributes: Begin each attributeon a new
indent-spaces: ^?[(M="0")indent-spaces: Number of spaces to indent
indent: ^?[(T=C)indent: Indent block-level tags=_no|yes|auto]
input-xml: ^?[input-xml: Is input xml=yes|_no]
keep-time: ^?[keep-time: If set, Tidy won't alter the last modified time for
files it writes back to=_yes|no]
logical-emphasis: ^?[logical-emphasis: Replace any occurrence of i by em and
any occurrence of b by strong=_yes|no]
markup: ^?[(T=C)markup: A pretty printed version of the markup.=_yes|no]
numeric-entities: ^?[numeric-entities: numeric-entities=yes|_no]
output-xhtml: ^?[output-xhtml: Output to xhtml=_yes|no]
output-xml: ^?[output-xml: Output to xml=yes|_no]
quiet: ^?[quiet: Do not output the welcome message or the summary of the
numbers of errors and warnings. =yes|_no]
quote-ampersand: ^?[quote-ampersand: Cause unadorned & characters to be
written out as &=_yes|no]
quote-marks: ^?[quote-marks: Cause " characters to be written out as
quote-nbsp: ^?[quote-nbsp: Causes non-breaking space characters to be
written out as entities=_yes|no]
show-warnings: ^?[show-warnings: Show warnings=_yes|no]
split: ^?[split: Use the input file to create a sequence of slides=yes|_no]
tab-size: ^?[(M="0")tab-size: number of columns between successive tab
tidy-mark: ^?[tidy-mark: Add a meta element to the document head to indicate
that the document has been tidied=yes|_no]
word-2000: ^?[clean word 2000: word 2000=_yes|no]
wrap-asp: ^?[wrap-asp: wrap-asp=_yes|no]
wrap-php: ^?[wrap-php: wrap-asp=_yes|no]
wrap-script-literals: ^?[wrap-script-literals: Wrap-script-literals=yes|_no]
wrap: ^?[(m=00)wrap: Right margin for line wrapping=0]
^!Save AS ^$GetAppPath$Tidy.cfg
- --- In ntb-html@y..., Jim Beidle <JBeidle@c...> wrote:
> Hmmm...the first thing that pops into my mind is that Word 2000doesn't
> produce HTML, it produces a MS-specification XML page. What youreally need
> is something to scrub out the XML. There are a couple of XMLcleaners on
> the Notetab library site at http://www.notetab.com/html.htm thatyou can
> use. Look at the whole page, not just the XML portion of it.I think I've isolated the problem as being that HTMLtidy does not
like the arbitrary line breaks used by Word, which fall in the middle
of tags and such, and who is to blame it? Unfortunately I couldn't
join the lines because the documents are very long, and apparently
Notetab was not having such a long paragraph (100,000 characters with
all of that useless repeated formatting data) What I did was download
a Microsoft product which strips all of their proprietary XML from
the HTML - I got it at:
With that removed, the code had fallen to 40,000 characters, and
small enough to join, the HTMLTidy, which worked its magic.
> Of course, you could do what I do and refuse to use Word as a HTMLeditor
> ;-) Even Front Page Express :-P does a better job of wysiwyg layoutthan
> Word and provides code that's easier to clean. Or just use NotetabIf I honestly had any choice, I would not be using Word.
> I hope this helped a bit, and good luck!
Unfortunately, Frontpage isn't part of my installation. Mind you, if
I honestly had any choice, I'd be soaking up some rays in the south
of France right now. Notwithstanding my personal bitterness, thanks
for your advice, Jim.
- --- In ntb-html@y..., "Grant" <emerge@p...> wrote:
> There is a special tidy switch which can be used to clean word htmlfiles.
> word-2000: [yes|no]to use a
> The best way to get all the switch options from withen notetab is
> config file.repository on the
> In my xhtml library (available from the library download
> notetab site)there are two tidy related clipswizard.
> which help generate a complete tidy config file via a notetab
> The wizard contains all the tidy config options including the 'word-2000: '
> switch which you can try.library.
> The two clips stand alone and be taken out of the general xhtml
> Included belowThis looks a lot more elegant way of configuring tidy. I have a
solution to my current problems (see other mail) but I shall download
your libraries for future use (I don't trust myself to figure out
where the line breaks go after so many brain frazzling hours of
The trouble with tidy, as with so many things in the computer world
is that there is a constant battle between power (and HTMLTidy is
powerful) and complexity. What I like about it is generally with the
default options it does a good job. But HTML author or programmer is
not my main job, so I simply haven't the time to learn the finer
points of configuration. It looks like your scripts take the edge off
this, for which much thanks.
- Hi Jim and Swirus
> Hmmm...the first thing that pops into my mind is that Word 2000 doesn'tThere's also the official HTML filter from Microsoft. I picked up my copy
> produce HTML, it produces a MS-specification XML page. What you
> really need
> is something to scrub out the XML. There are a couple of XML cleaners on
> the Notetab library site at http://www.notetab.com/html.htm that you can
> use. Look at the whole page, not just the XML portion of it.
from a magazine cover disk, but try a search for the file "msohtmlf2.exe".
This is v2 of the Microsoft Office HTML filter.
It does a number of things including place an "Export to compact HTML"
button on the standard toolbar and additional export options on the File
menu, including one to create a CSS file, from your document. HTML TIDY
will still find some garbage to correct, but its a massive improvement on
the standard output.
> There's also the official HTML filter from Microsoft. IIt's at http://office.microsoft.com/downloads/2000/Msohtmf2.aspx
> picked up my copy from a magazine cover disk, but try a
> search for the file "msohtmlf2.exe". This is v2 of the
> Microsoft Office HTML filter.
+44 (7850) 150133
PO Box 211 Welwyn AL6 0EX UK