Loading ...
Sorry, an error occurred while loading the content.

RE: [NH] Tidy weirdness

Expand Messages
  • Jim Beidle
    Hmmm...the first thing that pops into my mind is that Word 2000 doesn t produce HTML, it produces a MS-specification XML page. What you really need is
    Message 1 of 6 , Aug 13, 2001
    • 0 Attachment
      Hmmm...the first thing that pops into my mind is that Word 2000 doesn't
      produce HTML, it produces a MS-specification XML page. What you really need
      is something to scrub out the XML. There are a couple of XML cleaners on
      the Notetab library site at http://www.notetab.com/html.htm that you can
      use. Look at the whole page, not just the XML portion of it.

      Of course, you could do what I do and refuse to use Word as a HTML editor
      ;-) Even Front Page Express :-P does a better job of wysiwyg layout than
      Word and provides code that's easier to clean. Or just use Notetab

      I hope this helped a bit, and good luck!

      Jim

      -----Original Message-----
      From: swirus@... [mailto:swirus@...]
      Sent: Monday, August 13, 2001 9:51 AM
      To: ntb-html@yahoogroups.com
      Subject: [NH] Tidy weirdness


      Hello,

      This is my first post, I've been a registered Notetab user for over a
      year now. I've come up against an odd problem, and I can't see that
      it's ever happened before. I don't know if the problem is Notetab or
      HTML Tidy.

      Basically I've used MS Word 2000 to output a number of files into its
      own version of HTML, which is extremely bloaty. The first thing I
      want to do when I get them into Notetab is to run HTML Tidy with its
      Word-2000 yes option to get rid of all of this gubbins. However, this
      does not work. While it will tidy other documents correctly, it
      appears to just select all, then nothing happens.

      It's really odd behaviour, since I can't even tidy a subsection of
      the code copied and pasted into a blank document. HTML Tidy will run
      on the code in command line mode, but I find this interface very
      unwieldy and am not comfortable using it.

      Anybody experienced anything like this before? Am I doing something
      stupid?

      Any advice would be great,
      John




      Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/
    • Grant
      There is a special tidy switch which can be used to clean word html files. word-2000: [yes|no] The best way to get all the switch options from withen notetab
      Message 2 of 6 , Aug 14, 2001
      • 0 Attachment
        There is a special tidy switch which can be used to clean word html files.
        word-2000: [yes|no]
        The best way to get all the switch options from withen notetab is to use a
        config file.

        In my xhtml library (available from the library download repository on the
        notetab site)there are two tidy related clips
        -tidy
        _TidyConfigSetup
        which help generate a complete tidy config file via a notetab wizard.
        The wizard contains all the tidy config options including the 'word-2000: '
        switch which you can try.
        The two clips stand alone and be taken out of the general xhtml library.
        Included below





        > Basically I've used MS Word 2000 to output a number of files into its
        > own version of HTML, which is extremely bloaty. The first thing I
        > want to do when I get them into Notetab is to run HTML Tidy with its
        > Word-2000 yes option to get rid of all of this gubbins. However, this
        > does not work. While it will tidy other documents correctly, it
        > appears to just select all, then nothing happens.
        >
        > It's really odd behaviour, since I can't even tidy a subsection of
        > the code copied and pasted into a blank document. HTML Tidy will run
        > on the code in command line mode, but I find this interface very
        > unwieldy and am not comfortable using it.


        Remember an email client will produce unwanted line breaks.
        Each tidy switch option in the wizard will end with a closing bracket ']'

        H="-tidy"
        ;Modifier Keys
        ;Ctrl -- set up config file
        ;Alt -- skip errors file
        ;Shift -- Tidy home page
        ;%Allow% array sets allowable file types to tidy.
        ^!IfFileExist "^$GetTidyExe$" NEXT
        ^!IfFileExist "^$GetAppPath$Tidy.cfg" NEXT Else Config
        ^!IfTrue "^$IsCtrlKeyDown$" Config Else Next
        ^!IfTrue "^$IsShiftKeyDown$" TidyHome Else Next
        ^!SetArray %allow%=htm;asp;xml;html;php
        ^!SET %i%=0
        :loop
        ^!INC:%i%
        ^!IF ^%i% > ^%allow^%i%% Exit
        ^!IFSame "^$GetExt(^$GetDocName$)$" ".^%allow^%i%%" Skip
        ^!Goto loop
        ^!RunTidy
        ^!Save
        ^!IfTrue "^$IsAltKeyDown$" SKIP
        ^!open ^$GetAppPath$errors.txt
        ^!Jump doc_start
        ^!GoTo EXIT
        :Config
        ^!Clip:TidyConfigSetup
        ^!Goto Exit
        :TidyHome
        ^!URL http://www.w3.org/People/Raggett/tidy
        ^!Goto Exit

        H="_TidyConfigSetup"
        ^!Toolbar "New Document"
        error-file: ^$GetAppPath$errors.txt
        add-xml-pi: ^?[add-xml-pi: Add the XML processing instruction when
        outputting XML or XHTML=yes|_no]
        add-xml-decl: ^?[add-xml-pi: add the XML declaration when outputting XML or
        XHTML=yes|_no]
        add-xml-space: ^?[add-xml-space: causes Tidy to add xml:space preserve to
        elements such as pre, style and script when generating XML.=_yes|no]
        assume-xml-procins: ^?[assume-xml-procins: Change the parsing of processing
        instructions to require ?> as the terminator rather than >. =_yes|no]
        break-before-br: ^?[break-before-br: Break before br tag=_yes|no]
        char-encoding: ^?[char-encoding: Determines how Tidy interprets character
        streams=raw|_ascii|latin1|utf8|iso2022]
        clean: ^?[clean: Strip out surplus presentational tags=_yes|no]
        doctype: ^?[doctype: Adds doctype=omit|auto|_strict|loose]
        drop-empty-paras: ^?[drop-empty-paras: Discard empty paragraphs=yes|_no]
        drop-font-tags: ^?[drop-font-tags: discard font and center tags=_yes|no]
        enclose-text: ^?[enclose-text: Enclose any text it finds in the body element
        within a p element=_yes|no]
        enclose-block-text: ^?[enclose-block-text: insert a p element to enclose any
        text it finds in any element that allows mixed content for HTML transitional
        but not HTML strict=_yes|no]
        fix-backslash: ^?[fix-backslash: Cause backslash characters "\" in URLs to
        be replaced by forward slashes "/"=_yes|no]
        indent-attributes: ^?[indent-attributes: Begin each attributeon a new
        line=yes|_no]
        indent-spaces: ^?[(M="0")indent-spaces: Number of spaces to indent
        content=2]
        indent: ^?[(T=C)indent: Indent block-level tags=_no|yes|auto]
        input-xml: ^?[input-xml: Is input xml=yes|_no]
        keep-time: ^?[keep-time: If set, Tidy won't alter the last modified time for
        files it writes back to=_yes|no]
        logical-emphasis: ^?[logical-emphasis: Replace any occurrence of i by em and
        any occurrence of b by strong=_yes|no]
        markup: ^?[(T=C)markup: A pretty printed version of the markup.=_yes|no]
        numeric-entities: ^?[numeric-entities: numeric-entities=yes|_no]
        output-xhtml: ^?[output-xhtml: Output to xhtml=_yes|no]
        output-xml: ^?[output-xml: Output to xml=yes|_no]
        quiet: ^?[quiet: Do not output the welcome message or the summary of the
        numbers of errors and warnings. =yes|_no]
        quote-ampersand: ^?[quote-ampersand: Cause unadorned & characters to be
        written out as &=_yes|no]
        quote-marks: ^?[quote-marks: Cause " characters to be written out as
        "=yes|_no]
        quote-nbsp: ^?[quote-nbsp: Causes non-breaking space characters to be
        written out as entities=_yes|no]
        show-warnings: ^?[show-warnings: Show warnings=_yes|no]
        split: ^?[split: Use the input file to create a sequence of slides=yes|_no]
        tab-size: ^?[(M="0")tab-size: number of columns between successive tab
        stops=2]
        tidy-mark: ^?[tidy-mark: Add a meta element to the document head to indicate
        that the document has been tidied=yes|_no]
        word-2000: ^?[clean word 2000: word 2000=_yes|no]
        wrap-asp: ^?[wrap-asp: wrap-asp=_yes|no]
        wrap-php: ^?[wrap-php: wrap-asp=_yes|no]
        wrap-script-literals: ^?[wrap-script-literals: Wrap-script-literals=yes|_no]
        wrap: ^?[(m=00)wrap: Right margin for line wrapping=0]
        ^!Save AS ^$GetAppPath$Tidy.cfg
        ^!Close
        ^!GoTo EXIT
      • swirus@yahoo.com
        ... doesn t ... really need ... cleaners on ... you can ... I think I ve isolated the problem as being that HTMLtidy does not like the arbitrary line breaks
        Message 3 of 6 , Aug 14, 2001
        • 0 Attachment
          --- In ntb-html@y..., Jim Beidle <JBeidle@c...> wrote:
          > Hmmm...the first thing that pops into my mind is that Word 2000
          doesn't
          > produce HTML, it produces a MS-specification XML page. What you
          really need
          > is something to scrub out the XML. There are a couple of XML
          cleaners on
          > the Notetab library site at http://www.notetab.com/html.htm that
          you can
          > use. Look at the whole page, not just the XML portion of it.

          I think I've isolated the problem as being that HTMLtidy does not
          like the arbitrary line breaks used by Word, which fall in the middle
          of tags and such, and who is to blame it? Unfortunately I couldn't
          join the lines because the documents are very long, and apparently
          Notetab was not having such a long paragraph (100,000 characters with
          all of that useless repeated formatting data) What I did was download
          a Microsoft product which strips all of their proprietary XML from
          the HTML - I got it at:

          http://office.microsoft.com/downloads/2000/Msohtmf2.aspx

          With that removed, the code had fallen to 40,000 characters, and
          small enough to join, the HTMLTidy, which worked its magic.

          > Of course, you could do what I do and refuse to use Word as a HTML
          editor
          > ;-) Even Front Page Express :-P does a better job of wysiwyg layout
          than
          > Word and provides code that's easier to clean. Or just use Notetab
          >
          > I hope this helped a bit, and good luck!

          If I honestly had any choice, I would not be using Word.
          Unfortunately, Frontpage isn't part of my installation. Mind you, if
          I honestly had any choice, I'd be soaking up some rays in the south
          of France right now. Notwithstanding my personal bitterness, thanks
          for your advice, Jim.

          John.
        • swirus@yahoo.com
          ... files. ... to use a ... repository on the ... wizard. ... 2000: ... library. ... This looks a lot more elegant way of configuring tidy. I have a solution
          Message 4 of 6 , Aug 14, 2001
          • 0 Attachment
            --- In ntb-html@y..., "Grant" <emerge@p...> wrote:
            > There is a special tidy switch which can be used to clean word html
            files.
            > word-2000: [yes|no]
            > The best way to get all the switch options from withen notetab is
            to use a
            > config file.
            >
            > In my xhtml library (available from the library download
            repository on the
            > notetab site)there are two tidy related clips
            > -tidy
            > _TidyConfigSetup
            > which help generate a complete tidy config file via a notetab
            wizard.
            > The wizard contains all the tidy config options including the 'word-
            2000: '
            > switch which you can try.
            > The two clips stand alone and be taken out of the general xhtml
            library.
            > Included below
            >
            This looks a lot more elegant way of configuring tidy. I have a
            solution to my current problems (see other mail) but I shall download
            your libraries for future use (I don't trust myself to figure out
            where the line breaks go after so many brain frazzling hours of
            correcting MSHTML(TM)).

            The trouble with tidy, as with so many things in the computer world
            is that there is a constant battle between power (and HTMLTidy is
            powerful) and complexity. What I like about it is generally with the
            default options it does a good job. But HTML author or programmer is
            not my main job, so I simply haven't the time to learn the finer
            points of configuration. It looks like your scripts take the edge off
            this, for which much thanks.

            Cheers,
            John
          • Greg Chapman
            Hi Jim and Swirus ... There s also the official HTML filter from Microsoft. I picked up my copy from a magazine cover disk, but try a search for the file
            Message 5 of 6 , Aug 14, 2001
            • 0 Attachment
              Hi Jim and Swirus

              > Hmmm...the first thing that pops into my mind is that Word 2000 doesn't
              > produce HTML, it produces a MS-specification XML page. What you
              > really need
              > is something to scrub out the XML. There are a couple of XML cleaners on
              > the Notetab library site at http://www.notetab.com/html.htm that you can
              > use. Look at the whole page, not just the XML portion of it.

              There's also the official HTML filter from Microsoft. I picked up my copy
              from a magazine cover disk, but try a search for the file "msohtmlf2.exe".
              This is v2 of the Microsoft Office HTML filter.

              It does a number of things including place an "Export to compact HTML"
              button on the standard toolbar and additional export options on the File
              menu, including one to create a CSS file, from your document. HTML TIDY
              will still find some garbage to correct, but its a massive improvement on
              the standard output.

              Greg
            • Bob Janes
              ... It s at http://office.microsoft.com/downloads/2000/Msohtmf2.aspx Best wishes Bob -- Bob Janes Organisational Consultant +44 (7850) 150133 PO Box 211 Welwyn
              Message 6 of 6 , Aug 14, 2001
              • 0 Attachment
                > There's also the official HTML filter from Microsoft. I
                > picked up my copy from a magazine cover disk, but try a
                > search for the file "msohtmlf2.exe". This is v2 of the
                > Microsoft Office HTML filter.

                It's at http://office.microsoft.com/downloads/2000/Msohtmf2.aspx

                Best wishes

                Bob

                --

                Bob Janes
                Organisational Consultant
                +44 (7850) 150133
                PO Box 211 Welwyn AL6 0EX UK
                mailto:bob.janes@...
                www.webster-and-janes.co.uk
              Your message has been successfully submitted and would be delivered to recipients shortly.