Loading ...
Sorry, an error occurred while loading the content.
 

RE: The elephant in the room is... (was RE: [caplet] Re: ADsafe, Take 6)

Expand Messages
  • Larry Masinter
    On standards: The benefit of HTTP and XML and HTML is not that they are well-designed protocol and syntax and language, but that there are many different and
    Message 1 of 30 , Oct 23 8:40 AM
      On standards:

      The benefit of HTTP and XML and HTML is not that they are well-designed protocol and syntax and language, but that there are many different and (more-or-less) interoperable implementations for many operating systems and languages and well-deployed support infrastructure; with enough general agreement about them at the lower layers that you can get on with it defining the next layer up. So a "let's redesign them to be cleaner" effort isn't helpful, really. You'd have to be 10 times or 100 times better before getting traction.

      Protocol designers don't "gloss over" escaping; protocol designers are software developers (or maybe developers-gone-bad) for whom escaping is generally an ugly after-the-fact design addition or compromise between allowing everything to be encoded but letting simple cases be encoded concisely. Think of it as Huffman coding at the design level.

      <p><html>... </html></p>

      is how you write html in html, not because &#xxx; and &symbolname; are wonderful quoting mechanisms, but because the &entity; syntax was already there, and inventing another one for what was seemed like an uncommon case appeared unnecessary.

      On quoting:

      No matter what your escaping and encoding system, developers will have problems with them: you either understand the general principle or you don't.

      I know (from ancient experience) that most programmers learning LISP had trouble with thinking about X, (QUOTE X) and (EVAL X) when learning Lisp. The problem is keeping track of the different layers of interpretation – it isn't the syntax.

      Common Lisp added many other escaping conventions: `(let ((,x "abc\"def") (y ',z)) ,w) so it's hard to claim that S-expressions have consistent delimiters.

      On layering of escaping/quoting:

      The multiple layered quoting systems work well enough, because each layer does its own escaping/encoding and unescaping/unencoding and tools either hide or assist with the process. It's only when you're writing a program trying to process multiple layers simultaneously that you have trouble.

      The reason that ADsafe is having trouble is that it is trying to do filtering without actually using the normal layer software for parsing and interpretation, and skip what turns out to be necessary complexity. Try to write a regular expression that will determines whether a Lisp program might divide by zero, and you'd have similar problems.


      On checking URLs:

      I think you can't check for invalid URLs by examining a program's syntax because (a) programs can construct URLs, and you can't check for invalid (vs. valid) URLs any more than you can do all array bounds checking at compile time and (b) the rules for what constitute a "safe" URL are complicated and evolving. After all, a URL is just a reference to a registry of protocols, which requires the registering body define some syntax for how the URL syntax might identify something or invoke some protocol or process. Each URL scheme has its own syntax and story for what might be "safe" to execute in different contexts, but that depends as much on the implementation of the URL-interpreter as anything else.

      If you're going to do dynamic URL safety checking, there's not much point in doing syntactic checking, because you'll get lots of false positives ("this is unsafe" when it isn't) and won't catch any more problems syntactically than would be caught by the run-time check.

      Larry
    • Mike Samuel
      ... Maybe I m being horribly unfair to protocol designers, but implementors do. An example is entities in URIs embedded in HTML. is
      Message 2 of 30 , Oct 23 11:13 AM
        On 23/10/2007, Larry Masinter <lmm@...> wrote:
        >
        >
        >
        >
        >
        >
        > On standards:
        >
        > The benefit of HTTP and XML and HTML is not that they are well-designed protocol and syntax and language, but that there are many different and (more-or-less) interoperable implementations for many operating systems and languages and well-deployed support infrastructure; with enough general agreement about them at the lower layers that you can get on with it defining the next layer up. So a "let's redesign them to be cleaner" effort isn't helpful, really. You'd have to be 10 times or 100 times better before getting traction.
        >
        > Protocol designers don't "gloss over" escaping; protocol designers are software developers (or maybe developers-gone-bad) for whom escaping is generally an ugly after-the-fact design addition or compromise between allowing everything to be encoded but letting simple cases be encoded concisely. Think of it as Huffman coding at the design level.

        Maybe I'm being horribly unfair to protocol designers, but implementors do.

        An example is entities in URIs embedded in HTML.
        <a href="foo?bar=a&baz=b">
        is invalid HTML, but browser implementors, faced with a bunch of
        markup writers who don't understand escaping decided that they should
        guess what it means, so it means something vastly different from
        <a href="foo?bar=a&=c">


        > <p><html>... </html></p>
        >
        > is how you write html in html, not because &#xxx; and &symbolname; are wonderful quoting mechanisms, but because the &entity; syntax was already there, and inventing another one for what was seemed like an uncommon case appeared unnecessary.

        That's how it's done today, and that's the source of a huge number of problems.

        And a point I was trying to make (don't remember if I made it) was
        that embedding one structured language in another shouldn't require
        escaping. You should be able to include the parse tree directly using
        some kind of consistent (quote ...) syntax.


        > On quoting:
        >
        > No matter what your escaping and encoding system, developers will have problems with them: you either understand the general principle or you don't.

        Agreed. But there's a third possibility:

        I understand them and I've got deadlines and it'll usually work unless
        some lunatic gives their kid a name with apostrophes, so I'll not
        think through it now, and then come back to it once I've gotten some
        sleep which'll happen never.


        > I know (from ancient experience) that most programmers learning LISP had trouble with thinking about X, (QUOTE X) and (EVAL X) when learning Lisp. The problem is keeping track of the different layers of interpretation – it isn't the syntax.

        I think there's something fundamentally different about embedding by
        escaping and embedding by injecting attaching another languages parse
        tree to your own. Escaping requires clients of your language to be
        able to also parse any embeddable languages, and agree on the parse
        tree that any given string should produce.


        > Common Lisp added many other escaping conventions: `(let ((,x "abc\"def") (y ',z)) ,w) so it's hard to claim that S-expressions have consistent delimiters.

        Fair enough, but that's only a concern if you are embedding using escaping.

        > On layering of escaping/quoting:
        >
        > The multiple layered quoting systems work well enough, because each layer does its own escaping/encoding and unescaping/unencoding and tools either hide or assist with the process. It's only when you're writing a program trying to process multiple layers simultaneously that you have trouble.

        Quite right. If you're trying to identify a "secure" subset of any
        system (like web applications) which use multiple languages then you
        have two choices:
        - identify a safe subset of each language individually
        - deal with all languages at once and try to identify a safe subset of the union

        The first approach is certainly easier but has
        least-common-denominator problems -- you have to exclude things that
        might be allowable under the second approach.

        An example
        <a href="javascript:foo()">clicky</a>

        There's three languages in play here, Javascript, URIs, and HTML.

        If you consider the three individually, you have to conclude that the
        HTML is safe, the URI is not, and you never consider the javascript
        itself.

        If you deal with all languages at once, then you can apply your
        javascript verification recursively to the URI.

        To do that you have to parse all three languages and then hope that
        your parse trees agree with browsers' interpretations. My point is
        that's easier if you have a unified parse tree representation for all
        languages that appear in a document.


        > The reason that ADsafe is having trouble is that it is trying to do filtering without actually using the normal layer software for parsing and interpretation, and skip what turns out to be necessary complexity. Try to write a regular expression that will determines whether a Lisp program might divide by zero, and you'd have similar problems.
        >
        >
        > On checking URLs:
        >
        > I think you can't check for invalid URLs by examining a program's syntax because (a) programs can construct URLs, and you can't check for invalid (vs. valid) URLs any more than you can do all array bounds checking at compile time and (b) the rules for what constitute a "safe" URL are complicated and evolving. After all, a URL is just a reference to a registry of protocols, which requires the registering body define some syntax for how the URL syntax might identify something or invoke some protocol or process. Each URL scheme has its own syntax and story for what might be "safe" to execute in different contexts, but that depends as much on the implementation of the URL-interpreter as anything else.

        I disagree. URLs are more than a pointer into a registry of handlers.
        Specifically, javascript: and data: URLs contain data that can be
        classified.

        Again, you'll get fewer false negatives if you do that classification
        in the context of the larger document.


        > If you're going to do dynamic URL safety checking, there's not much point in doing syntactic checking, because you'll get lots of false positives ("this is unsafe" when it isn't) and won't catch any more problems syntactically than would be caught by the run-time check.

        There's always reason to do static checking since it let's you skip
        runtime checks :) But runtime checks are out of scope for ADSafe.


        > Larry
      Your message has been successfully submitted and would be delivered to recipients shortly.