Loading ...
Sorry, an error occurred while loading the content.

Re: Some data related to the frequency of cache-busting

Expand Messages
  • Benjamin Franz
    ... Yup. This problem will not clear until the author has the ability to distinguish between This page *MUST NOT* ever be displayed from a history , This
    Message 1 of 25 , Dec 1, 1996
    • 0 Attachment
      On Sat, 30 Nov 1996, David W. Morris wrote:


      > Other than carefully defining the difference between a History mechanism
      > and Caching, we did NOTHING! A protocol mechanism is needed so that
      > the server (applications) can influence browser history presentation.
      > The caching subgroup explicitly chose to defer this issue.

      Yup. This problem will not clear until the author has the ability to
      distinguish between 'This page *MUST NOT* ever be displayed from a
      history', 'This page *MAY* be redisplayed from a history but *MUST NOT* be
      refetched when displayed from history', 'This page *MAY* be redisplayed
      from a history, but *MUST* be refetched first', and 'This page *MAY* be
      redisplayed from history, unconditionally.' THe last two cases can be
      handled by properly implementing the existing Expires and cache control
      directives. But the first two cases cannot be done at all right now - and
      are both quite important.

      --
      Benjamin Franz
    • Larry Masinter
      You said of 1 This page *MUST NOT* ever be displayed from a history 2 This page *MAY* be redisplayed from a history but *MUST NOT* be refetched when
      Message 2 of 25 , Dec 1, 1996
      • 0 Attachment
        You said of


        1 'This page *MUST NOT* ever be displayed from a history'
        2 'This page *MAY* be redisplayed from a history but *MUST NOT* be
        refetched when displayed from history'
        3 'This page *MAY* be redisplayed from a history, but *MUST* be
        refetched first'
        4 'This page *MAY* be redisplayed from history, unconditionally.'

        that

        # The last two cases can be handled by properly implementing the
        # existing Expires and cache control directives.

        but I don't believe there are ANY http directives that place any
        requirements on the handling of history lists, to the point where HTTP
        _only_ requires 4.

        In fact, there are some browsers where doing much of anything else
        doesn't make much sense. For example, there was a two-dimensional
        infinite-plane browser where the 'history' was always completely
        visible, albeit in perspective.

        However, I'm a little fuzzy on why lack-of-controls of history makes
        'cache-busting' more of a problem, or lessens the value of hit
        metering.

        Larry
      • Shel Kaphan
        ... (I know I m repeating myself here, so bear with me): We just need to define the difference between a cache, which is used exclusively for performance
        Message 3 of 25 , Dec 1, 1996
        • 0 Attachment
          Larry Masinter writes:
          > You said of
          >
          >
          > 1 'This page *MUST NOT* ever be displayed from a history'
          > 2 'This page *MAY* be redisplayed from a history but *MUST NOT* be
          > refetched when displayed from history'
          > 3 'This page *MAY* be redisplayed from a history, but *MUST* be
          > refetched first'
          > 4 'This page *MAY* be redisplayed from history, unconditionally.'
          >
          > that
          >
          > # The last two cases can be handled by properly implementing the
          > # existing Expires and cache control directives.
          >
          > but I don't believe there are ANY http directives that place any
          > requirements on the handling of history lists, to the point where HTTP
          > _only_ requires 4.
          >
          > In fact, there are some browsers where doing much of anything else
          > doesn't make much sense. For example, there was a two-dimensional
          > infinite-plane browser where the 'history' was always completely
          > visible, albeit in perspective.
          >

          (I know I'm repeating myself here, so bear with me):

          We just need to define the difference between a cache, which is used
          exclusively for performance improvements and is supposed to be
          semantically transparent, and __any other client-local storage of
          fetched results__, which may be used for whatever purpose desired by the
          client (this includes "history"). This was done, to an extent, for
          1.1. The issue is that the rules for controlling the cache should not
          be mixed up with the rules for the other local storage. The design
          problem is that nobody wants to constrain browser design more than
          necessary to make services predictable and reliable, and that to even
          talk about this kind of thing we have to go beyond "bits on the wire".

          > However, I'm a little fuzzy on why lack-of-controls of history makes
          > 'cache-busting' more of a problem, or lessens the value of hit
          > metering.
          >
          > Larry
          >
          >

          Use of extra-protocol solutions like unique URLs are a problem for
          caching, especially if they can't be combined with appropriate cache
          controls for fear of making browsers act badly. These types of
          solutions may not be a problem for hit metering, except they make
          accumulating statistics more complex, because now many different URLs
          as seen by clients are actually "the same" URL from the server
          statistics point of view.

          Since caches and other local storage are typically mixed up, certain
          uses of certain HTTP headers will have unintended consequences. So,
          people resort to solutions that are outside the protocol, e.g. unique
          URLs.

          To repeat again the oft-repeated example, let's say a service author
          wants to send out a document that must always be refetched on "new"
          requests, but should be displayed from a locally stored copy if
          someone wants to view previous results. You set it up to expire
          immediately, or you set it up so that it is not cachable. That's
          fine, but what happens when someone hits the BACK button in their
          browser to go to this page? If the history buffer and cache system
          are mixed up, hitting BACK will result in the page being re-fetched,
          when the service author's goal was to have it be redisplayed from
          local storage. Some browsers can be a bit nasty about it, depending
          how the page was generated, and may display results like "DATA
          MISSING". This is no good from a UI perspective, and it will really
          freak out naive users, to the point that authors such as myself will
          simply avoid using the headers that cause this, and find other ways,
          outside the protocol, to approximate the desired result of causing new
          requests to get new pages but allow local browser history functions to
          work, too. The problem with this is that using these techniques is
          even worse than avoiding caching altogether -- it can cause pages that
          should never be cached in the first place to be cached, possibly
          displacing usefully cached pages.

          It's also a pain from the service design perspective, since you have
          to think about all kinds of weird interactions in browsers before
          using seemingly obvious and straightforward controls like Expires and
          Cache-control. In the long run this may be worse than a little
          caching inefficiency.

          --Shel
        • Koen Holtman
          ... With some current browsers, if you use caching directives, this has weird side-effects on how the history mechanism works. HTTP/1.1 states that there
          Message 4 of 25 , Dec 2, 1996
          • 0 Attachment
            Larry Masinter:
            >
            >However, I'm a little fuzzy on why lack-of-controls of history makes
            >'cache-busting' more of a problem,

            With some current browsers, if you use caching directives, this has weird
            side-effects on how the history mechanism works. HTTP/1.1 states that there
            should be no side effects on the history buffer, but not every current
            browser conforms to that.

            As long as these side-effects on the history mechanism remain, service
            authors which do not want the side effects (and there are many reasons for
            not wanting them) cannot use the caching directives. So these service
            authors will have to resort to one-time-URL cache busting techniques if they
            want to prevent the users from seeing stale data.

            Cache busting will remain with us to some extent until this unwanted
            coupling between history buffers and caches goes away.

            I have some hope that the language in 1.1 will make the coupling go away.
            If not, introducing explicit history control headers is my best bet on
            getting browsers to offer at least the option of not coupling between cache
            and history. Even though history control headers would not affect the bytes
            on the wire, they would affect the caching options for these bytes, so I
            feel that I could make a strong case for the http-wg getting involved in
            this area.

            >Larry

            Koen.
          • Jeffrey Mogul
            Benjamin Franz points out several ways in which my simplistic trace analysis might have overestimated the number of possibly cache-busted responses seen at our
            Message 5 of 25 , Dec 2, 1996
            • 0 Attachment
              Benjamin Franz points out several ways in which my simplistic trace
              analysis might have overestimated the number of possibly cache-busted
              responses seen at our proxy.

              In particular, he suggests that some of the non-query
              possibly-cachable references that I counted might actually
              have been CGI output, which should not have been included
              in the set of "possibly cache-busted responses". (I will
              note, however, that one of the examples he gave would NOT
              have been counted as such by my analysis, because that URL
              included the string "cgi-bin". I explicitly did not count
              such URLs.)

              If someone would like to propose a *feasible* filter on URLs
              and/or response headers (i.e., something that I could implement
              in a few dozen lines of C) that would exclude other CGI
              output (i.e., besides URLs containing "?" or "cgi-bin", which
              I already exclude), then I am happy to re-run my analysis.

              Drazen Kacar pointed out that I should probably have
              excluded .shtml URLs from this category, as well, because
              they are essentially the same thing as CGI output. I checked
              and found that 354 of the references in the trace were to .shtml
              URLs, and hence 10075, instead of 10429, of the references
              should have been categorized as possibly cache-busted. (This
              is a net change of less than 4%.)

              I would say the only *confirmable* deliberate cache busting done
              are the 28 pre-expired responses. And they are an insignificant
              (almost unmeasurable) percentage of the responses.

              If I was writing a scientific paper whose thesis was that a
              significant fraction of the responses are cache-busted, then
              you are right that I would not have a rigorous proof regarding
              anything but these 28 pre-expired responses. And, no matter
              how much more filtering I do on the data, I would not expect
              to be able to construct a rigorous proof based on such a trace.

              On the other hand, I don't believe that this trace could provide
              a rigorous proof of the converse hypothesis, that no deliberate
              cache-busting is done. Nor do I believe that any trace-based
              analysis could prove this, given the frequency with which I
              found responses that leave the question ambiguous.

              In short, if we are looking for a rigorous, scientific *proof*
              that cache-busting is either prevalent or negligible, I don't
              think we are going to find it in traces, and I can't think of
              where else one might look.

              But we are engaged in what fundamentally is an *engineering*
              process, rather than a scientific one. This means that, from
              time to time, we are going to have to infer future reality from
              an imprecise view of current reality, and that the future is
              in large part determined by the result of our engineering, not
              independent of it.

              I welcome other sources of data that might help make this inference
              more reliable. Certainly we should not base everything on five
              hours of trace data from one site. On the other hand, it's
              foolish to dismiss the implications of the data simply because
              it fails to rigorously prove a particular hypothesis (pace the
              Tobacco Institute, which has taken about 30 years to admit that
              there might in fact be a connection between smoking and cancer.)

              As you noted - much more study is needed. This one is utterly
              inconclusive. You conclude from your numbers that significant
              savings can be found.

              I wouldn't say I concluded that. I said "there does seem to
              be some potential here."

              I conclude from the same numbers that the extra overhead of the hit
              metering in fact is *higher* than the loses to deliberate cache
              busting. You would have more network traffic querying for hit meter
              results than the savings for such a tiny number of cache busted
              responses.

              This mystifies me. What overhead of hit-metering are you talking about?

              There are three kinds of overhead in our proposed scheme:

              (1) additional bytes of request headers
              (a) for agreeing to hit-meter
              (b) for reporting usage-counts
              (2) additional bytes of response headers
              (3) additional HEAD request/response transactions for
              "final reports"

              Overheads of types #1(b), #2, and #3 are *only* invoked if the origin
              server wants a response to be hit-metered (or usage-limited,
              but that's not relevant to this analysis). This means that
              if hit-metering were not useful to the origin-server, it would
              not be requested, and so these overheads would not be seen.
              (I'm assuming a semi-rational configuration of the server!)

              Note that #3 can *only* happen instead of a full request
              on the resource, and is likely to elicit a smaller (no-body)
              response, so it's not really clear that this should be
              counted as an "overhead".

              What remains is the overhead (type #1(a)) of a proxy telling
              a server that it is willing to meter. I'll ignore the obvious
              choice that a proxy owner could make, which is to disable this
              function if statistics showed that hit-metering increases overheads
              in reality, and assume that the proxy is run by someone of less
              than complete understanding of the tradeoffs.

              So, once per connection, the proxy would send
              Connection: meter
              which is 19 bytes, by my count. If each connection carried just
              one request, then (assuming that the mean request size stays
              at about 309 bytes, which is what I found for all of the requests
              I traced, and this does not include any IP or TCP headers!), then
              this is about a 6% overhead. (But at one request/connection,
              and with a mean request size smaller than 576 bytes, there would
              probably be almost no increase in packet count.)

              However, since hit-metering can only be used with HTTP/1.1 or
              higher, and persistent connections are the default in HTTP/1.1,
              and because we defined this aspect of a connection to be "sticky"
              in our proposal, one has to divide the calculated overhead by
              the expected number of requests per connection. As far as I know,
              nobody has done any quantitative study of this since my SIGCOMM '95
              paper, which is presumably somewhat out of date, but (using simulations
              based on traces of real servers) I was expecting on the order of 10
              requests/connection. It might even be higher, given the growing
              tendency to scatter little bits of pixels throughout every web page.

              Anyway, I wouldn't presume to put a specific number on this, because
              I'm already basing things on several layers of speculation. But I
              would appreciate seeing an analysis based on real data that supports
              your contention, that "the extra overhead of the hit metering in fact
              is *higher* than the loses to deliberate cache busting."

              -Jeff
            • Drazen Kacar
              ... You can check for everything that ends with .cgi and .nph as well as everything that starts with nph- . Don t forget that CGIs can have trailing path
              Message 6 of 25 , Dec 3, 1996
              • 0 Attachment
                Jeffrey Mogul wrote:
                >
                > If someone would like to propose a *feasible* filter on URLs
                > and/or response headers (i.e., something that I could implement
                > in a few dozen lines of C) that would exclude other CGI
                > output (i.e., besides URLs containing "?" or "cgi-bin", which
                > I already exclude), then I am happy to re-run my analysis.

                You can check for everything that ends with ".cgi" and ".nph" as well
                as everything that starts with "nph-". Don't forget that CGIs can
                have trailing path info.

                > Drazen Kacar pointed out that I should probably have
                > excluded .shtml URLs from this category, as well, because
                > they are essentially the same thing as CGI output. I checked
                > and found that 354 of the references in the trace were to .shtml
                > URLs, and hence 10075, instead of 10429, of the references
                > should have been categorized as possibly cache-busted. (This
                > is a net change of less than 4%.)

                There is a short (3 char) extension as well. I don't know which one.
                I think it's ".shm", but I'm not sure. You'll get additional percent
                or two if you inlude all of these.

                > I would say the only *confirmable* deliberate cache busting done
                > are the 28 pre-expired responses. And they are an insignificant
                > (almost unmeasurable) percentage of the responses.

                Some of them are probably due to HTTP 1.0 protocol and could have been
                cacheable if the server could count on vary header being recognized by
                the client.

                > In short, if we are looking for a rigorous, scientific *proof*
                > that cache-busting is either prevalent or negligible, I don't
                > think we are going to find it in traces, and I can't think of
                > where else one might look.

                I can. On-line advertising mailing lists. I'm subscribed to one of those
                not because it's my job, but to stay in touch with the web things. I'm
                just a lurker there (OK, I'm a lurker here as well, but not because I
                want to. I can't find time to read the drafts and I'm at least two
                versions behind with those I did read.)

                People on the list are professionals and experts in their field, but
                not in HTML or HTTP. A month ago somebody posted "a neat trick" which
                had these constructs in HTML source:

                <FONT FACE="New Times Roman" "Times Roman" "Times" SIZE=-1>...</FONT>
                <A HREF=...><TABLE>...</TABLE></A>

                Than somebody else pointed out that Netscape won't make the whole table
                clickable if it's contained in anchor. The answer from the original author
                started with "For some reason (and I don't know why) it seems that
                Netscape can't...". I let that one pass to see if anyone would mention
                DTDs, syntax, validators or anything at all. No one did. This is viewed
                as lack of functionality in NSN, and not as trully horrible HTML.
                To be fair, I must mention that most of them know a thing or two about
                ALT attributes and are actively fighting for its usage. They probably
                don't know it's required in AREA, but IMG is a start. My ethernal
                gratitude to people who are fighting on comp.infosystems.www.html. I stopped
                years ago.

                Another example is HTTP related. There was talk about search engines and
                one person posted that cheating them is called "hard working". Than there
                was a rush of posts saying that is not ethical and that pages text that
                contains repeating of key words could come up on the top of the list, but
                it would look horrible when the customer really requests the page. No one
                mentioned that you can deliver one thing to the search engine and another
                to the browser.

                To conclude, marketing people are clueless about HTML and (even more) HTTP
                and they can't participate on this list. It's not that they would not
                want to. They have some needs and if those are not met with HTTP, responses
                will be made uncacheable as soon as they find out how to do it.
                I'm doing the same thing because of charset problems. It's much more
                important for the information provider that users get the right code page
                than to let proxy cache the wrong one. OK, I'm checking for HTTP 1.1 things
                which indicate that I can let the entity body be cacheable, but those are
                not coming right now and (reading the wording in HTTP 1.1 spec) I doubt
                they will.

                A few examples of what's needed...

                Suppose I need high quality graphics for the page, but it's not mandatory.
                I'll make two versions of pictures, one will have small files and the
                other will (can't do anything about it) have big files. I can conclude
                vie feature negotiation if the user's hardware and software can display
                high quality pictures, but not if the user wants it, ie. if the bandwidth
                is big enough or if the user is prepared to wait.
                So, I'll display low res pictures by default and put a link to the same
                page with high res graphics. User's preference will be sent back to him in
                the cookie. It's really, really hard and painfull to maintain two versions
                of pages just for this and I'd want my server to select appropriate picture
                based on URL and the particular cookie. What happens with the proxy?
                I can send "Vary: set-cookie", but this is not enough. There'll be other
                cookies. On a really comercial site there'll be one cookie for each user.
                People are trying to gather information about their visitors. I can't
                blame them, although I have some ideas about preventing this. (Will have
                to read state management draft, it seems). Anyway, this must be made
                non cacheable. Counting on LOWSRC is not good enough.

                Another thing are ad banners. Some people are trying not to display the
                same banner more than 5 or 6 time to a particular user. The information
                about visits is stored in (surprise, surprise) cookie. The same thing, again.

                I think that technical experts should ask the masses what's needed. Don't
                expect the response in the form of Internet draft, though.

                --
                Life is a sexually transmitted disease.

                dave@...
                dave@...
              • Benjamin Franz
                ... Its worse than that. The world has started using many different TLA extensions for CGI type stuff. .dll is used on the Microsoft site with a path segment
                Message 7 of 25 , Dec 3, 1996
                • 0 Attachment
                  On Tue, 3 Dec 1996, Drazen Kacar wrote:

                  > Jeffrey Mogul wrote:
                  > >
                  > > If someone would like to propose a *feasible* filter on URLs
                  > > and/or response headers (i.e., something that I could implement
                  > > in a few dozen lines of C) that would exclude other CGI
                  > > output (i.e., besides URLs containing "?" or "cgi-bin", which
                  > > I already exclude), then I am happy to re-run my analysis.
                  >
                  > You can check for everything that ends with ".cgi" and ".nph" as well
                  > as everything that starts with "nph-". Don't forget that CGIs can
                  > have trailing path info.
                  >
                  > > Drazen Kacar pointed out that I should probably have
                  > > excluded .shtml URLs from this category, as well, because
                  > > they are essentially the same thing as CGI output. I checked
                  > > and found that 354 of the references in the trace were to .shtml
                  > > URLs, and hence 10075, instead of 10429, of the references
                  > > should have been categorized as possibly cache-busted. (This
                  > > is a net change of less than 4%.)
                  >
                  > There is a short (3 char) extension as well. I don't know which one.
                  > I think it's ".shm", but I'm not sure. You'll get additional percent
                  > or two if you inlude all of these.

                  Its worse than that. The world has started using many different TLA
                  extensions for CGI type stuff. .dll is used on the Microsoft site with a
                  path segment of 'isapi'. I have also seen .ast, .asm, .asp, .nsf, .exe,
                  .phtml, and of course .pl, .cgi and even .tcl. TO add to the problems,
                  some people configure Apache to do server-side parsing on .html files. One
                  general rule of thumb is 'anything except a widely known file extension in
                  the standard mime.conf file plus some common others (.mpg, .mov, .fli,
                  .avi, .wav, .mp2, .mp3, .png, .htm, .pdf, .java, .class) is probably being
                  generated dynamically'.

                  > I can send "Vary: set-cookie", but this is not enough. There'll be other
                  > cookies. On a really comercial site there'll be one cookie for each user.

                  Yep.

                  --
                  Benjamin Franz
                • Robert S. Thau
                  ... Unfortunately, use of server-side-include type schemes (what .shtml is typically meant to invoke) is not always so easy to detect --- the Apache web
                  Message 8 of 25 , Dec 3, 1996
                  • 0 Attachment
                    Jeffrey Mogul writes:

                    > Drazen Kacar pointed out that I should probably have
                    > excluded .shtml URLs from this category, as well, because
                    > they are essentially the same thing as CGI output. I checked
                    > and found that 354 of the references in the trace were to .shtml
                    > URLs, and hence 10075, instead of 10429, of the references
                    > should have been categorized as possibly cache-busted. (This
                    > is a net change of less than 4%.)

                    Unfortunately, use of server-side-include type schemes (what .shtml is
                    typically meant to invoke) is not always so easy to detect --- the
                    Apache web server, for instance, has hooks which allow the same sort
                    of processing to be applied to *.html files with certain unusual Unix
                    permission bit settings (XBitHack), and there are people who run the
                    server configured to treat *all* *.html files as (potentially)
                    containing server-side includes. Deliberate cache-busting (e.g., to
                    enable collection of better metrics) may not be the intent of these
                    setups, but they currently have something of that effect...

                    rst
                  • Anawat Chankhunthod
                    Last time I look, Pointcast objects (when you use point cast behind a http proxy) are cachable by header and url but every url is unique. I guess we can
                    Message 9 of 25 , Dec 3, 1996
                    • 0 Attachment
                      Last time I look, Pointcast objects (when you use point cast behind a http
                      proxy) are cachable by header and url but every url is unique.
                      I guess we can catagorize them as cache busting too.

                      Anawat
                    • Jeffrey Mogul
                      There s another category of cache-busting that you did not mention in the statistics you reported. This is the use of unique URL components, which may be
                      Message 10 of 25 , Dec 3, 1996
                      • 0 Attachment
                        There's another category of cache-busting that you did not mention in
                        the statistics you reported. This is the use of unique URL
                        components, which may be "once-only" URLs, or are at least unique for
                        a single user.

                        Right you are. I should have been more explicit in the title of
                        my message, and I didn't explain it clearly enough in the body
                        of the message, but this analysis was only aimed at finding instances
                        of cache-busting that might easily be avoided through use of our
                        hit-metering proposal. I thought it would be more realistic to
                        look for cache-busting that is done without using the unique-URL
                        technique.

                        It's not clear to me whether the users of once-only URLs would
                        switch to a more cache-friendly approach if our hit-metering
                        proposal were available. (Clearly, anyone that requires
                        cache-busting to provide usable results in the face of broken
                        history mechanisms is not going to switch, at least not until
                        virtually all browsers have fixed their history support.) So
                        I therefore assumed that non of the once-only URLs would be
                        amenable to hit-metering, and so I did not try to include these
                        URLs in my category of "possibly cache-busted responses."

                        On the other hand, it's not clear that I could have identified them
                        from their names. If they were pre-expired or had no last-modified
                        date, and they did not match my CGI filter, I would have included
                        them in my category of "possibly cache-busted responses" by mistake.

                        When I am ready to re-do the analysis, I'll try a version that is
                        limited to URLs for which the trace contains at least two status-200
                        responses. Presumably this will avoid any once-only URLs, right?
                        However, it will decrease the sample size by a large factor, which
                        means that the significance of the results may be weakened.

                        -Jeff
                      • Shel Kaphan
                        ... Yes, sure. You d have to resort to unreliable heuristic techniques to pick out such URLs. In fact, you re likely to have already considered them in one
                        Message 11 of 25 , Dec 3, 1996
                        • 0 Attachment
                          Jeffrey Mogul writes:
                          > There's another category of cache-busting that you did not mention in
                          > the statistics you reported. This is the use of unique URL
                          > components, which may be "once-only" URLs, or are at least unique for
                          > a single user.
                          >
                          > Right you are. I should have been more explicit in the title of
                          > my message, and I didn't explain it clearly enough in the body
                          > of the message, but this analysis was only aimed at finding instances
                          > of cache-busting that might easily be avoided through use of our
                          > hit-metering proposal. I thought it would be more realistic to
                          > look for cache-busting that is done without using the unique-URL
                          > technique.
                          >

                          Yes, sure. You'd have to resort to unreliable heuristic techniques to
                          pick out such URLs. In fact, you're likely to have already considered
                          them in one of your other categories, since they are more likely to
                          show up as invocations of CGI programs and the like, rather than
                          static ".html" URLs -- *something* on the server end has to interpret
                          or strip off the unique part of the URL. Unless the http server
                          itself has been hacked, it will be a CGI program or the moral
                          equivalent.

                          > It's not clear to me whether the users of once-only URLs would
                          > switch to a more cache-friendly approach if our hit-metering
                          > proposal were available. (Clearly, anyone that requires
                          > cache-busting to provide usable results in the face of broken
                          > history mechanisms is not going to switch, at least not until
                          > virtually all browsers have fixed their history support.) So
                          > I therefore assumed that non of the once-only URLs would be
                          > amenable to hit-metering, and so I did not try to include these
                          > URLs in my category of "possibly cache-busted responses."
                          >

                          They're mainly not amenable to hit metering because it's impossible to
                          algorithmically determine the "equivalence class" of once-only URLs --
                          all the superficially distinct URLs that fetch "the same" resource
                          look like different URLs. Anyway I'd have to guess that the
                          overwhelming majority of servers that work using unique URLs do it
                          more for semantics than explicitly for cache-busting.
                          One question that must be asked about this: is this technique
                          prevalent enough to be worth worrying much about? I see it a lot, but
                          then, I pay attention to sites that do stuff like this.

                          > On the other hand, it's not clear that I could have identified them
                          > from their names. If they were pre-expired or had no last-modified
                          > date, and they did not match my CGI filter, I would have included
                          > them in my category of "possibly cache-busted responses" by mistake.
                          >
                          but that "mistake" is actually OK, right?

                          > When I am ready to re-do the analysis, I'll try a version that is
                          > limited to URLs for which the trace contains at least two status-200
                          > responses. Presumably this will avoid any once-only URLs, right?

                          It will avoid true "once-only" URLs, but you still might see
                          some matches on "per-session" URLs -- ones that track a user through a
                          session. These per-session URLs are also fairly pointless to cache in a
                          shared cache, since they're only relevant to one user, but that user
                          might ask for the same thing more than once. Based purely
                          on anecdotal evidence I think per-session URLs are a lot more common than
                          true once-only URLs.

                          > However, it will decrease the sample size by a large factor, which
                          > means that the significance of the results may be weakened.
                          >
                          > -Jeff
                          >
                          >


                          --Shel
                        • Drazen Kacar
                          ... There are servers you don t have to hack. They were written by hackers. Phttpd, for example, has this nice thing called URL rewriting. Basicaly, the first
                          Message 12 of 25 , Dec 3, 1996
                          • 0 Attachment
                            Shel Kaphan wrote:

                            > Yes, sure. You'd have to resort to unreliable heuristic techniques to
                            > pick out such URLs. In fact, you're likely to have already considered
                            > them in one of your other categories, since they are more likely to
                            > show up as invocations of CGI programs and the like, rather than
                            > static ".html" URLs -- *something* on the server end has to interpret
                            > or strip off the unique part of the URL. Unless the http server
                            > itself has been hacked, it will be a CGI program or the moral
                            > equivalent.

                            There are servers you don't have to hack. They were written by hackers.
                            Phttpd, for example, has this nice thing called URL rewriting. Basicaly,
                            the first thing the server does is checking if the requested URL is
                            matched by one of rewriting pattern and if so, it changes it according to
                            rewriting rule. For example, I can have this pair:

                            /*/xexe/*.html /cgi-bin/script/%{-}

                            which will translate http://my.host/~dave/xexe/cacheme.html to
                            http://my.host/cgi-bin/script/~dave/xexe/cacheme.html
                            User agent will always see the URL before rewriting and server will invoke
                            CGI which will receive original URL via PATH_INFO (and PHTTPD_ORIG_URL as
                            well :) env variable(s). Phttpd is being run on a little more than 100
                            hosts, so I suppose you won't encounter this often, but I think that
                            Apache 1.1 can do crippled version of the magick with Action directive.

                            --
                            Life is uncacheable sexually transmitted disease.

                            dave@...
                            dave@...
                          • Andrew Daviel
                            Why can t (shouldn t) one cache a CGI response ? It seems to me more rational to flush cache based on the frequency of hits. For example, the help page at
                            Message 13 of 25 , Dec 4, 1996
                            • 0 Attachment
                              Why can't (shouldn't) one cache a CGI response ? It seems to me more
                              rational to flush cache based on the frequency of hits. For example, the
                              "help" page at altavista is CGI-generated from a query, but as far as I
                              know it's static. It's perfectly reasonable to generate static pages
                              from a database using CGI or otherwise, and it's quite possible to
                              set all the headers Last-Modified, Expires, Content-Length etc. in
                              an appropriate manner. I use a Squid cache set to reject "/imagemap"
                              and not much else (though not to pass cgi-bin or ? to the parent).
                              Perhaps 5% of queries are cache hits, compared to around 16% of images.

                              If someone looks up "Soccer in Latvia" in a search engine, is it really
                              going to change in ten minutes? A day? More so than
                              http://www.obscure.org/some/really/obscure/page.html ?

                              Re. charsets, content negotiation, etc in HTTP 1.0 - I decided
                              as a compromise that using a redirect CGI is "mostly harmless".
                              True, if the origin server can't be reached you're stuck, but the
                              big text files can be cached. I think Microsoft's doing something like
                              this, but as someone pointed out to me, they use set-cookie with a path
                              of / which strictly speaking makes the whole site uncacheable.

                              Andrew Daviel mailto:advax@...
                              http://vancouver-webpages.com/CacheNow/ - campaign for Proxy Cache
                            • Andrew Daviel
                              ... XBitHack is set to allow cacheing of SHTML. If the server has this enabled, and the file has the group excute bit set, then the page is served with the
                              Message 14 of 25 , Dec 4, 1996
                              • 0 Attachment
                                On Tue, 3 Dec 1996, Robert S. Thau wrote:

                                > Unfortunately, use of server-side-include type schemes (what .shtml is
                                > typically meant to invoke) is not always so easy to detect --- the
                                > Apache web server, for instance, has hooks which allow the same sort
                                > of processing to be applied to *.html files with certain unusual Unix
                                > permission bit settings (XBitHack), and there are people who run the
                                > server configured to treat *all* *.html files as (potentially)
                                > containing server-side includes. Deliberate cache-busting (e.g., to
                                > enable collection of better metrics) may not be the intent of these
                                > setups, but they currently have something of that effect...

                                XBitHack is set to allow cacheing of SHTML. If the server has this
                                enabled, and the file has the group excute bit set, then the page is
                                served with the last-modified date of the container file. It's up to the
                                webmaster to update the modification date to the container file when
                                any included files change. I use this extensively (user-configurable
                                pages where the included file grows, setting PICS headers, common header
                                and footer information on thousands of pages, etc.).

                                Anawat has recently pointed out to me, though, that Content-Length does
                                not get set, which will cause a persistant connection to be dropped at
                                the end of the document.
                                I confess that persistant connection is something I haven't really looked
                                at - it's not supported in many of the browsers and agents I've used.

                                Re. action - in Apache one can define a new suffix and pseudo-mime-type
                                which will redirect to a CGI script. I made one so that
                                /some/doc.lang would go to /cgi-bin/redirect-lang/doc.lang and thence
                                to /some/english.html, /some/french.html etc. as a semi-cacheable
                                alternative to the Apache content-negotiation using .var suffix, which
                                plain doesn't work with cache in HTTP 1.0

                                Andrew
                              • Jeffrey Mogul
                                Andrew Daviel writes: Why can t (shouldn t) one cache a CGI response ? It seems to me more rational to flush cache based on the frequency of hits. The HTTP/1.1
                                Message 15 of 25 , Dec 5, 1996
                                • 0 Attachment
                                  Andrew Daviel writes:
                                  Why can't (shouldn't) one cache a CGI response ? It seems to me
                                  more rational to flush cache based on the frequency of hits.

                                  The HTTP/1.1 specification, in fact, does not specifically say
                                  that proxies and clients should not cache the results of a CGI
                                  response. In fact, section 13.4 (Response Cachability) says

                                  Unless specifically constrained by a Cache-Control
                                  directive, a caching system may always store a successful response
                                  as a cache entry, may return it without validation if it
                                  is fresh, and may return it after successful validation. If there is
                                  neither a cache validator nor an explicit expiration time associated
                                  with a response, we do not expect it to be cached [...]

                                  In other words, if the server supplies a response with either a
                                  Last-Modified header, or an Expires header (or "Cache-control: max-age")
                                  that gives an expiration time in the future, then the response
                                  *should* be cached.

                                  However, because most existing caches were designed before HTTP/1.1,
                                  and do not expect servers to generate Expires headers (most servers
                                  apparently do not), they often cache responses that have neither
                                  a Last-Modified header or an Expires header. This is not really
                                  such a great idea, but it "usually" works. The two well-known cases
                                  that it often does not work in are those where the URL includes a "?"
                                  and those where it includes "cgi-bin" (or a few similar strings).
                                  So it's normal practice for proxies to not cache responses to such
                                  URLs.

                                  Note that section 13.9 says, regarding URLs with "?" in them,
                                  caches MUST NOT treat responses to such URLs as fresh unless
                                  the server provides an explicit expiration time.
                                  There is a general consensus (but not unanimity) that it is better
                                  to err on the side of caution in this case. I.e., since there are
                                  many such URLs for which caching would cause seriously wrong results,
                                  it's better to not cache any of these responses (and thus give up
                                  the ability to cache certain responses that are cachable), rather
                                  than to risk occasionally returning wrong answers.

                                  However, I think everyone agrees with you that it's both possible and
                                  desirable for origin servers to explictly mark all responses
                                  as either non-cachable or cachable, since then the proxies don't
                                  have to play guessing games based on the URL. E.g., if you are
                                  writing a server that uses CGI or "?" URLs, and you know that
                                  some of these are cachable, if you simply add a Last-Modified
                                  or Expires (in the future) header to the response, then a well
                                  designed proxy will cache the response. Conversely, if you
                                  mark the response as Expires "in the past", then no well designed
                                  cache should cache it (without at least sending you a conditional
                                  GET to see if the value has changed).

                                  As to why the AltaVista people haven't done this: I don't know.
                                  Some of them work in our building, but I don't have much to
                                  do with their design decisions (and they didn't invite me
                                  for a ride in the blimp!).

                                  It's probably too hard to decide automatically that a response
                                  on a query for "Soccer in Latvia" would be more stable than
                                  a query for "Cool Site of the day", but it should certainly
                                  be possible to set an expiration time reflecting the expected
                                  time between database updates.

                                  -Jeff
                                • jg@zorch.w3.org
                                  I have talked with Louis Monier (and I think I mentioned this to Mike Burrows as well) about doing stuff for marking CGI responses cachable in AltaVista. I
                                  Message 16 of 25 , Dec 5, 1996
                                  • 0 Attachment
                                    I have talked with Louis Monier (and I think I mentioned
                                    this to Mike Burrows as well) about doing stuff for marking
                                    CGI responses cachable in AltaVista. I told them that
                                    it didn't make any real difference to do it before 1.1 clients
                                    and proxies were available, which hasn't quite happened yet.
                                    (And they have had more than enough things to do in the first
                                    place that asking them to do so before it would actually help anyone
                                    seemed pointless.). So Louis at least has it on his list of things to
                                    do for A.V. in finite time (unless he has forgotten).

                                    I'll poke at them again next week if I can.

                                    BTW, we should have interesting HTTP/1.1 performance data available
                                    by next week (the IETF meeting); we're scrambling here to take
                                    the data now we have implementations we actually think work correctly
                                    that do pipelining (not just persistent connections, which we've
                                    had running for a long time). When the write up is done, we'll
                                    post a note to the working group (with luck, you might see it
                                    over the weekend).

                                    Preliminary results look very nice :-). Saves lots of packets,
                                    and runs faster. :-). But then again, that's what we expected....
                                    - Jim Gettys
                                  • Ho John Lee
                                    ... Do you know offhand which proxies currently interpret the Last-Modified header in a way that would cache CGI/ ? URLs? The interactive imaging protocol
                                    Message 17 of 25 , Dec 5, 1996
                                    • 0 Attachment
                                      > However, I think everyone agrees with you that it's both possible and
                                      > desirable for origin servers to explictly mark all responses
                                      > as either non-cachable or cachable, since then the proxies don't
                                      > have to play guessing games based on the URL. E.g., if you are
                                      > writing a server that uses CGI or "?" URLs, and you know that
                                      > some of these are cachable, if you simply add a Last-Modified
                                      > or Expires (in the future) header to the response, then a well
                                      > designed proxy will cache the response. Conversely, if you
                                      > mark the response as Expires "in the past", then no well designed
                                      > cache should cache it (without at least sending you a conditional
                                      > GET to see if the value has changed).

                                      Do you know offhand which proxies currently interpret the
                                      Last-Modified header in a way that would cache CGI/"?" URLs?

                                      The interactive imaging protocol we're currently developing
                                      *should* be cacheable, as the image tiles being sent don't
                                      change, and the initial image view will generally be the same
                                      each time it's requested. It would be nice if we could get
                                      *some* systems to cache something while we work out a better
                                      scheme for making the rest of the tile information cacheable
                                      by proxies.

                                      --hjl
                                    • Koen Holtman
                                      Jeffrey Mogul: [...] ... I think there are very few existing 1.0 proxies that cache responses without a Last-Modified header. Doing would cause problems with
                                      Message 18 of 25 , Dec 6, 1996
                                      • 0 Attachment
                                        Jeffrey Mogul:
                                        [...]
                                        >However, because most existing caches were designed before HTTP/1.1,
                                        >and do not expect servers to generate Expires headers (most servers
                                        >apparently do not), they often cache responses that have neither
                                        ^^^^^
                                        >a Last-Modified header or an Expires header.

                                        I think there are very few existing 1.0 proxies that cache responses without
                                        a Last-Modified header. Doing would cause problems with a large fraction of
                                        all CGI-based stuff, and this would get noticed very quickly by the cache
                                        maintainer.

                                        I believe the AOL cache does (or did at some point) cache everything for a
                                        few minutes at least, no matter what the headers, but proxies on the `real'
                                        internet generally tend to err on the conservative side.

                                        >-Jeff

                                        Koen.
                                      Your message has been successfully submitted and would be delivered to recipients shortly.