Loading ...
Sorry, an error occurred while loading the content.

forwarded message from connolly@pixel.convex.com

Expand Messages
  • Jean-Francois Groff
    WWW folks may like to comment on this, posted to wais-talk and cni-arch... Sorry if you ve already read it there ! -- Jean-Francois ... From:
    Message 1 of 1 , Dec 3, 1991
    View Source
    • 0 Attachment
      WWW folks may like to comment on this, posted to wais-talk and
      cni-arch... Sorry if you've already read it there !

      -- Jean-Francois

      ------- Start of forwarded message -------

      From: connolly@...
      To: wais-talk@...
      Cc: cni-arch@...
      Subject: Re: Document identifiers
      Date: Mon, 02 Dec 91 01:32:36 CST


      >The Coalition for Networked Information
      >Architectures & Standards Working Group
      >
      I don't like the direction this technology is headed.

      What is the desired functionality of these identifiers?

      If you want an identifier that uniquely identifies a file,
      why not use a checksum, such as returned by the unix
      sum command?

      Let's see how a checksum solves these issues, and then see
      what functionality I'd like to see in stead.

      >1. The need for identifiers, as distinct from location
      >information. This is best handled by a number (much like an
      >ISSN or ISBN), but the system must accomodate multiple
      >number-assigning agencies. Thus, the identifier is proposed
      >as <numbering-authority>,<identifier> where numbering
      >authorities are registered.
      >
      There's no location info in a checksum. Done deal.

      >2. The pointers must be representable as an ASCII string to
      >facilitate inclusion in a wide range of material, including
      >documents and electronic mail.
      >
      Check.

      >3. Location information must support multiple Locations for
      >the document, including the "location of record" and one or
      >more redistribution centers, local caches, etc. The means of
      >specifying a location should be sufficiently general to span
      >at least the set of networks covered under the Internet
      >Domain Naming system (DNS).
      >
      Ah! Now we want to be able to get location info out of the
      identifier. Checksums don't help. Well, in fact, they help
      no more or less than <numbering authority>-<id> helps, unless
      a numbering authority implies a location. I'm not clear on
      this at all.

      >4. Objects may be retrieved by a variety of access
      >mechanisms from servers, including FTP, LISTSERV, Z39.50,
      >and perhaps FTAM and SQL-based database access, as well as
      >requests for paper copies. The location information should
      >be sufficiently general to include information about these
      >different types of access techniques, and extensible to
      >include new access methods that may develop in future.
      >
      Hmmm... now it looks like the doc id should tell how to
      get the document... but not exactly. What we're relly looking
      for is some client software that interprets these numbers
      and queries servers. Checksums look as good as anything again.

      >5. Perhaps the location identifier should include some
      >information about the format and size of the object; on the
      >other hand, perhaps it should not. Discussion?
      >
      Checksums do not contain type/size info. If that's what we want,
      the checksum idea is no good.

      >6. It should be possible to further qualify a reference to a
      >"sublocation" within an object (which would have meaning
      >only to the server that houses it). This is needed, for
      >example, for hypertext-type links. Such a sublocation might
      >be the 25th paragraph of a text, for a hypertext-type
      >pointer.
      >
      Now we raise the question: just what does a document identifier
      identify? Until this item, it appeared that a document was
      a file. Now it's not so clear. Perhaps a document should be anything
      from a single character to a paragraph to a file to a chapter to
      a book to an encyclopedia to a library. That would be a good trick.
      Is that what we're after?

      >7. Indirection should be supported. In other words, one
      >should be able to format the location as the name of a
      >server that can be passed the identifier and which would
      >return location information. The protocol mechanism(s) for
      >doing this need to be specified as well.
      >
      Ah. Now the objectives of the location info become more clear.
      Sounds to me like the location is a TCP connection, or enough
      information on how to establish one.

      >8. While full rights and permissions data would seem to be
      >outside the scope of such a pointer, it might be useful to
      >include at least some basic information. This might be an
      >indication that the object is not copyrighted and can be
      >freely distributed, that it is copyrighted but can be freely
      >distributed, that it can be redistributed for noncommercial
      >use, or that restrictions apply to redistribution. Also, it
      >might make sense to include a pointer of some sort (an
      >e-mail address? a host address?) for further information
      >about rights.
      >
      Ack! This stuff seems totally orthogonal to the rest of the
      stuff, but in practice, this looks like a crucial issue.
      I don't have any good ideas here.

      >9. Perhaps there might be some type of checksum that can be
      >calculated on the retrieved object to ensure that the
      >pointer and the object have not gotten out of synch?
      >
      This is what sparked the checksum idea.


      My response to all this:

      I don't think we need [yet another] document identifier format.
      If you want location info, use an internet address; if you want
      data integrity, use a checksum; if you want format, we are lacking
      a standard here; if you want copyright info, ditto;

      What we need is some nifty client software to glue all the parts
      together. I guess there is some room for standardization, but please:
      LET'S LEVERAGE EXISTING SYSTEMS!

      Where these systems are robust, I think we should support them. I'd
      also like to see support for ad-hoc document identifiers. Here's
      an example to clarify:

      I'm browsing some email, netnews, or a README file from somewhere.
      I see a reference to more info:

      A full discussion of the BLURF protocol is available via
      anonymous FTP from frob.mit.edu as blurf-proto.tex
      in the directory /pub/protos.

      I select some or all of that text, and I click one of the buttons
      in my document retrieval tool:

      make ftp id -- extract the relevant information and display
      a well-formed identifier acceptable to some
      existing FTP client (I've heard of something
      called ange FTP. Another idea is to make
      a shell script that would do the retrieval:
      ftp frob.mit.edu
      cd /pub/protos
      get blurf-proto.tex
      )

      make wais id -- get enough info to make a WAIS doc ID
      [scrap this unless it stabilizes]
      make WWW id -- same thing for World Wide Web HTTP addresses.
      make NNTP id -- same thing for USENET news message id's.
      make LISTSERV id -- you get the idea
      Rather than making up a new format, these id's
      are instructions to EXISTING clients to retrieve
      a document.

      verify id -- connect to the necessary server(s) and verify
      that the id references an existing document.
      Append to the id a "verification date," which
      is the last time a server acknowledged the
      existence of the document.

      get id info -- connect to the necessary server(s) and get about
      1K of miscellaneous info: document size in bytes,
      date of last modification, available formats,
      short summary, etc.

      retrieve raw -- connect and retrieve the document in whatever
      format is convenient to the server, e.g.
      a compressed tar archive of C and troff sources.

      retrieve text -- connect and retrieve the document as
      plain text [defined, e.g. as the body of an
      RFC-822 mail message]

      retrieve... -- the user or the supporting client software
      specifies the supported information formats,
      (compression schemes, archiving formats,
      image file formats, typesetting languages)
      the client and the server hash over their options,
      [perhaps with user intervention]
      and the server sends the most desireable version
      of the document it has available.

      If we add a few buttons, we begin to encompass the scope of many existing
      systems:

      expand -- change the doc id to reference the "document"
      containing it. In the ftp example, rather than
      "get blurf.tex," it would have "ls."
      Click again and get "cd ..; ls."
      Obviously, this operation depends on the access
      mechanism. For WAIS documents, the expansion of
      a document is the source that contains it.

      select -- narrow the document to some of its parts. For a
      text file, select some of the characters/paragraphs
      for a WAIS source, select some of the documents.
      For a WWW node, select a neighboring node. For
      a directory, select some files.

      I guess my point is, let's think about how folks are going to use this
      document referencing technology, and let's see how well existing systems
      meet these needs.

      I guess some groups have come to the conclusion that the existing systems
      don't cut it. I'm beginning to agree.

      I guess we'd all agree that we should decide how we're going to use these
      doc id's and let that drive the design of the format. i.e. Let's decide
      on the methods of this object before we decide on its representation.

      [an idea: for syntax, the WAIS folks chose LISP. What about using
      something akin to RFC-822 syntax? I think it works well: define a bunch
      of standard headers; require some, allow some, disregard others; allow
      free-form text in the body. examples:

      ISBN: 0-13-590126-X
      or
      MESSAGE-ID: usenet-thing
      or
      FTP-HOST: frob.mit.edu
      USER: anonymous
      or
      WAIS-PORT: 8001@...


      This would allow us to leverage all the email technology out there, plus
      the emerging multi-part mail format.
      (and it would allow me to use PERL on these beasties! :-)
      ]

      Another thing I hope folks are keeping in mind: I don't think any one
      client can meet the information-retrieval needs of everybody. We need
      to support multiple platforms, for one thing. But I hope other folks are
      considering using mulitple clients at the same time! I'd like to use
      one slick X-windows front end to the whole ball of wax, in some ways like
      emacs does for programming, and in some ways like the mac GUI does for
      office-productivity applications. But I'm going to be using POST mail
      servers, NNTP servers, WAIS servers, FTP servers, etc, and I don't
      expect one client to do it all. The crucial trick is to make all this
      intuitive and interactive, i.e. to support hypertext browsing, fulltext
      retrieval, USENET news reading, and maybe email correspondence, all in
      one environment. Let's get started!

      Dan

      ------- End of forwarded message -------
    Your message has been successfully submitted and would be delivered to recipients shortly.