Loading ...
Sorry, an error occurred while loading the content.

19587Re: [rest-discuss] Media Type for resource archives

Expand Messages
  • Jørn Wildt
    Jan 13, 2014
    • 0 Attachment
      Hmmm, and what about the TAR format? http://www.fileformat.info/format/tar/corion.htm


      On Mon, Jan 13, 2014 at 11:19 AM, Edward Summers <ehs@...> wrote:
      Hi Jan,

      Have you run across the WARC format yet [1]? It was built for serializing representations of resources for the Web archiving domain (Internet Archive, etc) but it seems like it might have some relevance for your use case? Basically a WARC is a concatenation of HTTP responses, but you can also layer in the requests that generated them, DNS lookups, etc. Each WARC record has an id, which amounts to a manifest, and you have the ability to layer in arbitrary metadata if necessary.

      WARC is ISO 28500:2009 and ISO make you pay for the spec :-( But implementors generally know you can get the latest draft before it went to ISO for free from the Bibliothèque nationale de France who (along with a lot of other national libraries) also use it for the Web archiving efforts [3]. ArchiveTeam also have a decent list of software packages that support WARC [4].

      The ResourceSync effort might also be of interest, in particular their idea of a Resource Dump [5] — although I believe work on ResourceSync is ongoing, and may be in flux. Last time I looked ResourceSync added some extensions to Google Sitemaps that let you point at a file in a ZIP archive, and list its media type, byte length, and hash … which sounds a bit like what you might want out of a manifest?

      I’d be interested to hear what you come up with, whether you use either of these options or not.


      [1] https://en.wikipedia.org/wiki/Web_ARChive
      [2] http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717
      [3] http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
      [4] http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem
      [5] http://www.openarchives.org/rs/0.9.1/resourcesync#ResourceDump

      On Jan 10, 2014, at 5:35 PM, Jan Algermissen <jan.algermissen@...> wrote:

      > Hi,
      > I am thinking about a media type for bundling together a bunch of resources[1] into a single file. With these files I want to store a manifest file.
      > One option would be to just use a zip-based format and an manifest file with a well known name.
      > The problem with this is that useful stream processing of such a file can only be done by ensuring that the manifest is the first entry when unzipping. Apparently it requires some stunts to control the ordering of the zip entries and who knows whether the other end uses a compatible implementation.
      > Solution would be to unpack to disk first and go from there. Not nice.
      > A possible alternative would be to use a multipart format where I can simply require the manifest to be the first part. Then just zip that file or rely on transfer encoding to reduce the bytes on the wire.
      > Nice things about that:
      > - Ordering is guaranteed
      > - Full support for per-part MIME headers
      > - Content-Length enables fast splitting of the parts
      > - cid: URIs make for natural, standard URI-references inside the file
      > - stream processing without temporary storage
      > I am interested in reactions to the two alternatives or any ideas beyond that.
      > Jan
      > [1] Well, obviously their entities at some point in time
      > ------------------------------------
      > Yahoo Groups Links


      Yahoo Groups Links

      <*> To visit your group on the web, go to:

      <*> Your email settings:
          Individual Email | Traditional

      <*> To change settings online go to:
          (Yahoo! ID required)

      <*> To change settings via email:

      <*> To unsubscribe from this group, send an email to:

      <*> Your use of Yahoo Groups is subject to:

    • Show all 7 messages in this topic