Versioning Bags

I’ve been doing some work recently with BagIt.  BagIt is a specification that gives a directory of files some additional semantics, including a manifest with checksums and some minimal metadata.  I like the specification a lot.  I have recently become interested in being able to version a bag (“bag” is what you call a directory that conforms to BagIt).  CDL has a versioning specification called ReDD that I’m riffing on pretty liberally here.  Specifically, I want the versioning information to be checksummed, I want the bag proper (e.g. the data/ directory) to contain the current (think HEAD) version of the bag, I don’t want to add another layer of directories when I start versioning, and I want to use timestamps for versions instead of an internal numbering system.
BagIt puts the contents of the directory into a sub-directory called “data”.  In the mockup below, there is a directory at the same level as the data directory called “reverse-deltas”.  Inside are directories named with a timestamp.  Each of these directories is a valid bag in its own right (important so that the tools for passing around bags can be used to pass around reverse deltas, and to keep track of the fixities of the reverse deltas).  Each of these can be used to move the bag backward in time, by deleting the files in delete.txt and adding all the files inside the add/ directory if it is present:
|-bag-info.txt
|-manifest-md5.txt
|-data/
  |-file1.txt
  |-file2.txt
  |-file3.txt
|-reverse-deltas/
  |-README               # explains the deltas
  |-2009-10-21-23-59-59 
    |-bag-info.txt
    |-manifest-md5.txt   # changes are checksummed
    |-data/
      |-delete.txt       # files to be deleted
      |-add/             # files to be added
        |-file1.txt
        |-file2.txt
  |-2009-10-10-12-01-02  # another earlier version
    |-bag-info.txt
    |-manifest-md5.txt
    |-data/
      |-delete.txt

In terms of the tool-chain, I think this presents some interesting possibilities.  The most recent version of a bag is always in the data/ directory, and can be grabbed with any BagIt-compliant tools.  BagIt doesn’t include any restrictions on the other directories at the top-level.

However, if one were going to build a service on top of a versioned bag, it would be fairly straightforward to provide an identifier scheme that would get back to any previous version of a bag (e.g. http://davidbrunton.com/bag/1?date=2009-10-21).  One could even assume an earliest version that specifies deletion of all the files, possibly yielding a 404.
I’m curious to hear what use cases this meets and which it fails to meet.  Anyone?

6 thoughts on “Versioning Bags

  1. I'd love to see you post this to the digital-curation Google group if you haven't already. There've been discussions there about both BagIt and ReDD during the past 5-6 months or so.

    Interesting idea, though I have too many meetings today to chew on it.

    One question: by "changes are checksummed", do you just mean that every file under the data directory is checksummed? That is, its checksum behavior is the exact same as a vanilla bag?

  2. The latter: "checksum behavior is the exact same as a vanilla bag" – with the caveat that it's "not part of" the parent bag. E.g. you need to interact with it intentionally.

  3. I like this idea a lot David. I guess it's important to point out that this isn't an academic exercise, but something we have a real need for at $work :-)

    It would be interested to know to what degree the CDL folks are using REDD already, and if they (and others) would be interested in collaborating on an IETF RFC.

  4. FWIW, it seems the CDL folks are still working out the kinks on ReDD, DFlat, etc. So my sense, which may be totally off, is that they'd be receptive to your diffs.

    But only to the extent that they do not know how much you suck eggs.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>