Loading ...
Sorry, an error occurred while loading the content.

Vote update timing

Expand Messages
  • Neil Drumm
    What is the schedule for updating roll vote XML files, as in /data/us/110/rolls/...? In particular, when is the bill element added? Our algorithm currently
    Message 1 of 4 , Dec 13, 2007
    • 0 Attachment
      What is the schedule for updating roll vote XML files, as in
      /data/us/110/rolls/...? In particular, when is the bill element added?

      Our algorithm currently goes:
      1. Get http://www.govtrack.us/congress/votes_download_xml.xpd and look
      for new votes.
      2. For each new vote
      2a. Get the roll vote XML file to determine what bill to update.
      3b. Fully update the bill.

      A few weeks ago we missed a couple votes, but they worked when the
      bill update was manually triggered. I did not catch the XML quickly
      enough to verify, but I think the bill element might have been
      missing, causing step 2a to fail. Or at least, that is the simplest
      explanation.

      --
      Neil Drumm
      http://delocalizedham.com
    • Josh Tauberer
      ... The files are updated every 15 min. now, or something. The bill element comes in immediately --- it is information detected from the (official) source data
      Message 2 of 4 , Dec 13, 2007
      • 0 Attachment
        Neil Drumm wrote:
        > What is the schedule for updating roll vote XML files, as in
        > /data/us/110/rolls/...? In particular, when is the bill element added?

        The files are updated every 15 min. now, or something.

        The bill element comes in immediately --- it is information detected
        from the (official) source data file. So if it's missing, the source
        data page may have a mistake, or there could be a parsing mistake on my
        end. If you check the source and see a bill clearly identified but no
        bill element, let me know.

        > Our algorithm currently goes:
        > 1. Get http://www.govtrack.us/congress/votes_download_xml.xpd and look
        > for new votes.

        Wow, you could do that, but if you're going to ping by HTTP regularly,
        I'd much prefer you just fetch http://www.govtrack.us/data/us/110/rolls
        and parse the directory listing, since it involves much less processor
        overhead.

        > 2. For each new vote
        > 2a. Get the roll vote XML file to determine what bill to update.
        > 3b. Fully update the bill.

        Again, your best bet for updating bills is, besides rsync, parsing the
        directory listing at http://www.govtrack.us/data/us/110/bills.

        Starting very soon I think I am going to cut down severely on all of my
        government-transparency time, so I would normally offer to find a better
        solution than parsing directory listing pages, but now I won't.

        If you or anyone wanted to offer a Perl script that I could put in place
        to output, for instance, a machine-readable directory listing with
        last-modified times, I could use that.

        > A few weeks ago we missed a couple votes, but they worked when the
        > bill update was manually triggered. I did not catch the XML quickly
        > enough to verify, but I think the bill element might have been
        > missing, causing step 2a to fail. Or at least, that is the simplest
        > explanation.

        I'm not sure what might have happened. It's possible the vote appeared
        before the bill did.

        --
        - Josh Tauberer
        - GovTrack.us

        http://razor.occams.info

        "Yields falsehood when preceded by its quotation! Yields
        falsehood when preceded by its quotation!" Achilles to
        Tortoise (in "Gödel, Escher, Bach" by Douglas Hofstadter)
      • Josh Tauberer
        ... I m not sure that really improves on the Apache directory listing. The time format, for instance, varies depending on the age of the file. -- - Josh
        Message 3 of 4 , Dec 13, 2007
        • 0 Attachment
          Sam Smith wrote:
          > On Thu, 13 Dec 2007, Josh Tauberer wrote:
          >> If you or anyone wanted to offer a Perl script that I could put in place
          >> to output, for instance, a machine-readable directory listing with
          >> last-modified times, I could use that.
          >
          > cron this at the top level:
          >
          > ls -lR | zgip > ls-lR.gz

          I'm not sure that really improves on the Apache directory listing. The
          time format, for instance, varies depending on the age of the file.

          --
          - Josh Tauberer
          - GovTrack.us

          http://razor.occams.info

          "Yields falsehood when preceded by its quotation! Yields
          falsehood when preceded by its quotation!" Achilles to
          Tortoise (in "Gödel, Escher, Bach" by Douglas Hofstadter)
        • Corey Gilmore
          ... ... better ... place ... If it s at all possible for you I d highly recommend switching to rsync. My process is: * Run rsync on the votes dir, get
          Message 4 of 4 , Dec 13, 2007
          • 0 Attachment
            --- In govtrack@yahoogroups.com, Josh Tauberer <tauberer@...> wrote:
            >
            > Neil Drumm wrote:
            <snip>
            > > Our algorithm currently goes:
            > > 1. Get http://www.govtrack.us/congress/votes_download_xml.xpd and look
            > > for new votes.
            >
            > Wow, you could do that, but if you're going to ping by HTTP regularly,
            > I'd much prefer you just fetch http://www.govtrack.us/data/us/110/rolls
            > and parse the directory listing, since it involves much less processor
            > overhead.
            >
            > > 2. For each new vote
            > > 2a. Get the roll vote XML file to determine what bill to update.
            > > 3b. Fully update the bill.
            >
            > Again, your best bet for updating bills is, besides rsync, parsing the
            > directory listing at http://www.govtrack.us/data/us/110/bills.
            >
            > Starting very soon I think I am going to cut down severely on all of my
            > government-transparency time, so I would normally offer to find a
            better
            > solution than parsing directory listing pages, but now I won't.
            >
            > If you or anyone wanted to offer a Perl script that I could put in
            place
            > to output, for instance, a machine-readable directory listing with
            > last-modified times, I could use that.
            >

            If it's at all possible for you I'd highly recommend switching to
            rsync. My process is:
            * Run rsync on the votes dir, get the latest votes
            * Load a copy of my local votes directory listing into memory
            (www.php.net/scandir)
            * Grab the IDs of votes I've imported from my db (essentially select
            concat(vote_id, '.xml') as vote from votes) and put them into an array
            * $import = array_diff($files, $votes); (www.php.net/array_diff)
            * Process the list of votes to import. It's not something you're
            running that often, and array_diff isn't that CPU intensive with the
            relatively few votes you see in an typical year.
          Your message has been successfully submitted and would be delivered to recipients shortly.