259Re: [govtrack] Two Project Ideas [bill versioning]
- Nov 9 4:04 AMyahoogroups-backupemail@... wrote:
> On Wed, 8 Nov 2006, Joshua Tauberer / GovTrack.us wrote:Ahha, I think that could be useful. Thanks for the pointer. (It's
>> The first step is to convert it to text -- you can see the text versions
>> (from "pdftotext -layout -nopgbrk") that GovTrack makes at the same
>> addresses, just replace .pdf with .txt. Without "-layout" you get a
>> differently formatted text version that could be more useful for this.
> there's a fork of pdftotext (also free) which has very
> useful -html and -xml output flags which might be a
> better place to start from if you don't have tools already.
> http://pdftohtml.sourceforge.net/ <http://pdftohtml.sourceforge.net/>
actually been integrated in the poppler-utils RPM for Fedora Core 6, if
that's useful for anyone.)
For reference, the two PDFs in HTML with pdftohtml are:
It's not getting the alignment of lines quite right, splitting up things
on the same line, but that might not impact the task anyway since
different line breaks between versions has to be ignored anyway.
- Joshua Tauberer
"Strike up the klezmer and start acting like a man. You're
about to have a truth-mitzvah." -- The Colbert Report
- << Previous post in topic Next post in topic >>