Loading ...
Sorry, an error occurred while loading the content.

Re: "ocaml_beginners"::[] line_input from gzipped files

Expand Messages
  • Richard Jones
    ... As with the first reply, the answer is to use Unix.open_process_in to g/bunzip the file. If you take a look at our Weblogs library
    Message 1 of 4 , Dec 1, 2006
    • 0 Attachment
      On Thu, Nov 30, 2006 at 03:05:07PM +0200, Johann Spies wrote:
      > I want to be able to parse very large .gz and .bz2 files and it will be
      > nice if I can avoid (b|g)unzipping them before parsing them on a
      > line-input basis. There is no line_in function in the Zip-libraries as
      > far as I can see.
      >
      > I know it is easy to handle this when the file can be read into memory
      > as a string which can then be splitted up into a string list with "\n"
      > as separator. But how do I do that when I read a file into a string
      > buffer which do not contain the whole file?

      As with the first reply, the answer is to use Unix.open_process_in to
      g/bunzip the file.

      If you take a look at our Weblogs library
      (http://merjis.com/developers/weblogs) you'll see some code in there I
      wrote which can sniff the file type of an external file and then
      automatically read the file in line-by-line, either through an
      external g/bunzip program, or directly if the file isn't compressed.

      Rich.

      --
      Richard Jones, CTO Merjis Ltd.
      Merjis - web marketing and technology - http://merjis.com
      Internet Marketing and AdWords courses - http://merjis.com/courses - NEW!
      Merjis blog - http://blog.merjis.com - NEW!
    • Johann Spies
      ... Thanks Rich and all the others who replied. I was looking at Unix.open_process_in but was uncertain how to use it in this case. I will look at your code.
      Message 2 of 4 , Dec 4, 2006
      • 0 Attachment
        On Fri, Dec 01, 2006 at 09:12:10AM +0000, Richard Jones wrote:
        > On Thu, Nov 30, 2006 at 03:05:07PM +0200, Johann Spies wrote:
        > > I want to be able to parse very large .gz and .bz2 files and it will be
        > > nice if I can avoid (b|g)unzipping them before parsing them on a
        > > line-input basis. There is no line_in function in the Zip-libraries as
        > > far as I can see.
        > >
        > > I know it is easy to handle this when the file can be read into memory
        > > as a string which can then be splitted up into a string list with "\n"
        > > as separator. But how do I do that when I read a file into a string
        > > buffer which do not contain the whole file?
        >
        > As with the first reply, the answer is to use Unix.open_process_in to
        > g/bunzip the file.
        >
        > If you take a look at our Weblogs library
        > (http://merjis.com/developers/weblogs) you'll see some code in there I
        > wrote which can sniff the file type of an external file and then
        > automatically read the file in line-by-line, either through an
        > external g/bunzip program, or directly if the file isn't compressed.

        Thanks Rich and all the others who replied. I was looking at
        Unix.open_process_in but was uncertain how to use it in this
        case. I will look at your code.

        It would be easy to just read the files from the commandline with
        something like "zcat *.gz | ocamlprogram_to_parse_files" but the header of
        each log file is different and I need to parse that seperately to be
        able to identify the correct data.

        Regards
        Johann
        --
        Johann Spies Telefoon: 021-808 4036
        Informasietegnologie, Universiteit van Stellenbosch

        "Behold, happy is the man whom God correcteth.
        Therefore despise thou not the chastening of the
        Almighty." Job 5:17
      Your message has been successfully submitted and would be delivered to recipients shortly.