Loading ...
Sorry, an error occurred while loading the content.

Looking for advice/directions regarding the design of a library

Expand Messages
  • rixed@happyleptic.org
    Hello! I have a very simple problem to solve: given some CSV files, read the records and perform some simple functions over them and finaly store the resulting
    Message 1 of 4 , Jun 4, 2012
    • 0 Attachment
      Hello!

      I have a very simple problem to solve: given some CSV files, read the
      records and perform some simple functions over them and finaly store
      the resulting values in some files, with these contraints:

      - it should run as fast as possible
      - it should be easy to define the incomming format, the operations
      to perform and at what stage(s) to store the records, either
      directly as Ocaml code or in a dedicated config language
      - possible operations on fields must include such operations as
      SQL-like aggregate/group-by functions

      The purpose is to store data samples (csv of various basic types) in
      a custom database with various level of details (raw samples, averages
      per minutes, average per 20 minutes with less informations, and averages
      per day with even less informations, for instance).

      My initial plan was to implement every operations (starting from consing
      of basic values up to computing the aggregation functions) as functors,
      and then let the user configure everything by composing those functor,
      and then the database would be a succession of stages, each one taking a
      value and either serializing it in some file or using it as input to
      some function.

      For instance:

      (* some straightforward functors for composing types: *)
      module type TYPE = sig type t ...(serializers...) end
      module ConsOf (T1:TYPE) (T2:TYPE) = struct type t = T1.t * T2.t ... end

      (* The input CSV values, composed of an int, a date, a string and an int *)
      module InputType =
      ConsOf (Int) (ConsOf (Timestamp) (ConsOf (String) (Int)))

      module type GROUPBY = sig
      module InType : TYPE module OutType : TYPE ...
      end
      (* and module type AGGR almost the same *)

      (* The first 'group-by' function: round timestamp toward 1minute
      and group by only the first two fields: *)
      module Group1 : GROUPBY with InType = InputType =
      GroupConsOf (Keep) (GroupConsOf (TSRound) (GroupConsOf (Skip) (Skip)))

      (* and the first aggregate function: averaging the last int and
      ignoring the string: *)
      module Aggr1 : AGGR with InType = InputType =
      AggrConsOf (Skip) (AggrConsOf (Skip) (AggrConsOf (Skip) (Avg(Int))))

      (* and the result of the groupby+aggregation being of type
      (Group1.OutType.t * Aggr1.OutType.t) *)

      This basically works but the configuration is not very user friendly
      since after the first stage the record type, which started as:

      int*(timestamp*(string*int))

      evolved into a more subtle:

      (int*(timestamp*(unit*unit))) * (unit*(unit*(unit*int)))

      wich itself should be feed into Group2 and Aggr2 to compute the second
      level of details:

      module Group2 : GROUPBY with InType = ... =
      GroupConsOf
      (GroupConsOf (Keep) (GroupConsOf (TSRound2) (GroupConsOf (Skip) (Skip))))
      (Skip)

      and so on.

      Figuring out how to compose the functors to have an InType that match
      the OutType of the upper stage requires many guesses (at this point,
      error messages are not very helpful as they give no more details that
      "This.InType is not included in That.OutType"). And this is a very
      simplified example (in real life, records could have tens of fields).
      I've used some tricks to reduce the size and number of required functor
      applications (such as truncating tuples instead of skipping up to the
      end), but this does not help much. Also, I'm a little worried by the
      fact that accessing the Nth field require dereferencing a chain of n
      pointers.

      So it seams to me that the solution would be to flatten the record type,
      ie to have a*b*c*d instead of a*(b*(c*d)). But I can't possibly generate
      all the composition functors required for so many fields (the equivalent
      of BatTuple for tens of fields, anyone?), so it seams I'm only left with
      one option: code generation (luckily at compile time), but for some
      reason it feels overkill for such a simple problem. I probably missed
      the good ideas, didn't I?
    • Gabriel Scherer
      I wouldn t look for a single idea to work for your three stages of parsing, computing and output. Couldn t you more simply: 1. define parsers for your input
      Message 2 of 4 , Jun 5, 2012
      • 0 Attachment
        I wouldn't look for a single idea to work for your three stages of parsing,
        computing and output.
        Couldn't you more simply:
        1. define parsers for your input format, essentially producing a stream of
        rows of data
        2. define a library of "computations" on streams of rows
        3. define printer on streams of results producing output as soon as possible

        I'm thinking of something like this:

        type input = {
        name: string;
        age: int;
        score: float;
        }

        let input_row = Input.parse_row (fun row ->
        let name = Input.string row in
        let age = Input.int row in
        let score = Input.float row in
        { name; age; score }
        )

        type output1 = input
        type output2 = { age2:int; avg_score2: float }

        let comp1 row acc = Compute.copy row acc
        let comp2 row acc =
        Compute.average

        let print1 =
        let header ppf =
        Print.field ppf "name";
        Print.field ppf "age";
        Print.field ppf "score";
        in
        let row ppf o =
        Print.string ppf o.name;
        Print.int ppf o.age;
        Print.float ppf o.score;
        in
        (header, row)

        let print2 =
        let header ppf =
        Print.field ppf "age";
        Print.field ppf "avg_score";
        in
        let row ppf o =
        Print.int ppf o.age2;
        Print.int ppf o.avg_score;
        in
        (header, row)

        let process1 = Process.copy
        let process2 =
        Process.average
        ~key:(fun o -> o.age)
        ~val:(fun o -> o.score)
        ~return:(fun key avg -> {age2 = key; avg_score = avg})

        Process.traverse (Input.parse_file "foo.bar" input_row) [
        (process1, print1);
        (process2, print2);
        ]

        Process.traverse takes an input, and a list of (process * result_printer)
        couples for each different computation -- the idea is to compute several
        different results (stream of rows) in a single traversal of the input.
        Process descriptions are built using an imaginary DSL, and the "traverse"
        function is free to handle consumption of the input and production of the
        output in an efficient way. For example, it could do buffered processing:
        request a set of input rows from the file at one time, and run each process
        on that set (instead of paying the cost of traversing the process list on
        each row). If some processes can return input early (this is trivially the
        case for the " copy" process, but also for "average" if you assume that the
        input is sorted by the grouping key -- you may want to have different
        versions of "average" that make this assumption or not), it shuld do it
        (against in a buffered way) instead of accumulating the results in memory.

        Have you considered more drastic solutions, like putting all those CSV
        files in a small SQLite database and rutnning SQL queries on this? I also
        recently heard of a specialized data processing language oriented towards
        raw speed, r17. Never tested it though.
        www.rseventeen.com/

        On Tue, Jun 5, 2012 at 7:25 AM, <rixed@...> wrote:

        > Hello!
        >
        > I have a very simple problem to solve: given some CSV files, read the
        > records and perform some simple functions over them and finaly store
        > the resulting values in some files, with these contraints:
        >
        > - it should run as fast as possible
        > - it should be easy to define the incomming format, the operations
        > to perform and at what stage(s) to store the records, either
        > directly as Ocaml code or in a dedicated config language
        > - possible operations on fields must include such operations as
        > SQL-like aggregate/group-by functions
        >
        > The purpose is to store data samples (csv of various basic types) in
        > a custom database with various level of details (raw samples, averages
        > per minutes, average per 20 minutes with less informations, and averages
        > per day with even less informations, for instance).
        >
        > My initial plan was to implement every operations (starting from consing
        > of basic values up to computing the aggregation functions) as functors,
        > and then let the user configure everything by composing those functor,
        > and then the database would be a succession of stages, each one taking a
        > value and either serializing it in some file or using it as input to
        > some function.
        >
        > For instance:
        >
        > (* some straightforward functors for composing types: *)
        > module type TYPE = sig type t ...(serializers...) end
        > module ConsOf (T1:TYPE) (T2:TYPE) = struct type t = T1.t * T2.t ... end
        >
        > (* The input CSV values, composed of an int, a date, a string and an int *)
        > module InputType =
        > ConsOf (Int) (ConsOf (Timestamp) (ConsOf (String) (Int)))
        >
        > module type GROUPBY = sig
        > module InType : TYPE module OutType : TYPE ...
        > end
        > (* and module type AGGR almost the same *)
        >
        > (* The first 'group-by' function: round timestamp toward 1minute
        > and group by only the first two fields: *)
        > module Group1 : GROUPBY with InType = InputType =
        > GroupConsOf (Keep) (GroupConsOf (TSRound) (GroupConsOf (Skip) (Skip)))
        >
        > (* and the first aggregate function: averaging the last int and
        > ignoring the string: *)
        > module Aggr1 : AGGR with InType = InputType =
        > AggrConsOf (Skip) (AggrConsOf (Skip) (AggrConsOf (Skip) (Avg(Int))))
        >
        > (* and the result of the groupby+aggregation being of type
        > (Group1.OutType.t * Aggr1.OutType.t) *)
        >
        > This basically works but the configuration is not very user friendly
        > since after the first stage the record type, which started as:
        >
        > int*(timestamp*(string*int))
        >
        > evolved into a more subtle:
        >
        > (int*(timestamp*(unit*unit))) * (unit*(unit*(unit*int)))
        >
        > wich itself should be feed into Group2 and Aggr2 to compute the second
        > level of details:
        >
        > module Group2 : GROUPBY with InType = ... =
        > GroupConsOf
        > (GroupConsOf (Keep) (GroupConsOf (TSRound2) (GroupConsOf (Skip)
        > (Skip))))
        > (Skip)
        >
        > and so on.
        >
        > Figuring out how to compose the functors to have an InType that match
        > the OutType of the upper stage requires many guesses (at this point,
        > error messages are not very helpful as they give no more details that
        > "This.InType is not included in That.OutType"). And this is a very
        > simplified example (in real life, records could have tens of fields).
        > I've used some tricks to reduce the size and number of required functor
        > applications (such as truncating tuples instead of skipping up to the
        > end), but this does not help much. Also, I'm a little worried by the
        > fact that accessing the Nth field require dereferencing a chain of n
        > pointers.
        >
        > So it seams to me that the solution would be to flatten the record type,
        > ie to have a*b*c*d instead of a*(b*(c*d)). But I can't possibly generate
        > all the composition functors required for so many fields (the equivalent
        > of BatTuple for tens of fields, anyone?), so it seams I'm only left with
        > one option: code generation (luckily at compile time), but for some
        > reason it feels overkill for such a simple problem. I probably missed
        > the good ideas, didn't I?
        >
        >
        >
        > ------------------------------------
        >
        > Archives up to December 31, 2011 are also downloadable at
        > http://www.connettivo.net/cntprojects/ocaml_beginners
        > The archives of the very official ocaml list (the seniors' one) can be
        > found at http://caml.inria.fr
        > Attachments are banned and you're asked to be polite, avoid flames
        > etc.Yahoo! Groups Links
        >
        >
        >
        >


        [Non-text portions of this message have been removed]
      • rixed@happyleptic.org
        Computing several results from the same input set is required, but the interesting requirement is that computations can be piped into one another, so that one
        Message 3 of 4 , Jun 5, 2012
        • 0 Attachment
          Computing several results from the same input set is required, but the
          interesting requirement is that computations can be piped into one another, so
          that one can compute the second level of details based on the output of the
          first level of details (and so on).

          So to adapt your exemple I guess the result_printer functions should return
          another rows instead of merely printing (deferring actual write to a dedicated
          stage that returns no rows but write as a side effect). But then it is no more
          typeable (as is).

          The alternative is to pass the rows from one stage to the next as strings
          (serializing using the printer and deserialized using de parse_row method), but
          I'd really like to avoid resorting on this trick to pass data from stage to
          stage, first for better performance (although I haven't checked in the
          generated code, I believe the functor method allows the inliner to make almost
          all non essential code vanish) and then for type safety (since user will be
          required to enter these composition, I'd rather get sure they get it right).

          I have the same concern with the Process library of computations, as I fail to see
          how Process.average could works for any type of key, input and value.

          Considering using a regular relational database, I'm running one right now and
          the point is to replace it since the load of data now exceeds what's acceptable
          for such tools.

          I never had a look at r17, thank you for the link. At first sight this does not
          seams to handle the various level of details I want (it's targeted at data
          mining while I also want to be able to store a long history of degraded data as with
          round robin databases), but I can probably build onr r17 DB per LOD. I will
          probably give it a try, but I'd like to find an elegant solution in OCaml,
          though, as this proof-of-concept project is also to show OCaml's strength ;)
        • rixed@happyleptic.org
          I ve implemented some kind of mix between the previous functiorial mess and your solution based on explicit functions (datatypes are still functorized so that
          Message 4 of 4 , Jun 6, 2012
          • 0 Attachment
            I've implemented some kind of mix between the previous functiorial mess and
            your solution based on explicit functions (datatypes are still functorized so
            that the computations are build from them), and I replaced the supposedly slow
            chain of Cons by flat Tuples. And can't tell the difference performance wise ;
            This is more lines of code than the previous atempt but so much easier to grasp
            in the end.

            I'm also testing r17 and it's 6 times faster than my library for seq scanning a
            huge load of data, but I suspect it's mostly because r17 uses faster IO than
            ocaml stdlib's. Anyway, very nice library with a nice API. I just can't read
            C++ code (there's some bits of interresting artisanal code generation as well,
            see for instance:
            https://raw.github.com/matthewnourse/r17/6c4f08b47d534bcc28bc165211edb2ba7ee03cd6/np1/preproc.hpp
            :))
          Your message has been successfully submitted and would be delivered to recipients shortly.