## Re: "ocaml_beginners"::[] A few beginner's questions

Expand Messages
• ... What s wrong with match x with ... ? Since, if the first two guard conditions are not satisfied, n must be greater than 20. -- Matt Gushee
Message 1 of 16 , Jun 8, 2004
• 0 Attachment
On Tue, Jun 08, 2004 at 08:23:11PM -0700, Bradford Carpenter wrote:

> Also got an error I'm curious about. If you use pattern matching with
> guards for a situation like this:
>
> match x with
> | n when (n <= 10) -> dothis
> | n when (n > 10 && n <= 20) -> dothat
> | n when (n > 20) -> dotheother
>
> it will not compile, giving an error message about "Bad style: no
> values without guards" or something like that. Is there a way to
> specify this set of cases *without* using guards?

What's wrong with

match x with
| n when (n <= 10) -> dothis
| n when (n > 10 && n <= 20) -> dothat
| n -> dotheother

? Since, if the first two guard conditions are not satisfied, n must be
greater than 20.

--
Matt Gushee When a nation follows the Way,
Englewood, Colorado, USA Horses bear manure through
mgushee@... its fields;
http://www.havenrock.com/ When a nation ignores the Way,
Horses bear soldiers through
its streets.

--Lao Tzu (Peter Merel, trans.)
• ... Of course, you re right. Was too busy trying to actually match the specific conditions to step back and consider the fall-through logic, I guess. Thanks
Message 2 of 16 , Jun 8, 2004
• 0 Attachment
On Tue, 08 Jun 2004 21:35:06 -0600, Matt Gushee wrote:

>> Also got an error I'm curious about. If you use pattern matching with
>> guards for a situation like this:
>>
>> match x with
>>> n when (n <= 10) -> dothis
>>> n when (n > 10 && n <= 20) -> dothat
>>> n when (n > 20) -> dotheother
>>
>> it will not compile, giving an error message about "Bad style: no
>> values without guards" or something like that. Is there a way to
>> specify this set of cases *without* using guards?
>
> What's wrong with
>
> match x with
> | n when (n <= 10) -> dothis
> | n when (n > 10 && n <= 20) -> dothat
> | n -> dotheother
>
> ? Since, if the first two guard conditions are not satisfied, n must be
> greater than 20.

Of course, you're right. Was too busy trying to actually match the
specific conditions to step back and consider the fall-through logic, I
guess. Thanks for providing some perspective!

Best regards,
• On Tue, 8 Jun 2004, Bradford Carpenter wrote: ... The pattern matching proceeds in order - if a given case satisifies two or more patterns, the first one
Message 3 of 16 , Jun 8, 2004
• 0 Attachment
On Tue, 8 Jun 2004, Bradford Carpenter wrote:

> Also got an error I'm curious about. If you use pattern matching with
> guards for a situation like this:
>
> match x with
> | n when (n <= 10) -> dothis
> | n when (n > 10 && n <= 20) -> dothat
> | n when (n > 20) -> dotheother

The pattern matching proceeds "in order"- if a given case satisifies two
or more patterns, the first one listed takes precendence. So the above
function should be written:
match x with
| n when (n <= 10) -> dothis
| n when (n <= 20) -> dothat
| n (* for all other n *) -> dotheother

With which Ocaml has no problems.

> Hope these questions are not too basic!

Considering that I can only answer one of the three off the top of my
head, no, they weren't too basic.

--
"Usenet is like a herd of performing elephants with diarrhea -- massive,
difficult to redirect, awe-inspiring, entertaining, and a source of
mind-boggling amounts of excrement when you least expect it."
- Gene Spafford
Brian
• From: Bradford Carpenter Subject: ocaml_beginners ::[] A few beginner s questions Date: Tue, 8 Jun 2004 20:23:11 -0700 ... In OCaml, The whole
Message 4 of 16 , Jun 9, 2004
• 0 Attachment
Subject: "ocaml_beginners"::[] A few beginner's questions
Date: Tue, 8 Jun 2004 20:23:11 -0700

> After starting to write some functions on my own, discovered existing
> versions in various libraries and decided to try using those. After
> adding two library references (ExtLib and Nums) to the compile line,
> the size of my final native-code binary increased from 260K to 468K,
> this to use only three functions contained in the libraries. Simply
> adding functions to my source code causes no such increase, so I'm
> wondering if I'm linking to the libraries incorrectly (I was under the
> impression only referenced functions from the library were included) or
> if simply copying and pasting a few functions would be more efficient.
> My command line for the compile looks like this:
>
> ocamlopt.opt -I /opt/local/lib/ocaml/site-lib/pcre -I
> /opt/local/lib/ocaml/site-lib/extlib -o wcheck unix.cmxa pcre.cmxa
> nums.cmxa extLib.cmxa wcheck2.ml

In OCaml, The whole code of modules are linked to the executable. The
size increase you experienced is normal.

> Found a few OCaml implementations of standard unix tools (cat, grep,
> wc, etc) and wondered if there is a definitive source for OCaml tools
> like this. In particular, I'm using a system call to "sort -u | comm
> -23 - /opt/local/etc/wordlist" to do a quick count of unknown words,
> since I know nothing about the workings of comm. Seems like it would be
> faster using a native OCaml version without using a pipe (since this is
> run on every incoming email, speed is important). Is this a correct
> assumption or am I better off just piping through GNU comm?

Obviously, you could store the words into a Hashtbl.t, and compare
them to the words in email.

Textutil library would be interesting. Has anyone tried that?

--
Yamagata Yoriyuki
• ... That s not exactly true. I think the true rule is the following : Pattern matching can be reordered by the ocaml compiler but it will preserve the guards
Message 5 of 16 , Jun 9, 2004
• 0 Attachment
> I can answer this question:
>
> > Also got an error I'm curious about. If you use pattern matching with
> > guards for a situation like this:
> >
> > match x with
> > | n when (n <= 10) -> dothis
> > | n when (n > 10 && n <= 20) -> dothat
> > | n when (n > 20) -> dotheother
>
> The pattern matching proceeds "in order"- if a given case satisifies two
> or more patterns, the first one listed takes precendence. So the above
> function should be written:
> match x with
> | n when (n <= 10) -> dothis
> | n when (n <= 20) -> dothat
> | n (* for all other n *) -> dotheother
>
> With which Ocaml has no problems.

That's not exactly true.
I think the true rule is the following :

Pattern matching can be reordered by the ocaml compiler but it will preserve
the guards ordering, so you can safely do bounds checks on an integer.

Regards,
Nicolas Cannasse
• ... It is normal AFAIK, that linking increases the size of the binary. Of course on such tiny apps you may prefer to copy and paste the very functions that you
Message 6 of 16 , Jun 9, 2004
• 0 Attachment
Il Wednesday 09 June 2004 05:23, Bradford Carpenter ha scritto:
> Haven't worked with a compiled language before (just shell scripts, a
> little Perl, and what not), but have been trying to find a programming
> language I feel comfortable with for eons it seems. Just discovered
> OCaml a couple of weeks ago; not sure why, but I find it much more
> compelling than anything else I've tried. Still trying to get the
> basics though.
>
> Tried writing a simple command line tool for counting various features
> in emails (to feed into a Bayesian filter) as an initial learning
> exercise. Was able to answer most of my questions using the OCaml
> manual and Google, but here's a few lingering questions.
>
> After starting to write some functions on my own, discovered existing
> versions in various libraries and decided to try using those. After
> adding two library references (ExtLib and Nums) to the compile line,
> the size of my final native-code binary increased from 260K to 468K,
> this to use only three functions contained in the libraries. Simply
> adding functions to my source code causes no such increase, so I'm
> wondering if I'm linking to the libraries incorrectly (I was under the
> impression only referenced functions from the library were included) or
> if simply copying and pasting a few functions would be more efficient.
> My command line for the compile looks like this:
>
> ocamlopt.opt -I /opt/local/lib/ocaml/site-lib/pcre -I
> /opt/local/lib/ocaml/site-lib/extlib -o wcheck unix.cmxa pcre.cmxa
> nums.cmxa extLib.cmxa wcheck2.ml

It is normal AFAIK, that linking increases the size of the binary. Of course
on such tiny apps you may prefer to copy and paste the very functions that
you need: but if you're not making any modification to the functions, this
shall be done when you're quite sure about your way, because things become
unmantainable afterwards. Optimization is the LAST move to take AFAIK.

BTW, did you make a profiling of your programs?

>
> Found a few OCaml implementations of standard unix tools (cat, grep,
> wc, etc) and wondered if there is a definitive source for OCaml tools
> like this. In particular, I'm using a system call to "sort -u | comm
> -23 - /opt/local/etc/wordlist" to do a quick count of unknown words,
> since I know nothing about the workings of comm. Seems like it would be
> faster using a native OCaml version without using a pipe (since this is
> run on every incoming email, speed is important). Is this a correct
> assumption or am I better off just piping through GNU comm?
>

System calls from Ocaml have some problems, that are rather generic Unix
programming issues. For what concerns pipes and redirections, there is a
detailed paper by Gerd Stolpmann for his Equeue program ( see
http://ocaml-programming.de/packages/documentation/equeue/refman-shell/Shell.html
).
Sort and comm can both be addressed with parsing newline-separated elements
and list programming, so if you already master writing efficient code for
using lists (say, tail-rec code), go for it, otherwise step backwards.

Trivial enough, I would say that efficiency comes (far) after precision. You
don't need just efficient functions if they don't do exactly what you want,
see it the other way, you can use existing functions as far as you _know_
that they do _exactly_ what you want... this happens seldom, maybe because
generally one does not know a borrowed function as deep as one that he has
written personally. If you're not in a hurry, in my experience everything
that _you_ can express as a clear problem (parsing... lists... strings...) is
worth being re-written. When functions have a lot of previous knowledge or
interfacing inside, it is a waste of time and interest, and very error-prone,
to open their hood when they are not broken.

It is always very interesting to me to see how often problems in "informatics"
do not come out of complexity itself, but from errors in the EASY things.
Personally, I start writing programs by working on their type system, not on
the functions.

Ciao

Ernesto
• ... Ah. Then I would think that the extra libraries are relevant if you plan to make extended use of the functions they contain, not if you plan to call one
Message 7 of 16 , Jun 9, 2004
• 0 Attachment
On Wed, 09 Jun 2004 16:25:05 +0900 (JST), Yamagata Yoriyuki wrote:

>> ocamlopt.opt -I /opt/local/lib/ocaml/site-lib/pcre -I
>> /opt/local/lib/ocaml/site-lib/extlib -o wcheck unix.cmxa pcre.cmxa
>> nums.cmxa extLib.cmxa wcheck2.ml
>
> In OCaml, The whole code of modules are linked to the executable. The
> size increase you experienced is normal.

Ah. Then I would think that the extra libraries are relevant if you
plan to make extended use of the functions they contain, not if you
plan to call one function a single time in a small command line tool as
I'm doing. Thank you for the clarification.

>> Found a few OCaml implementations of standard unix tools (cat, grep,
>> wc, etc) and wondered if there is a definitive source for OCaml tools
>> like this. In particular, I'm using a system call to "sort -u | comm
>> -23 - /opt/local/etc/wordlist" to do a quick count of unknown words,
>> since I know nothing about the workings of comm. Seems like it would be
>> faster using a native OCaml version without using a pipe (since this is
>> run on every incoming email, speed is important). Is this a correct
>> assumption or am I better off just piping through GNU comm?
>
> Obviously, you could store the words into a Hashtbl.t, and compare
> them to the words in email.
>
> Textutil library would be interesting. Has anyone tried that?

Haven't come across textutil library...is there one available? Actually
this check is something of an experiment: I'm comparing the unique
words of an email to a list of approximately 385,000 level 80 words
taken from SCOWL (Spelling-Checker-Oriented Word Lists) as a way to
detect spam with blocks of random character strings. I'm following up
on an idea posted on the Spamprobe users list, and it seems somewhat
promising so far. An actual spell checker like aspell is far too slow,
but I was quite amazed at how fast comm could pull out unidentified
words from the email. Of course the natural questions are can Hashtbl.t
work with lists of this size and can it process it faster than GNU
comm? If a check like this can be made very fast, it might be useful,
otherwise it would probably bog down the mail server too much just to
catch a certain type of spam that is somewhat problematic for a
Bayesian filter.

Best regards,
• ... I hope that when you say a list of approximately 385,000 you mean something structured like a tree, or in any case with a non sequential access...
Message 8 of 16 , Jun 9, 2004
• 0 Attachment
Il Wednesday 09 June 2004 10:04, Bradford Carpenter ha scritto:
> I'm comparing the unique
> words of an email to a list of approximately 385,000 level 80 words
> taken from SCOWL (Spelling-Checker-Oriented Word Lists) as a way to
> detect spam with blocks of random character strings.

I hope that when you say "a list of approximately 385,000" you mean something
structured like a tree, or in any case with a non sequential access...
...your speed will depend _a lot_ on the dictionary structure and searching
algorithm, hashtables won't solve all of this problem and frankly speaking
they are a very poor way to build a dictionary.

You've been using lexical trees (see for instance
http://caml.inria.fr/oreilly-book/html/book-ora020.html#toc29 ), didn't you?

Ernesto
• From: Bradford Carpenter Subject: Re: ocaml_beginners ::[] A few beginner s questions Date: Wed, 9 Jun 2004 01:04:44 -0700 ... I ve heard
Message 9 of 16 , Jun 9, 2004
• 0 Attachment
Subject: Re: "ocaml_beginners"::[] A few beginner's questions
Date: Wed, 9 Jun 2004 01:04:44 -0700

> Haven't come across textutil library...is there one available? Actually
> this check is something of an experiment: I'm comparing the unique
> words of an email to a list of approximately 385,000 level 80 words
> taken from SCOWL (Spelling-Checker-Oriented Word Lists) as a way to
> detect spam with blocks of random character strings. I'm following up
> on an idea posted on the Spamprobe users list, and it seems somewhat
> promising so far. An actual spell checker like aspell is far too slow,
> but I was quite amazed at how fast comm could pull out unidentified
> words from the email. Of course the natural questions are can Hashtbl.t
> work with lists of this size and can it process it faster than GNU
> comm? If a check like this can be made very fast, it might be useful,
> otherwise it would probably bog down the mail server too much just to
> catch a certain type of spam that is somewhat problematic for a
> Bayesian filter.

I've heard Hathtbl.t can handle > 1M entries, and building such a
large table in each invocation would be a problem.

If your dictonary is sorted list of words (as in the case of Unix
dict.), then you could use the divide-and-conquer method. i.e. first
you choose the middle word in the dictionary, and split words in the
email to the word before an after it. Repeat the same procedure to
each half, and so on. In this way, you do not need to load the whole
dictionary into the memory.

Of course you could use a more advanced method, like DART.

--
Yamagata Yoriyuki
• From: Bradford Carpenter Subject: Re: ocaml_beginners ::[] A few beginner s questions Date: Wed, 9 Jun 2004 01:04:44 -0700 ... I meant, has
Message 10 of 16 , Jun 9, 2004
• 0 Attachment
Subject: Re: "ocaml_beginners"::[] A few beginner's questions
Date: Wed, 9 Jun 2004 01:04:44 -0700

> > Textutil library would be interesting. Has anyone tried that?

I meant, "has anyone tried to make that." There is no textutil
library as far as I know.

Sorry for the confusing statement.
--
Yamagata Yoriyuki
• ... First, one thing you can do to reduce the size of your executables -- valid for native code only -- is to strip them after they re built. For instance:
Message 11 of 16 , Jun 9, 2004
• 0 Attachment
On Tue, Jun 08, 2004 at 08:23:11PM -0700, Bradford Carpenter wrote:
> After starting to write some functions on my own, discovered existing
> versions in various libraries and decided to try using those. After
> adding two library references (ExtLib and Nums) to the compile line,
> the size of my final native-code binary increased from 260K to 468K,
> this to use only three functions contained in the libraries. Simply
> adding functions to my source code causes no such increase, so I'm

First, one thing you can do to reduce the size of your executables --
valid for native code only -- is to strip them after they're built. For
instance:

strip -s foo

As for the linking question, here's what I've heard. I am not certain
it is accurate. :-)

Please note: this technique works *only* for ocamlopt-generated
executables, and not for ocamlc-generated executables (even if they use
-custom).

When linking a .cmo/.cmx (a module object file), the entire file is

When linking a .cma/.cmxa (a library), only the component .cmo/.cmx
files that are actually used are linked in.

Though I have not seen its source, I would guess that a library such as
Extlib uses code from itself in its functions. So, you may only use one
or two List-related functions from Extlib, but those functions may make
calls to enums, or several other Extlib modules internally. Thus, those
other modules must be linked in as well.

-- John
• ... That would be great. If nobody is working on this, we should at least add it to the project proposals corner of the humps at
Message 12 of 16 , Jun 9, 2004
• 0 Attachment
On Wed, 9 Jun 2004, Yamagata Yoriyuki wrote:

> Subject: Re: "ocaml_beginners"::[] A few beginner's questions
> Date: Wed, 9 Jun 2004 01:04:44 -0700
>
> > > Textutil library would be interesting. Has anyone tried that?
>
> I meant, "has anyone tried to make that." There is no textutil
> library as far as I know.

That would be great.
If nobody is working on this, we should at least add it to the project
proposals corner of the humps at
http://caml.inria.fr/humps/proposals_latest.html

Martin
• ... No, haven t tried that yet, but since speed is an important issue for this particular tool, it would probably be good to see which sections take the most
Message 13 of 16 , Jun 9, 2004
• 0 Attachment
On Wed, 09 Jun 2004 09:42:13 +0200, Stalkern 2 wrote:

> It is normal AFAIK, that linking increases the size of the binary. Of
> course
> on such tiny apps you may prefer to copy and paste the very functions that
> you need: but if you're not making any modification to the functions, this
> shall be done when you're quite sure about your way, because things become
> unmantainable afterwards. Optimization is the LAST move to take AFAIK.
>
> BTW, did you make a profiling of your programs?

No, haven't tried that yet, but since speed is an important issue for
this particular tool, it would probably be good to see which sections
take the most time to execute. Except for the wordlist check, all tests
pretty much use regular expressions and pcre-ocaml, so I wouldn't
expect a lot of variation.

>> Found a few OCaml implementations of standard unix tools (cat, grep,
>> wc, etc) and wondered if there is a definitive source for OCaml tools
>> like this. In particular, I'm using a system call to "sort -u | comm
>> -23 - /opt/local/etc/wordlist" to do a quick count of unknown words,
>> since I know nothing about the workings of comm. Seems like it would be
>> faster using a native OCaml version without using a pipe (since this is
>> run on every incoming email, speed is important). Is this a correct
>> assumption or am I better off just piping through GNU comm?
>>
>
> System calls from Ocaml have some problems, that are rather generic Unix
> programming issues. For what concerns pipes and redirections, there is a
> detailed paper by Gerd Stolpmann for his Equeue program ( see
>
http://ocaml-programming.de/packages/documentation/equeue/refman-shell/Shell.html
> ).

I'd looked at equeue, but opted for adapting Martin Jambon's "feed"
functions for external filtering (from
http://martin.jambon.free.fr/toolbox.html), since they seemed much
simpler and I could actually understand what they were doing.

> Sort and comm can both be addressed with parsing newline-separated elements
> and list programming, so if you already master writing efficient code for
> using lists (say, tail-rec code), go for it, otherwise step backwards.

Getting a mix of feedback on this. Sort and comm seem like fairly
optimized tools, but it sounds like something might be put together
within OCaml that fits the task a bit better. Again, this is my *first*
attempt at writing something with OCaml; still trying to master
writing code that actually *works*, but in this case, efficiency is
definitely an issue.

> Trivial enough, I would say that efficiency comes (far) after
> precision. You
> don't need just efficient functions if they don't do exactly what you want,
> see it the other way, you can use existing functions as far as you _know_
> that they do _exactly_ what you want... this happens seldom, maybe because
> generally one does not know a borrowed function as deep as one that he has
> written personally. If you're not in a hurry, in my experience everything
> that _you_ can express as a clear problem (parsing... lists...
> strings...) is
> worth being re-written. When functions have a lot of previous knowledge or
> interfacing inside, it is a waste of time and interest, and very
> error-prone,
> to open their hood when they are not broken.

Discovering this. Turns out, for example, that a rounding function I
needed (to a single significant figure) was much easier (and probably
more efficient) to implement as a custom function from scratch than as
an adaption any of the existing rounding functions in libraries I could
find. Examining my use of libraries more closely now.

> It is always very interesting to me to see how often problems in
> "informatics"
> do not come out of complexity itself, but from errors in the EASY things.
> Personally, I start writing programs by working on their type system,
> not on
> the functions.
• ... At the moment, as per comm s requirements, the word list is simply a linear list that has been pre-sorted with the sort command. The sort command is then
Message 14 of 16 , Jun 9, 2004
• 0 Attachment
On Wed, 09 Jun 2004 10:28:26 +0200, Stalkern 2 wrote:

>> I'm comparing the unique
>> words of an email to a list of approximately 385,000 level 80 words
>> taken from SCOWL (Spelling-Checker-Oriented Word Lists) as a way to
>> detect spam with blocks of random character strings.
>
> I hope that when you say "a list of approximately 385,000" you mean
> something
> structured like a tree, or in any case with a non sequential access...
> ...your speed will depend _a lot_ on the dictionary structure and searching
> algorithm, hashtables won't solve all of this problem and frankly speaking
> they are a very poor way to build a dictionary.

At the moment, as per comm's requirements, the word list is simply a
linear list that has been pre-sorted with the sort command. The sort
command is then applied to the email text for comparison. All I'm
really trying to do to determine the number of unique words in the
email that are not in the wordlist.

> You've been using lexical trees (see for instance
> http://caml.inria.fr/oreilly-book/html/book-ora020.html#toc29 ), didn't you?

answers to the exercises! Using a lexical tree is one option I'll need
to look into.

Best regards,
• ... Hadn t heard of this one, but then I m new to trying to compile my own code. Thanks for the tip; I ll try to track down some info on the strip command. ...
Message 15 of 16 , Jun 9, 2004
• 0 Attachment
On Wed, 09 Jun 2004 08:12:41 -0500, John Goerzen wrote:

>> After starting to write some functions on my own, discovered existing
>> versions in various libraries and decided to try using those. After
>> adding two library references (ExtLib and Nums) to the compile line,
>> the size of my final native-code binary increased from 260K to 468K,
>> this to use only three functions contained in the libraries. Simply
>> adding functions to my source code causes no such increase, so I'm
>
> First, one thing you can do to reduce the size of your executables --
> valid for native code only -- is to strip them after they're built. For
> instance:
>
> strip -s foo

Hadn't heard of this one, but then I'm new to trying to compile my own
code. Thanks for the tip; I'll try to track down some info on the strip
command.

> As for the linking question, here's what I've heard. I am not certain
> it is accurate. :-)
>
> Please note: this technique works *only* for ocamlopt-generated
> executables, and not for ocamlc-generated executables (even if they use
> -custom).
>
> When linking a .cmo/.cmx (a module object file), the entire file is
>
> When linking a .cma/.cmxa (a library), only the component .cmo/.cmx
> files that are actually used are linked in.
>
> Though I have not seen its source, I would guess that a library such as
> Extlib uses code from itself in its functions. So, you may only use one
> or two List-related functions from Extlib, but those functions may make
> calls to enums, or several other Extlib modules internally. Thus, those
> other modules must be linked in as well.

Think you may be right about Extlib; I notice the enum module is
referenced by most other modules in one way or another. I'm finding for
this particular use, simply writing custom functions, rather than using
libraries, seems to be the way to go, but good to have some
understanding of what's involved with linking in libraries for when I
do need them. Thanks!

Best regards,