Loading ...
Sorry, an error occurred while loading the content.

Re: [patch] improved equivalent classes in regular expressions

Expand Messages
  • Bram Moolenaar
    ... I don t think the documentation says that it works according to any standard. If we go this way, we need to make sure we are actually using the right
    Message 1 of 11 , Jan 17, 2013
    • 0 Attachment
      Christian Brabandt wrote:

      > Bram,
      > I recently discovered, that using equivalence classes in regular
      > expressions did not match all expected characters. Also I think, the
      > current implementation does not work as expected, since searching for
      > [[=Ä=]] does only match Ä and neither A nor any other A like character.
      >
      > So I looked into the standard¹ and found that apparently not all
      > characters are matched according to it.

      I don't think the documentation says that it works according to any
      standard. If we go this way, we need to make sure we are actually using
      the right standard for this functionality.

      > I wrote a testfile² that contains all character codes that need to match
      > for /[[=A=]]. If you search for /[[=A=]]$ you'll see, that some
      > characters are skipped.
      >
      > So I threw together a small vim script³, that parses the given standard
      > file and generates a huge switch statement to be used in the function
      > reg_equi_class() of the regexp.c in the Vim source.
      >
      > Using this generated code in regexp.c, I created this patch⁴, which
      > successfully matches all expected characters from that testfile. It also
      > adds equivalence classes for the 10 digits 0-9 (and added some missing
      > equivalence classes, e.g. for 'Q')
      >
      > However, some characters are now missing from the equivalence classes,
      > like most notably U01E4 U01E5 U0149 U0166 U0167 U01B5 U01B6 since they
      > are defined to have different primary weight than their Ascii
      > counterparts (G g n T t Z z), so I removed those chars from test44

      Hmm, doesn't this indicate the standard is not right for this purpose?


      --
      Bravely bold Sir Robin, rode forth from Camelot,
      He was not afraid to die, Oh Brave Sir Robin,
      He was not at all afraid to be killed in nasty ways
      Brave, brave, brave, brave Sir Robin.
      "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

      /// Bram Moolenaar -- Bram@... -- http://www.Moolenaar.net \\\
      /// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
      \\\ an exciting new programming language -- http://www.Zimbu.org ///
      \\\ help me help AIDS victims -- http://ICCF-Holland.org ///

      --
      You received this message from the "vim_dev" maillist.
      Do not top-post! Type your reply below the text you are replying to.
      For more information, visit http://www.vim.org/maillist.php
    • Christian Brabandt
      Hi Bram! ... It should be the correct standard. It s all about collation order and that is what defines the equivalence classes. Anyhow, here is an updated
      Message 2 of 11 , Jan 17, 2013
      • 0 Attachment
        Hi Bram!

        On Do, 17 Jan 2013, Bram Moolenaar wrote:

        >
        > Christian Brabandt wrote:
        >
        > > Bram,
        > > I recently discovered, that using equivalence classes in regular
        > > expressions did not match all expected characters. Also I think, the
        > > current implementation does not work as expected, since searching for
        > > [[=Ä=]] does only match Ä and neither A nor any other A like character.
        > >
        > > So I looked into the standard¹ and found that apparently not all
        > > characters are matched according to it.
        >
        > I don't think the documentation says that it works according to any
        > standard. If we go this way, we need to make sure we are actually using
        > the right standard for this functionality.
        >
        > > I wrote a testfile² that contains all character codes that need to match
        > > for /[[=A=]]. If you search for /[[=A=]]$ you'll see, that some
        > > characters are skipped.
        > >
        > > So I threw together a small vim script³, that parses the given standard
        > > file and generates a huge switch statement to be used in the function
        > > reg_equi_class() of the regexp.c in the Vim source.
        > >
        > > Using this generated code in regexp.c, I created this patch⁴, which
        > > successfully matches all expected characters from that testfile. It also
        > > adds equivalence classes for the 10 digits 0-9 (and added some missing
        > > equivalence classes, e.g. for 'Q')
        > >
        > > However, some characters are now missing from the equivalence classes,
        > > like most notably U01E4 U01E5 U0149 U0166 U0167 U01B5 U01B6 since they
        > > are defined to have different primary weight than their Ascii
        > > counterparts (G g n T t Z z), so I removed those chars from test44
        >
        > Hmm, doesn't this indicate the standard is not right for this purpose?

        It should be the correct standard. It's all about collation order and
        that is what defines the equivalence classes. Anyhow, here is an updated
        patch, that includes the few missing characters from before.

        Mit freundlichen Grüßen
        Christian
        --
        Hallo Cremeschnittchen-Esser!

        --
        You received this message from the "vim_dev" maillist.
        Do not top-post! Type your reply below the text you are replying to.
        For more information, visit http://www.vim.org/maillist.php
      • Christian Brabandt
        Hi Dominique! ... Indeed, that looks like a useful addition. I have another idea with regards to equivalence classes: When searching for /[[=ß=]] this should
        Message 3 of 11 , Jan 21, 2013
        • 0 Attachment
          Hi Dominique!

          On Mi, 16 Jan 2013, Dominique Pellé wrote:

          > When using equivalent class [[=x=]], I realized that what I
          > generally want, is to use it on the full strings rather than on
          > a single characters. Searching for "foobar" with...
          >
          > /[[=f=]][[=o=]][[=o=]][[=b=]][[=a=]][[=r=]]
          >
          > ... works but is rather unpleasant. I wish there was a flag
          > such as \q switch on equivalent class, which would
          > work like \c for case insensitivity. So instead of the above
          > regexp, I could search for:
          >
          > /\qfoobar
          >
          > As far as I know \q is unused in Vim regexp, so
          > that should not break compatibility.
          >
          > Maybe there could also be a function normalize({expr}}
          > (any better name?) that given a string with diacritics
          > "fňóbâr" returns "foobar" in similar way to tolower({expr}})
          > which returns a lowercase version of the string.
          >
          > Before I spend time trying to do that, would it be useful
          > and accepted?

          Indeed, that looks like a useful addition.

          I have another idea with regards to equivalence classes:
          When searching for /[[=ß=]] this should translate into /sz. But that is
          more complicated, since a search for /[s][z] wouldn't match ß (eszet)
          anymore.

          > Regarding the few characters that are no longer equivalent,
          > I find it odd from a user point of view. For example U+01e4
          > (LATIN CAPITAL LETTER G WITH STROKE) was equivalent
          > to uppercase G but it is no longer equivalent to G.
          > Yet some other letters with stroke are still equivalent.
          > For example, U+0141 (LATIN CAPITAL LETTER L WITH STROKE)
          > is still equivalent to L. It seems inconsistent, even if that's
          > what the ISO standard says. Previous behavior made more
          > sense to me for U+1e4 at least.

          Fixed with the latest patch.

          Mit freundlichen Grüßen
          Christian
          --
          Alkoholismus: Gift und Gegengift sind identisch.
          -- Gerhard Uhlenbruck

          --
          You received this message from the "vim_dev" maillist.
          Do not top-post! Type your reply below the text you are replying to.
          For more information, visit http://www.vim.org/maillist.php
        • Dominique Pellé
          ... I have no time now for that unfortunately, but maybe in a few weeks. ... You obviously speak better German than me, but isn t the German ess-zett
          Message 4 of 11 , Jan 21, 2013
          • 0 Attachment
            Christian Brabandt wrote:

            > Hi Dominique!
            >
            > On Mi, 16 Jan 2013, Dominique Pellé wrote:
            >
            >> When using equivalent class [[=x=]], I realized that what I
            >> generally want, is to use it on the full strings rather than on
            >> a single characters. Searching for "foobar" with...
            >>
            >> /[[=f=]][[=o=]][[=o=]][[=b=]][[=a=]][[=r=]]
            >>
            >> ... works but is rather unpleasant. I wish there was a flag
            >> such as \q switch on equivalent class, which would
            >> work like \c for case insensitivity. So instead of the above
            >> regexp, I could search for:
            >>
            >> /\qfoobar
            >>
            >> As far as I know \q is unused in Vim regexp, so
            >> that should not break compatibility.
            >>
            >> Maybe there could also be a function normalize({expr}}
            >> (any better name?) that given a string with diacritics
            >> "fňóbâr" returns "foobar" in similar way to tolower({expr}})
            >> which returns a lowercase version of the string.
            >>
            >> Before I spend time trying to do that, would it be useful
            >> and accepted?
            >
            > Indeed, that looks like a useful addition.

            I have no time now for that unfortunately, but maybe in a few weeks.

            > I have another idea with regards to equivalence classes:
            > When searching for /[[=ß=]] this should translate into /sz. But that is
            > more complicated, since a search for /[s][z] wouldn't match ß (eszet)
            > anymore.

            You obviously speak better German than me, but isn't the German
            ess-zett equivalent to ss rather than sz? I'm curious why /sz.

            >> Regarding the few characters that are no longer equivalent,
            >> I find it odd from a user point of view. For example U+01e4
            >> (LATIN CAPITAL LETTER G WITH STROKE) was equivalent
            >> to uppercase G but it is no longer equivalent to G.
            >> Yet some other letters with stroke are still equivalent.
            >> For example, U+0141 (LATIN CAPITAL LETTER L WITH STROKE)
            >> is still equivalent to L. It seems inconsistent, even if that's
            >> what the ISO standard says. Previous behavior made more
            >> sense to me for U+1e4 at least.
            >
            > Fixed with the latest patch.

            Yes, I saw that. Thanks!

            --
            You received this message from the "vim_dev" maillist.
            Do not top-post! Type your reply below the text you are replying to.
            For more information, visit http://www.vim.org/maillist.php
          • Christian Brabandt
            Hi Dominique! ... You got me ;) Of course esszett is, despite its name, equivalent to ss and that is what the standard actually demands (Although the Swiss
            Message 5 of 11 , Jan 23, 2013
            • 0 Attachment
              Hi Dominique!

              On Mo, 21 Jan 2013, Dominique Pellé wrote:

              > You obviously speak better German than me, but isn't the German
              > ess-zett equivalent to ss rather than sz? I'm curious why /sz.

              You got me ;)
              Of course esszett is, despite its name, equivalent to ss and that is
              what the standard actually demands (Although the Swiss might think
              otherwise). Sorry for the confusion.

              regards,
              Christian
              --
              Zeit ist das, was man an der Uhr abliest.
              -- Albert Einstein

              --
              --
              You received this message from the "vim_dev" maillist.
              Do not top-post! Type your reply below the text you are replying to.
              For more information, visit http://www.vim.org/maillist.php
            • Joachim Schmitz
              ... But still, while ß is equivalent to ss, the oposite is not true, only few ss are equivalent to ß. Same for ä,ö,ü and ae, oe, ue, equivalent in one
              Message 6 of 11 , Jan 24, 2013
              • 0 Attachment
                Christian Brabandt wrote:
                > Hi Dominique!
                >
                > On Mo, 21 Jan 2013, Dominique Pellé wrote:
                >
                >> You obviously speak better German than me, but isn't the German
                >> ess-zett equivalent to ss rather than sz? I'm curious why /sz.
                >
                > You got me ;)
                > Of course esszett is, despite its name, equivalent to ss and that is
                > what the standard actually demands (Although the Swiss might think
                > otherwise). Sorry for the confusion.


                But still, while ß is equivalent to ss, the oposite is not true, only few ss
                are equivalent to ß.
                Same for ä,ö,ü and ae, oe, ue, equivalent in one direction but not the
                other.

                Bye, Jojo


                --
                --
                You received this message from the "vim_dev" maillist.
                Do not top-post! Type your reply below the text you are replying to.
                For more information, visit http://www.vim.org/maillist.php
              • Christian Brabandt
                Hi Joachim! ... Indeed, but when we are talking about equivalence classes regarding regular expressions, then ss and ß are equal. regards, Christian -- Der
                Message 7 of 11 , Jan 24, 2013
                • 0 Attachment
                  Hi Joachim!

                  On Do, 24 Jan 2013, Joachim Schmitz wrote:

                  > But still, while ß is equivalent to ss, the oposite is not true,
                  > only few ss are equivalent to ß.
                  > Same for ä,ö,ü and ae, oe, ue, equivalent in one direction but not
                  > the other.

                  Indeed, but when we are talking about equivalence classes regarding
                  regular expressions, then ss and ß are equal.

                  regards,
                  Christian
                  --
                  Der beste Teil der Schönheit ist der, den ein Bild nicht wiedergeben
                  kann.
                  -- Francis Bacon

                  --
                  --
                  You received this message from the "vim_dev" maillist.
                  Do not top-post! Type your reply below the text you are replying to.
                  For more information, visit http://www.vim.org/maillist.php
                • Tony Mechelynck
                  ... What do you mean, the Swiss may think otherwise ? IIUC, in the de_CH standard the eszett is not used, it is always replaced by ss, because the Swiss have
                  Message 8 of 11 , Jan 24, 2013
                  • 0 Attachment
                    On 23/01/13 22:08, Christian Brabandt wrote:
                    > Hi Dominique!
                    >
                    > On Mo, 21 Jan 2013, Dominique Pellé wrote:
                    >
                    >> You obviously speak better German than me, but isn't the German
                    >> ess-zett equivalent to ss rather than sz? I'm curious why /sz.
                    >
                    > You got me ;)
                    > Of course esszett is, despite its name, equivalent to ss and that is
                    > what the standard actually demands (Although the Swiss might think
                    > otherwise). Sorry for the confusion.
                    >
                    > regards,
                    > Christian
                    >

                    What do you mean, "the Swiss may think otherwise"? IIUC, in the de_CH
                    standard the eszett is not used, it is always replaced by ss, because
                    the Swiss have no room for it on their trilingual (well, quadrilingual,
                    even) typewriter keyboards. Hence the well-known slur against them:

                    — Wie trinken die Schweizer Bier?
                    ("How do the Swiss drink beer?")
                    — In Masse.
                    ("massively", where for any other German-speaking country except maybe
                    Liechtenstein it would of course be "in Maße", "in moderation").


                    Best regards,
                    Tony.
                    --
                    Speak roughly to your little boy,
                    And beat him when he sneezes:
                    He only does it to annoy
                    Because he knows it teases.

                    Wow! wow! wow!

                    I speak severely to my boy,
                    And beat him when he sneezes:
                    For he can thoroughly enjoy
                    The pepper when he pleases!

                    Wow! wow! wow!
                    -- Lewis Carrol, "Alice in Wonderland"

                    --
                    --
                    You received this message from the "vim_dev" maillist.
                    Do not top-post! Type your reply below the text you are replying to.
                    For more information, visit http://www.vim.org/maillist.php
                  • Christian Brabandt
                    Hi Tony! ... I thought the Swiss used to replace ß by sz but that is apparently wrong, as you pointed out correctly. Mit freundlichen Grüßen Christian -- --
                    Message 9 of 11 , Jan 24, 2013
                    • 0 Attachment
                      Hi Tony!

                      On Do, 24 Jan 2013, Tony Mechelynck wrote:

                      > What do you mean, "the Swiss may think otherwise"? IIUC, in the
                      > de_CH standard the eszett is not used, it is always replaced by ss,
                      > because the Swiss have no room for it on their trilingual (well,
                      > quadrilingual, even) typewriter keyboards. Hence the well-known slur
                      > against them:
                      >
                      > — Wie trinken die Schweizer Bier?
                      > ("How do the Swiss drink beer?")
                      > — In Masse.
                      > ("massively", where for any other German-speaking country except
                      > maybe Liechtenstein it would of course be "in Maße", "in
                      > moderation").

                      I thought the Swiss used to replace ß by sz but that is apparently
                      wrong, as you pointed out correctly.

                      Mit freundlichen Grüßen
                      Christian
                      --

                      --
                      --
                      You received this message from the "vim_dev" maillist.
                      Do not top-post! Type your reply below the text you are replying to.
                      For more information, visit http://www.vim.org/maillist.php
                    Your message has been successfully submitted and would be delivered to recipients shortly.