Loading ...
Sorry, an error occurred while loading the content.

Re: [patch] improved equivalent classes in regular expressions

Expand Messages
  • Christian Brabandt
    Hi Bram! ... It should be the correct standard. It s all about collation order and that is what defines the equivalence classes. Anyhow, here is an updated
    Message 1 of 11 , Jan 17, 2013
    • 0 Attachment
      Hi Bram!

      On Do, 17 Jan 2013, Bram Moolenaar wrote:

      >
      > Christian Brabandt wrote:
      >
      > > Bram,
      > > I recently discovered, that using equivalence classes in regular
      > > expressions did not match all expected characters. Also I think, the
      > > current implementation does not work as expected, since searching for
      > > [[=Ä=]] does only match Ä and neither A nor any other A like character.
      > >
      > > So I looked into the standard¹ and found that apparently not all
      > > characters are matched according to it.
      >
      > I don't think the documentation says that it works according to any
      > standard. If we go this way, we need to make sure we are actually using
      > the right standard for this functionality.
      >
      > > I wrote a testfile² that contains all character codes that need to match
      > > for /[[=A=]]. If you search for /[[=A=]]$ you'll see, that some
      > > characters are skipped.
      > >
      > > So I threw together a small vim script³, that parses the given standard
      > > file and generates a huge switch statement to be used in the function
      > > reg_equi_class() of the regexp.c in the Vim source.
      > >
      > > Using this generated code in regexp.c, I created this patch⁴, which
      > > successfully matches all expected characters from that testfile. It also
      > > adds equivalence classes for the 10 digits 0-9 (and added some missing
      > > equivalence classes, e.g. for 'Q')
      > >
      > > However, some characters are now missing from the equivalence classes,
      > > like most notably U01E4 U01E5 U0149 U0166 U0167 U01B5 U01B6 since they
      > > are defined to have different primary weight than their Ascii
      > > counterparts (G g n T t Z z), so I removed those chars from test44
      >
      > Hmm, doesn't this indicate the standard is not right for this purpose?

      It should be the correct standard. It's all about collation order and
      that is what defines the equivalence classes. Anyhow, here is an updated
      patch, that includes the few missing characters from before.

      Mit freundlichen Grüßen
      Christian
      --
      Hallo Cremeschnittchen-Esser!

      --
      You received this message from the "vim_dev" maillist.
      Do not top-post! Type your reply below the text you are replying to.
      For more information, visit http://www.vim.org/maillist.php
    • Christian Brabandt
      Hi Dominique! ... Indeed, that looks like a useful addition. I have another idea with regards to equivalence classes: When searching for /[[=ß=]] this should
      Message 2 of 11 , Jan 21, 2013
      • 0 Attachment
        Hi Dominique!

        On Mi, 16 Jan 2013, Dominique Pellé wrote:

        > When using equivalent class [[=x=]], I realized that what I
        > generally want, is to use it on the full strings rather than on
        > a single characters. Searching for "foobar" with...
        >
        > /[[=f=]][[=o=]][[=o=]][[=b=]][[=a=]][[=r=]]
        >
        > ... works but is rather unpleasant. I wish there was a flag
        > such as \q switch on equivalent class, which would
        > work like \c for case insensitivity. So instead of the above
        > regexp, I could search for:
        >
        > /\qfoobar
        >
        > As far as I know \q is unused in Vim regexp, so
        > that should not break compatibility.
        >
        > Maybe there could also be a function normalize({expr}}
        > (any better name?) that given a string with diacritics
        > "fňóbâr" returns "foobar" in similar way to tolower({expr}})
        > which returns a lowercase version of the string.
        >
        > Before I spend time trying to do that, would it be useful
        > and accepted?

        Indeed, that looks like a useful addition.

        I have another idea with regards to equivalence classes:
        When searching for /[[=ß=]] this should translate into /sz. But that is
        more complicated, since a search for /[s][z] wouldn't match ß (eszet)
        anymore.

        > Regarding the few characters that are no longer equivalent,
        > I find it odd from a user point of view. For example U+01e4
        > (LATIN CAPITAL LETTER G WITH STROKE) was equivalent
        > to uppercase G but it is no longer equivalent to G.
        > Yet some other letters with stroke are still equivalent.
        > For example, U+0141 (LATIN CAPITAL LETTER L WITH STROKE)
        > is still equivalent to L. It seems inconsistent, even if that's
        > what the ISO standard says. Previous behavior made more
        > sense to me for U+1e4 at least.

        Fixed with the latest patch.

        Mit freundlichen Grüßen
        Christian
        --
        Alkoholismus: Gift und Gegengift sind identisch.
        -- Gerhard Uhlenbruck

        --
        You received this message from the "vim_dev" maillist.
        Do not top-post! Type your reply below the text you are replying to.
        For more information, visit http://www.vim.org/maillist.php
      • Dominique Pellé
        ... I have no time now for that unfortunately, but maybe in a few weeks. ... You obviously speak better German than me, but isn t the German ess-zett
        Message 3 of 11 , Jan 21, 2013
        • 0 Attachment
          Christian Brabandt wrote:

          > Hi Dominique!
          >
          > On Mi, 16 Jan 2013, Dominique Pellé wrote:
          >
          >> When using equivalent class [[=x=]], I realized that what I
          >> generally want, is to use it on the full strings rather than on
          >> a single characters. Searching for "foobar" with...
          >>
          >> /[[=f=]][[=o=]][[=o=]][[=b=]][[=a=]][[=r=]]
          >>
          >> ... works but is rather unpleasant. I wish there was a flag
          >> such as \q switch on equivalent class, which would
          >> work like \c for case insensitivity. So instead of the above
          >> regexp, I could search for:
          >>
          >> /\qfoobar
          >>
          >> As far as I know \q is unused in Vim regexp, so
          >> that should not break compatibility.
          >>
          >> Maybe there could also be a function normalize({expr}}
          >> (any better name?) that given a string with diacritics
          >> "fňóbâr" returns "foobar" in similar way to tolower({expr}})
          >> which returns a lowercase version of the string.
          >>
          >> Before I spend time trying to do that, would it be useful
          >> and accepted?
          >
          > Indeed, that looks like a useful addition.

          I have no time now for that unfortunately, but maybe in a few weeks.

          > I have another idea with regards to equivalence classes:
          > When searching for /[[=ß=]] this should translate into /sz. But that is
          > more complicated, since a search for /[s][z] wouldn't match ß (eszet)
          > anymore.

          You obviously speak better German than me, but isn't the German
          ess-zett equivalent to ss rather than sz? I'm curious why /sz.

          >> Regarding the few characters that are no longer equivalent,
          >> I find it odd from a user point of view. For example U+01e4
          >> (LATIN CAPITAL LETTER G WITH STROKE) was equivalent
          >> to uppercase G but it is no longer equivalent to G.
          >> Yet some other letters with stroke are still equivalent.
          >> For example, U+0141 (LATIN CAPITAL LETTER L WITH STROKE)
          >> is still equivalent to L. It seems inconsistent, even if that's
          >> what the ISO standard says. Previous behavior made more
          >> sense to me for U+1e4 at least.
          >
          > Fixed with the latest patch.

          Yes, I saw that. Thanks!

          --
          You received this message from the "vim_dev" maillist.
          Do not top-post! Type your reply below the text you are replying to.
          For more information, visit http://www.vim.org/maillist.php
        • Christian Brabandt
          Hi Dominique! ... You got me ;) Of course esszett is, despite its name, equivalent to ss and that is what the standard actually demands (Although the Swiss
          Message 4 of 11 , Jan 23, 2013
          • 0 Attachment
            Hi Dominique!

            On Mo, 21 Jan 2013, Dominique Pellé wrote:

            > You obviously speak better German than me, but isn't the German
            > ess-zett equivalent to ss rather than sz? I'm curious why /sz.

            You got me ;)
            Of course esszett is, despite its name, equivalent to ss and that is
            what the standard actually demands (Although the Swiss might think
            otherwise). Sorry for the confusion.

            regards,
            Christian
            --
            Zeit ist das, was man an der Uhr abliest.
            -- Albert Einstein

            --
            --
            You received this message from the "vim_dev" maillist.
            Do not top-post! Type your reply below the text you are replying to.
            For more information, visit http://www.vim.org/maillist.php
          • Joachim Schmitz
            ... But still, while ß is equivalent to ss, the oposite is not true, only few ss are equivalent to ß. Same for ä,ö,ü and ae, oe, ue, equivalent in one
            Message 5 of 11 , Jan 24, 2013
            • 0 Attachment
              Christian Brabandt wrote:
              > Hi Dominique!
              >
              > On Mo, 21 Jan 2013, Dominique Pellé wrote:
              >
              >> You obviously speak better German than me, but isn't the German
              >> ess-zett equivalent to ss rather than sz? I'm curious why /sz.
              >
              > You got me ;)
              > Of course esszett is, despite its name, equivalent to ss and that is
              > what the standard actually demands (Although the Swiss might think
              > otherwise). Sorry for the confusion.


              But still, while ß is equivalent to ss, the oposite is not true, only few ss
              are equivalent to ß.
              Same for ä,ö,ü and ae, oe, ue, equivalent in one direction but not the
              other.

              Bye, Jojo


              --
              --
              You received this message from the "vim_dev" maillist.
              Do not top-post! Type your reply below the text you are replying to.
              For more information, visit http://www.vim.org/maillist.php
            • Christian Brabandt
              Hi Joachim! ... Indeed, but when we are talking about equivalence classes regarding regular expressions, then ss and ß are equal. regards, Christian -- Der
              Message 6 of 11 , Jan 24, 2013
              • 0 Attachment
                Hi Joachim!

                On Do, 24 Jan 2013, Joachim Schmitz wrote:

                > But still, while ß is equivalent to ss, the oposite is not true,
                > only few ss are equivalent to ß.
                > Same for ä,ö,ü and ae, oe, ue, equivalent in one direction but not
                > the other.

                Indeed, but when we are talking about equivalence classes regarding
                regular expressions, then ss and ß are equal.

                regards,
                Christian
                --
                Der beste Teil der Schönheit ist der, den ein Bild nicht wiedergeben
                kann.
                -- Francis Bacon

                --
                --
                You received this message from the "vim_dev" maillist.
                Do not top-post! Type your reply below the text you are replying to.
                For more information, visit http://www.vim.org/maillist.php
              • Tony Mechelynck
                ... What do you mean, the Swiss may think otherwise ? IIUC, in the de_CH standard the eszett is not used, it is always replaced by ss, because the Swiss have
                Message 7 of 11 , Jan 24, 2013
                • 0 Attachment
                  On 23/01/13 22:08, Christian Brabandt wrote:
                  > Hi Dominique!
                  >
                  > On Mo, 21 Jan 2013, Dominique Pellé wrote:
                  >
                  >> You obviously speak better German than me, but isn't the German
                  >> ess-zett equivalent to ss rather than sz? I'm curious why /sz.
                  >
                  > You got me ;)
                  > Of course esszett is, despite its name, equivalent to ss and that is
                  > what the standard actually demands (Although the Swiss might think
                  > otherwise). Sorry for the confusion.
                  >
                  > regards,
                  > Christian
                  >

                  What do you mean, "the Swiss may think otherwise"? IIUC, in the de_CH
                  standard the eszett is not used, it is always replaced by ss, because
                  the Swiss have no room for it on their trilingual (well, quadrilingual,
                  even) typewriter keyboards. Hence the well-known slur against them:

                  — Wie trinken die Schweizer Bier?
                  ("How do the Swiss drink beer?")
                  — In Masse.
                  ("massively", where for any other German-speaking country except maybe
                  Liechtenstein it would of course be "in Maße", "in moderation").


                  Best regards,
                  Tony.
                  --
                  Speak roughly to your little boy,
                  And beat him when he sneezes:
                  He only does it to annoy
                  Because he knows it teases.

                  Wow! wow! wow!

                  I speak severely to my boy,
                  And beat him when he sneezes:
                  For he can thoroughly enjoy
                  The pepper when he pleases!

                  Wow! wow! wow!
                  -- Lewis Carrol, "Alice in Wonderland"

                  --
                  --
                  You received this message from the "vim_dev" maillist.
                  Do not top-post! Type your reply below the text you are replying to.
                  For more information, visit http://www.vim.org/maillist.php
                • Christian Brabandt
                  Hi Tony! ... I thought the Swiss used to replace ß by sz but that is apparently wrong, as you pointed out correctly. Mit freundlichen Grüßen Christian -- --
                  Message 8 of 11 , Jan 24, 2013
                  • 0 Attachment
                    Hi Tony!

                    On Do, 24 Jan 2013, Tony Mechelynck wrote:

                    > What do you mean, "the Swiss may think otherwise"? IIUC, in the
                    > de_CH standard the eszett is not used, it is always replaced by ss,
                    > because the Swiss have no room for it on their trilingual (well,
                    > quadrilingual, even) typewriter keyboards. Hence the well-known slur
                    > against them:
                    >
                    > — Wie trinken die Schweizer Bier?
                    > ("How do the Swiss drink beer?")
                    > — In Masse.
                    > ("massively", where for any other German-speaking country except
                    > maybe Liechtenstein it would of course be "in Maße", "in
                    > moderation").

                    I thought the Swiss used to replace ß by sz but that is apparently
                    wrong, as you pointed out correctly.

                    Mit freundlichen Grüßen
                    Christian
                    --

                    --
                    --
                    You received this message from the "vim_dev" maillist.
                    Do not top-post! Type your reply below the text you are replying to.
                    For more information, visit http://www.vim.org/maillist.php
                  Your message has been successfully submitted and would be delivered to recipients shortly.