Loading ...
Sorry, an error occurred while loading the content.

Regular Expression Problem

Expand Messages
  • Greg Chapman
    Hi Folks, I have a simple address book file in the form: user@domain[TAB]Firstname Lastname (Comment)[TAB]group I wish to change it to: Firstname Lastname
    Message 1 of 20 , Sep 28, 2007
    • 0 Attachment
      Hi Folks,

      I have a simple address book file in the form:

      user@domain[TAB]Firstname Lastname (Comment)[TAB]group

      I wish to change it to:

      "Firstname Lastname" <username@domain>[TAB]Comment[TAB]group

      Actually, there are a handful of entries where the second field
      contains variations on the "Firstname Lastname (Comment)" format, so I
      am happy to leave the second field untouched in the revised version
      and do the final refinements manually

      Obviously, I'll need to check the regular expression box, but what
      should I enter in the Search/Replace boxes to catch all forms of valid
      mail address and collect the first two words from the second field to
      place them in the first, with the required puncuation round the name
      and address?

      Greg
    • Tony Mc
      ... Hi Greg, if the comments are really enclosed in parentheses, as your example suggests, you should use: Search for: ^(.*@.*) t(.*?)( (.*? )) t(.*)$ Replace
      Message 2 of 20 , Sep 28, 2007
      • 0 Attachment
        On Fri, 28 Sep 2007 13:46:29 +0100, you wrote:

        > Hi Folks,
        >
        > I have a simple address book file in the form:
        >
        > user@domain[TAB]Firstname Lastname (Comment)[TAB]group
        >
        > I wish to change it to:
        >
        > "Firstname Lastname" <username@domain>[TAB]Comment[TAB]group
        >
        > Actually, there are a handful of entries where the second field
        > contains variations on the "Firstname Lastname (Comment)" format, so I
        > am happy to leave the second field untouched in the revised version
        > and do the final refinements manually
        >
        > Obviously, I'll need to check the regular expression box, but what
        > should I enter in the Search/Replace boxes to catch all forms of valid
        > mail address and collect the first two words from the second field to
        > place them in the first, with the required puncuation round the name
        > and address?


        Hi Greg,

        if the comments are really enclosed in parentheses, as your example
        suggests, you should use:

        Search for: ^(.*@.*)\t(.*?)(\(.*?\))\t(.*)$
        Replace with: $2 <$1>\t$3\t$4

        This will keep the parentheses around the comment field. If you would
        prefer the second field to remain as "Firstname Lastname (comment)" as
        your "Actually..." suggests, then use:

        Search for: ^(.*?)\t(.*?)\t(.*)$
        Replace with: $2\t$1\t$3

        In each case the (.*?)\t finds a field up to the next tab and saves it
        in one of the $i variables. The replacement field then uses these $i
        separated by tabs in the order you require. In the second case above,
        your manual tweaking would be on the $2 replacement.

        Best,
        Tony
      • Don - HtmlFixIt.com
        ... Try this ... so long as domains are all actual domains and not ip addresses ... ^([a-z|0-9| .|_]*)@([a-z|0-9| .|_]*) .([a-z]{1,4}) t(.*) ((.*) )(.*) $4
        Message 3 of 20 , Sep 28, 2007
        • 0 Attachment
          Greg Chapman wrote:
          > Hi Folks,
          >
          > I have a simple address book file in the form:
          >
          > user@domain[TAB]Firstname Lastname (Comment)[TAB]group
          >
          > I wish to change it to:
          >
          > "Firstname Lastname" <username@domain>[TAB]Comment[TAB]group
          >

          Try this ... so long as domains are all actual domains and not ip
          addresses ...

          ^([a-z|0-9|\.|_]*)@([a-z|0-9|\.|_]*)\.([a-z]{1,4})\t(.*)\((.*)\)(.*)

          \"$4\" <$1@$2.$3>\t$5\t$6
        • Marcelo de Castro Bastos
          ... Try this: SEARCH ^([^ t]+) t([^ s]+) ([^ s]+)([^ t]+) t([^ r]+) REPLACE WITH $2 $3 t$4 t$5 It may need a little more tweaking, especially concerning
          Message 4 of 20 , Sep 28, 2007
          • 0 Attachment
            Interviewed by CNN on 28/9/2007 09:46, Greg Chapman told the world:
            > Hi Folks,
            >
            > I have a simple address book file in the form:
            >
            > user@domain[TAB]Firstname Lastname (Comment)[TAB]group
            >
            > I wish to change it to:
            >
            > "Firstname Lastname" <username@domain>[TAB]Comment[TAB]group
            >
            > Actually, there are a handful of entries where the second field
            > contains variations on the "Firstname Lastname (Comment)" format, so I
            > am happy to leave the second field untouched in the revised version
            > and do the final refinements manually
            >
            > Obviously, I'll need to check the regular expression box, but what
            > should I enter in the Search/Replace boxes to catch all forms of valid
            > mail address and collect the first two words from the second field to
            > place them in the first, with the required puncuation round the name
            > and address?
            >
            >
            Try this:

            SEARCH
            ^([^\t]+)\t([^\s]+) ([^\s]+)([^\t]+)\t([^\r]+)

            REPLACE WITH
            "$2 $3" <$1>\t$4\t$5

            It may need a little more tweaking, especially concerning the CR/LF
            string at the end of the line, but it's a start.

            (The trick here is using *negative* matches, like "match anything that
            is NOT a tab")

            Marcelo
            -=-=-
            Could crop circles be the work of a cereal killer?
            * TagZilla 0.066 on Seamonkey 1.1.4
          • Don - HtmlFixIt.com
            ... I suppose this part: [a-z]{1,4} could actually be [a-z]{2,4} since there are no one letter domain extensions. To better understand my whack at this: ^
            Message 5 of 20 , Sep 28, 2007
            • 0 Attachment
              > Try this ... so long as domains are all actual domains and not ip
              > addresses ...
              >
              > ^([a-z|0-9|\.|_]*)@([a-z|0-9|\.|_]*)\.([a-z]{1,4})\t(.*)\((.*)\)(.*)
              >
              > \"$4\" <$1@$2.$3>\t$5\t$6
              >
              I suppose this part:
              [a-z]{1,4}
              could actually be
              [a-z]{2,4}
              since there are no one letter domain extensions.


              To better understand my whack at this:
              ^ anchors it to start of line

              ([a-z|0-9|\.|_]*) any combination of letters, numbers, dot or
              underscore, you may wish to add a hyphen I suppose, the dot is escaped
              so it is really a dot because in regex a dot not escaped with the
              backslash means any character

              @ the at sign

              ([a-z|0-9|\.|_]*) the domain name consisting of any letters or numbers
              for domain names you should definitely add a hypen to this as there are
              many hyphenated domain names

              \. the dot in the domain names

              ([a-z]{2,4}) the top level domain which to my knowledge must be from 2-4
              alphabetic characters from ca to info for example.

              \t tab

              (.*) any characters * means unlimited in number for the name

              \( the opening ( on the comment stops the unlimited characters ... I
              assume there really was a (

              (.*) any characters in the comment

              \) stops at the ) at the end of the comment

              (.*) any characters for the group
            • Sheri
              ... Hi Marcelo, I would suggest to include both r and n in all your negated character classes unless you specifically want to allow for crossing linebreaks,
              Message 6 of 20 , Sep 28, 2007
              • 0 Attachment
                --- In notetab@yahoogroups.com, Marcelo de Castro Bastos
                <mcblista@...> wrote:
                >

                > ^([^\t]+)\t([^\s]+) ([^\s]+)([^\t]+)\t([^\r]+)

                > It may need a little more tweaking, especially concerning the CR/LF
                > string at the end of the line, but it's a start.

                Hi Marcelo,

                I would suggest to include both \r and \n in all your negated
                character classes unless you specifically want to allow for crossing
                linebreaks, and to never include one without the other.

                >
                > (The trick here is using *negative* matches, like "match anything
                > that is NOT a tab")

                What harm to match instead that which is not a tab, carriage return or
                linefeed?

                As is, if it hits a line that's missing the tabs, it eats multiple
                pieces of lines.

                Regards,
                Sheri
              • Sheri
                ... Hi Don, Just wanted to point out that a dot has no special meaning inside a character class. Escaping it with a backslash there isn t necessary (but
                Message 7 of 20 , Sep 28, 2007
                • 0 Attachment
                  --- In notetab@yahoogroups.com, "Don - HtmlFixIt.com" <don@...> wrote:

                  Hi Don,

                  Just wanted to point out that a dot has no special meaning inside a
                  character class. Escaping it with a backslash there isn't necessary
                  (but doesn't hurt anything).

                  Also, you might want to be careful with asterisk. It doesn't just
                  match any number of whatever is referenced, it also matches none (a
                  quantity of zero) of them.

                  Regards,
                  Sheri
                • Don - HtmlFixIt.com
                  ... Thanks Sheri, Good points. I guess I have always escaped dots so never realized. I guess I hadn t thought about zero on the asterisk. I should have used
                  Message 8 of 20 , Sep 28, 2007
                  • 0 Attachment
                    Sheri wrote:
                    > --- In notetab@yahoogroups.com, "Don - HtmlFixIt.com" <don@...> wrote:
                    >
                    > Hi Don,
                    >
                    > Just wanted to point out that a dot has no special meaning inside a
                    > character class. Escaping it with a backslash there isn't necessary
                    > (but doesn't hurt anything).
                    >
                    > Also, you might want to be careful with asterisk. It doesn't just
                    > match any number of whatever is referenced, it also matches none (a
                    > quantity of zero) of them.
                    >
                    > Regards,
                    > Sheri

                    Thanks Sheri,

                    Good points. I guess I have always escaped dots so never realized.
                    I guess I hadn't thought about zero on the asterisk. I should have used
                    a plus sign for one or more then correct?
                  • Greg Chapman
                    Hi Marcelo and Don, Thanks for both having a go at this! On 28 Sep 07 14:28 Marcelo de Castro Bastos ... No problem at all with the
                    Message 9 of 20 , Sep 28, 2007
                    • 0 Attachment
                      Hi Marcelo and Don,

                      Thanks for both having a go at this!

                      On 28 Sep 07 14:28 Marcelo de Castro Bastos <mcblista@...>
                      said:
                      > > I have a simple address book file in the form:
                      > >
                      > > user@domain[TAB]Firstname Lastname (Comment)[TAB]group
                      > >
                      > > I wish to change it to:
                      > >
                      > > "Firstname Lastname" <username@domain>[TAB]Comment[TAB]group
                      ...
                      > Try this:
                      >
                      > SEARCH
                      > ^([^\t]+)\t([^\s]+) ([^\s]+)([^\t]+)\t([^\r]+)
                      >
                      > REPLACE WITH
                      > "$2 $3" <$1>\t$4\t$5
                      >
                      > It may need a little more tweaking, especially concerning the CR/LF
                      > string at the end of the line, but it's a start.

                      No problem at all with the CR/LF and your code did more than I asked
                      and actually attempted the full job.

                      However,the last letter of the lastname was missing from the first
                      field... only to be found left behind sitting there all lonely in the
                      second field.

                      Even reading the PCRE help file I'm struggling to work out your code
                      and see how to edit it to pick up the final letter.

                      Don,
                      I regret that your version was far less effective. It only found 8
                      changes when it should have found around 138.

                      Thanks again!

                      Greg
                    • Don - HtmlFixIt.com
                      ... Hi Greg, Well I am a self admitted Regex failure ... but it doesn t stop me from trying. Garbage in and garbage out ... But I didn t have the garbage to
                      Message 10 of 20 , Sep 28, 2007
                      • 0 Attachment
                        > Don,
                        > I regret that your version was far less effective. It only found 8
                        > changes when it should have found around 138.
                        >
                        > Thanks again!
                        >
                        > Greg
                        Hi Greg,

                        Well I am a self admitted Regex failure ... but it doesn't stop me from
                        trying.

                        Garbage in and garbage out ... But I didn't have the garbage to work
                        with :-) hence no garbage out I guess.

                        If I had the patterns I was looking for I could no doubt figure out why
                        I missed them. If you want to send me that file off list, I'd be happy
                        to look. I bet the ones missed didn't fit your sample pattern.

                        I checked to see if it matched what you gave me (the sample) and it did.
                        So your sample is in essence bad in that it obviously doesn't fairly
                        represent what you are actually searching.

                        Happy to have attempted even in failure.

                        Don
                      • Marcelo de Castro Bastos
                        ... Hmmm, it might have something to do with whatever is the expected next character. I also forgot to get rid of the parentheses. Maybe a slightly changed
                        Message 11 of 20 , Sep 28, 2007
                        • 0 Attachment
                          Interviewed by CNN on 28/9/2007 12:41, Greg Chapman told the world:
                          > No problem at all with the CR/LF and your code did more than I asked
                          > and actually attempted the full job.
                          >
                          > However,the last letter of the lastname was missing from the first
                          > field... only to be found left behind sitting there all lonely in the
                          > second field.
                          >
                          > Even reading the PCRE help file I'm struggling to work out your code
                          > and see how to edit it to pick up the final letter.
                          >
                          Hmmm, it might have something to do with whatever is the expected next
                          character. I also forgot to get rid of the parentheses.

                          Maybe a slightly changed version will do it:

                          ^([^\t]+)\t([^\s]+) ([^\(]+)\(([^\)]+\))\t([^\r]+)


                          Marcelo
                          -=-=-
                          This mind intentionally left blank.
                          * TagZilla 0.066 on Seamonkey 1.1.4
                        • Sheri
                          ... Most likely Greg did not pick up the space between the two ([^ s]+) groupings. Best to use x20 for real spaces BTW, [^ s] excludes more than just real
                          Message 12 of 20 , Sep 28, 2007
                          • 0 Attachment
                            --- In notetab@yahoogroups.com, Greg Chapman <gregchapmanuk@...> wrote:
                            >
                            > > SEARCH
                            > > ^([^\t]+)\t([^\s]+) ([^\s]+)([^\t]+)\t([^\r]+)

                            Most likely Greg did not pick up the space between the two ([^\s]+)
                            groupings.

                            Best to use \x20 for real spaces

                            BTW, [^\s] excludes more than just real spaces, e.g., it also excludes
                            tabs, carriage returns, line feeds and other whitespace characters.

                            Regards,
                            Sheri
                          • Alex Plantema
                            ... I tried this, which seems to work: search: (.*) t(.*) ((.*) ) t(.*) replace with: $2 t$3 t$4 Alex.
                            Message 13 of 20 , Sep 28, 2007
                            • 0 Attachment
                              Op vrijdag 28 september 2007 14:46 schreef Greg Chapman:

                              > I have a simple address book file in the form:
                              >
                              > user@domain[TAB]Firstname Lastname (Comment)[TAB]group
                              >
                              > I wish to change it to:
                              >
                              > "Firstname Lastname" <username@domain>[TAB]Comment[TAB]group

                              I tried this, which seems to work:

                              search:
                              (.*)\t(.*) \((.*)\)\t(.*)

                              replace with:
                              "$2" <$1>\t$3\t$4

                              Alex.
                            • Sheri
                              ... There s obviously a lot of ways to structure it. Here s one using named substrings which should make it easier to tweak various components if you want to
                              Message 14 of 20 , Sep 29, 2007
                              • 0 Attachment
                                --- In notetab@yahoogroups.com, Greg Chapman <gregchapmanuk@...> wrote:
                                >
                                > Hi Folks,
                                >
                                > I have a simple address book file in the form:
                                >
                                > user@domain[TAB]Firstname Lastname (Comment)[TAB]group
                                >
                                > I wish to change it to:
                                >
                                > "Firstname Lastname" <username@domain>[TAB]Comment[TAB]group
                                >

                                There's obviously a lot of ways to structure it. Here's one using
                                named substrings which should make it easier to tweak various
                                components if you want to only process lines that have valid info in
                                every field, or if you want to rearrange the output fields differently.

                                As shown below, it does require at least one character in every field.

                                Search:
                                ^(?<domain>.+?)\t(?<first>.+?)\x20(?<last>.+?)\x20\((?<comment>.+?)\)\t(?<group>.+)$

                                Replace:
                                "$<first>\x20$<last>"\x20$<domain>\t$<comment>\t$<group>


                                \x20 is a space, \x22 is a double quote.

                                .+? is ungreedy, one or more characters. Ungreedy means it stops when
                                it gets to the character which is supposed to follow it, e.g., the tab.

                                Regards,
                                Sheri
                              • Sheri
                                Warning, if viewed on Yahoo, in the regular expression (Search) I posted in the previous message, Yahoo has rendered a backslash after which shouldn t
                                Message 15 of 20 , Sep 29, 2007
                                • 0 Attachment
                                  Warning, if viewed on Yahoo, in the regular expression (Search) I
                                  posted in the previous message, Yahoo has rendered a backslash after
                                  <group> which shouldn't be there. The search criteria should all be on
                                  one line without that particular backslash. It came through my email
                                  without that addition.

                                  Regards,
                                  Sheri
                                • Greg Chapman
                                  Hi Marcelo, On 28 Sep 07 17:09 Marcelo de Castro Bastos ... Mmmm! That found nothing at all! But you seem to be searching for
                                  Message 16 of 20 , Sep 29, 2007
                                  • 0 Attachment
                                    Hi Marcelo,

                                    On 28 Sep 07 17:09 Marcelo de Castro Bastos <mcblista@...>
                                    said:

                                    > Hmmm, it might have something to do with whatever is the expected
                                    > next character. I also forgot to get rid of the parentheses.
                                    >
                                    > Maybe a slightly changed version will do it:
                                    >
                                    > ^([^\t]+)\t([^\s]+) ([^\(]+)\(([^\)]+\))\t([^\r]+)

                                    Mmmm! That found nothing at all!

                                    But you seem to be searching for paraentheses there, and there aren't
                                    any in my file. There doesn't seem to be a change to to reflect the
                                    "expected next character" though?

                                    But with these clues I shall try and see if I can puzzle the thing out
                                    myself!

                                    Greg
                                  • Greg Chapman
                                    HI Tony, ... Yes the comments really are in parenthheses, where they exist. Most records do not include a comment. I have a few names to which I have added
                                    Message 17 of 20 , Sep 29, 2007
                                    • 0 Attachment
                                      HI Tony,

                                      On 28 Sep 07 14:10 Tony Mc <afmcc@...> said:
                                      >
                                      > if the comments are really enclosed in parentheses, as your example
                                      > suggests, you should use:

                                      Yes the comments really are in parenthheses, where they exist. Most
                                      records do not include a comment. I have a few names to which I have
                                      added (Work) or (Home). Once the names are taken out of the "Comment"
                                      field I don't need the parentheses round the comment any more.

                                      > Search for: ^(.*@.*)\t(.*?)(\(.*?\))\t(.*)$
                                      > Replace with: $2 <$1>\t$3\t$4

                                      This suggestion seems not to tackle the main part of the problem -
                                      getting the first field to wrap the newly inserted name in quotes and
                                      the address in <>.

                                      For a better understanding of the problem, take a look at:
                                      http://www.npopsupport.org.uk/addrbook.htm

                                      There is a new release coming within a few days, and I have decided to
                                      update my address book in line with what the programmers are now
                                      suggesting. Note that this doesn't quite follow any of the examples
                                      that I offer for the format of the address field. (If you're looking
                                      at the page updated on 13 July 2007)

                                      Greg
                                    • Don - HtmlFixIt.com
                                      ^([a-z|A-Z|0-9|.|_| -]+?)@([a-z|A-Z|0-9|.|_| -]+?) .([a-z|A-Z|0-9|.|_| -]{2,4}) t(.*?) t(.*) missed just three of your examples you sent me directly and for
                                      Message 18 of 20 , Sep 29, 2007
                                      • 0 Attachment
                                        ^([a-z|A-Z|0-9|.|_|\-]+?)@([a-z|A-Z|0-9|.|_|\-]+?)\.([a-z|A-Z|0-9|.|_|\-]{2,4})\t(.*?)\t(.*)

                                        missed just three of your examples you sent me directly and for
                                        explainable reasons
                                        should be 93 characters long

                                        I'm sure someone will reduce it to nothing :-)
                                      • Tony Mc
                                        ... Hi Greg, sorry, I missed the bit about the name being enclosed in quotes, though the replacement above will enclose the email address in angle brackets
                                        Message 19 of 20 , Sep 30, 2007
                                        • 0 Attachment
                                          On Sat, 29 Sep 2007 22:50:39 +0100, you wrote:

                                          > HI Tony,
                                          > > Search for: ^(.*@.*)\t(.*?)(\(.*?\))\t(.*)$
                                          > > Replace with: $2 <$1>\t$3\t$4
                                          >
                                          > This suggestion seems not to tackle the main part of the problem -
                                          > getting the first field to wrap the newly inserted name in quotes and
                                          > the address in <>.
                                          >

                                          Hi Greg,

                                          sorry, I missed the bit about the name being enclosed in quotes,
                                          though the replacement above will enclose the email address in angle
                                          brackets (why do you think it doesn't?). So, to do the wrapping of the
                                          name, try this:

                                          Search for: ^(.*@.*)\t(.*?)\s+(\(.*?\))\t(.*)$
                                          Replace with: "$2" <$1>\t$3\t$4

                                          The search string is the same as before except that there is a \s+
                                          between the name and comment fields, which prevents white space from
                                          being included in the name field. Anyway, let me know if this does
                                          what you want.

                                          Best,
                                          Tony
                                        • Greg Chapman
                                          Hi Don, Apologies to you and all the others who responded for not getting back to you sooner. Other pressures mean I ve not been reading this list for a few
                                          Message 20 of 20 , Oct 4, 2007
                                          • 0 Attachment
                                            Hi Don,

                                            Apologies to you and all the others who responded for not getting back
                                            to you sooner. Other pressures mean I've not been reading this list
                                            for a few days.

                                            Thanks so much for this...

                                            On 29 Sep 07 23:00 "Don - HtmlFixIt.com" <don@...> said:
                                            >
                                            >
                                            >
                                            > ^([a-z|A-Z|0-9|.|_|\-]+?)@([a-z|A-Z|0-9|.|_|\-]+?)\.([a-z|A-Z|0-9|.|_|\-]{2,4})\t(.*?)\t(.*)
                                            >
                                            > missed just three of your examples you sent me directly and for
                                            > explainable reasons
                                            > should be 93 characters long

                                            Yes it certainly would have been a tall order to get it 100% right.

                                            Most of those who had a go for me (and thanks to all) concentrated too
                                            much on my attempt at a summary of the typical entry and didn't read
                                            all my accompanying notes, which did rather alter the problem.

                                            Checking back, I also realised that in one of my followups I provided
                                            a conflicting bit of information, so that didn't help.

                                            You didn't tell me what I should use as a replace string, so I used
                                            the original one, but that left me with a trailing $6 on the end of
                                            each entry, so I deleted that and tried again. Also the stuff in
                                            column three shifted into column two. However, that was easy to sort
                                            out with a regular search for a ^T and replace with ^T^T.

                                            After that it was just the three entries you mentioned and the handful
                                            of entries which by then had the (Comment) as part of the name. It
                                            certainly saved me a mass of time.

                                            Thanks again!

                                            Now I really must set aside some time to work out exactly how that all
                                            worked!

                                            Greg
                                          Your message has been successfully submitted and would be delivered to recipients shortly.