Loading ...
Sorry, an error occurred while loading the content.
 

Re: [Clip] Re: regex help ....

Expand Messages
  • Don
    Wish I had a clue what you just did ... it worked great ... except there are some other formats I guess ... that did not convert ... periods or hypens it
    Message 1 of 6 , May 3, 2010
      Wish I had a clue what you just did ...
      it worked great ... except there are some other formats I guess ... that
      did not convert ... periods or hypens it appears.
      Day. ThurgooScott, Ja'Naye3 12.77q
      Ft. Wayne WaPerry, Jasmine4 12.82q
      Tol. BowsherFranklin, Joy6 12.91q
      Trotwood-MadDavis, Alexis7 12.93q
      Ft. Wayne PaHardy, Victoria8 13.04q
      Day. DunbarCherry, Meghan9 13.14
      Det. MumfordGreen, Rosalynd10 13.16

      > I think I'd indeed do just what you suggested.
      >
      > Search for:
      > "^([\w \.\-]+?)(\p{Lu}\p{Ll}+), (\pL+)(\d+) ([\d\.]+)\pL?"
      >
      > Replace with:
      > "$1\t$2\t$3\t$4\t$5"
      >
      >
    • diodeom
      I see that you ve already edited the first capturing pattern to account for possible periods and dashes in school names. To handle apostrophes in first names
      Message 2 of 6 , May 3, 2010
        I see that you've already edited the first capturing pattern to account for possible periods and dashes in school names. To handle apostrophes in first names (Ja'Naye) the third pattern (\pL+) could be broadened as well into ([\pL']+).

        So the whole search part would read:
        "^([\w \.\-]+?)(\p{Lu}\p{Ll}+), ([\pL']+)(\d+) ([\d\.]+)\pL?"

        To break it down, find:
        at a line's beginning
        ^
        only as many "word" characters, spaces and specified punctuation
        ([\w \.\-]+?)
        until you stumble upon a capitalized word (single uppercase letter followed by a maximum number of lowercase letters) just before a comma and a space
        (\p{Lu}\p{Ll}+),
        followed by a maximum number of letters and/or apostrophes
        ([\pL']+)
        followed by a maximum number of digits before a space,
        (\d+)
        and after this space, max number of digits and/or dots
        ([\d\.]+)
        followed by an optional single letter
        \pL?

        Parentheses specify substrings to capture (and refer to later with $1, $2, etc.), so any undesirable fluff can be eliminated in the replacement.

        I'm afraid this pattern could yet grow quite a bit, e.g. if we were to consider a possible case of a dashed school's name (Trotwood-Mad) and similarly dashed runner's last name (Vasques-Ramirez)...
        I think that Flo's solution could avoid these issues altogether, though (as is) it doesn't look like it would know NOT to split e.g. "Oak Ridge" with a tab.

        --- Don <don@...> wrote:
        >
        > Wish I had a clue what you just did ...
        > it worked great ... except there are some other formats I guess ... that
        > did not convert ... periods or hypens it appears.
        > Day. ThurgooScott, Ja'Naye3 12.77q
        > Ft. Wayne WaPerry, Jasmine4 12.82q
        > Tol. BowsherFranklin, Joy6 12.91q
        > Trotwood-MadDavis, Alexis7 12.93q
        > Ft. Wayne PaHardy, Victoria8 13.04q
        > Day. DunbarCherry, Meghan9 13.14
        > Det. MumfordGreen, Rosalynd10 13.16
        >
        > > I think I'd indeed do just what you suggested.
        > >
        > > Search for:
        > > "^([\w \.\-]+?)(\p{Lu}\p{Ll}+), (\pL+)(\d+) ([\d\.]+)\pL?"
        > >
        > > Replace with:
        > > "$1\t$2\t$3\t$4\t$5"
        > >
        > >
        >
      • diodeom
        ... This seems to work: Search for: p{Ll} K(?= p{Lu})|, | pL K(?= d)| d K (?= d) Replace (WARS) with: t It just leaves an occasional q at the end of
        Message 3 of 6 , May 3, 2010
          I wrote:
          >
          > I think that Flo's solution could avoid these issues altogether, though (as is) it doesn't look like it would know NOT to split e.g. "Oak Ridge" with a tab.
          >

          This seems to work:

          Search for:
          "\p{Ll}\K(?=\p{Lu})|, |\pL\K(?=\d)|\d\K (?=\d)"

          Replace (WARS) with:
          "\t"

          It just leaves an occasional "q" at the end of some lines to clean.
        Your message has been successfully submitted and would be delivered to recipients shortly.