## regular expression:word boundary

Expand Messages
• Hello everyone: I have trouble to understand how word boundary b and B works. I search the internet and found a good explaination for word boundary There are
Message 1 of 3 , May 30, 2006
• 0 Attachment
Hello everyone:

I have trouble to understand how word boundary \b and \B works.

I search the internet and found a good explaination for word boundary

There are four different positions that qualify as word boundaries:

a) Before the first character in the string, if the first
character is a word character.
b)After the last character in the string, if the last character is
a word character.
c)Between a word character and a non-word character following
right after the word character.
d)Between a non-word character and a word character following
right after the non-word character.

I understand first a and b, but have trouble to figure out C and D.

If you can give me a illustration how case c) and d) works I will
###############################################
The example I use for case a and b is:

1. "Charles the Brit raced his moped through the park."
2. "The Park Ranger watched Charles do this."

var reg3 = /\bt/; // " the" and " through" but not "Brit" or "watched"
var reg4 = /\Bt/; // "Brit" or "watched" but not " the" or " through"

\bt - pattern matches a word boundary followed by a char 't'
the-get matched becuase there is a white space (separation point b/w a
word and non-word chars) before chat 't'
Brit - won't match because char i preceeds t is not a word boundary

\Bt-pattern matches a non-word boundary followed by chart 't'

the- won't match because white space before 't' is a word boundary. so
it failed

Brit-matches becuase i is a charter is not a word boundary so Brit matches
• ... c) alert( x+ .match(/ w b W/)); d) alert( +x .match(/ W b w/)); -- Jonas Raoni Soares Silva http://www.jsfromhell.com
Message 2 of 3 , May 30, 2006
• 0 Attachment
On 5/30/06, flashqa1 <flashqa1@...> wrote:
> c)Between a word character and a non-word character following
> right after the word character.
> d)Between a non-word character and a word character following
> right after the non-word character.

--
Jonas Raoni Soares Silva
http://www.jsfromhell.com
• ... Let s try explaining it: First, something about strings... You ve been taught string characters are 0 indiced, right? First character is character number
Message 3 of 3 , May 30, 2006
• 0 Attachment
On 30/05/06, flashqa1 <flashqa1@...> wrote:
> Hello everyone:
> There are four different positions that qualify as word boundaries:
>
> a) Before the first character in the string, if the first
> character is a word character.
> b)After the last character in the string, if the last character is
> a word character.
> c)Between a word character and a non-word character following
> right after the word character.
> d)Between a non-word character and a word character following
> right after the non-word character.
>
> I understand first a and b, but have trouble to figure out C and D.

Let's try explaining it:
First, something about strings... You've been taught string characters
are 0 indiced, right? First character is character number 0?
Forget that. The index doesn't represent a character - the index
represents the place between characters. Index 0 is the place before
any characters in the string. Index 1 is the place between the first
character and the second, and so on. The last index is the same as the
length of the string. This index is the place after the last character
in the string.

"Why should I be thinking like this? I thought the old model worked
just fine..."
Well, the thing is, with regex you can match either a character before
or after the index - or you can match the index. In other words you
can match the place between characters. Start of string, end of
string, start of line, end of line, word boundary. These are all
examples of matching between characters.

So, we're matching the places between characters and not the actual
characters. How does the matching work? Well it's a boolean test
really:

isWordBoundary(index) = isWordCharacter(characterLeftOf(index))
XOR isWordCharacter(characterRightOf(index))

Which translated to English would read out: If, and only it, either
the character to the left of the index or the character to the right
of the index is a word character, this index is a word boundary. If
both or neither are word characters, this index is not a word
boundary.

And a results table for that would look like this:
Left
is \w isn't \w
Right ---- ----
is \w : is \B is \b
isn't \w: is \b is \B

> If you can give me a illustration how case c) and d) works I will
> ###############################################
> The example I use for case a and b is:

These examples actually show cases c and d. The cases a and b only
occurs if the character you're looking at is the first or last
character of the string, respectively.

> 1. "Charles the Brit raced his moped through the park."
> 2. "The Park Ranger watched Charles do this."
>
> var reg3 = /\bt/; // " the" and " through" but not "Brit" or "watched"

/\bt/ = <A word boundary followed by a 't' character>
Since 't' is a word character you can translate this into:
/\bt/ = <A 't' character which directly follows a non-word character>

Space is non-word character, thus the 't' characters in "the" and
"through" both match. 'i' and 'a' are word characters though, thus the
't' characters in "Brit" and "watched" don't match.

> var reg4 = /\Bt/; // "Brit" or "watched" but not " the" or " through"

/\bt/ = <A non-word-boundary followed by a 't' character>
Since 't' is a word character you can translate this into:
/\bt/ = <A 't' character which directly follows a word character>

'i' and 'a' are word characters, thus the 't' characters in "Brit" and
"watched" match. Space is a non-word character though, thus the 't'
characters in "the" and "through" don't match.

Cases a and b:
var
str='test',
re0=/\bt/,
re1=/\Bt/,
re2=/t\b/,
re3=/t\B/;

function report(arr){
var a=[],i;
for(i in arr)
a.push(i+': '+arr[i]);
}

report(re0.exec(str)); // => 0:'t' index:0 input:'test'
report(re1.exec(str)); // => 0:'t' index:3 input:'test'
report(re2.exec(str)); // => 0:'t' index:3 input:'test'
report(re3.exec(str)); // => 0:'t' index:0 input:'test'

Here you see that start of string and end of string are entered in the
results table I listed above as "isn't \w". Since 't' is a word
character, thus /\bt/ and /t\b/ match where the 't' is preceded by
start of string or followed by end of string respectively. /\Bt/ and
/t\B/ are the opposite and don't match where the 't' i preceded by
start of string or followed by end of string respectively.
--