Quantcast

Problem with lazy regexp

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Problem with lazy regexp

T Lee Davidson
According to http://gambaswiki.org/wiki/doc/pcre , using "*?" in a regular expression should lazily match 0 or more characters.
However, it appears to act greedily.

I am trying to do some very simple HTML tag stripping with 'Regex.Replace(sText, "<.*?>", "")', and it takes out way more than
just the tags.

Have I misunderstood the documentation?

(Project attached.)

--
Lee

[System]
Gambas=3.9.2
OperatingSystem=Linux
Kernel=4.4.57-18.3-default
Architecture=x86_64
Distribution=SuSE NAME="openSUSE Leap"
VERSION="42.2"
ID=opensuse
ID_LIKE="suse"
VERSION_ID="42.2"
PRETTY_NAME="openSUSE Leap 42.2"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:leap:42.2"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"
Desktop=KDE5
Theme=QtCurve
Language=en_US.UTF-8
Memory=3951M

[Libraries]
DBus=libdbus-1.so.3.8.14
OpenGL=libGL.so.1.2.0

[Environment]
(redacted)

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user

lazyregextest-0.0.1.tar.gz (16K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Problem with lazy regexp

Tobias Boege-2
On Sun, 23 Apr 2017, T Lee Davidson wrote:

> According to http://gambaswiki.org/wiki/doc/pcre , using "*?" in a regular
> expression should lazily match 0 or more characters. However, it appears to
> act greedily.
>
> I am trying to do some very simple HTML tag stripping with
> 'Regex.Replace(sText, "<.*?>", "")', and it takes out way more than just the
> tags.
>
> Have I misunderstood the documentation?
>

I believe you are correct. I get the same greedy behaviour from "<.*?>".
The Gambas wiki page seems to be copied from the libpcre documentation [1]
and the point, under QUANTIFIERS:

  *?          0 or more, lazy

hardly gives room for misinterpretation. I just tried the following line:

  RegExp.Replace("<tag abc=\"xyz\">content</tag>", "<.*>", "", RegExp.Ungreedy)

which correctly delivers "content", if you are interested in a workaround.
If no one else does it, I can (try to remember to) try to have a look at
gb.pcre this evening.

Regards,
Tobi

[1] http://www.pcre.org/current/doc/html/pcre2syntax.html

--
"There's an old saying: Don't change anything... ever!" -- Mr. Monk

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Problem with lazy regexp

Tobias Boege-2
On Mon, 24 Apr 2017, Tobias Boege wrote:

> On Sun, 23 Apr 2017, T Lee Davidson wrote:
> > According to http://gambaswiki.org/wiki/doc/pcre , using "*?" in a regular
> > expression should lazily match 0 or more characters. However, it appears to
> > act greedily.
> >
> > I am trying to do some very simple HTML tag stripping with
> > 'Regex.Replace(sText, "<.*?>", "")', and it takes out way more than just the
> > tags.
> >
> > Have I misunderstood the documentation?
> >
>
> I believe you are correct. I get the same greedy behaviour from "<.*?>".
> The Gambas wiki page seems to be copied from the libpcre documentation [1]
> and the point, under QUANTIFIERS:
>
>   *?          0 or more, lazy
>
> hardly gives room for misinterpretation. I just tried the following line:
>
>   RegExp.Replace("<tag abc=\"xyz\">content</tag>", "<.*>", "", RegExp.Ungreedy)
>
> which correctly delivers "content", if you are interested in a workaround.
> If no one else does it, I can (try to remember to) try to have a look at
> gb.pcre this evening.
>

It's still before noon, but I saw that the RegExp.Replace() routine always
automatically adds the RegExp.Ungreedy flag to the regular expression. With
that in mind, I tried

  RegExp.Replace(sText, "<.*>", "")

and it worked ungreedily. In fact, since the compilation options are always
OR'd, my successful pattern above with RegExp.Ungreedy was just an accident
and the setting of RegExp.Ungreedy was redundant. The PCRE documentation [1]
mentions a fact that escapes the Gambas documentation [2]:

  PCRE2_UNGREEDY           Invert greediness of quantifiers

(the Gambas documentation reads like it makes everything ungreedy.)

So, the greediness you get is explained, I'll add some bits to the
documentation later. Basically, RegExp.Replace() is always ungreedy.
You can still get greedy quantifiers by using ungreedy ones in your
pattern...

Regards,
Tobi

[1] http://www.pcre.org/current/doc/html/pcre2_compile.html
[2] http://gambaswiki.org/wiki/comp/gb.pcre/regexp/ungreedy

--
"There's an old saying: Don't change anything... ever!" -- Mr. Monk

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Problem with lazy regexp

T Lee Davidson
On 04/24/2017 04:25 AM, Tobias Boege wrote:
> You can still get greedy quantifiers by using ungreedy ones in your
> pattern...

LOL. To get ungreedy behavior, use greedy quantifiers. That's logical.


Thank you very much, Tobi, for digging that up and updating the documentation.

Perhaps, though, more intuitive behavior would be achieved if the RegExp.Ungreedy flag was not set by default. Just a thought.

Thanks again.

--
Lee


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user
Loading...