Regex - expert opinion requested

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Regex - expert opinion requested

Fernando Cabral
This is only for those who like to work with regular expressions.
It is a performance issue. I am using 26 different regular expressions of
this kind:

txt = RegExp.Replace(TextoBruto, NaoNumerais, "&1\n", RegExp.UTF8)
txt = RegExp.Replace(Txt, "\n\n+?", "\n", RegExp.UTF8)
txt = RegExp.Replace(Txt, "^\n+?", "", RegExp.UTF8)
txt = RegExp.Replace(Txt, "\n+?$", "", RegExp.UTF8)

Those are pretty fast. Less than one second for a text with 415KB (about
six thousand lines).

But the following code is quite slow. About 27 seconds each:

ttDigitos = String.Len(RegExp.Replace(TextoBruto, "[^0-9]", "",
RegExp.UTF8)) ' 27 segundos
ttPontuacao = String.Len(RegExp.Replace(TextoBruto, "[^.:;,?!]", "",
RegExp.UTF8))  ' 27 segundos
ttBrancos = String.Len(RegExp.Replace(TextoBruto, "[^ \t]", "",
RegExp.UTF8))   ' 27 segundos
Print "Especial antigo", Now
'ttEspeciais = String.Len(RegExp.Replace(TextoBruto,
"[^-[\\](){}\"@#$%&*_+=<>/\\\\|ºª§“”‘’]", "", RegExp.UTF8))  ' 27 segundos
Print "Especial novo", Now
ttEspeciais = String.Len(RegExp.Replace(TextoBruto,
"[-aeiouãáéíóúâõàbcçdfghjlmnpqrstvxyz
,.:;!?()0-9êôwkèìòùäÄÁÉÍÓÚÀÈÌÒÙÂÔÂÊÔÇABCDEFGHIJKLMNOPQRSTUVWXYZ]", "",
RegExp.UTF8))  ' 27 segundos
Print "fim especial novo", Now

Quite slow. The whole programm takes 2 minutes to run. The above lines
alone consume 108 seconds (108:120).

I tried some variations. For instance, ttEspeciais = .... has two versions.
One negates what to leave in, the other describes what to take out. End
result is the same. And so is the time spent.

I have also written a much longer code that does the same thing using loops
and searching for the characters I want in or want out. The whole thing
runs in about 5 seconds (but this code took me much, much longer do write).

I wonder if any of you could suggest potentially faster RegExp that could
replace the specimens above.

Regard

- fernando
--
Fernando Cabral
Blogue: http://fernandocabral.org
Twitter: http://twitter.com/fjcabral
e-mail: [hidden email]
Facebook: [hidden email]
Telegram: +55 (37) 99988-8868
Wickr ID: fernandocabral
WhatsApp: +55 (37) 99988-8868
Skype:  fernandojosecabral
Telefone fixo: +55 (37) 3521-2183
Telefone celular: +55 (37) 99988-8868

Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
nenhum político ou cientista poderá se gabar de nada.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Regex - expert opinion requested

Tobias Boege-2
On Wed, 31 May 2017, Fernando Cabral wrote:

> This is only for those who like to work with regular expressions.
> It is a performance issue. I am using 26 different regular expressions of
> this kind:
>
> txt = RegExp.Replace(TextoBruto, NaoNumerais, "&1\n", RegExp.UTF8)
> txt = RegExp.Replace(Txt, "\n\n+?", "\n", RegExp.UTF8)
> txt = RegExp.Replace(Txt, "^\n+?", "", RegExp.UTF8)
> txt = RegExp.Replace(Txt, "\n+?$", "", RegExp.UTF8)
>
> Those are pretty fast. Less than one second for a text with 415KB (about
> six thousand lines).
>
> But the following code is quite slow. About 27 seconds each:
>
> ttDigitos = String.Len(RegExp.Replace(TextoBruto, "[^0-9]", "",
> RegExp.UTF8)) ' 27 segundos
> ttPontuacao = String.Len(RegExp.Replace(TextoBruto, "[^.:;,?!]", "",
> RegExp.UTF8))  ' 27 segundos
> ttBrancos = String.Len(RegExp.Replace(TextoBruto, "[^ \t]", "",
> RegExp.UTF8))   ' 27 segundos
> Print "Especial antigo", Now
> 'ttEspeciais = String.Len(RegExp.Replace(TextoBruto,
> "[^-[\\](){}\"@#$%&*_+=<>/\\\\|ºª§“”‘’]", "", RegExp.UTF8))  ' 27 segundos
> Print "Especial novo", Now
> ttEspeciais = String.Len(RegExp.Replace(TextoBruto,
> "[-aeiouãáéíóúâõàbcçdfghjlmnpqrstvxyz
> ,.:;!?()0-9êôwkèìòùäÄÁÉÍÓÚÀÈÌÒÙÂÔÂÊÔÇABCDEFGHIJKLMNOPQRSTUVWXYZ]", "",
> RegExp.UTF8))  ' 27 segundos
> Print "fim especial novo", Now
>
> Quite slow. The whole programm takes 2 minutes to run. The above lines
> alone consume 108 seconds (108:120).
>
> I tried some variations. For instance, ttEspeciais = .... has two versions.
> One negates what to leave in, the other describes what to take out. End
> result is the same. And so is the time spent.
>
> I have also written a much longer code that does the same thing using loops
> and searching for the characters I want in or want out. The whole thing
> runs in about 5 seconds (but this code took me much, much longer do write).
>
> I wonder if any of you could suggest potentially faster RegExp that could
> replace the specimens above.
>

This sounds interesting, because for one thing I can't imagine a pipe chain
of "sed" invocations to take this long on just 500 KiB input (but I could
be wrong).

Also, in case you didn't know, the IDE also has a very handy profiler
(Debug > Activate profiling menu). It lets you take a somewhat closer look
at where your code spends its time, but it may not be of much help here.

About your regular expressions: I think the key point is that you are really
just erasing characters of character classes. Your expressions are extremely
simple in that regard. You mentioned that avoiding regular expressions gives
you a big speedup but the code took you longer to write. I don't see why.
You should be able to write a general function

  Private Function EraseClass(sStr As String, sClass As String) As String

which erases from sStr every character that is in sClass, using a simple
loop and String.InStr().

You can probably even abuse the Split() function for this. To remove any
single character in sClass from the string sStr, do:

  Split(sStr, sClass).Join("")

Split() probably won't behave well with multibyte characters, though, such
as the UTF-8 you require above. With both attempts it is harder to implement
the "[^...]" inverse character class syntax.

Regardless, I would be a little interested in getting a sample project which
includes your regular expressions and such a text file, to see for myself
where the time is exactly spent. Can you send a version of your project that
contains only the parts relevant to these regular expressions?

Regards,
Tobi

--
"There's an old saying: Don't change anything... ever!" -- Mr. Monk

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Regex - expert opinion requested

Fernando Cabral
Tobi,

> This sounds interesting, because for one thing I can't imagine a pipe
chain
> of "sed" invocations to take this long on just 500 KiB input (but I could
> be wrong).

"sed" does it in a lightning fast way.
For instance, sed 's/[^.:;,?!]//g' <FILE takes no more than 0m0.098s while
the gambas equivalent
(ttPontuacao = String.Len(RegExp.Replace(TextoBruto, "[^.:;,?!]", "",
RegExp.UTF8)))
takes 27 seconds. It means 'sed' is 275 times faster!

What is more notable: in Gambas we are working with a variable
("TextoBruto") which has the same contents as FILE in the "sed" example.
Since in Gambas this operation does not require disk I/O, it should be
faster.

>About your regular expressions: I think the key point is that you are
really
>just erasing characters of character casses. Your expressions are extremely
>simple in that regard. You mentioned that avoiding regular expressions
gives
>you a big speedup but the code took you longer to write. I don't see why.
>You should be able to write a general function.

Yes, that's what amazes me the most: the RE are pretty simple;
nevertheless, slow when using PCRE.

On the other hand, yes, I have written a general expression (see the
attached code). It is just a very simple parser that breaks the text into
char, non-char, syllable, words, sentences.
In the end, this solution is about 31 times faster than the solution based
on the regex.replace function.
That's a lot faster than both gambas.

> Regardless, I would be a little interested in getting a sample project
which
> includes your regular expressions and such a text file, to see for myself
> where the time is exactly spent. Can you send a version of your project
that
> contains only the parts relevant to these regular expressions?

I am attaching the two versions, the one based on RE, and the other one I
built from scratch. Neither is an example of good code. I am trying to
learn gambas so I certainly have not used the best possible alternatives
available.  Many things I did by trial and error because I don't have any
experience with gambas. So, pardon me for the low code quality.

I have also attached the test file. Its text is not in good shape. This
means it has a lot of broken things, like missing or wrong punctuation,
blank lines, dangling words, unpaired parenthesis and quotes. This is one
of the reasons I am using it as a test platform. It allows me to test the
code robustness.

To run and compare the two versions timewise, do:

$  time ./Legibilidade-odt PauloCoelho.odt
$ time ./AnalisaSenteca PauloCoelho.odt

You will have to install "unoconv" in your machine. Just in case you don't
have it and you do not want to install it, I am sending also a txt version
of the same file. You can use it to test the RegExp using sed and also you
can change the code to skip the conversion phase (conversion from ODT to
TXT).

Results generated by the two programs are very similar, but the time spent
by each of them is quite different.

SInce the list administrator did not allow me to send the files, I am
sending you the links so you can get them from Dropbox:

https://www.dropbox.com/s/6prpw8l7bir177f/AnalisaSentenca-0.0.665.tar.gz?dl=0
https://www.dropbox.com/s/82adoan7ojbwvbn/Legibilidade-odt-0.0.354.tar.gz?dl=0
https://www.dropbox.com/s/3n637e7g8rwqzfd/PauloCoelho.odt?dl=0
https://www.dropbox.com/s/3n637e7g8rwqzfd/PauloCoelho.odt?dl=0



Regards

- fernando
PS - Neither of the two programs is good stuff. They are not finished
and some of the algorithms and functions are only crude versions of what
they could be.

2017-05-31 19:21 GMT-03:00 Tobias Boege <[hidden email]>:

> On Wed, 31 May 2017, Fernando Cabral wrote:
> > This is only for those who like to work with regular expressions.
> > It is a performance issue. I am using 26 different regular expressions of
> > this kind:
> >
> > txt = RegExp.Replace(TextoBruto, NaoNumerais, "&1\n", RegExp.UTF8)
> > txt = RegExp.Replace(Txt, "\n\n+?", "\n", RegExp.UTF8)
> > txt = RegExp.Replace(Txt, "^\n+?", "", RegExp.UTF8)
> > txt = RegExp.Replace(Txt, "\n+?$", "", RegExp.UTF8)
> >
> > Those are pretty fast. Less than one second for a text with 415KB (about
> > six thousand lines).
> >
> > But the following code is quite slow. About 27 seconds each:
> >
> > ttDigitos = String.Len(RegExp.Replace(TextoBruto, "[^0-9]", "",
> > RegExp.UTF8)) ' 27 segundos
> > ttPontuacao = String.Len(RegExp.Replace(TextoBruto, "[^.:;,?!]", "",
> > RegExp.UTF8))  ' 27 segundos
> > ttBrancos = String.Len(RegExp.Replace(TextoBruto, "[^ \t]", "",
> > RegExp.UTF8))   ' 27 segundos
> > Print "Especial antigo", Now
> > 'ttEspeciais = String.Len(RegExp.Replace(TextoBruto,
> > "[^-[\\](){}\"@#$%&*_+=<>/\\\\|ºª§“”‘’]", "", RegExp.UTF8))  ' 27
> segundos
> > Print "Especial novo", Now
> > ttEspeciais = String.Len(RegExp.Replace(TextoBruto,
> > "[-aeiouãáéíóúâõàbcçdfghjlmnpqrstvxyz
> > ,.:;!?()0-9êôwkèìòùäÄÁÉÍÓÚÀÈÌÒÙÂÔÂÊÔÇABCDEFGHIJKLMNOPQRSTUVWXYZ]", "",
> > RegExp.UTF8))  ' 27 segundos
> > Print "fim especial novo", Now
> >
> > Quite slow. The whole programm takes 2 minutes to run. The above lines
> > alone consume 108 seconds (108:120).
> >
> > I tried some variations. For instance, ttEspeciais = .... has two
> versions.
> > One negates what to leave in, the other describes what to take out. End
> > result is the same. And so is the time spent.
> >
> > I have also written a much longer code that does the same thing using
> loops
> > and searching for the characters I want in or want out. The whole thing
> > runs in about 5 seconds (but this code took me much, much longer do
> write).
> >
> > I wonder if any of you could suggest potentially faster RegExp that could
> > replace the specimens above.
> >
>
> This sounds interesting, because for one thing I can't imagine a pipe chain
> of "sed" invocations to take this long on just 500 KiB input (but I could
> be wrong).
>
> Also, in case you didn't know, the IDE also has a very handy profiler
> (Debug > Activate profiling menu). It lets you take a somewhat closer look
> at where your code spends its time, but it may not be of much help here.
>
> About your regular expressions: I think the key point is that you are
> really
> just erasing characters of character classes. Your expressions are
> extremely
> simple in that regard. You mentioned that avoiding regular expressions
> gives
> you a big speedup but the code took you longer to write. I don't see why.
> You should be able to write a general function
>
>   Private Function EraseClass(sStr As String, sClass As String) As String
>
> which erases from sStr every character that is in sClass, using a simple
> loop and String.InStr().
>
> You can probably even abuse the Split() function for this. To remove any
> single character in sClass from the string sStr, do:
>
>   Split(sStr, sClass).Join("")
>
> Split() probably won't behave well with multibyte characters, though, such
> as the UTF-8 you require above. With both attempts it is harder to
> implement
> the "[^...]" inverse character class syntax.
>
> Regardless, I would be a little interested in getting a sample project
> which
> includes your regular expressions and such a text file, to see for myself
> where the time is exactly spent. Can you send a version of your project
> that
> contains only the parts relevant to these regular expressions?
>
> Regards,
> Tobi
>
> --
> "There's an old saying: Don't change anything... ever!" -- Mr. Monk
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Gambas-user mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gambas-user
>



--
Fernando Cabral
Blogue: http://fernandocabral.org
Twitter: http://twitter.com/fjcabral
e-mail: [hidden email]
Facebook: [hidden email]
Telegram: +55 (37) 99988-8868
Wickr ID: fernandocabral
WhatsApp: +55 (37) 99988-8868
Skype:  fernandojosecabral
Telefone fixo: +55 (37) 3521-2183
Telefone celular: +55 (37) 99988-8868

Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
nenhum político ou cientista poderá se gabar de nada.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user
Loading...