Help needed from regexp gurus

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Help needed from regexp gurus

Fernando Cabral
Still beating my head against the wall due to my lack of knowledge about
the PCRE methods and properties... Because of this, I have progressed not
only very slowly but also -- I fell -- in a very inelegant way. So perhaps
you guys who are more acquainted with PCRE might be able to hint me on a
better solution.

I want to search a long string that can contain a sentence, a paragraph or
even a full text. I wanna find and isolate every word it contains. A word
is defined as any sequence of alphabetic characters followed by a
non-alphatetic character.

The sample code bellow does work, but I don't feel it is as elegant and as
fast as it could and should be.  Especially the way I am traversing the
string from the beginning to the end. It looks awkward and slow. There must
be a more efficient way, like working only with offsets and lengths instead
of copying the string again and again.

Dim Alphabetics as string "abc...zyzABC...ZYZ"
Dim re as RegExp
Dim matches as String []
Dim RawText as String

re.Compile("([" & Alphabetics & "]+?)([^" & Alphabetics & "]+)",
RegExp.utf8)
RawText = "abc12345def ghi jklm mno p1"

Do While RawText
     re.Exec(RawText)
     matches.add(re[1].text)
     RawText = String.Mid(RawText, String.Len(re.text) + 1)
Loop

For i = 0 To matches.Count - 1
  Print matches[i]
Next


Above code correctly finds "abc, def, ghi, jlkm, mno, p". But the tricks I
have used are cumbersome (like advancing with string.mid() and resorting to
re[1].text and re.text.

--
Fernando Cabral
Blogue: http://fernandocabral.org
Twitter: http://twitter.com/fjcabral
e-mail: [hidden email]
Facebook: [hidden email]
Telegram: +55 (37) 99988-8868
Wickr ID: fernandocabral
WhatsApp: +55 (37) 99988-8868
Skype:  fernandojosecabral
Telefone fixo: +55 (37) 3521-2183
Telefone celular: +55 (37) 99988-8868

Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
nenhum político ou cientista poderá se gabar de nada.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Help needed from regexp gurus

Tobias Boege-2
On Sat, 17 Jun 2017, Fernando Cabral wrote:

> Still beating my head against the wall due to my lack of knowledge about
> the PCRE methods and properties... Because of this, I have progressed not
> only very slowly but also -- I fell -- in a very inelegant way. So perhaps
> you guys who are more acquainted with PCRE might be able to hint me on a
> better solution.
>
> I want to search a long string that can contain a sentence, a paragraph or
> even a full text. I wanna find and isolate every word it contains. A word
> is defined as any sequence of alphabetic characters followed by a
> non-alphatetic character.
>

The Mathematician in me can't resist to point this out: you hopefully wanted
to define "word in a string" as "a *longest* sequence of alphabetic characters
followed by a non-alphabetic character (or the end of the string)". Using your
definition above, the words in "abc:" would be "c", "bc" and "abc", whereas
you probably only wanted "abc" (the longest of those).

> The sample code bellow does work, but I don't feel it is as elegant and as
> fast as it could and should be.  Especially the way I am traversing the
> string from the beginning to the end. It looks awkward and slow. There must
> be a more efficient way, like working only with offsets and lengths instead
> of copying the string again and again.
>

You think worse of String.Mid() than it deserves, IMHO. Gambas strings
are triples of a pointer to some data, a start index and a length, and
the built-in string functions take care not to copy a string when it's
not necessary. The plain Mid$() function (dealing with ASCII strings only)
is implemented as a constant-time operation which simply takes your input
string and adjusts the start index and length to give you the requested
portion of the string. The string doesn't even have to be read, much less
copied, to do this.

Now, the String.Mid() function is somewhat more complicated, because
UTF-8 strings have variable-width characters, which makes it difficult
to map byte indices to character positions. To implement String.Mid(),
your string has to be read, but, again, not copied.

Extracting a part of a string is a non-destructive operation in Gambas
and no copying takes place. (Concatenating strings, on the other hand,
will copy.) So, there is some reading overhead (if you need UTF-8 strings),
but it's smaller than you probably thought.

> Dim Alphabetics as string "abc...zyzABC...ZYZ"
> Dim re as RegExp
> Dim matches as String []
> Dim RawText as String
>
> re.Compile("([" & Alphabetics & "]+?)([^" & Alphabetics & "]+)",
> RegExp.utf8)
> RawText = "abc12345def ghi jklm mno p1"
>
> Do While RawText
>      re.Exec(RawText)
>      matches.add(re[1].text)
>      RawText = String.Mid(RawText, String.Len(re.text) + 1)
> Loop
>
> For i = 0 To matches.Count - 1
>   Print matches[i]
> Next
>
>
> Above code correctly finds "abc, def, ghi, jlkm, mno, p". But the tricks I
> have used are cumbersome (like advancing with string.mid() and resorting to
> re[1].text and re.text.
>

Well, I think you can't use PCRE alone to solve your problem, if you want
to capture a variable number of words in your submatches. I did a bit of
reading and from what I gather [1][2] capturing group numbers are assigned
based on the verbatim regular expression, i.e. the number of submatches
you can receive is limited by the number of "(...)" constructs in your
expression; and the (otherwise very nifty) recursion operator (?R) does
not give you an unlimited number of capturing groups, sadly.

Anyway, I think by changing your regular expression, you can let PCRE take
care of the string advancement, like so:

   1 #!/usr/bin/gbs3
   2
   3 Use "gb.pcre"
   4
   5 Public Sub Main()
   6   Dim r As New RegExp
   7   Dim s As string
   8
   9   r.Compile("([[:alpha:]]+)[[:^alpha:]]+(.*$)", RegExp.UTF8)
  10   s = "abc12345def ghi jklm mno p1"
  11   Print "Subject:";; s
  12   Do
  13     r.Exec(s)
  14     If r.Offset = -1 Then Break
  15     Print " ->";; r[1].Text
  16     s = r[2].Text
  17   Loop While s
  18 End

Output:

  Subject: abc12345def ghi jklm mno p1
   -> abc
   -> def
   -> ghi
   -> jklm
   -> mno
   -> p

But, I think, this is less efficient than using String.Mid(). The trailing
group (.*$) _may_ make the PCRE library read the entire subject every time.
And I believe gb.pcre will copy your submatch string when returning it.
If you care deeply about this, you'll have to trace the code in gb.pcre
and main/gbx (the interpreter) to see what copies strings and what doesn't.

Regards,
Tobi

[1] http://www.regular-expressions.info/recursecapture.html (Capturing Groups Inside Recursion or Subroutine Calls)
[2] http://www.rexegg.com/regex-recursion.html (Groups Contents and Numbering in Recursive Expressions)

--
"There's an old saying: Don't change anything... ever!" -- Mr. Monk

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Help needed from regexp gurus

Fernando Cabral
Thank you, Tobi, for taking the time to comment on my issues. I will ponder
the following.

2017-06-17 18:06 GMT-03:00 Tobias Boege <[hidden email]>:

> On Sat, 17 Jun 2017, Fernando Cabral wrote:
> >> Still beating my head against the wall due to my lack of knowledge about
> >> the PCRE methods and properties... Because of this, I have progressed
> not
> >> only very slowly but also -- I fell -- in a very inelegant way. So
> perhaps
> >> you guys who are more acquainted with PCRE might be able to hint me on
> a
> >> better solution.
> >>
> >> I want to search a long string that can contain a sentence, a
> paragraph or
> >> even a full text. I wanna find and isolate every word it contains. A
> word
> >> is defined as any sequence of alphabetic characters followed by a
> >> non-alphatetic character.
>
> >The Mathematician in me can't resist to point this out: you hopefully
> wanted
> >to define "word in a string" as "a *longest* sequence of alphabetic
> characters
> >followed by a non-alphabetic character (or the end of the string)".
> Using your
> >definition above, the words in "abc:" would be "c", "bc" and "abc",
> whereas
> >you probably only wanted "abc" (the longest of those).
>
> Right, the longest sequence. But I can't see why my definition is not
equivalent to yours, even thou
it is simpler. "A word is defined as any sequence of alphabetic characters
followed by a non-alphabetic character" has to be the longest, no matter
what. See, in "abc", "a" and "ab" are not followed by a non-alphabetic, so
you have to keep advancing. "abc" is followed by a non-alphabetic, so it
will comply with the definition.

So I think we can do without stating it has to be the longest sequence. If
I am wrong, I still can' t see why.


> >> The sample code bellow does work, but I don't feel it is as elegant and
> as
> >> fast as it could and should be.  Especially the way I am traversing the
> >> string from the beginning to the end. It looks awkward and slow. There
> must
> >> be a more efficient way, like working only with offsets and lengths
> instead
> >> of copying the string again and again.
>
> >You think worse of String.Mid() than it deserves, IMHO. Gambas strings
> >are triples of a pointer to some data, a start index and a length, and
> >the built-in string functions take care not to copy a string when it's
> >not necessary. The plain Mid$() function (dealing with ASCII strings only)
> >is implemented as a constant-time operation which simply takes your input
> >string and adjusts the start index and length to give you the requested
> >portion of the string. The string doesn't even have to be read, much less
> >copied, to do this.
>
> >Now, the String.Mid() function is somewhat more complicated, because
> >UTF-8 strings have variable-width characters, which makes it difficult
> >to map byte indices to character positions. To implement String.Mid(),
> >your string has to be read, but, again, not copied.
>
> Right. Since I am workings with Portuguese, it has to be UTF8. So I can't
avoid using
String.Mid().

But I still understand it has to be copied because I am doing a

str = String.Mid(str, HowMany)

In this case I would guess it has to be copied because the original
contents is shrunk, which
happens again and again, until nothing is left to be scanned. I understand
Gambas does not do
garbage collection as old basic used to do, but still, I suppose it
eventually will have to recover
unused memory.




> > Extracting a part of a string is a non-destructive operation in Gambas
> > and no copying takes place. (Concatenating strings, on the other hand,
> > will copy.) So, there is some reading overhead (if you need UTF-8
> strings),
> > but it's smaller than you probably thought.
>
> As per above, in this case it is not only extracting, but overwriting the
contents itself.


> > Dim Alphabetics as string "abc...zyzABC...ZYZ"
> > Dim re as RegExp
> > Dim matches as String []
> > Dim RawText as String
> >
> > re.Compile("([" & Alphabetics & "]+?)([^" & Alphabetics & "]+)",
> > RegExp.utf8)
> > RawText = "abc12345def ghi jklm mno p1"
> >
> > Do While RawText
> >      re.Exec(RawText)
> >      matches.add(re[1].text)
> >      RawText = String.Mid(RawText, String.Len(re.text) + 1)
> > Loop
> >
> > For i = 0 To matches.Count - 1
> >   Print matches[i]
> > Next
> >
> >
> > Above code correctly finds "abc, def, ghi, jlkm, mno, p". But the tricks
> I
> > have used are cumbersome (like advancing with string.mid() and resorting
> to
> > re[1].text and re.text.
> >
>
> >Well, I think you can't use PCRE alone to solve your problem, if you want
> >to capture a variable number of words in your submatches. I did a bit of
> >reading and from what I gather [1][2] capturing group numbers are
> assigned
> >based on the verbatim regular expression, i.e. the number of submatches
> >you can receive is limited by the number of "(...)" constructs in your
> >expression; and the (otherwise very nifty) recursion operator (?R) does
> >not give you an unlimited number of capturing groups, sadly.
>

What I need is to grab a word at a time. The reason I am using two
submatches
"([:Alpha:])([:^Alpha:])" is because I don't care for Non-Alpha. This way I
can
I can forget about the submatch, but it will help me to skip to the next
word (since len(re.text)
complises the lenght of both submatches).

>
> > Anyway, I think by changing your regular expression, you can let PCRE
> take
> > care of the string advancement, like so:
>

For the time being, I will use the loop the way you proposed bellow. It
seems cleaner than
my solution. As to the performance, latter I'll check which one is faster.

Thanks a lot

- fernando


>
>    1 #!/usr/bin/gbs3
>    2
>    3 Use "gb.pcre"
>    4
>    5 Public Sub Main()
>    6   Dim r As New RegExp
>    7   Dim s As string
>    8
>    9   r.Compile("([[:alpha:]]+)[[:^alpha:]]+(.*$)", RegExp.UTF8)
>   10   s = "abc12345def ghi jklm mno p1"
>   11   Print "Subject:";; s
>   12   Do
>   13     r.Exec(s)
>   14     If r.Offset = -1 Then Break
>   15     Print " ->";; r[1].Text
>   16     s = r[2].Text
>   17   Loop While s
>   18 End
>
> Output:
>
>   Subject: abc12345def ghi jklm mno p1
>    -> abc
>    -> def
>    -> ghi
>    -> jklm
>    -> mno
>    -> p
>
> But, I think, this is less efficient than using String.Mid(). The trailing
> group (.*$) _may_ make the PCRE library read the entire subject every time.
> And I believe gb.pcre will copy your submatch string when returning it.
> If you care deeply about this, you'll have to trace the code in gb.pcre
> and main/gbx (the interpreter) to see what copies strings and what doesn't.
>
> Regards,
> Tobi
>
> [1] http://www.regular-expressions.info/recursecapture.html (Capturing
> Groups Inside Recursion or Subroutine Calls)
> [2] http://www.rexegg.com/regex-recursion.html (Groups Contents and
> Numbering in Recursive Expressions)
>
> --
> "There's an old saying: Don't change anything... ever!" -- Mr. Monk
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Gambas-user mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gambas-user
>



--
Fernando Cabral
Blogue: http://fernandocabral.org
Twitter: http://twitter.com/fjcabral
e-mail: [hidden email]
Facebook: [hidden email]
Telegram: +55 (37) 99988-8868
Wickr ID: fernandocabral
WhatsApp: +55 (37) 99988-8868
Skype:  fernandojosecabral
Telefone fixo: +55 (37) 3521-2183
Telefone celular: +55 (37) 99988-8868

Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
nenhum político ou cientista poderá se gabar de nada.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Help needed from regexp gurus

Fernando Cabral
In reply to this post by Tobias Boege-2
Tobi

One more thing about the way I wish it could work (I remember having done
this in C perhaps 30 years ago). The pseudo-code bellow is pretty
schematic, but I think it will clarify the issue.

Let p and l be arrays of integers and s be the string "abc defg hijkl"

So, after traversing the string we would have the following result:
p[0] = offset of "a" (0)
l[0] = length of "abc" (3)
p[1] = offset of "d" (4)
l[1] = lenght of "defg" (4)
p[2] = offset of "h" (9)
l[2] = lenght of "hijkl" (5).

After this, each word could be retrieved in the following manner:

for i = 0 to 2
    print mid(s, p[i], l[i])
next

I think this would be the most efficient way to do it. But I can't find how
to do it in Gambas using Regex.

Regards

- fernando


2017-06-17 18:06 GMT-03:00 Tobias Boege <[hidden email]>:

> On Sat, 17 Jun 2017, Fernando Cabral wrote:
> > Still beating my head against the wall due to my lack of knowledge about
> > the PCRE methods and properties... Because of this, I have progressed not
> > only very slowly but also -- I fell -- in a very inelegant way. So
> perhaps
> > you guys who are more acquainted with PCRE might be able to hint me on a
> > better solution.
> >
> > I want to search a long string that can contain a sentence, a paragraph
> or
> > even a full text. I wanna find and isolate every word it contains. A word
> > is defined as any sequence of alphabetic characters followed by a
> > non-alphatetic character.
> >
>
> The Mathematician in me can't resist to point this out: you hopefully
> wanted
> to define "word in a string" as "a *longest* sequence of alphabetic
> characters
> followed by a non-alphabetic character (or the end of the string)". Using
> your
> definition above, the words in "abc:" would be "c", "bc" and "abc", whereas
> you probably only wanted "abc" (the longest of those).
>
> > The sample code bellow does work, but I don't feel it is as elegant and
> as
> > fast as it could and should be.  Especially the way I am traversing the
> > string from the beginning to the end. It looks awkward and slow. There
> must
> > be a more efficient way, like working only with offsets and lengths
> instead
> > of copying the string again and again.
> >
>
> You think worse of String.Mid() than it deserves, IMHO. Gambas strings
> are triples of a pointer to some data, a start index and a length, and
> the built-in string functions take care not to copy a string when it's
> not necessary. The plain Mid$() function (dealing with ASCII strings only)
> is implemented as a constant-time operation which simply takes your input
> string and adjusts the start index and length to give you the requested
> portion of the string. The string doesn't even have to be read, much less
> copied, to do this.
>
> Now, the String.Mid() function is somewhat more complicated, because
> UTF-8 strings have variable-width characters, which makes it difficult
> to map byte indices to character positions. To implement String.Mid(),
> your string has to be read, but, again, not copied.
>
> Extracting a part of a string is a non-destructive operation in Gambas
> and no copying takes place. (Concatenating strings, on the other hand,
> will copy.) So, there is some reading overhead (if you need UTF-8 strings),
> but it's smaller than you probably thought.
>
> > Dim Alphabetics as string "abc...zyzABC...ZYZ"
> > Dim re as RegExp
> > Dim matches as String []
> > Dim RawText as String
> >
> > re.Compile("([" & Alphabetics & "]+?)([^" & Alphabetics & "]+)",
> > RegExp.utf8)
> > RawText = "abc12345def ghi jklm mno p1"
> >
> > Do While RawText
> >      re.Exec(RawText)
> >      matches.add(re[1].text)
> >      RawText = String.Mid(RawText, String.Len(re.text) + 1)
> > Loop
> >
> > For i = 0 To matches.Count - 1
> >   Print matches[i]
> > Next
> >
> >
> > Above code correctly finds "abc, def, ghi, jlkm, mno, p". But the tricks
> I
> > have used are cumbersome (like advancing with string.mid() and resorting
> to
> > re[1].text and re.text.
> >
>
> Well, I think you can't use PCRE alone to solve your problem, if you want
> to capture a variable number of words in your submatches. I did a bit of
> reading and from what I gather [1][2] capturing group numbers are assigned
> based on the verbatim regular expression, i.e. the number of submatches
> you can receive is limited by the number of "(...)" constructs in your
> expression; and the (otherwise very nifty) recursion operator (?R) does
> not give you an unlimited number of capturing groups, sadly.
>
> Anyway, I think by changing your regular expression, you can let PCRE take
> care of the string advancement, like so:
>
>    1 #!/usr/bin/gbs3
>    2
>    3 Use "gb.pcre"
>    4
>    5 Public Sub Main()
>    6   Dim r As New RegExp
>    7   Dim s As string
>    8
>    9   r.Compile("([[:alpha:]]+)[[:^alpha:]]+(.*$)", RegExp.UTF8)
>   10   s = "abc12345def ghi jklm mno p1"
>   11   Print "Subject:";; s
>   12   Do
>   13     r.Exec(s)
>   14     If r.Offset = -1 Then Break
>   15     Print " ->";; r[1].Text
>   16     s = r[2].Text
>   17   Loop While s
>   18 End
>
> Output:
>
>   Subject: abc12345def ghi jklm mno p1
>    -> abc
>    -> def
>    -> ghi
>    -> jklm
>    -> mno
>    -> p
>
> But, I think, this is less efficient than using String.Mid(). The trailing
> group (.*$) _may_ make the PCRE library read the entire subject every time.
> And I believe gb.pcre will copy your submatch string when returning it.
> If you care deeply about this, you'll have to trace the code in gb.pcre
> and main/gbx (the interpreter) to see what copies strings and what doesn't.
>
> Regards,
> Tobi
>
> [1] http://www.regular-expressions.info/recursecapture.html (Capturing
> Groups Inside Recursion or Subroutine Calls)
> [2] http://www.rexegg.com/regex-recursion.html (Groups Contents and
> Numbering in Recursive Expressions)
>
> --
> "There's an old saying: Don't change anything... ever!" -- Mr. Monk
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Gambas-user mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gambas-user
>



--
Fernando Cabral
Blogue: http://fernandocabral.org
Twitter: http://twitter.com/fjcabral
e-mail: [hidden email]
Facebook: [hidden email]
Telegram: +55 (37) 99988-8868
Wickr ID: fernandocabral
WhatsApp: +55 (37) 99988-8868
Skype:  fernandojosecabral
Telefone fixo: +55 (37) 3521-2183
Telefone celular: +55 (37) 99988-8868

Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
nenhum político ou cientista poderá se gabar de nada.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Help needed from regexp gurus

Jussi Lahtinen
I think I would do something like:

  Dim ii As Integer
  Dim sStr As String = "abc defg hijkl"
  Dim sWords As String[]

  sWords = Split(sStr, " ")

  For ii = 0 To 2
   Print sWords[ii]
  Next




Jussi

On Sun, Jun 18, 2017 at 2:57 AM, Fernando Cabral <
[hidden email]> wrote:

> Tobi
>
> One more thing about the way I wish it could work (I remember having done
> this in C perhaps 30 years ago). The pseudo-code bellow is pretty
> schematic, but I think it will clarify the issue.
>
> Let p and l be arrays of integers and s be the string "abc defg hijkl"
>
> So, after traversing the string we would have the following result:
> p[0] = offset of "a" (0)
> l[0] = length of "abc" (3)
> p[1] = offset of "d" (4)
> l[1] = lenght of "defg" (4)
> p[2] = offset of "h" (9)
> l[2] = lenght of "hijkl" (5).
>
> After this, each word could be retrieved in the following manner:
>
> for i = 0 to 2
>     print mid(s, p[i], l[i])
> next
>
> I think this would be the most efficient way to do it. But I can't find how
> to do it in Gambas using Regex.
>
> Regards
>
> - fernando
>
>
> 2017-06-17 18:06 GMT-03:00 Tobias Boege <[hidden email]>:
>
> > On Sat, 17 Jun 2017, Fernando Cabral wrote:
> > > Still beating my head against the wall due to my lack of knowledge
> about
> > > the PCRE methods and properties... Because of this, I have progressed
> not
> > > only very slowly but also -- I fell -- in a very inelegant way. So
> > perhaps
> > > you guys who are more acquainted with PCRE might be able to hint me on
> a
> > > better solution.
> > >
> > > I want to search a long string that can contain a sentence, a paragraph
> > or
> > > even a full text. I wanna find and isolate every word it contains. A
> word
> > > is defined as any sequence of alphabetic characters followed by a
> > > non-alphatetic character.
> > >
> >
> > The Mathematician in me can't resist to point this out: you hopefully
> > wanted
> > to define "word in a string" as "a *longest* sequence of alphabetic
> > characters
> > followed by a non-alphabetic character (or the end of the string)". Using
> > your
> > definition above, the words in "abc:" would be "c", "bc" and "abc",
> whereas
> > you probably only wanted "abc" (the longest of those).
> >
> > > The sample code bellow does work, but I don't feel it is as elegant and
> > as
> > > fast as it could and should be.  Especially the way I am traversing the
> > > string from the beginning to the end. It looks awkward and slow. There
> > must
> > > be a more efficient way, like working only with offsets and lengths
> > instead
> > > of copying the string again and again.
> > >
> >
> > You think worse of String.Mid() than it deserves, IMHO. Gambas strings
> > are triples of a pointer to some data, a start index and a length, and
> > the built-in string functions take care not to copy a string when it's
> > not necessary. The plain Mid$() function (dealing with ASCII strings
> only)
> > is implemented as a constant-time operation which simply takes your input
> > string and adjusts the start index and length to give you the requested
> > portion of the string. The string doesn't even have to be read, much less
> > copied, to do this.
> >
> > Now, the String.Mid() function is somewhat more complicated, because
> > UTF-8 strings have variable-width characters, which makes it difficult
> > to map byte indices to character positions. To implement String.Mid(),
> > your string has to be read, but, again, not copied.
> >
> > Extracting a part of a string is a non-destructive operation in Gambas
> > and no copying takes place. (Concatenating strings, on the other hand,
> > will copy.) So, there is some reading overhead (if you need UTF-8
> strings),
> > but it's smaller than you probably thought.
> >
> > > Dim Alphabetics as string "abc...zyzABC...ZYZ"
> > > Dim re as RegExp
> > > Dim matches as String []
> > > Dim RawText as String
> > >
> > > re.Compile("([" & Alphabetics & "]+?)([^" & Alphabetics & "]+)",
> > > RegExp.utf8)
> > > RawText = "abc12345def ghi jklm mno p1"
> > >
> > > Do While RawText
> > >      re.Exec(RawText)
> > >      matches.add(re[1].text)
> > >      RawText = String.Mid(RawText, String.Len(re.text) + 1)
> > > Loop
> > >
> > > For i = 0 To matches.Count - 1
> > >   Print matches[i]
> > > Next
> > >
> > >
> > > Above code correctly finds "abc, def, ghi, jlkm, mno, p". But the
> tricks
> > I
> > > have used are cumbersome (like advancing with string.mid() and
> resorting
> > to
> > > re[1].text and re.text.
> > >
> >
> > Well, I think you can't use PCRE alone to solve your problem, if you want
> > to capture a variable number of words in your submatches. I did a bit of
> > reading and from what I gather [1][2] capturing group numbers are
> assigned
> > based on the verbatim regular expression, i.e. the number of submatches
> > you can receive is limited by the number of "(...)" constructs in your
> > expression; and the (otherwise very nifty) recursion operator (?R) does
> > not give you an unlimited number of capturing groups, sadly.
> >
> > Anyway, I think by changing your regular expression, you can let PCRE
> take
> > care of the string advancement, like so:
> >
> >    1 #!/usr/bin/gbs3
> >    2
> >    3 Use "gb.pcre"
> >    4
> >    5 Public Sub Main()
> >    6   Dim r As New RegExp
> >    7   Dim s As string
> >    8
> >    9   r.Compile("([[:alpha:]]+)[[:^alpha:]]+(.*$)", RegExp.UTF8)
> >   10   s = "abc12345def ghi jklm mno p1"
> >   11   Print "Subject:";; s
> >   12   Do
> >   13     r.Exec(s)
> >   14     If r.Offset = -1 Then Break
> >   15     Print " ->";; r[1].Text
> >   16     s = r[2].Text
> >   17   Loop While s
> >   18 End
> >
> > Output:
> >
> >   Subject: abc12345def ghi jklm mno p1
> >    -> abc
> >    -> def
> >    -> ghi
> >    -> jklm
> >    -> mno
> >    -> p
> >
> > But, I think, this is less efficient than using String.Mid(). The
> trailing
> > group (.*$) _may_ make the PCRE library read the entire subject every
> time.
> > And I believe gb.pcre will copy your submatch string when returning it.
> > If you care deeply about this, you'll have to trace the code in gb.pcre
> > and main/gbx (the interpreter) to see what copies strings and what
> doesn't.
> >
> > Regards,
> > Tobi
> >
> > [1] http://www.regular-expressions.info/recursecapture.html (Capturing
> > Groups Inside Recursion or Subroutine Calls)
> > [2] http://www.rexegg.com/regex-recursion.html (Groups Contents and
> > Numbering in Recursive Expressions)
> >
> > --
> > "There's an old saying: Don't change anything... ever!" -- Mr. Monk
> >
> > ------------------------------------------------------------
> > ------------------
> > Check out the vibrant tech community on one of the world's most
> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> > _______________________________________________
> > Gambas-user mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/gambas-user
> >
>
>
>
> --
> Fernando Cabral
> Blogue: http://fernandocabral.org
> Twitter: http://twitter.com/fjcabral
> e-mail: [hidden email]
> Facebook: [hidden email]
> Telegram: +55 (37) 99988-8868
> Wickr ID: fernandocabral
> WhatsApp: +55 (37) 99988-8868
> Skype:  fernandojosecabral
> Telefone fixo: +55 (37) 3521-2183
> Telefone celular: +55 (37) 99988-8868
>
> Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
> nenhum político ou cientista poderá se gabar de nada.
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Gambas-user mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gambas-user
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Help needed from regexp gurus

Jussi Lahtinen
Oh, sorry... this way of course:

  Dim sStr As String = "abc. def!!!     ghi?   jkl:  (mno)"
  Dim sWords As String[]

  sWords = Split(sStr, " .!?:()", "", True) ''Expand as you will.

  For ii = 0 To sWords.Max
   Print sWords[ii]
  Next



Jussi

On Sun, Jun 18, 2017 at 6:29 AM, Jussi Lahtinen <[hidden email]>
wrote:

> It's not problem.
>
>   Dim sStr As String = "abc. def!!!     ghi?   jkl:  (mno)"
>   Dim sWords As String[]
>
>   sWords = Split(sStr, " .!?:()") '' Exapand as you will.
>
>   ii = 0
>   Do
>     If sWords[ii] = "" Then
>       sWords.Remove(ii)
>     Else
>       Inc ii
>     Endif
>   Loop Until ii > sWords.Max
>
>   For ii = 0 To sWords.Max
>    Print sWords[ii]
>   Next
>
>
> Jussi
>
>
> On Sun, Jun 18, 2017 at 4:53 AM, Fernando Cabral <
> [hidden email]> wrote:
>
>> Jussi, what you suggest will not work. You have presumed the only
>> separator is a single space.
>> This is not the case. Between any two words you can have any non-alpha
>> character in any number.
>> It could be, for instance, "abc. def!!!     ghi?   jkl:  (mno)" and so
>> forth.
>> This means, the definition of word is "any sequence of alphabetic
>> characters followed by any sequence of non-alphabetic.
>>
>> That's why your suggestion does not apply.
>>
>> - fernando
>>
>> 2017-06-17 21:21 GMT-03:00 Jussi Lahtinen <[hidden email]>:
>>
>>> I think I would do something like:
>>>
>>>   Dim ii As Integer
>>>   Dim sStr As String = "abc defg hijkl"
>>>   Dim sWords As String[]
>>>
>>>   sWords = Split(sStr, " ")
>>>
>>>   For ii = 0 To 2
>>>    Print sWords[ii]
>>>   Next
>>>
>>>
>>>
>>>
>>> Jussi
>>>
>>> On Sun, Jun 18, 2017 at 2:57 AM, Fernando Cabral <
>>> [hidden email]> wrote:
>>>
>>>> Tobi
>>>>
>>>> One more thing about the way I wish it could work (I remember having
>>>> done
>>>> this in C perhaps 30 years ago). The pseudo-code bellow is pretty
>>>> schematic, but I think it will clarify the issue.
>>>>
>>>> Let p and l be arrays of integers and s be the string "abc defg hijkl"
>>>>
>>>> So, after traversing the string we would have the following result:
>>>> p[0] = offset of "a" (0)
>>>> l[0] = length of "abc" (3)
>>>> p[1] = offset of "d" (4)
>>>> l[1] = lenght of "defg" (4)
>>>> p[2] = offset of "h" (9)
>>>> l[2] = lenght of "hijkl" (5).
>>>>
>>>> After this, each word could be retrieved in the following manner:
>>>>
>>>> for i = 0 to 2
>>>>     print mid(s, p[i], l[i])
>>>> next
>>>>
>>>> I think this would be the most efficient way to do it. But I can't find
>>>> how
>>>> to do it in Gambas using Regex.
>>>>
>>>> Regards
>>>>
>>>> - fernando
>>>>
>>>>
>>>> 2017-06-17 18:06 GMT-03:00 Tobias Boege <[hidden email]>:
>>>>
>>>> > On Sat, 17 Jun 2017, Fernando Cabral wrote:
>>>> > > Still beating my head against the wall due to my lack of knowledge
>>>> about
>>>> > > the PCRE methods and properties... Because of this, I have
>>>> progressed not
>>>> > > only very slowly but also -- I fell -- in a very inelegant way. So
>>>> > perhaps
>>>> > > you guys who are more acquainted with PCRE might be able to hint me
>>>> on a
>>>> > > better solution.
>>>> > >
>>>> > > I want to search a long string that can contain a sentence, a
>>>> paragraph
>>>> > or
>>>> > > even a full text. I wanna find and isolate every word it contains.
>>>> A word
>>>> > > is defined as any sequence of alphabetic characters followed by a
>>>> > > non-alphatetic character.
>>>> > >
>>>> >
>>>> > The Mathematician in me can't resist to point this out: you hopefully
>>>> > wanted
>>>> > to define "word in a string" as "a *longest* sequence of alphabetic
>>>> > characters
>>>> > followed by a non-alphabetic character (or the end of the string)".
>>>> Using
>>>> > your
>>>> > definition above, the words in "abc:" would be "c", "bc" and "abc",
>>>> whereas
>>>> > you probably only wanted "abc" (the longest of those).
>>>> >
>>>> > > The sample code bellow does work, but I don't feel it is as elegant
>>>> and
>>>> > as
>>>> > > fast as it could and should be.  Especially the way I am traversing
>>>> the
>>>> > > string from the beginning to the end. It looks awkward and slow.
>>>> There
>>>> > must
>>>> > > be a more efficient way, like working only with offsets and lengths
>>>> > instead
>>>> > > of copying the string again and again.
>>>> > >
>>>> >
>>>> > You think worse of String.Mid() than it deserves, IMHO. Gambas strings
>>>> > are triples of a pointer to some data, a start index and a length, and
>>>> > the built-in string functions take care not to copy a string when it's
>>>> > not necessary. The plain Mid$() function (dealing with ASCII strings
>>>> only)
>>>> > is implemented as a constant-time operation which simply takes your
>>>> input
>>>> > string and adjusts the start index and length to give you the
>>>> requested
>>>> > portion of the string. The string doesn't even have to be read, much
>>>> less
>>>> > copied, to do this.
>>>> >
>>>> > Now, the String.Mid() function is somewhat more complicated, because
>>>> > UTF-8 strings have variable-width characters, which makes it difficult
>>>> > to map byte indices to character positions. To implement String.Mid(),
>>>> > your string has to be read, but, again, not copied.
>>>> >
>>>> > Extracting a part of a string is a non-destructive operation in Gambas
>>>> > and no copying takes place. (Concatenating strings, on the other hand,
>>>> > will copy.) So, there is some reading overhead (if you need UTF-8
>>>> strings),
>>>> > but it's smaller than you probably thought.
>>>> >
>>>> > > Dim Alphabetics as string "abc...zyzABC...ZYZ"
>>>> > > Dim re as RegExp
>>>> > > Dim matches as String []
>>>> > > Dim RawText as String
>>>> > >
>>>> > > re.Compile("([" & Alphabetics & "]+?)([^" & Alphabetics & "]+)",
>>>> > > RegExp.utf8)
>>>> > > RawText = "abc12345def ghi jklm mno p1"
>>>> > >
>>>> > > Do While RawText
>>>> > >      re.Exec(RawText)
>>>> > >      matches.add(re[1].text)
>>>> > >      RawText = String.Mid(RawText, String.Len(re.text) + 1)
>>>> > > Loop
>>>> > >
>>>> > > For i = 0 To matches.Count - 1
>>>> > >   Print matches[i]
>>>> > > Next
>>>> > >
>>>> > >
>>>> > > Above code correctly finds "abc, def, ghi, jlkm, mno, p". But the
>>>> tricks
>>>> > I
>>>> > > have used are cumbersome (like advancing with string.mid() and
>>>> resorting
>>>> > to
>>>> > > re[1].text and re.text.
>>>> > >
>>>> >
>>>> > Well, I think you can't use PCRE alone to solve your problem, if you
>>>> want
>>>> > to capture a variable number of words in your submatches. I did a bit
>>>> of
>>>> > reading and from what I gather [1][2] capturing group numbers are
>>>> assigned
>>>> > based on the verbatim regular expression, i.e. the number of
>>>> submatches
>>>> > you can receive is limited by the number of "(...)" constructs in your
>>>> > expression; and the (otherwise very nifty) recursion operator (?R)
>>>> does
>>>> > not give you an unlimited number of capturing groups, sadly.
>>>> >
>>>> > Anyway, I think by changing your regular expression, you can let PCRE
>>>> take
>>>> > care of the string advancement, like so:
>>>> >
>>>> >    1 #!/usr/bin/gbs3
>>>> >    2
>>>> >    3 Use "gb.pcre"
>>>> >    4
>>>> >    5 Public Sub Main()
>>>> >    6   Dim r As New RegExp
>>>> >    7   Dim s As string
>>>> >    8
>>>> >    9   r.Compile("([[:alpha:]]+)[[:^alpha:]]+(.*$)", RegExp.UTF8)
>>>> >   10   s = "abc12345def ghi jklm mno p1"
>>>> >   11   Print "Subject:";; s
>>>> >   12   Do
>>>> >   13     r.Exec(s)
>>>> >   14     If r.Offset = -1 Then Break
>>>> >   15     Print " ->";; r[1].Text
>>>> >   16     s = r[2].Text
>>>> >   17   Loop While s
>>>> >   18 End
>>>> >
>>>> > Output:
>>>> >
>>>> >   Subject: abc12345def ghi jklm mno p1
>>>> >    -> abc
>>>> >    -> def
>>>> >    -> ghi
>>>> >    -> jklm
>>>> >    -> mno
>>>> >    -> p
>>>> >
>>>> > But, I think, this is less efficient than using String.Mid(). The
>>>> trailing
>>>> > group (.*$) _may_ make the PCRE library read the entire subject every
>>>> time.
>>>> > And I believe gb.pcre will copy your submatch string when returning
>>>> it.
>>>> > If you care deeply about this, you'll have to trace the code in
>>>> gb.pcre
>>>> > and main/gbx (the interpreter) to see what copies strings and what
>>>> doesn't.
>>>> >
>>>> > Regards,
>>>> > Tobi
>>>> >
>>>> > [1] http://www.regular-expressions.info/recursecapture.html
>>>> (Capturing
>>>> > Groups Inside Recursion or Subroutine Calls)
>>>> > [2] http://www.rexegg.com/regex-recursion.html (Groups Contents and
>>>> > Numbering in Recursive Expressions)
>>>> >
>>>> > --
>>>> > "There's an old saying: Don't change anything... ever!" -- Mr. Monk
>>>> >
>>>> > ------------------------------------------------------------
>>>> > ------------------
>>>> > Check out the vibrant tech community on one of the world's most
>>>> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>> > _______________________________________________
>>>> > Gambas-user mailing list
>>>> > [hidden email]
>>>> > https://lists.sourceforge.net/lists/listinfo/gambas-user
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Fernando Cabral
>>>> Blogue: http://fernandocabral.org
>>>> Twitter: http://twitter.com/fjcabral
>>>> e-mail <http://twitter.com/fjcabrale-mail>:
>>>> [hidden email]
>>>> Facebook: [hidden email]
>>>> Telegram: +55 (37) 99988-8868
>>>> Wickr ID: fernandocabral
>>>> WhatsApp: +55 (37) 99988-8868
>>>> Skype:  fernandojosecabral
>>>> Telefone fixo: +55 (37) 3521-2183
>>>> Telefone celular: +55 (37) 99988-8868
>>>>
>>>> Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
>>>> nenhum político ou cientista poderá se gabar de nada.
>>>> ------------------------------------------------------------
>>>> ------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Gambas-user mailing list
>>>> [hidden email]
>>>> https://lists.sourceforge.net/lists/listinfo/gambas-user
>>>>
>>>
>>>
>>
>>
>> --
>> Fernando Cabral
>> Blogue: http://fernandocabral.org
>> Twitter: http://twitter.com/fjcabral
>> e-mail: [hidden email]
>> Facebook: [hidden email]
>> Telegram: +55 (37) 99988-8868 <+55%2037%2099988-8868>
>> Wickr ID: fernandocabral
>> WhatsApp: +55 (37) 99988-8868 <+55%2037%2099988-8868>
>> Skype:  fernandojosecabral
>> Telefone fixo: +55 (37) 3521-2183 <+55%2037%203521-2183>
>> Telefone celular: +55 (37) 99988-8868 <+55%2037%2099988-8868>
>>
>> Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
>> nenhum político ou cientista poderá se gabar de nada.
>>
>>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Help needed from regexp gurus / String.Mid() and Mid$() implementation

Tobias Boege-2
In reply to this post by Fernando Cabral
On Sat, 17 Jun 2017, Fernando Cabral wrote:

> Tobi
>
> One more thing about the way I wish it could work (I remember having done
> this in C perhaps 30 years ago). The pseudo-code bellow is pretty
> schematic, but I think it will clarify the issue.
>
> Let p and l be arrays of integers and s be the string "abc defg hijkl"
>
> So, after traversing the string we would have the following result:
> p[0] = offset of "a" (0)
> l[0] = length of "abc" (3)
> p[1] = offset of "d" (4)
> l[1] = lenght of "defg" (4)
> p[2] = offset of "h" (9)
> l[2] = lenght of "hijkl" (5).
>
> After this, each word could be retrieved in the following manner:
>
> for i = 0 to 2
>     print mid(s, p[i], l[i])
> next
>
> I think this would be the most efficient way to do it. But I can't find how
> to do it in Gambas using Regex.
>
As I said before, the Gambas String.Mid() and Mid$() functions do just that.
The internal representation of a string is some base data (which is usually
shared among many strings, via reference counting), an offset and a length.
If you apply String.Mid() or Mid$() to a string, no copying takes place, only
the offset and length members of the Gambas string structure are adjusted.
This is why Gambas strings are sometimes called "read-only" in the wiki (the
same string base data is shared by many strings, so you can't have external
libraries modify the data behind a Gambas string). Even the statement

  s = String.Mid$(s, 10, 20)

will *not* require a copy operation. You simply add 10 (UTF-8 positions) to
the offset member of the string structure and set the length member to 20
(UTF-8 positions) (or to the remaining length of s if it's smaller than 20).

String.Mid() and Mid$() are implemented exactly by manipulating offsets and
lengths, like you want to do. In fact there are multiple places in the Gambas
source tree where those two are used in place of a C-style

  for (i = 0; i < len; i++)
    do something with str[i];

loop. I suggest you look at the implementations yourself if you don't
believe it:

  String datatype: https://sourceforge.net/p/gambas/code/HEAD/tree/gambas/trunk/main/share/gambas.h#l126
  Mid$():          https://sourceforge.net/p/gambas/code/HEAD/tree/gambas/trunk/main/gbx/gbx_exec_loop.c#l3820
  String.Mid():    https://sourceforge.net/p/gambas/code/HEAD/tree/gambas/trunk/main/gbx/gbx_c_string.c#l399

(I recommend downloading the source tree and using ctags or something to
navigate through it, of course, instead of the SF web interface.)

You should also try the following: create a console project with this code
in the Main.module:

   1 ' Gambas module file
   2
   3 Public Sub Main()
   4   Dim s As String = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
   5   Dim i As Integer
   6
   7   For i = 1 To 5
   8     s = String.Mid$(s, i, 2*i)
   9   Next
  10   s &= "a"
  11 End

It will call String.Mid$() multiple times. Now compile and run this program
through callgrind:

  $ cd /path/to/project
  $ gbc3 -ga
  $ valgrind --tool=callgrind gbx3

and use kcachegrind to visualise the callgraph. I'll attach the two
interesting graphs to this mail. One shows that the single invocation
of SUBR_cat (the &= operation at the end) needed a malloc() and hence
did something like copying the string, whereas the multiple invocations
of String_Mid do not lead to malloc or any other means of allocating
memory, meaning no copy operation takes place.

Assuming you aren't prematurely optimising here and performance is actually
poor with Gambas code you should look into doing it in C and possibly avoid
regular expressions altogether. If you always just want all the words in a
given string, you can do it in a single linear pass through your text.

But honestly, I would be surprised if you have bad performance by using
String.Mid$(), since it is really just using a map of offsets and lengths
on a single shared base string.

Regards,
Tobi

PS: Again about the definition of "word in a string". My point was that if
you say "a word in a string is a sequence of alphabetic characters followed
by a non-alphabetic character", then "c", "bc" and "abc" will be words in
the string "abc.", right? "c" is a (length-1) sequence of alphabetic
characters which is followed by the non-alphabetic character "." in the
string. But you don't want to call "c" alone a word because there are further
alphabetic characters in front of it. You want any *longest* sequence of
alphabetic characters which is followed by a non-alphabetic one, which in
the string above, would only be "abc".

--
"There's an old saying: Don't change anything... ever!" -- Mr. Monk

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user

SUBR_cat-graph.png (37K) Download Attachment
String_Mid-graph.png (33K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Help needed from regexp gurus / String.Mid() and Mid$() implementation

Fernando Cabral
Tobi, I have been learning a lot with your comments and suggestions.

Usually, I don't stress performance too much.  In this case, I have tried
to do things a little faster because I am using Gambas as a kind of add-on
to LibreOffice. I use it to check readability, wordiness, sentences that
are too long, things like that. In order to avoid coding, I resorted to RE.
My first try at it resulted in very slow code. In fact, a 150-page long
document took about two and a half minute to check. That's not short enough
to stimulate someone to run the code repeatedly. That's when I tried
introducing RegExp.Replace to canonize the input text and then applying
split(). Split () is fast, but canonizing the input text proved slow.  Then
I resorted to RegExp.Compile and RegExp. Way faster than Regex.Replace(),
but still slow. That's when I asked for help.

Today, following Jussi's suggestion, I went back to Split(). Without the
canonization phase, the same 150-page document was processed in 1,5 second.
That' s 100 times faster than the original version. That is great.
Nevertheless, there a few issues I have not been able to solve in an
elegant way.

Anyway, for this magnitude of performance gain, I am quite willing to go
back to Split () and then do soma massaging in those situations where
results are less than perfect.

Thank you for the many hints you've provided me with.

Regards

- fernando






2017-06-18 11:08 GMT-03:00 Tobias Boege <[hidden email]>:

> On Sat, 17 Jun 2017, Fernando Cabral wrote:
> > Tobi
> >
> > One more thing about the way I wish it could work (I remember having done
> > this in C perhaps 30 years ago). The pseudo-code bellow is pretty
> > schematic, but I think it will clarify the issue.
> >
> > Let p and l be arrays of integers and s be the string "abc defg hijkl"
> >
> > So, after traversing the string we would have the following result:
> > p[0] = offset of "a" (0)
> > l[0] = length of "abc" (3)
> > p[1] = offset of "d" (4)
> > l[1] = lenght of "defg" (4)
> > p[2] = offset of "h" (9)
> > l[2] = lenght of "hijkl" (5).
> >
> > After this, each word could be retrieved in the following manner:
> >
> > for i = 0 to 2
> >     print mid(s, p[i], l[i])
> > next
> >
> > I think this would be the most efficient way to do it. But I can't find
> how
> > to do it in Gambas using Regex.
> >
>
> As I said before, the Gambas String.Mid() and Mid$() functions do just
> that.
> The internal representation of a string is some base data (which is usually
> shared among many strings, via reference counting), an offset and a length.
> If you apply String.Mid() or Mid$() to a string, no copying takes place,
> only
> the offset and length members of the Gambas string structure are adjusted.
> This is why Gambas strings are sometimes called "read-only" in the wiki
> (the
> same string base data is shared by many strings, so you can't have external
> libraries modify the data behind a Gambas string). Even the statement
>
>   s = String.Mid$(s, 10, 20)
>
> will *not* require a copy operation. You simply add 10 (UTF-8 positions) to
> the offset member of the string structure and set the length member to 20
> (UTF-8 positions) (or to the remaining length of s if it's smaller than
> 20).
>
> String.Mid() and Mid$() are implemented exactly by manipulating offsets and
> lengths, like you want to do. In fact there are multiple places in the
> Gambas
> source tree where those two are used in place of a C-style
>
>   for (i = 0; i < len; i++)
>     do something with str[i];
>
> loop. I suggest you look at the implementations yourself if you don't
> believe it:
>
>   String datatype: https://sourceforge.net/p/gambas/code/HEAD/tree/gambas/
> trunk/main/share/gambas.h#l126
>   Mid$():          https://sourceforge.net/p/gambas/code/HEAD/tree/gambas/
> trunk/main/gbx/gbx_exec_loop.c#l3820
>   String.Mid():    https://sourceforge.net/p/gambas/code/HEAD/tree/gambas/
> trunk/main/gbx/gbx_c_string.c#l399
>
> (I recommend downloading the source tree and using ctags or something to
> navigate through it, of course, instead of the SF web interface.)
>
> You should also try the following: create a console project with this code
> in the Main.module:
>
>    1 ' Gambas module file
>    2
>    3 Public Sub Main()
>    4   Dim s As String = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
>    5   Dim i As Integer
>    6
>    7   For i = 1 To 5
>    8     s = String.Mid$(s, i, 2*i)
>    9   Next
>   10   s &= "a"
>   11 End
>
> It will call String.Mid$() multiple times. Now compile and run this program
> through callgrind:
>
>   $ cd /path/to/project
>   $ gbc3 -ga
>   $ valgrind --tool=callgrind gbx3
>
> and use kcachegrind to visualise the callgraph. I'll attach the two
> interesting graphs to this mail. One shows that the single invocation
> of SUBR_cat (the &= operation at the end) needed a malloc() and hence
> did something like copying the string, whereas the multiple invocations
> of String_Mid do not lead to malloc or any other means of allocating
> memory, meaning no copy operation takes place.
>
> Assuming you aren't prematurely optimising here and performance is actually
> poor with Gambas code you should look into doing it in C and possibly avoid
> regular expressions altogether. If you always just want all the words in a
> given string, you can do it in a single linear pass through your text.
>
> But honestly, I would be surprised if you have bad performance by using
> String.Mid$(), since it is really just using a map of offsets and lengths
> on a single shared base string.
>
> Regards,
> Tobi
>
> PS: Again about the definition of "word in a string". My point was that if
> you say "a word in a string is a sequence of alphabetic characters followed
> by a non-alphabetic character", then "c", "bc" and "abc" will be words in
> the string "abc.", right? "c" is a (length-1) sequence of alphabetic
> characters which is followed by the non-alphabetic character "." in the
> string. But you don't want to call "c" alone a word because there are
> further
> alphabetic characters in front of it. You want any *longest* sequence of
> alphabetic characters which is followed by a non-alphabetic one, which in
> the string above, would only be "abc".
>
> --
> "There's an old saying: Don't change anything... ever!" -- Mr. Monk
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Gambas-user mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gambas-user
>
>


--
Fernando Cabral
Blogue: http://fernandocabral.org
Twitter: http://twitter.com/fjcabral
e-mail: [hidden email]
Facebook: [hidden email]
Telegram: +55 (37) 99988-8868
Wickr ID: fernandocabral
WhatsApp: +55 (37) 99988-8868
Skype:  fernandojosecabral
Telefone fixo: +55 (37) 3521-2183
Telefone celular: +55 (37) 99988-8868

Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
nenhum político ou cientista poderá se gabar de nada.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Help needed from regexp gurus

Fernando Cabral
In reply to this post by Tobias Boege-2
This is mostly to thank Tobi and Jussi for their help in solving some
issues that were making me unhappy.

With three lines of code I have solved what used to take me twenty or so.
What is better yet: execution time
fell down from 2min 30 sec to 1,5 seconds. And the code is much more
transparent, The three lines bellow are the heart and brain of the program:

MatchedWords = Split(RawText, " \"'`[]{}+-_:#$%&.!?:(),;-\n", "", True)
MatchedSentences = Split(RegExp.Replace(RawText,
"([.])|([!])|([?])|(;\n)|([:]\n)", "&1&2&3&4&5\x00", RegExp.UTF8), "\x00",
"", True)
MatchedParagraphs = Split(RawText, "\n", "", True)

These three lines will take an entire text file (read into the variable
RawText) and split it into words, sentences and paragraphs. They ONE SECOND
to process a 150-page long text file with  414,961 bytes, tallying 69,196
words, 4,626 sentences and 2,409 paragraphs.

I am impressed!

In this last (and fast) version I have depended very little on RegExp. But
I still have used it to do some massaging on the original text. The line
"MatchedSentences = ...." above shows an example. The characters ".?!",
and the strings ";\n" and ":\n" signal the end of a sentence. Nevertheless,
I can not use them as separators for Split(). I can' t because Split ()
would drop them as it does with separators. Nevertheless, I need them
later. So I used RegExpl.Replace () to insert a \x00 after each of them and
then I used \x00 as the only sentence separator. This preserved the
punctuation marks I needed at the end of each sentence.

After running those three lines I still need to do some additional
processing with the resulting arrays, but that only consumes another half a
second for the same 150-page long document.

Now I am happy and I feel stimulated to complete the code and do some
polishing.

Thank you, Tobi and Jussi. You have helped a lot.

2017-06-17 18:06 GMT-03:00 Tobias Boege <[hidden email]>:

> On Sat, 17 Jun 2017, Fernando Cabral wrote:
> > Still beating my head against the wall due to my lack of knowledge about
> > the PCRE methods and properties... Because of this, I have progressed not
> > only very slowly but also -- I fell -- in a very inelegant way. So
> perhaps
> > you guys who are more acquainted with PCRE might be able to hint me on a
> > better solution.
> >
> > I want to search a long string that can contain a sentence, a paragraph
> or
> > even a full text. I wanna find and isolate every word it contains. A word
> > is defined as any sequence of alphabetic characters followed by a
> > non-alphatetic character.
> >
>
> The Mathematician in me can't resist to point this out: you hopefully
> wanted
> to define "word in a string" as "a *longest* sequence of alphabetic
> characters
> followed by a non-alphabetic character (or the end of the string)". Using
> your
> definition above, the words in "abc:" would be "c", "bc" and "abc", whereas
> you probably only wanted "abc" (the longest of those).
>
> > The sample code bellow does work, but I don't feel it is as elegant and
> as
> > fast as it could and should be.  Especially the way I am traversing the
> > string from the beginning to the end. It looks awkward and slow. There
> must
> > be a more efficient way, like working only with offsets and lengths
> instead
> > of copying the string again and again.
> >
>
> You think worse of String.Mid() than it deserves, IMHO. Gambas strings
> are triples of a pointer to some data, a start index and a length, and
> the built-in string functions take care not to copy a string when it's
> not necessary. The plain Mid$() function (dealing with ASCII strings only)
> is implemented as a constant-time operation which simply takes your input
> string and adjusts the start index and length to give you the requested
> portion of the string. The string doesn't even have to be read, much less
> copied, to do this.
>
> Now, the String.Mid() function is somewhat more complicated, because
> UTF-8 strings have variable-width characters, which makes it difficult
> to map byte indices to character positions. To implement String.Mid(),
> your string has to be read, but, again, not copied.
>
> Extracting a part of a string is a non-destructive operation in Gambas
> and no copying takes place. (Concatenating strings, on the other hand,
> will copy.) So, there is some reading overhead (if you need UTF-8 strings),
> but it's smaller than you probably thought.
>
> > Dim Alphabetics as string "abc...zyzABC...ZYZ"
> > Dim re as RegExp
> > Dim matches as String []
> > Dim RawText as String
> >
> > re.Compile("([" & Alphabetics & "]+?)([^" & Alphabetics & "]+)",
> > RegExp.utf8)
> > RawText = "abc12345def ghi jklm mno p1"
> >
> > Do While RawText
> >      re.Exec(RawText)
> >      matches.add(re[1].text)
> >      RawText = String.Mid(RawText, String.Len(re.text) + 1)
> > Loop
> >
> > For i = 0 To matches.Count - 1
> >   Print matches[i]
> > Next
> >
> >
> > Above code correctly finds "abc, def, ghi, jlkm, mno, p". But the tricks
> I
> > have used are cumbersome (like advancing with string.mid() and resorting
> to
> > re[1].text and re.text.
> >
>
> Well, I think you can't use PCRE alone to solve your problem, if you want
> to capture a variable number of words in your submatches. I did a bit of
> reading and from what I gather [1][2] capturing group numbers are assigned
> based on the verbatim regular expression, i.e. the number of submatches
> you can receive is limited by the number of "(...)" constructs in your
> expression; and the (otherwise very nifty) recursion operator (?R) does
> not give you an unlimited number of capturing groups, sadly.
>
> Anyway, I think by changing your regular expression, you can let PCRE take
> care of the string advancement, like so:
>
>    1 #!/usr/bin/gbs3
>    2
>    3 Use "gb.pcre"
>    4
>    5 Public Sub Main()
>    6   Dim r As New RegExp
>    7   Dim s As string
>    8
>    9   r.Compile("([[:alpha:]]+)[[:^alpha:]]+(.*$)", RegExp.UTF8)
>   10   s = "abc12345def ghi jklm mno p1"
>   11   Print "Subject:";; s
>   12   Do
>   13     r.Exec(s)
>   14     If r.Offset = -1 Then Break
>   15     Print " ->";; r[1].Text
>   16     s = r[2].Text
>   17   Loop While s
>   18 End
>
> Output:
>
>   Subject: abc12345def ghi jklm mno p1
>    -> abc
>    -> def
>    -> ghi
>    -> jklm
>    -> mno
>    -> p
>
> But, I think, this is less efficient than using String.Mid(). The trailing
> group (.*$) _may_ make the PCRE library read the entire subject every time.
> And I believe gb.pcre will copy your submatch string when returning it.
> If you care deeply about this, you'll have to trace the code in gb.pcre
> and main/gbx (the interpreter) to see what copies strings and what doesn't.
>
> Regards,
> Tobi
>
> [1] http://www.regular-expressions.info/recursecapture.html (Capturing
> Groups Inside Recursion or Subroutine Calls)
> [2] http://www.rexegg.com/regex-recursion.html (Groups Contents and
> Numbering in Recursive Expressions)
>
> --
> "There's an old saying: Don't change anything... ever!" -- Mr. Monk
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Gambas-user mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gambas-user
>



--
Fernando Cabral
Blogue: http://fernandocabral.org
Twitter: http://twitter.com/fjcabral
e-mail: [hidden email]
Facebook: [hidden email]
Telegram: +55 (37) 99988-8868
Wickr ID: fernandocabral
WhatsApp: +55 (37) 99988-8868
Skype:  fernandojosecabral
Telefone fixo: +55 (37) 3521-2183
Telefone celular: +55 (37) 99988-8868

Enquanto houver no mundo uma só pessoa sem casa ou sem alimentos,
nenhum político ou cientista poderá se gabar de nada.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Gambas-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gambas-user
Loading...