[PyQt] Problem with regex in my code

Maziar Parsijani maziar.parsijani at gmail.com
Sun Sep 2 20:06:42 BST 2018


,
Thank you for the clarification.
I am so happy that you answering my questions and it is your kindness that
you explain much more than answering.
Well you know I try to find the way that this site http://tanzil.net/ using
for search .And as you say I confused and then I post question.
your comments are useful for me and I use them.There are lots of problems
in programing with such languages one of them is this and the search
methods in Arabic word I am thinking of having a dictionary for example
when someone write a word I make search to find the other possible words.If
you want to know more about that last character it can be in different
position there for example : " يَوْمَ - يَوْمٍ - يَوْمُ - يَوْمٌ "
  all of these must be searched and must be founded when I search for
"یوم". In my SQLITE database I can find them because I remove these
characters named tashkeel or harakats, but when I am going to show and
highlight it in textedit with original writing method it is going to be a
challenge for me.
AND I AM REALLY REALLY SAY THANKS FOR YOUR USEFUL COMMENTS.

On Sun, Sep 2, 2018 at 9:04 PM Maurizio Berti <maurizio.berti at gmail.com>
wrote:

> Well, that's exactly what I meant when asking for a bit of context.
> As somebody not knowing anything about arabic languages, to me the last
> character in "قَالَ" is "لَ", as "لَ" is treated as a complete character
> when moving around or selecting with the cursor; my answer worked because
> nobody told me about that.
> The fatha, as I understand, is a diacritic glyph, but in most western
> languages those glyphs are always part of another letter, so "à" is always
> a single character in unicode (even if "`" exists as a single character
> alone).
> For example, u'à' is u'\xe0' for python 2 and for python 3 it is
> b'\xc3\xa0', which is a single character len as 'a' without any accent,
> while u'لَ' is u'\u0644\u064e' and b'\xd9\x84\xd9\x8e' respectively. You
> can easily do a len() to see that. Unfortunally, character selection and
> cursor position treat 'لَ' as a single one just like 'à', hence the
> misunderstanding. I assume that's a similar behavior with how japanese
> kanji characters are treated.
>
> So, the problem here is that, since those glyphs are treated as single
> separated characters for "strings", you'll need a bit more complex and
> stronger regex - and that's where you'll need to better understand how
> regexes work, since they were created mostly for western languages, and
> their usage with extended languages (and then unicode) requires a deeper
> knowledge.
>
> When I tried your example I tested the regex alone before, and the results
> were confirmed, so I assumed it was ok.
> You probably need to add a further step to better understand what's not
> working: the first is to be completely sure that the expression you use
> actually matches what you want. You can just print to console the results
> (with character positions and so on) or anything you want, _then_ test the
> result to the QTextDocument and see, by iterating the results carefully,
> how the QTextCursor is behaving. If you still don't understand what's going
> on then, that's where you send here that code with the forementioned
> context. We're eager to help, but we can't do everything :-)
> And there's the possibility that, while doing so, you'll find what is
> wrong; it's called "Rubber duck debugging", and it's when you have to
> explain almost line-by-line, your code and your case, possibly to somebody
> who knows absolutely anything about it. I solved tons of problems by
> starting to ask a question and writing example code, and finding the
> solution myself even before finishing the question because I was forced to
> think about the problem and explain it in a "different" and thorough way.
>
> Anyway, I wasn't saying you should never ask those things, but that you
> probably should ask somewhere else, some forum/website/list where you would
> probably find more people who is actually able to give you an answer, and
> for that stackoverflow might be much more helpful. About that, have a look
> at this:
> https://stackoverflow.com/questions/11323596/regular-expression-for-arabic-language
> Knowing the unicode blocks could be helpful to better track word boundaries
> in expressions.
>
>
> I understand that documentation about these topics is not common or easy
> to find as in other languages, but I'm also pretty sure you're not the
> first having these kind of issues, so, the trick is to be patient and ask
> in/look for the right places :-)
> If you need to test your expressions, you can also use regex101.com,
> which features a very good interface and provides deep insights to regular
> expressions as they are inserted.
>
> So, I'd suggest you try to simplify your code, make step-by-step tests
> against your regular expressions and, if everything looks ok but you still
> have problems with the QTextCursor, then submit us an example with all
> possible test cases _already included_ in the code, and I'd be happy to
> help if I can.
>
> Good luck!
> Maurizio
>
> 2018-09-02 10:12 GMT+02:00 Maziar Parsijani <maziar.parsijani at gmail.com>:
>
>> Hi Maurizio and David
>> You both are correct ,but if you attract on the code :When you copy the
>> last character in "قَالَ " we call it fatha " ــَ " and put it
>> completely in the lineedit.text() then it will find it correctly without (
>> r'\b{}\b').
>> The problem here is with that last one character which is sometimes
>> different for example if I change the string to this :
>>
>> m = "قَالَتْ قَالَ فَقَالُوا۟ قَالُوا۟ قَالَ قَالُوٓا۟ قَالَتَا ٱ
>> لْقَالِينَ  قَال  قَالِ "
>> then we can not use (r'\b{}\b').
>> Dear Maurizio I will take your advice But the problem here maybe related
>> to PYQT.QTGUI.Qtextcursor and Qtextedit.textcursor
>> if  you say that it is not related then I will not ask anything else .and
>> as you know the Arabic  help texts in python are not so helpful.
>> Thanks anyway but let me know about that I don't try to post other
>> problems like this.
>>
>>
>>
>> On Sun, Sep 2, 2018 at 12:33 AM Maurizio Berti <maurizio.berti at gmail.com>
>> wrote:
>>
>>> Simple regexes like that one will match all sequences of characters
>>> found; if you want to find those character as a single "word", you will
>>> need to use word boundaries.
>>>
>>> Change the pattern init to this and it works (at least, according to the
>>> string you gave):
>>>
>>> pattern = QtCore.QRegExp(r'\b{}\b'.format(self.lineEdit.text()))
>>>
>>> Have a look at this, too:
>>> https://stackoverflow.com/questions/40731058/regex-match-arabic-keyword
>>>
>>> As a side note, forgive my bluntness, but you might want to read some
>>> more documentation about regex and right-to-left languages practices.
>>> The latest questions you asked were not python nor qt related at all,
>>> and, as you might understand, most of the people on this mailing list don't
>>> even know how arabic language works; a bit of context might be useful, so
>>> that even people not experienced with those language might help too.
>>>
>>> Also, try to simplify examples by limiting code to what is really
>>> related to your issue: those geometry, font and widget settings won't help
>>> anyone reading your code. The same example could have been written in less
>>> than half lines, and would have been much more readable: easier to read is
>>> easier to understand and easier to help.
>>> Finally, avoid mixing the way you import modules. You should import from
>>> the main module _OR_ from the submodules. While it doesn't change much in
>>> terms of computation, it decreases the possibility of bugs and improves
>>> readibility, which is better for everybody reading your code, including you
>>> :-)
>>>
>>> Maurizio
>>>
>>>
>>>
>>> 2018-09-01 20:33 GMT+02:00 Maziar Parsijani <maziar.parsijani at gmail.com>
>>> :
>>>
>>>> I want to select 2 "قَالَ" in m = "قَالَتْ قَالَ فَقَالُوا۟ قَالُوا۟
>>>> قَالَ قَالُوٓا۟ قَالَتَا ٱلْقَالِينَ  " but with the below code it
>>>> selects all words which contains " "قَالَ"
>>>> Now what is the problem here?
>>>> I have to put spaces before and after pattern1 for such thing but it
>>>> doesn't work for me.
>>>>
>>>> pattern1 = " {0} ".format(self.lineEdit.text())
>>>>
>>>>
>>>> from PyQt5 import QtCore, QtGui, QtWidgets
>>>> from PyQt5.QtWidgets import QApplication, QTextEdit
>>>> from PyQt5.QtGui import QTextDocument, QTextDocumentFragment
>>>> from PyQt5 import QtCore, QtGui, QtWidgets
>>>> import sys
>>>> from PyQt5.QtWidgets import QDialog, QApplication
>>>> class AppWindow(QDialog):
>>>>     def __init__(self):
>>>>         super().__init__()
>>>>         self.setObjectName("Dialog")
>>>>         self.resize(800, 600)
>>>>         self.lineEdit = QtWidgets.QLineEdit(self)
>>>>         self.lineEdit.setGeometry(QtCore.QRect(70, 70, 211, 21))
>>>>         self.lineEdit.setObjectName("lineEdit")
>>>>         self.pushButton = QtWidgets.QPushButton(self)
>>>>         self.pushButton.setGeometry(QtCore.QRect(130, 110, 83, 28))
>>>>         self.pushButton.setObjectName("pushButton")
>>>>         self.SearchResults = QtWidgets.QTextEdit(self)
>>>>         self.SearchResults.setGeometry(QtCore.QRect(130, 140, 500, 400))
>>>>         font = QtGui.QFont()
>>>>         font.setFamily("Amiri")
>>>>         font.setPointSize(12)
>>>>         self.SearchResults.setFont(font)
>>>>         self.SearchResults.setToolTipDuration(0)
>>>>         self.SearchResults.setReadOnly(True)
>>>>         self.SearchResults.setAutoFormatting(QtWidgets.QTextEdit.AutoAll)
>>>>         self.SearchResults.setObjectName("SearchResults")
>>>>
>>>>         self.retranslateUi(self)
>>>>         QtCore.QMetaObject.connectSlotsByName(self)
>>>>     def find1(self):
>>>>             m = "  قَالَتْ  قَالَ فَقَالُوا۟ قَالُوا۟ قَالَ  قَالُوٓا۟ قَالَتَا  ٱلْقَالِينَ    "
>>>>             self.SearchResults.append('{0} '.format(m))
>>>>
>>>>
>>>>             cursor = self.SearchResults.textCursor()
>>>>             format = QtGui.QTextCharFormat()
>>>>             format.setForeground(QtGui.QBrush(QtGui.QColor("red")))
>>>>
>>>>             pattern1 = "{0}".format(self.lineEdit.text())
>>>>             regex = QtCore.QRegExp(pattern1)
>>>>             pos = 0
>>>>             index = regex.indexIn(self.SearchResults.toPlainText(), pos)
>>>>             tedad = 0
>>>>             while (index != -1):
>>>>                 cursor.setPosition(index)
>>>>                 cursor.movePosition(QtGui.QTextCursor.WordLeft, QtGui.QTextCursor.KeepAnchor)
>>>>                 cursor.mergeCharFormat(format)
>>>>                 pos = index + regex.matchedLength()
>>>>                 index = regex.indexIn(self.SearchResults.toPlainText(), pos)
>>>>                 if regex.isValid():
>>>>                     tedad += 1
>>>>             nmayesh = ("{}".format(tedad))
>>>>             self.SearchResults.append("{}".format(tedad))
>>>>
>>>>     def retranslateUi(self, Dialog):
>>>>         _translate = QtCore.QCoreApplication.translate
>>>>         self.setWindowTitle(_translate("Dialog", "Dialog"))
>>>>         self.pushButton.setText(_translate("Dialog", "PushButton"))
>>>>         self.pushButton.clicked.connect(self.find1)
>>>>
>>>> app = QApplication(sys.argv)
>>>> w = AppWindow()
>>>> w.show()
>>>> sys.exit(app.exec_())
>>>>
>>>>
>>>> _______________________________________________
>>>> PyQt mailing list    PyQt at riverbankcomputing.com
>>>> https://www.riverbankcomputing.com/mailman/listinfo/pyqt
>>>>
>>>
>>>
>>>
>>> --
>>> È difficile avere una convinzione precisa quando si parla delle ragioni
>>> del cuore. - "Sostiene Pereira", Antonio Tabucchi
>>> http://www.jidesk.net
>>>
>>
>
>
> --
> È difficile avere una convinzione precisa quando si parla delle ragioni
> del cuore. - "Sostiene Pereira", Antonio Tabucchi
> http://www.jidesk.net
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.riverbankcomputing.com/pipermail/pyqt/attachments/20180902/edaae659/attachment-0001.html>


More information about the PyQt mailing list