[PyQt] Search method for Arabic text

Maziar Parsijani maziar.parsijani at gmail.com
Wed Aug 29 03:20:14 BST 2018


Thanks for your pro and advance answer.

On Wed, Aug 29, 2018 at 10:07 AM Zachary Scheuren <angryjaga at gmail.com>
wrote:

> Yes, that's what I was afraid of. You lose the index when you strip the
> marks so you need to keep track of them somehow. One way is to split on
> spaces and track it that way. Something like this with PyQt5. You'll need
> to fill in your pieces, but the main idea is here:
>
> # setup your QApplication and QTextEdit, etc.
> from pyarabic import araby
> from PyQt5.QtGui import QTextCharFormat
> from PyQt5.QtGui import QTextCursor
>
> textedit = your QTextEdit
> text = the source text
> text2 = araby.strip_tashkeel(text)
> search_string = 'السماء'
> matches = []
>
> # to be more accurate about the match you can use re.finditer() instead of
> just "if search_string in" to get the start index,
> # but then you should track that as well as the list index so you can be
> more accurate later. this is just a quick proof of concept so I didn't go
> all the way
> for i, c in enumerate(text2.split()):
>     if search_string in c:
>         matches.append(i)
>
> # Now use the match indices to highlight the original text
> cursor = textedit.textCursor()
> highlight_color = QColor()  # whatever color you want
> fmt = QTextCharFormat()
> fmt.setBackground(highlight_color)
>
> for match_index in matches:
>     start = len(' '.join(text.split()[0:match_index])) + 1
>     end = len(' '.join(text.split()[0:match_index + 1]))
>     cursor.setPosition(start, QTextCursor.MoveAnchor)
>     cursor.setPosition(end, QTextCursor.KeepAnchor)
>     cursor.setCharFormat(fmt)
>
> This works, but it can highlight some parts that aren't really part of the
> match. To avoid that you could check the index offset within each matched
> item in the list to make sure you get the right start index. It's all
> possible. Just takes some more code. I hope this helps get you closer.
>
>
>
> On Tue, Aug 28, 2018 at 12:57 AM, Maziar Parsijani <
> maziar.parsijani at gmail.com> wrote:
>
>> Hi Zachary Scheuren
>> Thanks a lot for your answer .The reason I have Email PYQT is that I use
>> its widgets and I forgot to say that I use Qtextedit.And for more detail I
>> can say that I found that pyarabic library before and as you said I can
>> remove marks with its strip_tashkeel(text) function ,But if you take a
>> look at below example I think you will find what I want to do.And I can
>> refer you to http://tanzil.net here if you search for "السماء" it will
>> find and highlight "ٱلسَّمَآءِ" I want to know if its possible in
>> python?I use sqlite database and I could find the results like this but I
>> can not highlight them.
>> Example :
>> search for :" السماء "
>> but I want to show Quranuthmani and find :" ٱلسَّمَآءِ" in it and
>> highlight them.
>> I can find them with no problem cause of using sqlite database table with
>> different Quran text But the problem is highlighting them cause of using
>> regex
>> and " السماء " ," ٱلسَّمَآءِ" are not the same so I can not highlight
>> them.
>> Please accept my apologizes for asking my question before your permission
>> .
>>
>> On Mon, Aug 27, 2018 at 10:14 AM Zachary Scheuren <angryjaga at gmail.com>
>> wrote:
>>
>>> This isn't really a PyQt question. You can do all that in basic Python,
>>> but it can help if you have something like the pyarabic library. With that
>>> you can strip out the vocalization before comparing strings. You also need
>>> to consider all the possible Alefs like in str1 you have Alef with Wasla,
>>> but str2 only has Alef. pyarabic can also help there with araby.ALEFAT
>>> which is a list of all possible Alefs with marks. You need to manually
>>> check that because Alef with Wasla has no Unicode decomposition and the
>>> wasla isn't encoded as a separate mark. There have been Unicode proposals
>>> for that, but nothing has happened so far. Anyway, I did a quick little
>>> test with your strings...
>>>
>>> import re
>>> from pyarabic import araby
>>> str3_nomarks = araby.separate(str3)[0]  # strips all diacritics
>>> for c in araby.ALEFAT:  # replace any Alef with a mark by base Alef
>>>     str3_nomarks = str3_nomarks.replace(c, araby.ALEF)
>>>
>>> re.findall(str2, str3_nomarks)
>>>
>>> Something like that will get you matches, but if you need to track the
>>> position in a string you'll have to do some more work since dropping the
>>> diacritics will throw off the index.
>>>
>>>
>>>
>>> On Wed, Aug 22, 2018 at 12:43 AM, Maziar Parsijani <
>>> maziar.parsijani at gmail.com> wrote:
>>>
>>>> Hi
>>>> I have some Arabic strings in mt database now I want to if I search
>>>> like this :
>>>>
>>>>   str1 = "ٱلْمُفْلِحُونَ"
>>>>   str2 = "المفلحون"
>>>> as you can see str1 is the same as str2 but in Arabic text str1 has
>>>> more alphabetical characters.
>>>> Is there anyway to search str2 but I could find both of them in a
>>>> string like :
>>>>  str3 = " المفلحون ٱلْمُفْلِحُونَ ٱلنَّاسُ المفلحون ٱلْمُفْلِحُونَ
>>>> المفلحون ٱلنَّاسُ المفلحون ٱلنَّاسُ "
>>>>
>>>> _______________________________________________
>>>> PyQt mailing list    PyQt at riverbankcomputing.com
>>>> https://www.riverbankcomputing.com/mailman/listinfo/pyqt
>>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.riverbankcomputing.com/pipermail/pyqt/attachments/20180829/91efe9cf/attachment-0001.html>


More information about the PyQt mailing list