[PyQt] PyQt cannot trasform QString into str when reading emoji symbol from QClipboard
Kovid Goyal
kovid at kovidgoyal.net
Fri Jan 23 02:52:50 GMT 2015
That's wrong, for python <= 3.2
On Windows and OS X, python <= 3.2 uses narrow builds. That means unicode
strings are UTF-16 internally and having surrogates is perfectly
correct. That is the only possible way to represent non-BMP characters
(codepoint > 0xffff).
Indeed, on windows and OS X it is not possible to build it as a wide
build, because that would break lots of extension modules that interface
with the windows and OS X APIs that assume python unicode objects are
UTF-16. Therefore, on python <= 3.2, on windows and OS X, returning
surrogate pairs is correct.
In python >= 3.3 python strings use a variable encoding internally,
which can be ascii, UCS-2 (not UTF-16) or UCS-4.
IIRC, QString internally stores unicode strings as UTF-16, so having surrogate
pairs in QString is correct. http://qt-project.org/wiki/QtStrings
When PyQt auto converts a QString containing surrogate pairs to a python
string, it should convert it using
PyUnicode_DecodeUTF16()
on the internal buffer of a QString.
For consistency, PyQt could use PyUnicode_DecodeUTF16() on all version
of python, though I dont know if memcopying the internal buffer of a
QString would be faster on narrow builds for python <= 3.2. Depends on
the internal implementation of PyUnicode_DecodeUTF16(), which I dont
have time right now to check.
Kovid.
On Thu, Jan 22, 2015 at 03:57:46PM -0500, Pavel Roskin wrote:
> This would decode surrogates!
>
> import array
> string = QApplication.clipboard().text()
> # string = '\U0001f637'
> # string = '\ufeff\ud83d\ude87'
> try:
> # sane case - valid unicode
> string.encode('utf-8')
> except UnicodeEncodeError:
> # insane case - need to decode surrogates
> string = array.array('H', map(ord, list(string))).tobytes().decode('utf-16')
> print(string)
>
> The string is split into characters, converted to integers, packed as
> 16-bit unsigned int, converted to bytes and decoded as UTF-16. Real
> characters over 0xffff would raise OverflowError in that expression.
> That's why it's a fallback if UTF-8 encoding doesn't work.
>
> Of course it's a workaround. QApplication.clipboard().text() should
> not return surrogates.
>
> --
> Regards,
> Pavel Roskin
> _______________________________________________
> PyQt mailing list PyQt at riverbankcomputing.com
> http://www.riverbankcomputing.com/mailman/listinfo/pyqt
>
> !DSPAM:3,54c165b517501839019753!
>
>
--
_____________________________________
Dr. Kovid Goyal
http://www.kovidgoyal.net
http://calibre-ebook.com
_____________________________________
More information about the PyQt
mailing list