[PyQt] PyQt cannot trasform QString into str when reading emoji symbol from QClipboard

Kovid Goyal kovid at kovidgoyal.net
Fri Jan 23 02:52:50 GMT 2015


That's wrong, for python <= 3.2
On Windows and OS X, python <= 3.2 uses narrow builds. That means unicode
strings are UTF-16 internally and having surrogates is perfectly
correct. That is the only possible way to represent non-BMP characters
(codepoint > 0xffff).

Indeed, on windows and OS X it is not possible to build it as a wide
build, because that would break lots of extension modules that interface
with the windows and OS X APIs that assume python unicode objects are
UTF-16.  Therefore, on python <= 3.2, on windows and OS X, returning
surrogate pairs is correct.

In python >= 3.3 python strings use a variable encoding internally,
which can be ascii, UCS-2 (not UTF-16) or UCS-4.

IIRC, QString internally stores unicode strings as UTF-16, so having surrogate
pairs in QString is correct. http://qt-project.org/wiki/QtStrings

When PyQt auto converts a QString containing surrogate pairs to a python 
string, it should convert it using

PyUnicode_DecodeUTF16() 

on the internal buffer of a QString. 

For consistency, PyQt could use PyUnicode_DecodeUTF16() on all version
of python, though I dont know if memcopying the internal buffer of a
QString would be faster on narrow builds for python <= 3.2. Depends on
the internal implementation of PyUnicode_DecodeUTF16(), which I dont
have time right now to check.

Kovid.


On Thu, Jan 22, 2015 at 03:57:46PM -0500, Pavel Roskin wrote:
> This would decode surrogates!
> 
> import array
> string = QApplication.clipboard().text()
> # string = '\U0001f637'
> # string = '\ufeff\ud83d\ude87'
> try:
>     # sane case - valid unicode
>     string.encode('utf-8')
> except UnicodeEncodeError:
>     # insane case - need to decode surrogates
>     string = array.array('H', map(ord, list(string))).tobytes().decode('utf-16')
> print(string)
> 
> The string is split into characters, converted to integers, packed as
> 16-bit unsigned int, converted to bytes and decoded as UTF-16. Real
> characters over 0xffff would raise OverflowError in that expression.
> That's why it's a fallback if UTF-8 encoding doesn't work.
> 
> Of course it's a workaround. QApplication.clipboard().text() should
> not return surrogates.
> 
> -- 
> Regards,
> Pavel Roskin
> _______________________________________________
> PyQt mailing list    PyQt at riverbankcomputing.com
> http://www.riverbankcomputing.com/mailman/listinfo/pyqt
> 
> !DSPAM:3,54c165b517501839019753!
> 
> 

-- 
_____________________________________

Dr. Kovid Goyal 
http://www.kovidgoyal.net
http://calibre-ebook.com
_____________________________________


More information about the PyQt mailing list