[PyQt] PyQt cannot trasform QString into str when reading emoji symbol from QClipboard

Fri Jan 23 04:55:39 GMT 2015

And just for completeness, here is some code from calibre, that converts
a UTF-16 buffer (from ICU) to a python unicode string (this is for
python <= 3.2 only.

https://github.com/kovidgoyal/calibre/blob/91afce9f4c6b8df8512110d543e8b27eb9d968f0/src/calibre/utils/icu_calibre_utils.h#L99

It uses PyUnicode_DecodeUTF16() on wide python (UCS-4) builds and 
PyUnicode_FromUnicode() on narrow builds (for performance).

Kovid.

On Fri, Jan 23, 2015 at 08:22:50AM +0530, Kovid Goyal wrote:
> That's wrong, for python <= 3.2
> On Windows and OS X, python <= 3.2 uses narrow builds. That means unicode
> strings are UTF-16 internally and having surrogates is perfectly
> correct. That is the only possible way to represent non-BMP characters
> (codepoint > 0xffff).
> 
> Indeed, on windows and OS X it is not possible to build it as a wide
> build, because that would break lots of extension modules that interface
> with the windows and OS X APIs that assume python unicode objects are
> UTF-16.  Therefore, on python <= 3.2, on windows and OS X, returning
> surrogate pairs is correct.
> 
> In python >= 3.3 python strings use a variable encoding internally,
> which can be ascii, UCS-2 (not UTF-16) or UCS-4.
> 
> IIRC, QString internally stores unicode strings as UTF-16, so having surrogate
> pairs in QString is correct. http://qt-project.org/wiki/QtStrings
> 
> When PyQt auto converts a QString containing surrogate pairs to a python 
> string, it should convert it using
> 
> PyUnicode_DecodeUTF16() 
> 
> on the internal buffer of a QString. 
> 
> For consistency, PyQt could use PyUnicode_DecodeUTF16() on all version
> of python, though I dont know if memcopying the internal buffer of a
> QString would be faster on narrow builds for python <= 3.2. Depends on
> the internal implementation of PyUnicode_DecodeUTF16(), which I dont
> have time right now to check.
> 
> Kovid.
> 
> 
> On Thu, Jan 22, 2015 at 03:57:46PM -0500, Pavel Roskin wrote:
> > This would decode surrogates!
> > 
> > import array
> > string = QApplication.clipboard().text()
> > # string = '\U0001f637'
> > # string = '\ufeff\ud83d\ude87'
> > try:
> >     # sane case - valid unicode
> >     string.encode('utf-8')
> > except UnicodeEncodeError:
> >     # insane case - need to decode surrogates
> >     string = array.array('H', map(ord, list(string))).tobytes().decode('utf-16')
> > print(string)
> > 
> > The string is split into characters, converted to integers, packed as
> > 16-bit unsigned int, converted to bytes and decoded as UTF-16. Real
> > characters over 0xffff would raise OverflowError in that expression.
> > That's why it's a fallback if UTF-8 encoding doesn't work.
> > 
> > Of course it's a workaround. QApplication.clipboard().text() should
> > not return surrogates.
> > 
> > -- 
> > Regards,
> > Pavel Roskin
> > _______________________________________________
> > PyQt mailing list    PyQt at riverbankcomputing.com
> > http://www.riverbankcomputing.com/mailman/listinfo/pyqt
> > 
> > 
> > 
> > 
> 
> -- 
> _____________________________________
> 
> Dr. Kovid Goyal 
> http://www.kovidgoyal.net
> http://calibre-ebook.com
> _____________________________________
> _______________________________________________
> PyQt mailing list    PyQt at riverbankcomputing.com
> http://www.riverbankcomputing.com/mailman/listinfo/pyqt
> 
> !DSPAM:3,54c1b8e717501323527016!
> 
> 

-- 
_____________________________________

Dr. Kovid Goyal 
http://www.kovidgoyal.net
http://calibre-ebook.com
_____________________________________