[PyQt] PyQt cannot trasform QString into str when reading emoji symbol from QClipboard
Kovid Goyal
kovid at kovidgoyal.net
Fri Jan 23 04:55:39 GMT 2015
And just for completeness, here is some code from calibre, that converts
a UTF-16 buffer (from ICU) to a python unicode string (this is for
python <= 3.2 only.
https://github.com/kovidgoyal/calibre/blob/91afce9f4c6b8df8512110d543e8b27eb9d968f0/src/calibre/utils/icu_calibre_utils.h#L99
It uses PyUnicode_DecodeUTF16() on wide python (UCS-4) builds and
PyUnicode_FromUnicode() on narrow builds (for performance).
Kovid.
On Fri, Jan 23, 2015 at 08:22:50AM +0530, Kovid Goyal wrote:
> That's wrong, for python <= 3.2
> On Windows and OS X, python <= 3.2 uses narrow builds. That means unicode
> strings are UTF-16 internally and having surrogates is perfectly
> correct. That is the only possible way to represent non-BMP characters
> (codepoint > 0xffff).
>
> Indeed, on windows and OS X it is not possible to build it as a wide
> build, because that would break lots of extension modules that interface
> with the windows and OS X APIs that assume python unicode objects are
> UTF-16. Therefore, on python <= 3.2, on windows and OS X, returning
> surrogate pairs is correct.
>
> In python >= 3.3 python strings use a variable encoding internally,
> which can be ascii, UCS-2 (not UTF-16) or UCS-4.
>
> IIRC, QString internally stores unicode strings as UTF-16, so having surrogate
> pairs in QString is correct. http://qt-project.org/wiki/QtStrings
>
> When PyQt auto converts a QString containing surrogate pairs to a python
> string, it should convert it using
>
> PyUnicode_DecodeUTF16()
>
> on the internal buffer of a QString.
>
> For consistency, PyQt could use PyUnicode_DecodeUTF16() on all version
> of python, though I dont know if memcopying the internal buffer of a
> QString would be faster on narrow builds for python <= 3.2. Depends on
> the internal implementation of PyUnicode_DecodeUTF16(), which I dont
> have time right now to check.
>
> Kovid.
>
>
> On Thu, Jan 22, 2015 at 03:57:46PM -0500, Pavel Roskin wrote:
> > This would decode surrogates!
> >
> > import array
> > string = QApplication.clipboard().text()
> > # string = '\U0001f637'
> > # string = '\ufeff\ud83d\ude87'
> > try:
> > # sane case - valid unicode
> > string.encode('utf-8')
> > except UnicodeEncodeError:
> > # insane case - need to decode surrogates
> > string = array.array('H', map(ord, list(string))).tobytes().decode('utf-16')
> > print(string)
> >
> > The string is split into characters, converted to integers, packed as
> > 16-bit unsigned int, converted to bytes and decoded as UTF-16. Real
> > characters over 0xffff would raise OverflowError in that expression.
> > That's why it's a fallback if UTF-8 encoding doesn't work.
> >
> > Of course it's a workaround. QApplication.clipboard().text() should
> > not return surrogates.
> >
> > --
> > Regards,
> > Pavel Roskin
> > _______________________________________________
> > PyQt mailing list PyQt at riverbankcomputing.com
> > http://www.riverbankcomputing.com/mailman/listinfo/pyqt
> >
> >
> >
> >
>
> --
> _____________________________________
>
> Dr. Kovid Goyal
> http://www.kovidgoyal.net
> http://calibre-ebook.com
> _____________________________________
> _______________________________________________
> PyQt mailing list PyQt at riverbankcomputing.com
> http://www.riverbankcomputing.com/mailman/listinfo/pyqt
>
> !DSPAM:3,54c1b8e717501323527016!
>
>
--
_____________________________________
Dr. Kovid Goyal
http://www.kovidgoyal.net
http://calibre-ebook.com
_____________________________________
More information about the PyQt
mailing list