[PyQt] UnicodeDecodeError with output from Windows OS command

Florian Bruhin me at the-compiler.org
Fri Dec 1 16:19:08 GMT 2017


(Re-adding the mailinglist again, until someone there tells me they've heard
enough horror stories about encodings :D)

On Fri, Dec 01, 2017 at 01:46:39PM +0000, J Barchan wrote:
> On 1 December 2017 at 12:18, Florian Bruhin <me at the-compiler.org> wrote:
> 
> >
> >
> > Yes, thinking about it again, I was wrong there - sorry! If you can't force
> > robocopy to output UTF-8, the best (and probably correct) guess is
> > ​​
> > locale.getpreferredencoding():
> > https://docs.python.org/3/library/locale.html#locale.getpreferredencoding
> >
> > The documentation for subprocess mentions that it's using
> > locale.getpreferredencoding(False) if no encoding was given. I'm still
> > not sure
> > what that argument (do_setlocale) means exactly, but you might want to
> > just do
> > the same.
> >
> 
> ​Dear Florian,
> 
> This is *almost* what I was always looking for, which nobody has come up
> with till now...!!!  But read on....
> 
> All the people elsewhere who have been saying use utf-8 or utf-16 or the
> file system encoding or all the other things have been doing my head in...
> :(

Can't blame them for the wrong answers, this stuff is hard.
If you thought it this was bad enough, things going to get worse in the rest of
this mail.

You were asking how other tools manage to do this (like Visual Studio).
I haven't tried, but either they've been written by people who fully understand
this madness, or they do the wrong thing. Probably the latter. :D

> After I had penned my long epistle to you earlier, I discovered that,
> although some of the solutions had *appeared* to work in that I no longer
> got the conversion error, ​it turns out they were then *displaying* that £
> character differently, e.g. as the "oe" character or other symbol.

See? There's no easy answer if you want a correct result ;-)

> I discovered for myself that *only* decode("cp850") caused it to display as
> the desired £.  And that's because UK/Western Europe is *Code Page 850*.
> So it all fits together now!
> 
> Now, I *assumed* your suggestion of ​locale.getpreferredencoding() would
> return cp850 under Windows (remember, I do not have Windows python to
> test).  But I checked with my Windows stakeholder, and it does not --- it
> returns the "windows_1252" type string, which we know does *not* get the £
> right....
> 
> Then, I got him to open a Command Prompt, and just enter the command:
> 
> c:\Tmp> *chcp*
> > Active code page: 850
> >
> 
> That response from chcp of 850 as the CP is precisely what I am looking
> for....!!!
> 
> *Soooo*, I have a little request-task for you!  Assuming you have Windows +
> Python 3, can you find a native Qt/PyQt/Python call which returns that
> under Windows, please, please??  If you're not UK, obviously your CP value
> may vary, but you know what I'm seeking.  Thank you so much!

So... further down the rabbit hole. I've learned things about Windows and
encodings, and I wish I didn't look :D

Codepage 850 is apparently what Windows (or rather, DOS) used before there was
windows-1251 (which in theory is *also* replaced by an Unicode encoding
nowadays, but yeah... in theory).

Money quote from https://en.wikipedia.org/wiki/Code_page_850 :

> Systems largely replaced code page 850 with, first, Windows-1252 (often
> mislabeled as ISO-8859-1), and later with UCS-2, and finally with UTF-16.

Note it was introduced in 1987...

Anyways - locale.getpreferredencoding() returns windows-1252 because that's
what the part of Windows which is a bit less ancient uses. Turns out the
console is even *more* horrible than that.

Can you please try this (Windows only):

    import ctypes
    ctypes.cdll.kernel32.GetConsoleOutputCP()

See https://docs.microsoft.com/en-us/windows/console/getconsoleoutputcp

BUT! We're not done there yet, things get *more* horrible. (Yes, I didn't know
that was possible either).

That returns "437" for me. That's is an *even older* codepage, from 1981.
It *does* happen to have the pound sign at the right position though :D

See https://en.wikipedia.org/wiki/Code_page_437

I originally wanted to blame Robocopy for doing horrible things, but it's
really Windows in general:

    >>> p = subprocess.run(['cmd', '/C', 'echo £'], stdout=subprocess.PIPE)
    >>> p.stdout
    b'\x9c\r\n'
    >>> p.stdout.decode('cp850')
    '£\r\n'
    >>> p.stdout.decode('cp437')
    '£\r\n'

What happens with characters which aren't in cp437 is left as an exercise to
the reader, I have no idea.

> BTW, in http://doc.qt.io/qt-5/qprocess.html I do not see any mention of
> "encoding at all"?  Where were you getting your second paragraph above from?

I was looking at Python's subprocess module:
https://docs.python.org/3/library/subprocess.html

QProcess is probably the better choice if you're running things from a GUI,
though.

So yeah, I'd recommend something like:

    if sys.platform == 'win32':
        encoding = ctypes.cdll.kernel32.GetConsoleOutputCP()
    else:
        encoding = locale.getpreferredencoding(False)

And now I'll need to do something other than reading about encodings for a
while... :D

Florian

-- 
https://www.qutebrowser.org  | me at the-compiler.org (Mail/XMPP)
   GPG: 916E B0C8 FD55 A072  | https://the-compiler.org/pubkey.asc
         I love long mails!  | https://email.is-not-s.ms/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://www.riverbankcomputing.com/pipermail/pyqt/attachments/20171201/521dc7a2/attachment-0001.sig>


More information about the PyQt mailing list