[PyQt] UnicodeDecodeError with output from Windows OS command

J Barchan jnbarchan at gmail.com
Fri Dec 1 16:52:49 GMT 2017


On 1 December 2017 at 16:19, Florian Bruhin <me at the-compiler.org> wrote:

> (Re-adding the mailinglist again, until someone there tells me they've
> heard
> enough horror stories about encodings :D)
>
> On Fri, Dec 01, 2017 at 01:46:39PM +0000, J Barchan wrote:
> > On 1 December 2017 at 12:18, Florian Bruhin <me at the-compiler.org> wrote:
> >
> > >
> > >
> > > Yes, thinking about it again, I was wrong there - sorry! If you can't
> force
> > > robocopy to output UTF-8, the best (and probably correct) guess is
> > > ​​
> > > locale.getpreferredencoding():
> > > https://docs.python.org/3/library/locale.html#locale.
> getpreferredencoding
> > >
> > > The documentation for subprocess mentions that it's using
> > > locale.getpreferredencoding(False) if no encoding was given. I'm still
> > > not sure
> > > what that argument (do_setlocale) means exactly, but you might want to
> > > just do
> > > the same.
> > >
> >
> > ​Dear Florian,
> >
> > This is *almost* what I was always looking for, which nobody has come up
> > with till now...!!!  But read on....
> >
> > All the people elsewhere who have been saying use utf-8 or utf-16 or the
> > file system encoding or all the other things have been doing my head
> in...
> > :(
>
> Can't blame them for the wrong answers, this stuff is hard.
> If you thought it this was bad enough, things going to get worse in the
> rest of
> this mail.
>
> You were asking how other tools manage to do this (like Visual Studio).
> I haven't tried, but either they've been written by people who fully
> understand
> this madness, or they do the wrong thing. Probably the latter. :D
>
> > After I had penned my long epistle to you earlier, I discovered that,
> > although some of the solutions had *appeared* to work in that I no longer
> > got the conversion error, ​it turns out they were then *displaying* that
> £
> > character differently, e.g. as the "oe" character or other symbol.
>
> See? There's no easy answer if you want a correct result ;-)
>
> > I discovered for myself that *only* decode("cp850") caused it to display
> as
> > the desired £.  And that's because UK/Western Europe is *Code Page 850*.
> > So it all fits together now!
> >
> > Now, I *assumed* your suggestion of ​locale.getpreferredencoding() would
> > return cp850 under Windows (remember, I do not have Windows python to
> > test).  But I checked with my Windows stakeholder, and it does not --- it
> > returns the "windows_1252" type string, which we know does *not* get the
> £
> > right....
> >
> > Then, I got him to open a Command Prompt, and just enter the command:
> >
> > c:\Tmp> *chcp*
> > > Active code page: 850
> > >
> >
> > That response from chcp of 850 as the CP is precisely what I am looking
> > for....!!!
> >
> > *Soooo*, I have a little request-task for you!  Assuming you have
> Windows +
> > Python 3, can you find a native Qt/PyQt/Python call which returns that
> > under Windows, please, please??  If you're not UK, obviously your CP
> value
> > may vary, but you know what I'm seeking.  Thank you so much!
>
> So... further down the rabbit hole. I've learned things about Windows and
> encodings, and I wish I didn't look :D
>
> Codepage 850 is apparently what Windows (or rather, DOS) used before there
> was
> windows-1251 (which in theory is *also* replaced by an Unicode encoding
> nowadays, but yeah... in theory).
>
> Money quote from https://en.wikipedia.org/wiki/Code_page_850 :
>
> > Systems largely replaced code page 850 with, first, Windows-1252 (often
> > mislabeled as ISO-8859-1), and later with UCS-2, and finally with UTF-16.
>
> Note it was introduced in 1987...
>
> Anyways - locale.getpreferredencoding() returns windows-1252 because that's
> what the part of Windows which is a bit less ancient uses. Turns out the
> console is even *more* horrible than that.
>
> Can you please try this (Windows only):
>
>     import ctypes
>     ctypes.cdll.kernel32.GetConsoleOutputCP()
>
> See https://docs.microsoft.com/en-us/windows/console/getconsoleoutputcp
>
> BUT! We're not done there yet, things get *more* horrible. (Yes, I didn't
> know
> that was possible either).
>
> That returns "437" for me. That's is an *even older* codepage, from 1981.
> It *does* happen to have the pound sign at the right position though :D
>
> See https://en.wikipedia.org/wiki/Code_page_437
>
> I originally wanted to blame Robocopy for doing horrible things, but it's
> really Windows in general:
>
>     >>> p = subprocess.run(['cmd', '/C', 'echo £'], stdout=subprocess.PIPE)
>     >>> p.stdout
>     b'\x9c\r\n'
>     >>> p.stdout.decode('cp850')
>     '£\r\n'
>     >>> p.stdout.decode('cp437')
>     '£\r\n'
>
> What happens with characters which aren't in cp437 is left as an exercise
> to
> the reader, I have no idea.
>
> > BTW, in http://doc.qt.io/qt-5/qprocess.html I do not see any mention of
> > "encoding at all"?  Where were you getting your second paragraph above
> from?
>
> I was looking at Python's subprocess module:
> https://docs.python.org/3/library/subprocess.html
>
> QProcess is probably the better choice if you're running things from a GUI,
> though.
>
> So yeah, I'd recommend something like:
>
>     if sys.platform == 'win32':
>         encoding = ctypes.cdll.kernel32.GetConsoleOutputCP()
>     else:
>         encoding = locale.getpreferredencoding(False)
>
> And now I'll need to do something other than reading about encodings for a
> while... :D
>
> Florian
>
> --
> https://www.qutebrowser.org  | me at the-compiler.org (Mail/XMPP)
>    GPG: 916E B0C8 FD55 A072  | https://the-compiler.org/pubkey.asc
>          I love long mails!  | https://email.is-not-s.ms/
>


​Hi Florian,

You're a hero!  On mine, ctypes.cdll.kernel32.GetConsoleOutputCP() does
indeed return 850 !!  (I only just reluctantly installed Python on my
Windows a few minutes ago, else I couldn't have done this!)  So it's
considered something not worthy of being available natively from Python,
without going down the cdll.kernel32 rabbit-hole, I wonder why? ;-)

I've been thinking about all this too, in view of how messy it is.

I'm coming to the likely conclusion that this issue & its solution are all
to do with when running "DOS" (deliberately in quotes) commands, only.  We
know dir or echo are only internal-aliases within cmd.exe, and while
robocopy is a genuine external program it's probably a DOS-sy rather than a
Windows-y program. (I don't know what that means exactly, but it'll
probably be to do with how the executable is marked "Windows" or not, I
recall that being some kind of option when from the linker.)

Now I think all this dealing with £ character as 0x9c --- which is the root
of the problem, and should really be a "oe" character anyway --- is at the
DOS side, because for example if I insert a £ into Notepad and save the
file it is stored as a 0xa9 if ANSI or as 0xc2,0xa3 if UTF-8.  So I'm
thinking that the whole cp850 Code Page has nothing to do with
Windows-Unicode-or-whatever-level, and is only coming into play because I'm
using things which happen to still be DOS-sy, such as cmd.exe or
robocopy.exe.

I guess the filename is not *really* being stored with an 0x9c in it, but
some Windows-Unicode-or-other-character, but it appears to "DOS-sy"
programs like it is a 0x9c.  If they were true, native "Windows"
applications they wouldn't be seeing or outputting 0x9c for the £, but some
Unicode or whatever, and none of this would have arisen.

*However*, "true Windows-y programs" which can be invoked purely from the
command-line, don't show a GUI, and report the files they are backing up on
stdout/stderr as they progress for reading into my window *probably* are
not going to be very common... :)

[ BTW, I do have to use Qt QProcess for spawning, I use the
readyReadStandardOutput
signal to be fetching the output progress filenames into the scrollable
window for my user to see as it goes along, the Python os module is not
going to let me do that.  This is why I'm posting to a PyQt forum! ]

This whole thing has done my head in.  If you have further observations, or
agree or disagree with what I am saying above, let me know, I'm always
interested.

Thank you *so much* for your time, effort & explanations.  I respect people
who know their stuff.  Doubtless you'll hear from me with other PyQt topics
in due course, you have been warned... :)  Have a good weekend.

Kindest,
Jonathan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.riverbankcomputing.com/pipermail/pyqt/attachments/20171201/4b14e6d3/attachment.html>


More information about the PyQt mailing list