<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On 30 November 2017 at 22:59, Florian Bruhin <span dir="ltr"><<a href="mailto:me@the-compiler.org" target="_blank">me@the-compiler.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">(I assume you accidentally removed the list from the reply, so I re-added it)<br>
<span class="gmail-"><br>
On Thu, Nov 30, 2017 at 10:28:58PM +0000, J Barchan wrote:<br>
> Did I mention that I have learnt now what output is causing the error? It<br>
> turns out it's if robocopy --- which simply reports filenames as it goes<br>
> --- encounters a filename with a "£" (UK pound sterling) character in its<br>
> name. It would happen just as much if the command were, say, "dir" instead<br>
> of "robocopy". That pound character is a single byte of 0x9c in the<br>
> output, which decode('utf-8') is barfing it.<br>
<br>
</span>Right, because it is not output encoded as utf-8, so utf-8 is the wrong<br>
encoding to decode it with ;-)<br>
<span class="gmail-"><br>
> In extensive investigations, all I came across was<br>
> <a href="https://riverbankcomputing.com/pipermail/pyqt/2010-January/025564.html" rel="noreferrer" target="_blank">https://riverbankcomputing.<wbr>com/pipermail/pyqt/2010-<wbr>January/025564.html</a>:<br>
><br>
> > >> > really often I have this kind of code in my application when it comes<br>
> > >> > to<br>
> > >> > converting a QByteArray to s string under Python 3.1.<br>
> > >> ><br>
> > >> > s = bytes(QByteArray).decode()<br>
> > >> ><br>
> > >> > In Python 2 one could use<br>
> > >> ><br>
> > >> > s = unicode(QByteArray)<br>
> > >> ><br>
> > >> > to get the same result. Did I miss something or could QByteArray get<br>
> a<br>
> > >> > decode() method to make it similar to a Python3 bytes or bytearray<br>
> > >> > type?<br>
> > >><br>
> > >> The Python3 way to do it is...<br>
> > >><br>
> > >> s = str(QByteArray, encoding='ascii')<br>
> > >><br>
> > >> ...or whatever encoding is used.<br>
> > >><br>
> > >> It would be possible to change things so that...<br>
> > >><br>
> > >> s = str(QByteArray)<br>
> > >><br>
> > >> ...automatically uses the default encoding. However that would then<br>
> make<br>
> > >> it<br>
> > >> inconsistent with the behaviour of...<br>
> > >><br>
> > >> s = str(bytes)<br>
> > >><br>
> > >> ...and I'm not sure that that is a good idea.<br>
><br>
</span>> See, that guy is saying in Python 3 "*...or whatever encoding is used*.".<br>
> But in Python 2 he says it "*automatically uses the default encoding*".<br>
<br>
That person seems to be talking about ascii encoding only, which is the most<br>
simple case.<br>
<br>
> *I just want the Python 2 behaviour, what was the "default encoding" used<br>
> there, which I meant I didn't have to explicitize one? * Python 3<br>
> equivalent.<br>
<br>
It's picking ascii and erroring out if that doesn't work, which is basically<br>
what you're seeing ;-)<br>
<br>
<a href="https://docs.python.org/2/howto/unicode.html#the-unicode-type" rel="noreferrer" target="_blank">https://docs.python.org/2/<wbr>howto/unicode.html#the-<wbr>unicode-type</a><br>
<br>
The first argument is converted to Unicode using the specified encoding; if<br>
you leave off the encoding argument, the ASCII encoding is used for the<br>
conversion, so characters greater than 127 will be treated as errors: [...]<br>
<span class="gmail-"><br>
> Did it just use bytearray.decode('utf-8', 'replace') like you say for the<br>
> C++?<br>
<br>
</span>No, the equivalent of bytearray.decode('ascii', 'error').<br>
<br>
What you probably want in this case is:<br>
<br>
bytearray.decode(sys.<wbr>getfilesystemencoding())<br>
<br>
Assuming that robocopy doesn't somehow re-encode the filenames it gets.<br>
<div class="gmail-HOEnZb"><div class="gmail-h5"><br>
Florian<br>
<br>
--<br>
<a href="https://www.qutebrowser.org" rel="noreferrer" target="_blank">https://www.qutebrowser.org</a> | <a href="mailto:me@the-compiler.org">me@the-compiler.org</a> (Mail/XMPP)<br>
GPG: 916E B0C8 FD55 A072 | <a href="https://the-compiler.org/pubkey.asc" rel="noreferrer" target="_blank">https://the-compiler.org/<wbr>pubkey.asc</a><br>
I love long mails! | <a href="https://email.is-not-s.ms/" rel="noreferrer" target="_blank">https://email.is-not-s.ms/</a><br>
</div></div></blockquote></div><br><div style="font-family:tahoma,sans-serif" class="gmail_default">Firstly, I admit I get lost/confused when replying in my Gmail client. I've seen some people here reply to me only, reply to pyqt only, or reply to some combination. Anyway, this time I'm using "Reply All"....</div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">Secondly, thanks to your suggestions, at the end of the day I can decode via <span style="font-family:monospace,monospace">"latin-1"</span> or <span style="font-family:monospace,monospace">"windows_1252"</span> and it works in my customer's case here in the UK (I do not have access to the Windows running code or their filenames, I develop just under Linux!), so I am no longer "stuck".</div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">However, I would still like to do this in the "best"/"most robust"/"most portable"/"correct" way if I can, plus I like to understand things properly. Assuming you are still interested & helpful (!?), I still have a couple of things to pursue with you.</div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">[Among the places I have been asking to try to resolve this, you have been the most "on the ball" and provided responses which actually show an understanding of the issue. So I'd be ever so grateful if you wouldn't mind taking your time to look through what I'm writing below. I know it's a bit long, but I hope you'll see I'm just laying out a very logical argument of what's going on here. There are 3 "Parts". Plus a "thank you" at the end....]<br></div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default"><b>FIRST PART:</b><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">You suggest:</div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="font-family:tahoma,sans-serif" class="gmail_default">What you probably want in this case is:<br>
<br>
bytearray.decode(sys.<wbr>getfilesystemencoding())</div></blockquote><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">Now, from <a href="https://docs.python.org/3/library/sys.html">https://docs.python.org/3/library/sys.html</a> "<b>PEP 529 -- Change Windows filesystem encoding to UTF-8</b>":<br></div><div style="font-family:tahoma,sans-serif" class="gmail_default"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><ul class="gmail-simple"><li>On Mac OS X, the encoding is <code class="gmail-docutils gmail-literal"><span class="gmail-pre">'utf-8'</span></code>.</li><li>On Unix, the encoding is the locale encoding.</li><li>On Windows, the encoding may be <code class="gmail-docutils gmail-literal"><span class="gmail-pre">'utf-8'</span></code> or <code class="gmail-docutils gmail-literal"><span class="gmail-pre">'mbcs'</span></code>, depending
on user configuration.</li></ul></blockquote><p>And from the <a href="https://www.python.org/dev/peps/pep-0529/">https://www.python.org/dev/peps/pep-0529/</a> referenced from there:</p><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><p>This PEP proposes changing the default filesystem encoding on Windows to utf-8</p></blockquote><p>and:</p><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><p>Currently the default filesystem encoding is 'mbcs', which is a meta-encoder
that uses the active code page. <br></p><p>The use of utf-8 will not be configurable, except for the provision of a
"legacy mode" flag to revert to the previous behaviour.</p><h2><a class="gmail-toc-backref" href="https://www.python.org/dev/peps/pep-0529/#id7">Update sys.getfilesystemencoding</a></h2><p>Remove the default value for <tt class="gmail-docutils gmail-literal">Py_FileSystemDefaultEncoding</tt> and set it in
<tt class="gmail-docutils gmail-literal">initfsencoding()</tt> to utf-8, or if the legacy-mode switch is enabled to mbcs.</p></blockquote>
<p>So, (remembering that I cannot test this under Windows...) I understand that <span style="font-family:monospace,monospace">sys.</span><wbr><span style="font-family:monospace,monospace">getfilesystemencoding()</span> is going to return <span style="font-family:monospace,monospace">utf-8</span> there, and we already know that the problem is this will be wrong when it encounters the 0x9c character, since that's what I'm reporting....<br></p></div></div><div class="gmail_extra"><div style="font-family:tahoma,sans-serif" class="gmail_default"></div><div style="font-family:tahoma,sans-serif" class="gmail_default"><b>SECOND PART:</b></div><div style="font-family:tahoma,sans-serif" class="gmail_default"></div><div style="font-family:tahoma,sans-serif" class="gmail_default">You, and others, keep talking about the behaviour of the particular <span style="font-family:monospace,monospace">robocopy</span> command I happen to get the problem with. I don't see that is the cause/relevant of the issue. My code is for generic issuing of an unknown OS command and reading its output.<br></div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">So, let's get rid of <span style="font-family:monospace,monospace">robocopy</span> completely! Certainly here in the UK, on my keyboard/with my UK Windows I can go into a Command Prompt and type:</div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="gmail_default"><span style="font-family:monospace,monospace">echo £</span></div></blockquote><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">and I get <span style="font-family:monospace,monospace">£</span> as output. And if I examine the output it has a single byte of <span style="font-family:monospace,monospace">0x9c</span> for the <span style="font-family:monospace,monospace">£</span> character.</div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">So, let's say that --- or some program which outputs same --- is my sub-process program. <u><i>Note that now it has nothing to do with Windows filenames!!</i></u> (Nor anything to do with <span style="font-family:monospace,monospace">robocopy</span>.) The choice of anything to do with <span style="font-family:monospace,monospace">sys.</span><wbr><span style="font-family:monospace,monospace">getfilesystemencoding()</span> now seems "inappropriate" to me, as the issue has nothing to do with file system file names. Which is how it always seemed to me.</div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">[BTW: I notice <a href="https://stackoverflow.com/questions/15530635/why-is-sys-getdefaultencoding-different-from-sys-stdout-encoding-and-how-does">https://stackoverflow.com/questions/15530635/why-is-sys-getdefaultencoding-different-from-sys-stdout-encoding-and-how-does</a>. While that is Python2, it may be relevant to this discussion. Though please don't get hung up on it!]<br></div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">My "spawner" app cannot possibly know anything about which program is being run, nor should it.</div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">I wish to display the output much as, say, it would appear in the Command Prompt console. I just want the output in a scrollable Qt window (<span style="font-family:monospace,monospace">QTextEdit</span>) for the user to examine. Now, there are many programs out there which do this sort of thing. For example, Qt Creator itself allows the spawning of arbitrary OS commands and displays the output in some scrollable control, as does say Visual Studio, and has no knowledge of what the sub-process might be doing in the way of text encoding, yet they manage fine. I want the same! So how do they do it?!</div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">My <i>thought</i> is: When I install Windows here, I am asked what "locale" I am in, I pick "UK", and at that point (as I understand it) the system is set so that my keyboard input and my whatever output (certainly for the console) is set to something UK-ish, which includes that <span style="font-family:monospace,monospace">£</span> character. Your mileage may vary, e.g. if you're in the US, or Germany with all its funny characters. I'm thinking/wondering whether there is something from there I can get at which is what I should be using if I have to pass an explicit parameter to <span style="font-family:monospace,monospace">decode()</span>.</div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">For example, I see from <a href="http://doc.qt.io/qt-5/qtextcodec.html#details">http://doc.qt.io/qt-5/qtextcodec.html#details</a> :<br></div></div><div class="gmail_extra"><div style="font-family:tahoma,sans-serif" class="gmail_default"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><pre class="gmail-cpp gmail-prettyprint gmail-prettyprinted"><span class="gmail-type"><a href="http://doc.qt.io/qt-5/qbytearray.html"><span class="gmail-typ">QByteArray</span></a></span><span class="gmail-pln"> encodedString </span><span class="gmail-operator"><span class="gmail-pun">=</span></span><span class="gmail-pln"> </span><span class="gmail-string"><span class="gmail-str">"..."</span></span><span class="gmail-pun">;</span><span class="gmail-pln">
</span><span class="gmail-type"><a href="http://doc.qt.io/qt-5/qtextcodec.html#QTextCodec"><span class="gmail-typ">QTextCodec</span></a></span><span class="gmail-pln"> </span><span class="gmail-operator"><span class="gmail-pun">*</span></span><span class="gmail-pln">codec </span><span class="gmail-operator"><span class="gmail-pun">=</span></span><span class="gmail-pln"> </span><span class="gmail-type"><a href="http://doc.qt.io/qt-5/qtextcodec.html#QTextCodec"><span class="gmail-typ">QTextCodec</span></a></span><span class="gmail-operator"><span class="gmail-pun">::</span></span><span class="gmail-pln">codecForName</span><span class="gmail-pun">(</span><span class="gmail-string"><span class="gmail-str">"KOI8-R"</span></span><span class="gmail-pun">);</span><span class="gmail-pln">
</span><span class="gmail-type"><a href="http://doc.qt.io/qt-5/qstring.html"><span class="gmail-typ">QString</span></a></span><span class="gmail-pln"> </span><span class="gmail-kwd">string</span><span class="gmail-pln"> </span><span class="gmail-operator"><span class="gmail-pun">=</span></span><span class="gmail-pln"> codec</span><span class="gmail-operator"><span class="gmail-pun">-</span></span><span class="gmail-operator"><span class="gmail-pun">></span></span><span class="gmail-pln">toUnicode</span><span class="gmail-pun">(</span><span class="gmail-pln">encodedString</span><span class="gmail-pun">);</span></pre></blockquote>and:</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="font-family:tahoma,sans-serif" class="gmail_default"><h3 class="gmail-fn" id="gmail-codecForLocale"><span class="gmail-type"><a href="http://doc.qt.io/qt-5/qtextcodec.html#QTextCodec">QTextCodec</a></span> *QTextCodec::<span class="gmail-name">codecForLocale</span>()</h3>
<p>Returns a pointer to the codec most suitable for this locale.</p>
<p>On Windows, the codec will be based on a system locale. On Unix systems, the codec will might fall back to using the <i>iconv</i> library if no builtin codec for the locale can be found.</p>
<p>Note that in these cases the codec's name will be "System".</p></div></blockquote><div style="font-family:tahoma,sans-serif" class="gmail_default">So is <i>that</i> the "correct" approach to use, do you think?<br></div><br></div><div class="gmail_extra"><div style="font-family:tahoma,sans-serif" class="gmail_default"><b>THIRD PART:</b></div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">Elsewhere I have been told that the content of the error I see:</div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="font-family:tahoma,sans-serif" class="gmail_default"><span style="font-family:monospace,monospace">Unhandled Exception:<br><br>'utf-8' codec can't decode byte 0x9c in position 32: invalid start byte<br><br><class 'UnicodeDecodeError'><br>File "C:\HJinn\widgets\<wbr>messageboxes.py", line 289, in processReadyReadStandardOutput<br>output = output.data().decode('utf-8')</span></div></blockquote><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">is specifically coming from <i>Python</i>. (The <span style="font-family:monospace,monospace">QByteArray.decode()</span> is a Python-only/PyQt function, not present in Qt C++, is that correct??) If that's true, are you sure that the whole issue is nothing to with specifically Python's handling of the 0x9c character, as per the dedicated stackoverflow post for specifically that character: <a href="https://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c">https://stackoverflow.com/questions/12468179/unicodedecodeerror-utf8-codec-cant-decode-byte-0x9c</a> ?</div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">But again you may say it has nothing to do with Python <i>per se</i>, it's just people noticing it transitioning from Python 2 to Python 3 behaviour....</div><div style="font-family:tahoma,sans-serif" class="gmail_default"><br></div><div style="font-family:tahoma,sans-serif" class="gmail_default">I see there various suggested solutions of <span class="gmail-comment-copy"><code>decode('cp1252')</code></span> (that must be "Code Page", and be equivalent to <span style="font-family:monospace,monospace">windows_1252</span>, ah ha!) and<code> <span class="gmail-pln">decode</span><span class="gmail-pun">(</span><span class="gmail-str">'unicode_escape'</span><span class="gmail-pun">)</span> </code>....</div><br></div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra"><div style="font-family:tahoma,sans-serif" class="gmail_default"><u><i><b>Well, if you've read this far I'm really grateful, and would really appreciate any suggestions. If you're not fed up with me/have lost the will to live... ;-)</b></i></u></div><br></div></div>