[Eric] Issue with non-english text in Django plugin

Sun Jun 20 12:42:21 BST 2010

> There might be two different issues.
>
> 1. The encoding was set to utf-8-default, which means, no suitable encoding
> was detected and eric simply chose the default setting. My question is, was
> enhanced encoding detection activated on the Editor->Filehandling page of the
> config dialog? What is the correct encoding of the file? Would you please send
> it.
>
> 2. The styling may be wrong. Please check, if selecting "Alternative:
> Django/Jinja" as the lexer language (via edittor context menu) gives a correct
> highlighting.
>
> Regards,
> Detlev
>
> 1. Correct file encoding is utf-8 (system-wide locale), on Filehandling
page also set utf-8, moreover, I've tried to turn off encoding detection
with no result. There is my file http://www.mediafire.com/?fdytlqdb51z

2. Because I don't have the Jinja plugin installed, I've tried another
lexers, such as HTML/PHP, Python and others - all of them works fine with
the same file(s) (I mean there was no pretty django highlighting, but string
mangling has gone too).

3. Finally, I'd like you to pay attention to my previous messages - the
problem seems to be gone if I use the str() instead of the unicode()
function. According to the
http://boodebr.org/main/python/all-about-python-and-unicode , Python may
return wrong values on len function when you call it for unicode strings.Now
I see that str() works properly only due to it encodes unicode to ascii. So
it likely won't work with, for example, japanese locale, which has no ascii
implementation.
Look at short demo I prepared
http://img571.imageshack.us/img571/4782/snapshot2g.png (I show only first
tag "block", text bellow is unimportant) :

   1. Everything works great until I use russain.
   2. For example, english works fine.
   3. I wrote 1 russian symbol. In this moment lexer highlighted closing tag
   not fully, note that exactly one symbol highlighted wrong.
   4. I wrote second symbol and you can see that the lexer now not
   highlighted two symbols.
   5. Each added russian symbol cause lexer "forget" to highlight one more
   symbol in closing tag.

So, here is my explanation:
One russian letter in my case takes two bytes.
len(unicode(one_russian_letter)) returns, as expected, 1. But lexer,
obviously, assumed that one letter takes one byte - here we get one-byte
shift and symbol corruption (did you note that lexer corrupts only odd
number of symbols?), so lexer badly interprets length of non-english
strings. And if I replace unicode() with str(), strings encodes to one-byte
ascii and lexer works fine.
So, here comes two conclusions:

   1. Eric's lexer subsystyem works with strings as common ascii strings,
   not unicode.
   2. Other lexers (Python, HTML/PHP) works fine, cause they convert strings
   to ascii.

Please, fix me if I'm wrong.

P.S.: absolutely the same happens if I don't replace unicode() with str()
and add .encode() to it in styleText method, so it looks now like "for
token, txt in self.__lexer.get_tokens(unicode(self.editor.text())*.encode()*[:end
+ 1]):" (w/o quotes)
Also excuse me for my not so good english and my manner to write much boring
text.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.riverbankcomputing.com/pipermail/eric/attachments/20100620/3033fe1e/attachment.html>