Using reference to unicode function on unicode string containing non-ASCII characters throws UnicodeDecodeError


Calling unicode on a unicode string containing non-ASCII characters is supposed to return the same unicode string. This indeed works properly in both CPython and IronPython. However, when using a reference to unicode, an exception is thrown instead. The attached file contains asserts that pass under CPython but throw a UnicodeDecodeError on IronPython.
Example code to replicate behaviour:
IronPython 2.6.2 (2.6.10920.0) on .NET 2.0.50727.4952
Type "help", "copyright", "credits" or "license" for more information.
unicode_reference = unicode
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: ('unknown', u'\xe9', 0, 1, '')
Under CPython, unicode and unicode_reference can be used interchangeably.
Workaround: Use unicode_reference = lambda x: unicode(x) instead.
Note: Using Jinja2 on IronPython with non-ASCII characters in a template will result in this error (at least as of Jinja2 v2.5). Replace to_string = unicode with to_string = lambda x: unicode(x) near the top of runtime.py to workaround this issue.

file attachments

Closed Dec 9, 2014 at 7:52 PM by jdhardy
Migrated to GitHub.


jdhardy wrote Nov 23, 2010 at 6:49 AM

IronPython special-cases calls to unicode directly, which is why the alias behaves differently; see the mailing list thread starting from http://lists.ironpython.com/htdig.cgi/users-ironpython.com/2010-February/012063.html for an idea of why. Basically, either option is broken.

One workaround for Jinja could be: to_string = unicode if str is not unicode else lambda x: unicode(x); in IronPython, str is unicode.

pacrook wrote Jan 24, 2014 at 5:39 PM

I hit the same issue trying to use unicode with map. The following works in Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit (AMD64)]
>>> print '\t'.join(map(unicode, [u'\xe5', 1.0]))
å 1.0
but in IronPython 2.7.4 ( on .NET 4.0.30319.34003 (64-bit)
>>> print '\t'.join(map(unicode, [u'\xe5', 1.0]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: ('unknown', u'\xe5', 0, 1, '')
the problem arises from the map(unicode, ...) call, e.g.
>>> print map(unicode, [u'\xe5', 1.0])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: ('unknown', u'\xe5', 0, 1, '')
introducing a labda reference as suggested above fixes the Iron Python version of the code:
>>> print '\t'.join(map(lambda x:unicode(x), [u'\xe5', 1.0]))
å 1.0