Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode characters in filenames are not preserved #1443

Closed
rstelzleni opened this issue Feb 4, 2021 · 11 comments
Closed

Unicode characters in filenames are not preserved #1443

rstelzleni opened this issue Feb 4, 2021 · 11 comments

Comments

@rstelzleni
Copy link
Contributor

Description of Issue

Recently we've run into some issues with USD files that have non-ascii characters in their filenames. We initially found that a USD file with an umlaut in the filename opened fine on Linux but failed to open on Windows. I noticed that ArchOpenFile calls fopen, which might explain that difference. On Linux fopen can accept a utf8 filename, but the Windows version doesn't.

Based on that I wasn't sure if USD is expected to handle utf8 filenames or not. I did some testing, and it seems like you can use utf8 filenames in sublayer lists and in references, and things seem to work. However, if I create such a file by hand, open it and save it out I find that any special characters get stripped. For example, this:

#usda 1.0

def "Sphere" (
    references = @Grüße/Löwenbräu.usd@</Sphere>
)
{
}

to this

#usda 1.0

def "Sphere" (
    references = @Gre/Lwenbru.usd@</Sphere>
)
{
}

I didn't test this with string valued attributes or asset paths in USD files.

I posted this to the mailing list, and Alex Mohr suggested filing an issue for tracking this case.

Steps to Reproduce

  1. Create a usd file and name it with some utf8 characters
  2. Attempt to open on windows, open will fail
  3. On linux, it will open. Create another usd file that references it
  4. Check that you can open that usd file and the reference works
  5. Save that usd file, the non ascii characters are stripped from the reference

System Information (OS, Hardware)

Tested on Windows 10 and Ubuntu 20.04

Package Versions

USD 21.02

@jilliene
Copy link

jilliene commented Feb 9, 2021

Filed as internal issue #USD-6560

@rstelzleni
Copy link
Contributor Author

Hello! When I originally filed this issue it sounded like Pixar was already working on utf8 support. I just wanted to check in to see if there is a plan for that. Would we expect to see it in 21.04, for instance?

Thanks!

@spiffmon
Copy link
Member

Hi @rstelzleni ! Yes, we are working on it, and had hoped to have it ready for 21.05; however, some pretty intense production priorities have intervened, which pushes this out to 21.08.

@rstelzleni
Copy link
Contributor Author

Sounds good, thanks Spiff. If there are changes that we could cherry pick and test out sooner we'd be happy to get some testing done. I didn't see anything in the current public repo. We could also help with implementation under Pixar's direction if that helps, just let us know!

@DDoS
Copy link

DDoS commented Apr 15, 2021

Hi, we've also had problems with non-ASCII characters in paths when using this library on Windows. Enabling the UTF-8 codepage by default as described here seems to fix the problems. Although this will affect the entire application. For our case this was acceptable. Hope this helps.

@rstelzleni
Copy link
Contributor Author

I'm excited to see the new support rolling out soon! I did a few builds off of dev and tested out opening files on different platforms. I can open and save files with utf8 names, and containing utf8 characters on Mac and Linux, and so far everything seems to work. On Windows, files containing utf8 characters open correctly and the contents save out correctly, but I've found that I can't open files with unicode characters in the filename. I tried opening directly, and also as a reference.

I have mostly been trying with python scripts and in usdview. I suspect it might work if I did what DDoS suggested in the previous comment. Have you been able to open files on windows with unicode filenames? I didn't see any in the tests, but I might have just missed them.

@aloysbaillet
Copy link
Contributor

Thanks to @gitamohr 's changes asset paths can now indeed be utf-8, and that's great!

Unfortunately the changes are not quite sufficient to actually open USD files that contain unicode characters in their file names on Windows (it seems fine on linux!).

On a recent windows box (and as described here: https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page) one can patch python.exe (using the described mt -manifest utf8manifest.xml -outputresource:python.exe;#1 command) and get usdcat to also work on referenced unicode paths, but this manifest-based "code page" change is not very portable.

The proper is probably to define UNICODE and fix all the win32 calls to pass wchar_t when using the win32 functions defined with the W suffix instead of the A suffix (or _wfopen instead of fopen).

@spiffmon could you tell us if this is something that you are already planning on doing? I've just started looking at the issue, and it seem doable but there might be performance concerns when converting all filenames using this snippet:

#include <locale>
#include <codecvt>

std::wstring GetWideStr(const std::string& s){
       std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
       return converter.from_bytes(s);
}

@spitzak
Copy link

spitzak commented Jul 26, 2021 via email

@spiffmon
Copy link
Member

@gitamohr will be following up on this shortly, @aloysbaillet , and thanks for the info, @spitzak !

@aloysbaillet
Copy link
Contributor

Thanks @spitzak , indeed the current code can be used successfully if the windows ANSI code page is changed either globally or using the manifest on specific executables, or fopen can also be made to accept UTF-8 by adding , ccs=UTF-8 at the end of the flag string, that might be preferable to converting to wchar_t.
But currently the TfPathExists function used by USD to check if a file path exists calls Tf_HasAttribute which in turn uses the win32 GetFileAttributes function, which when compiled without the #defineUNICODE resolves to the GetFileAttributesA function which only supports UTF-8 if the code page is changed either globally or by the executable manifest, and does not support long paths... So I would argue this one needs changing either way?

@aloysbaillet
Copy link
Contributor

Here are the changes that make unicode work well on windows: #1580
FYI @gitamohr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants