Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The problem is not with forcing strings to be utf-8, the problem is treating filenames as strings.

Filenames are opaque blobs that can be lossily converted to strings for display if you know or can guess at the encoding.



Opaque, except for '\0', '/', and (to some extent) '.'.


Even those details are platform-specific though. If you want to be truly portable, you can't even assume that paths are byte arrays.

On windows, the path separator is '\' and paths are arrays of 16-bit integers.


Windows is tricky. You can't have certain names like "con" (or "con.txt", "con.png", etc) and some symbols aren't allowed either, like *, ?, etc. Also names can't end with a dot.

Other than some explicit exclusions, any wchar is valid whether or not it's valid unicode. After all, NTFS and Windows dates back to the times of UCS-2 when 16bits was enough for any character™.

EDIT: Though I should hasten to add that it's a very strong convention that all paths be UTF-16 encoded. So much so that many official docs assert this to be true even though it technically isn't.


NTFS doesn't care if you have a file called "con", e.g. in PowerShell you can do:

    New-Item -ItemType File -Path "\\?\d:\con"
and get "D:\con", where you can't create it directly as "D:\con". It's the Win32 API which intercepts "con" for backwards compatibility, because it was a meaningful name in MS-DOS. But it's fine as a filesystem path.

There's other fun Windows/NTFS Path things here: https://news.ycombinator.com/item?id=17307023 and Google Project Zero's deep dive into Win32 and NT path handling: https://googleprojectzero.blogspot.com/2016/02/the-definitiv...


> So much so that many official docs assert this to be true even though it technically isn't.

Do you have any links for that? I've been working with winapi recently and have had a hell of a time getting some clear concrete statements about exactly what encoding (if any) is used in file paths.


https://docs.microsoft.com/en-us/windows/desktop/FileIO/nami...

> the file system treats path and file names as an opaque sequence of WCHARs.

In essence I think you should use UTF-16 encoded strings when creating file paths. However, when reading them you can't assume any encoding (aside for the special characters mentioned in that article). For accessing the filesystem, just treat paths as an opaque blob of data. When displaying a name to the user, assume UTF-16 encoding but handle any decoding errors (e.g. by using replacement characters where neceeary).


Oh, I meant, did you have any links from official docs that said UTF-16 was used?

Your advice is fine, but when the rest of the world is UTF-8 (including the regex engine), things become quite a bit trickier!


Oh I see. UTF-16 is the preferred encoding for all new applications: https://docs.microsoft.com/en-us/windows/desktop/intl/unicod...

Basically, in Windows land, unicode means UTF-16 unless code pages are mentioned https://docs.microsoft.com/en-us/windows/desktop/intl/code-p...


On Windows the path separator is U+005c, it's only a backslash in most codepages, but not all: https://devblogs.microsoft.com/oldnewthing/20051014-20/?p=33... which links to a dead link; copy here http://archives.miloush.net/michkap/archive/2005/09/17/46994...

That doesn't change just because Unicode renders individual codepages obsolete, it's now special-cased into Windows that Japanese and Korean situations display U+005c as a currency symbol instead of a backslash.

There's also [System.IO.Path]::AltDirectorySeparatorChar which is `/` because Windows is often fine with / as a path separator as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: