P2845R2: Formatting of std::filesystem::path

P2845R2
Formatting of std::filesystem::path

Published Proposal,

Author:
Audience:
SG16
Project:
ISO/IEC JTC1/SC22/WG21 14882: Programming Language — C++

"The Tao is constantly moving, the path is always changing." ― Lao Tzu

1. Introduction

[P1636] "Formatters for library types" proposed adding a number of std::formatter specializations, including the one for std::filesystem::path. However, SG16 recommended removing it because of quoting and localization concerns. The current paper addresses these concerns and proposes adding an improved std::formatter specialization for path.

2. Changes from R1

3. Changes from R0

4. Problems

[P1636] proposed defining a formatter specialization for path in terms of the ostream insertion operator which, in turn, formats the native representation wrapped in quoted. For example:

std::cout << std::format("{}", std::filesystem::path("/usr/bin"));

would print "/usr/bin" with quotes being part of the output.

Unfortunately this has a number of problems, some of them raised in the LWG discussion of the paper.

First, std::quoted only escapes the delimiter (") and the escape character itself (\). As a result the output may not be usable if the path contains control characters such as newlines. For example:

std::cout << std::format("{}", std::filesystem::path("multi\nline"));

would print

"multi
line"

which is not a valid string in C++ and many other languages, most importantly including shell languages. Such output is pretty much unusable and interferes with formatting of ranges of paths.

Another problem is encoding. The native member function returns basic_string<value_type> where

value_type is a typedef for the operating system dependent encoded character type used to represent pathnames.

value_type is normally char on POSIX and wchar_t on Windows.

This function may perform encoding conversion per [fs.path.type.cvt].

On POSIX, when the target code unit type is char no conversion is normally performed:

For POSIX-based operating systems path::value_type is char so no conversion from char value type arguments or to char value type return values is performed.

This usually gives the desired result.

On Windows, when the target code unit type is char the encoding conversion would result in invalid output. For example, trying to print the following path in Belarusian

std::print("{}\n", std::filesystem::path(L"Шчучыншчына"));

would result in the following output in the Windows console even though all code pages and localization settings are set to Belarusian and both the source and literal encodings are UTF-8:

"�����������"

The problem is that despite print and path both support Unicode the intermediate conversion goes through CP1251 (the code page used for Belarusian) which is not even valid for printing in the console which uses legacy CP866. This has been discussed at length in [P2093] "Formatted output".

5. Proposal

Both of the problems discussed in the previoius section have already been solved. The escaping mechanism that can handle invalid code units has been introduced in [P2286] "Formatting Ranges" and encoding issues have been addressed in [P2093] and other papers. We apply those solutions to the formatting of paths.

This paper proposes adding a formatter specialization for path that does escaping similarly to [P2286] and Unicode transcoding on Windows. Additionally, it proposes giving the user control over escaping via format specifiers. The debug format (?) gives the escaped representation while the default is unescaped and minimally processed with only invalid code units substituted with replacement characters if necessary. This is consistent with formatting of strings. The default format can be useful for displaying paths in a UI and gives the user control whether and how to handle special characters. The debug format is useful for displaying paths as parts of a larger structure such as a range and prevents interferring with its formatting.

Code P1636 This proposal
auto p = std::filesystem::path("/usr/bin");
std::cout << std::format("{}", p);
"/usr/bin"
/usr/bin
auto p = std::filesystem::path("multi\nline");
std::cout << std::format("{}", p);
"multi
line"
multi
line
auto p = std::filesystem::path("multi\nline");
std::cout << std::format("{:?}", p);
ill-formed
"multi\nline"
// On Windows with UTF-8 as a literal encoding.
auto p = std::filesystem::path(L"Шчучыншчына");
std::print("{}\n", p);
"�����������"
Шчучыншчына

This leaves only one question of how to handle invalid Unicode. Plain strings handle them by formatting ill-formed code units as hexadecimal escapes, e.g.

// invalid UTF-8, s has value: ["\x{c3}("]
std::string s = std::format("[{:?}]", "\xc3\x28");

This is useful because it doesn’t loose any information. But in case of paths it is a bit more complicated because the string is in a different form and the mapping between ill-formed code units in one form to another may not be well-defined.

The current paper proposes applying hexadecimal escapes to the original ill-formed data when escaping because it gives more intuitive result and doesn’t require non-standard mappings such as WTF-8 ([WTF]).

For example:

auto p = std::filesystem::path(L"\xd800"); // a lone surrogate
std::print("{}\n", p);

prints

"\u{d800}"

When not escaping, we propose substituting invalid code units with replacement characters which is the recommended Unicode practice ([UNICODE-SUB]):

For example:

auto p = std::filesystem::path(L"\xd800"); // a lone surrogate
std::print("{}\n", p);

prints

6. Wording

Add to "Header <filesystem> synopsis" [fs.filesystem.syn]:

// [fs.path.fmt], formatter
template<class charT> struct formatter<filesystem::path, charT>;

Add a new section "Formatting" [fs.path.fmt] under "Class path" [fs.class.path]:

template<class charT> struct formatter<filesystem::path, charT> {
  constexpr format_parse_context::iterator parse(format_parse_context& ctx);

  template<class FormatContext>
    typename FormatContext::iterator
      format(const filesystem::path& path, FormatContext& ctx) const;
};

formatter<filesystem::path, charT> is debug-enabled ([format.formatter.spec]).

constexpr format_parse_context::iterator parse(format_parse_context& ctx);

Effects: Parses the format specifier as a path-format-spec and stores the parsed specifiers in *this.

path-format-spec:
  fill-and-alignopt widthopt ?opt

where the productions fill-and-align and width are described in [format.string].

Returns: An iterator past the end of the path-format-spec.

template<class FormatContext>
  typename FormatContext::iterator
    format(const filesystem::path& p, FormatContext& ctx) const;

Effects: Let s be

Writes s into ctx.out(), adjusted according to the path-format-spec. If Char is char, path::value_type is wchar_t and the literal encoding is UTF-8 then the escaped path is transcoded from the native encoding for wide character strings to UTF-8 with invalid code units substituted with U+FFFD REPLACEMENT CHARACTER per the Unicode Standard, Chapter 3.9 U+FFFD Substitution in Conversion. If Char and path::value_type are the same then no transcoding is performed. Otherwise, transcoding is implementation-defined.

Returns: An iterator past the end of the output range.

7. Implementation

The proposed formatter for std::filesystem::path has been implemented in the open-source {fmt} library ([FMT]).

8. Acknowledgements

Thanks to Mark de Wever for reviewing an early version of the paper and suggesting a number of fixes and improvements.

References

Informative References

[FMT]
Victor Zverovich; et al. The {fmt} library. URL: https://github.com/fmtlib/fmt
[P1636]
Lars Gullik Bjønnes. Formatters for library types. URL: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1636r2.pdf
[P2093]
Victor Zverovich. Formatted output. URL: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2093r14.html
[P2286]
Barry Revzin. Formatting Ranges. URL: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2286r8.html
[UNICODE-SUB]
The Unicode Consortium. The Unicode Standard Version 13.0 – Core Specification, Chapter 3.9, U+FFFD Substitution of Maximal Subparts. URL: https://www.unicode.org/versions/Unicode13.0.0/UnicodeStandard-13.0.pdf
[WTF]
Simon Sapin. The WTF-8 encoding. URL: https://simonsapin.github.io/wtf-8/