Home | All Questions | alt.html FAQ >

Should I upload my .html files in binary mode?

Note: Another post courtesy of Jukka Korpela

When you transfer in binary mode, your file will be exactly the same on the destination computer as it is on the source--the file sizes will be the same. In ASCII mode, the destination computer will read through the file and convert line endings to match what its conventions are. On Macintosh systems, lines of text are terminated with a carriage return, ASCII character #13. On Unix systems, lines are terminated with the new line character #10. On Windows, they're terminated by both: CR LF.

For example, if you create a document on a Mac, it will most probably be in Mac's own character encoding. You won't notice anything if you use Ascii characters only, but if you use accented letters or other non-Ascii stuff, it's really relevant. Ascii mode of transfer in FTP should take care of this too - so if the document contains, say, letter o with umlaut in Mac encoding, and the file is transferred to a Unix or Windows server, the octet (byte) will be replaced by a value that presents o with umlaut in iso-8859-1 encoding. And everyone will be happy. Well, assuming the FTP client performs the _right_ conversion.

(This illustrates why "ASCII mode" is a misnomer. The Ascii character code has been and is so widely used that "ASCII" is often a synonym for "text", or often "plain text". The character code conversions performed in "ASCII mode" are mostly relevant for _non_-Ascii characters only, since the Ascii part is usually invariant between encodings - Ebcdic probably being the only exception still alive.)

To see why this matters, retreive a text file from a Unix server in binary mode and put it on a Windows machine. Open the file in Notepad and notice that you end up with very long lines, and instead of line breaks, you get boxes. Windows is looking for CR LF sequences, but all it's finding is LF, so it never breaks the line. Using binary mode suppressed the conversion, so you got the file in Unix format instead of Windows format.

Or do the opposite - this is what I have seen people have problems with. An editor like Emacs will show ^M at the end of every line, since the editor sees a Carriage Return ("control-M") right before Line Feed, which is the Unix newline. In fact people could just ignore that, since the Carriage Returns won't hurt in browsing, but all those ^M look inconvenient. (It's easy to remove them using Emacs commands but it's a bit clumsy to have to do that.)

Technically, HTML does care in the sense that if we regard HTML as an "SGML application" (which wasn't ever much more than just lip service), then the correct method would be to start every line with Line Feed and end every line with Carriage Return. This would resemble the Windows convention with an additional leading LF and with just CR (not CR LF) at the end of the last line. But this is just the theory; even the HTML 2.0 specification first said (in 4.2.2):

SGML specifies that a text entity is a sequence of records, each beginning with a record start character and ending with a record end character (code positions 10 and 13 respectively)

[MIME] specifies that a body of type `text/*' is a sequence of lines, each terminated by CRLF, that is, octets 13, 10."

which makes it clear that there is a contradiction; and the spec then presented the syncretistic practical solution:

"In practice, HTML documents are frequently represented and transmitted using an end of line convention that depends on the conventions of the source of the document; frequently, that representation consists of CR only, LF only, or a CR LF sequence. Hence the decoding of the octets will often result in a text entity with some missing record start and record end characters.

Since there is no ambiguity, HTML user agents are encouraged to infer the missing record start and end characters.

An HTML user agent should treat end of line in any of its variations as a word space in all contexts except preformatted text. Within preformatted text, an HTML user agent should treat any of the three common representations of end-of-line as starting a new line."

And luckily it seems that browsers obey those recommendations. But, as stated, the newline presentation can still be relevant when _editing_ files, or processing them with software other than "HTML user agents", so "Ascii mode" of transfer is recommendable.

Recommended Resources

Discussion

Related Questions