Skip to the content of the web site.

Project C.3: Binary-to-text conversion

Suppose you have binary characters that you would like to transmit in an e-mail. E-mail can only contain printed characters, and yet you may wish to send a binary file (e.g., a photograph). Similarly, PDF is an editable format, but it contains, for example, binary images.

The process of storing binary files as printable ASCII characters is called binary-to-text conversion. There are 95 printable ASCII characters (0-31 are non-printable characters with 32 being the first printable character; namely, the Space character (' '), and '~' is the last at 126 with character 127 being the DEL or delete character). Now, to store four bytes, you require $256^4 = (2^8)^4 = 2^{32}$ or approximately 4 billion values. $95^5$ is close to $8$ billion, so it is possible to store all $4$-byte binary values as five printable ASCII characters.

A development team at Adobe realized that $84^5 < 2^{32} < 85^5$, so all that was really needed was 85 characters instead of 95, so they choose characters from 33 ('!') through 117 ('u'). Thus, no a Space will never appear in this conversion.

Consequently, it converts the 32 bits of binary to five base-85 characters. Thus, for every four bytes, this is one unsigned integer.

To convert an int to five base-85 characters, we do the following:

	unsigned int n{};

	// 'n' is assigned a value
	unsigned int b85_0{n % 85};
	n /= 85;
	unsigned int b85_1{n % 85};
	n /= 85;
	unsigned int b85_2{n % 85};
	n /= 85;
	unsigned int b85_3{n % 85};
	n /= 85;
	unsigned int b85_4{n % 85};

	// Now, each of 'b85_4' (the most significant)
	// through 'b85_0' (the least significant) are a
	// value between 0 and 84

	std::cout << ('!' + b_85_4)
	          << ('!' + b_85_3)
	          << ('!' + b_85_2)
	          << ('!' + b_85_1)
	          << ('!' + b_85_0);

Once the entire binary file is printed, Adobe chose to denote the end by using the '~' character.

This encoding is known as "Ascii85".

Your problem:

First, write a function

char *to_ascii( char *binary_string, std::size_t length );

If length is not a multiple of four, throw an exception (alternatively, you may read how this format deals with this here). Allocate memory for an array of size 5*(length/4) + 2. Each block of four bytes should now store five printable ASCII characters as described above. The second last byte must be the tilde character '~' and the last must be the null character '\0'. Return the address of that array.

If you send the returned character array and print it with std::cout, it should print out the ASCII string.

Second, write a function

char *to_binary( char *ascii_string, std::size_t length );

If length is not a multiple of five with a remainder of 2, throw an exception. If the second last character is not the '~', also throw an exception. Allocate memory for an array of size 4*(length/5). Each block of five printable ASCII characters is then converted to four binary characters. You will, of course, ignore the '~' and '\0' characters.

Side project

It happens that $93^{11} < 2^{8 \cdot 9} < 94^{11}$, so it is possible to store nine bytes using 11 base-94 numbers. Replacing four bytes with five requires $\frac{5}{4} \cdot 100 \% = 120 \%$ the memory, but replacing nine with eleven requires only $\frac{11}{9} \cdot 100 \% = 122.222 \%$ the memory. Thus using the above, one megabyte would require an additional 256 KiB of memory, while this technique would require only approximately 228 KiB, thus saving 28 KiB per megabyte.

Design question: 94 characters are required to representat a number that is base 94, so as there are only 95 printable characters, what character would you choose to denote the end of the ascii encoded message?

The obvious choices are Space or ~, as these are at either end-point of the available printable characters; however, if you wanted to always start a text-encoded binary message with a ", it may make sense to allow all possible letters except for ", thus, allowing you to end with a double quote.

This encoding could be called Ascii93.