A few years ago I worked on a project for a company that used a base64 class to encode a binary object into xml. It was a class in that loosest of senses it being a simple wrapper around a pair of static C functions. In fact it also made some assumptions which was painfully obvious as soon as I saw them – the biggest of which was there would always be enough space to put the resulting streams. The code assumed that there would always be enough.
static int encode(const char* bytes_to_encode, unsigned int in_len, char* encoded_bytes, unsigned int out_len);
Also it was painful to use – the code we were writing had std::string objects and std::vector all over the place and passing them as pointers with offsets was a pain in the proverbial. So I threw together some unit tests via the help of Wikipedia and set about improving it. Once I the tests in place I set about improving the interface. This became templated with an input container and an output container. The result looked something like:
template<class InContainer, class OutContainer>
static int encode(InContainer& bytes_to_encode, OutContainer& encoded_bytes);
Since this was the property of the company I was working for I wanted one for myself I sat down an wrote one from scratch, different to the one I had used at work (which to be honest looked like it could of been grabbed from any one of a dozen places on the ‘Net).
Complete, this sat in my Subversion repository until recently. Dusting it off I saw that the code could be improved to look more like the STL algorithm library with iterators instead of containers. The first pass looked like this:
template<class InputIt, class OutputIt>
static OutputIt encode(InputIt init_first, InputIt init_last, OutputIt outit_first);
Which is very similar to transform function from the STL. Could I create one that allows us to use the transform function. Checking what the transform does and what the RFC of the base64/32/16 says we can easily see they are incompatible. However we can work with the for_each() function.
Rather than hack together a bit of code I decided to rewrite it from the ground up. I started off with a helper function. It looked originally looked like this:
template < class OutContainer >
encoder_<OutContainer> encoder(OutContainer& c)
But I eventually changed it to use an output iterator. The change allows such use as to inject the stream directly into the transport stream rather than simply pushing back after via use of insert_iterator.
template< class OutputIt >
encoder_<OutputIt> encode(OutputIt ref)
I decided to use the call it encoder and and place it in the encoder namespace as I may wish in the future to extend it with a Motorola SREC encoder.
Before we go much further we need to write a few tests to verify what we are doing is correct. As mentioned earlier, both the Wikipedia page and the RFC4648 pages are very good places to get ‘inspiration’ for the test matter.
The RFC4648 which is the standard for not only the base64 encodings, but the lesser known base32 and base16 encodings plus the alternative alphabets for each. Most encoders on the ‘Net only work with the base64…can we provide support for these? Actually yes we can and fairly easily too. First of all we create a bunch of alphabet objects. Simple const string s. One for each encoding type. For each of these we need to create the encoding objects. These can be simple structs. I think they are referred to as traits in the STL but I might be using them slightly differently to how traits are used.
static const std::string alphabet_;
static const std::string alphabet_;
const std::string base64::alphabet_ = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
const std::string base64url::alphabet_ = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_";
These become a template parameter to function and class. We specify the standard base64 alphabet as the default. Giving us the new signature of:
template< typename Alphabet = base64, class OutputIt >
encoder_<Alphabet, OutputIt> encoder(OutputIt it)
return encoder_<Alphabet, OutputIt>(it);
So now write some tests to use both alphabets. To support the base32 and base16 we add various information to the various alphabet structs for the masks, shifts and modulus.
Eventually we are left with a working encoder. The decoder is slightly harder but not much. Again with the single code base we can decode base64/32/16 with different alphabets. The only difference is the handling the 3.3:
Implementations MUST reject the encoded data if it contain characters outside the base alphabet when interpreting base-encoded data, unless the specification referring to this document explicitly states otherwise.
This is handled by another parameter class, one which decides whether to allow or to reject non alphabet characters. By default we should reject (I throw). We are left with this as an interface:
template< typename Alphabet = base64, typename InvalidAlphabet = invalid_data_throw<Alphabet>, class OutputIt >
decoder_<Alphabet, InvalidAlphabet, OutputIt> decode(OutputIt ref)
return decoder_<Alphabet, InvalidAlphabet, OutputIt>(ref);
While we have a nice looking code, it doesn’t help if it isn’t fast. I did a simple Google search and grabbed the first C++ base64 encoder I came across. A quick bit of hacking and the result was not good. I am slightly more than 2x slower on the encode. Why is this? Looking at the code by Rene he is working on data in blocks of threes while I am doing it byte by byte – to which I would expect my code to be 3x slower.
Encoding is only half the story. We need to decode too. In this we are faster, but only just. When it comes to base32 (and base16) mine wins:-)
So I have pretty much done as much as I wanted to do with it. There is a little more I could do – better code coverage, tidier code, different encodings – but for now I will upload it to GitHub. See it here.