Revisiting C++

I was a C++ expert twice in the past. The first time around was because I was doing some work for Taligent, and their whole operating system was written in C++. With that system I got knee deep into the finer details of templates, and exceptions, to a degree that will likely never be seen on the planet earth.

The second time around, was because I was programming on the BeOS. Not quite as crazy as the Taligent experience, but C/C++ were all the rage.

Then I drifted into Microsoft, and C# was born. For the past 15 years, it’s been a slow rise to dominance with C# in certain quarters of Microsoft. It just so happens that this corresponds to the rise of the virus attacks on Windows, as well as the shift in programming skills of college graduates. In the early days of spectacular virus attacks, you could attribute most of them to buffer overruns, which allowed code to run on the stack. This was fairly easily plugged by C#, and security coding standards.

Today, I am working on a project where once again I am learning C++. This time around it’s C++ 11, which is decidedly more mature than the C++ I learned while working on Taligent. It’s not so dramatically different as say the difference between Lisp and Cobol, but it gained a lot of stuff over the years.

I thought I would jot down some of the surface differences I have noticed since I’ve been away.

First, to compare C++ to Lua, there are some surface differences. Most of the languages I program in today have their roots in Algol, so they largely look the same. But, there are some simple dialect differences. C++ is full of curly braces ‘{}’, semi-colons ‘;’, and parenthesis ‘()’. Oh my god with the parens and semis!! With Lua, parens are optional, semis are optional, and instead of curlies, there are ‘do’, ‘end’, or simply ‘end’. For loops are different, array indices are different (unless you’re doing interop with the FFI), and do/while is repeat/until.

These are all minor differences, like say the differences between Portuguese and Spanish. You can still understand the other if you speak one. Perhaps not perfectly, but there is a relatively easy translation path.

Often times in language wars, these are the superficial differences that people talk about. Meh, not interesting enough to drive me one way or another.

But then, there’s this other stuff, which is truly the essence of the differences. Strong typing/duck typing, managed memory, dynamic code execution. I say ‘Lua’ here, but really that could be a standin for C#, node.js, Python, or Ruby. Basically, there are a set of modern languages which exhibit a similar set of features which are different enough from C/C++ that there is a difference in the programming models.

To illustrate, here’s a bit of C++ code that I have written recently. The setup is this, I receive a packet of data, typically the beginning of a HTTP conversation. From that packet of data, I must be able to ‘parse’ the thing, determine whether it is http/https, pull out headers, etc. I need to build a series of in-place parsers, which keep the amount of memory allocated to a minimum, and work fairly quickly. So, the first piece is this thing called a AShard_t:

#pragma once

#include "anl_conf.h"

class  DllExport AShard_t  {
public:
	uint8_t *	m_Data;
	size_t	m_Length;
	size_t	m_Offset;

	// Constructors
	AShard_t();
	AShard_t(const char *);
	AShard_t(uint8_t *data, size_t length, size_t offset);

	// Virtual Destructor
	virtual ~AShard_t() {};

	// type cast
	operator uint8_t *() {return getData();}

	// Operator Overloads
	AShard_t & operator= (const AShard_t & rhs);

	// Properties
	uint8_t *	getData() {return &m_Data[m_Offset];};
	size_t		getLength() {return m_Length;};

	// Member functions
	AShard_t &	clear();
	AShard_t &	first(AShard_t &front, AShard_t &rest, uint8_t delim) const;
	bool		indexOfChar(const uint8_t achar, size_t &idx) const;
	bool		indexOfShard(const AShard_t &target, size_t &idx);
	bool 		isEmpty() const;
	void		print() const;
	bool		rebase();
	char *		tostringz() const;
	AShard_t &	trimfrontspace();

};

OK, so it’s actually a fairly simple data structure. Assuming you have a buffer of data, a shard is just a pointer into that buffer. It contains the pointer, an offset, and a length. You might say that the pointer/offset combo is redundant, you probably don’t need both. The offset could be eliminated, assuming the pointer is always at the base of the structure. But, there might be a design choice that makes this useful later.

At any rate, there’s a lot going on here for such a simple class. First of all, there’s that ‘#pragma once’ at the top. Ah yes, good ol’ C preprocessor, needs to be told not to load stuff it’s already loaded before. There’s there’s class vs struct, not to be confused with ‘typedef struct’. Public/Protected/Private, copy constructor or ‘operator=’. And heaven forbid you forget to make a public default constructor. You will not be able to create an array of these things without it!

These are not mere dialectual differences, these are the differences between Spanish and Hungarian. You MUST know about the default constructor thing, or things just won’t work.

As far as implementation is concerned, I did a mix of things here, primarily because the class is so small. I’ve inserted some simple “string” processing right into the class, because I found them to be constantly useful. ‘first’, ‘indexOfChar’, and ‘indexOfShard’ turn out to be fairly handy when you’re trying to parse through something in place. ‘first’ is like in Lisp, get the first element off the list of elements. In this case you can specify a single character delimiter. ‘indexOfChar’, is like strchr() function from C, except in this case it’s aware of the length, and it doesn’t assume a ‘null’ terminated string. ‘indexOfShard’ is like ‘strstr’, or ‘strpbrk’. With these in hand, you can do a lot of ‘tokenizing’.

Here’s an example of parsing a URL:

bool parseUrl(const AShard_t &uriShard)
{
  AShard_t shard = uriShard;
  AShard_t rest;
	
  AShard_t scheme;
  AShard_t url;
  AShard_t authority;
  AShard_t hostname;
  AShard_t port;
  AShard_t resquery;
  AShard_t resource;
  AShard_t query;

  // http:
  shard.first(scheme, rest, ':');

  // the 'rest' represents the resource, which 
  // includes the authority + query
  // so try and separate authority from query if the 
  // query part exists
  shard = rest;
  // skip past the '//'
  shard.m_Offset += 2;
  shard.m_Length -= 2;

  // Now we have the url separated from the scheme
  url = shard;

  // separate the authority from the resource based on '/'
  url.first(authority, rest, '/');
  resquery = rest;

  // Break the authority into host and port
  authority.first(hostname, rest, ':');
  port = rest;

  // Back to the resource.  Split it into resource/query
  parseResourceQuery(resquery, resource, query);


  // Print the shards
  printf("URI: "); uriShard.print();
  printf("  Scheme: "); scheme.print();
  printf("  URL: "); url.print();
  printf("    Authority: "); authority.print();
  printf("      Hostname: "); hostname.print();
  printf("      Port: "); port.print();
  printf("    Resquery: "); resquery.print();
  printf("      Resource: "); resource.print();
  printf("      Query: "); query.print();
  printf("\n");

  return true;
}

AShard_t url0("http://www.sharmin.com:8080/resources/gifs/bunny.gif?user=willynilly&password=funnybunny");
parseUrl(url0);

Of course, I’m leaving out error checking, but even for this simple tokenization, it’s fairly robust because in most cases, if a ‘first’ fails, you’ll just gen an empty ‘rest’, but definitely not a crash.

So, how does this fair against my beloved LuaJIT? Well, at this level things are about the same. In Lua, I could create exactly the same structure, using a table, and perform exactly the same operations. Only, if I wanted to do it without using the ffi, I’d have to stuff the data into a Lua string object (which causes a copy), then use the lua string.char, count from 1, etc. totally doable, and probably fairly optimized. There is a bit of a waste though because in Lua, everything interesting is represented by a table, so that’s a much bigger data structure than this simple AShard_t. It’s bigger in terms of memory footprint, and it’s probably slower in execution because it’s a generalized data structure that can serve many wonderful purposes.

For memory management, at this level of structure, things are relatively easy. Since the shard does not copy the data, it doesn’t actually do any allocations, so there’s relatively little to cleanup. The most common use case for shards is that they’ll either be stack based, or they’ll be stuffed into a data structure. In either case, their lifetime is fairly short and well managed, so memory management isn’t a big issue. If they are dynamically allocated, then of course there’s something to be concerned with.

Well, that touches the ice berg. I’ve re-attached to C++, and so far the gag reflex hasn’t driven me insane, so I guess it’s ok to continue.

Next, I’ll explore how insanely great the world becomes when shards roam the earth.

Advertisements


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s