Transforming Low Value Data

The Whole World’s a Database!  Or at least, a lot more things would become easier if more data was readily available, from a programmer’s perspective.  There are tons of little bits and pieces of things out there in the programming fermament, which would be easier to deal with if they were already in a form that was readily programmable.

One example is the mime type information.  Briefly, mime types are used to roughly describe some chunk of data.  The most familiar these days might be “text/html”, which describes simple html data.  One task you might want to do as a programmer is quickly check to see if a particular mime type is actually a valid registered mime type or not.

The official registry of mime type information is with the iana organization.  It’s useful as an observer to go look at the web page, click on links, check things out.  But, this is hardly the readily available information source I need as a programmer.

There is a public domain source of mime type information that is maintained by the apache community.  The data is held in one large file that looks like this:

# MIME type (lowercased)			Extensions
# ============================================	==========
# application/1d-interleaved-parityfec
# application/3gpp-ims+xml
# application/activemessage
application/andrew-inset			ez
# application/applefile
application/applixware				aw
application/atom+xml				atom
application/atomcat+xml				atomcat
# application/atomicmail

It’s probably easy enough to write a little parser, or regular expression, that will go through this file, separating out the mime type information into convenient other formats. As a programmer, when I look at this file, I’m thinking “ok, one barrier of entry is writing that little engine that will parse this data”. It’s not much, but it’s yet another task, unrelated to whatever my primary task is, that will take time.

This kind of data exists all over the place. Ideally, rather than being in this particular form, it would at least be in Comma Separated Value (csv) format, where there is already code readily available to deal with it. But alas, the file was probably created at a time when someone was just trying to document mime type in a human readable form.

Here’s a format that I find to be more useful:

 {"",""" [,extensions]}
============================================ ==========
--]]
local mimetypes = {
{"application","1d-interleaved-parityfec"};
{"application","3gpp-ims+xml"};
{"application","activemessage"};
{"application","andrew-inset","ez"};
{"application","applefile"};
{"application","applixware","aw"};
{"application","atom+xml","atom"};
{"application","atomcat+xml","atomcat"};
{"application","atomicmail"};
}

In this case, using Lua, as soon as you read the file, it is automatically in a Lua table, and can be queried. Separating the type from the subtype (leaving out the ‘/’ character), allows for greater flexibility in terms of what can now be done with the data.

If you know one bit of the puzzle, let’s say the subtype “html”, then you can go through the data and try to find the other part, the “type”, and any associated extensions.

function findmimetype(subtype)
  for i,row in ipairs(mimetypes) do
    if row[2] == subtype then
      return row
    end
  end
  -- didn't find it
  return nil
end

local mtype = findmimetype("html")
if mtype then
  print(string.format("%s/%s", mtype[1], mtype[2]))
end

Of course, you can do more interesting things like load the table of mime information into a database, and then perform queries from there.

This is the kind of data that noone will really bother with cleaning up. It’s already in a form that makes it somewhat readable, and most people will just write their purpose built code to read from the first file from the apache org, or something like that.

But, if you actually spend some amount of time with a fairly capable text editor, or if you parse that file once, then you can put it into a format that is infinitely more flexible. Once it’s in the Lua table form, or something equivalent, you could then generate other forms fairly easily.

It might even be true that this “low value” data is exactly the type of data that should be converted to a more machine readable form, rather than a human readable form. Ultimately, I’ll be saving a lot more compute cycles since I won’t have to constantly transform the human readable forms into machine readable forms.

And so it goes. I now have a mime type table as part of the LAPHLibs.

Advertisements


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s