Shiny Ideas: Erlang Dabbling

General Observations

I decided to try teaching myself Erlang after reading an article on Slashdot to the effect that functional languages are the new hotness. The idea of a language that parallelizes easily (or, even better, automatically) is compelling; I've certainly found myself wanting to make things multi-thread more often as the number of cores in readily-available systems increases. When you've only one core there's no point in parallelization. With two cores you write your program to hog one, leaving the other available for system tasks. But when you've four or eight cores, as is the case now that dual/dual and dual/quad systems are standard, the ability to use multiple cores simultaneously becomes something more than just a party trick.

So far I'm pretty impressed; Erlang has a lot of features which make it stand out from the crowd. Things which have impressed me so far:

Automagic parallelism: Various features/idiosyncrasies of the language allow the Erlang Runtime System (ERTS) to automatically execute operations in parallel.
Robust error handling: Yeah, every language claims that it has robust error handling, but it actually seems to be true in Erlang's case. And you don't have to write a lot of extra code to take advantage of it.
Integrated support for distributed operation: The ERTS has built-in facilities for spawning processes on multiple nodes and passing messages between them.

All of the above make Erlang a prime candidate when you're writing long-running application code that needs to handle errors gracefully. This isn't surprising given that it was originally developed by Erikson for running telco switches and such. To that end the ERTS has a bunch of operationally useful features such as on-the-fly code replacement, the ability to start and stop processes remotely, and so on. Such things are either difficult or impossible with languages such as Perl or Java.

I would specifically not use Erlang as a replacement for a scripting language. It just isn't adapted to automating system administration tasks in the same way that Perl or bash are. While it's possible to get options off of the command line, execute system commands, parse text, etc. the syntax hasn't been tuned to make these things quick and easy. Erlang is also lacking some other niceties which, while not critical, would definitely be helpful. There's no support for opaque, user-defined types (much less objects or classes), which is a bit of a challenge if you've grown used to having such funtionality available to help structure your code. Nor does it have any sort of metaprogramming support¹; if you need a dozen getter/setter pairs you gotta write them all by hand. None of these are deal-breakers by any means, but you should be aware that Erlang is fairly specialized.

The standard toolset is about what you'd expect in a modern programming language. Embedded documentation is available via edoc², unit testing is available via eunit, there's a pretty good debugger, and so on. The one complaint I have so far is that the stock make package is trés primitive. I highly recommend using GNU Make instead if its available.

A Binary Buffer

So what have I actually been doing with Erlang? Glad you asked. I'm working on a parallel log-processor (of a sort), more details of which I hope to share as soon as I have something worthwhile. Until then why don't I give you a tour of a supporting module which I've already written. Take a look at buffer.erl, which implemented a FIFO buffer for binary data as a stand-alone process.

The first thing you're probably asking is why is such a thing even useful? Well, one of Erlang's idiosyncrasies is that it has "write-once" variables; you can't modify the value of a variable once it's been assigned. This is one of the reasons why the ERTS can automatically execute operations in parallel; the use of write-once variables greatly reduces the occurrence of synchronization issues which prevent parallelization in other languages. One of the downsides of this architecture, however, is that you can't modify data structures in place. Functions which need to change data structures in some fashion usually return modified copies of the original which have been composed in realtime. In most circumstances this isn't that big a deal, but it can make code more complicated/ugly and may force you to violate compartimentalization/information hiding.

One workaround for the above is to do all your copying and state tracking behind the scenes via a separate process. I got the idea from observing how Erlang handles file operations which, on casual observation, appear to violate the write-once rule because they don't ever return a modified file handle. However, if you dig a little you find out that a file handle in Erlang isn't some sort of compound data structure but rather a handle to another process (called a process ID, or PID, in Erlang parlance). The calling process communicates with the file-serving process by sending messages to the associated PID and never need concern itself with how file handling is implemented behind the scenes. I adopted this paradigm in writing the buffer module for just this reason.

Commentary on the source, by line number, follows for those of you who care:

1 - 33: Embedded documentation for the module as a whole. See the edoc documentation for details.
35: Declare the module. This creates the buffer namespace and probably does a bunch more behind the scenes.
39: Including the eunit header enables unit testing functionality.
43 - 47: The export declaration which functions are accessible outside of the lexical scope of the module. The number after each slash qualifies the function name with an arity; there may be multiple definitions of a function, each with a different arity.
49 - 71: Additional error checking code which will be compiled if the value debug is set at compile time.
74 - 83: The clear function empties the buffer. Here you see the first occurence of an Erlang-ism, "BufferPid ! clear", which means "Send a message consisting of the atom³ clear to the process specified by BufferPid".
85 - 107: The definition for the dequeue function. There are a couple interesting things happening here. On line 101 a message is sent to the buffer process containing a tuple⁴ which will tell the buffer process to dequeue data and send it to the appropriate PID (self() returns the calling process' PID and is frequently used to provide a "return address" when communicating with other processes). This is followed by a receive clause on lines 102 - 107 which causes the function to either wait for one of two messages (presumably from the buffer process) or timeout after 1000 ms with an error.
109 - 136: More function definitions. There's another interesting Erlang-ism on line 130, the <<>> operator, One of Erlang's minor strengths is that has a special syntax for composing/decomposing binary data. I've found this feature extremely useful for parsing binary files.
138 - 145: The init function primes the main processing loop (implemented, appropriately enough, as a function called loop) with an empty binary string representing an empty buffer.
147 - 164: The loop function is where all the action takes place. We enter the function at the top with the value Data, which represents the current contents of the buffer. The function then waits for a message generated by one of the buffer functions (enqueue, dequeue, etc.) and responds appropriately. Note that the last step for everything but the stop message is a tail recursive call; this is how event loops are implemented in Erlang.
166 - 170: new creates the buffer process via a call to spawn_opt, returning the PID of the spawned process. The link option causes process linking, another cool Erlang feature which creates a bi-directional relationship between two processes for the purpose of error handling. If the process on one end of a link terminates the process on the other end is automatically sent a special signal to that effect. The buffer process doesn't trap the signal and so will automatically terminate when the calling process terminates.
172 - 263: A bunch of unit tests. Including the eunit.hrl header file makes macros such as ?assert available and works some magic behind the scenes (via parse_transform) so that functions ending in _test() are understood to be unit tests.

The Virtue of Erlang

If I'm counting correctly buffer.erl, excluding the unit tests, consists of 81 lines of code. Stop and consider that in light of what the module actually does. It spawns a subprocess for managing the buffer, provides a message-based interface for communicating with it, and has fairly robust error handling to boot. That's pretty fricking efficient.

And that, I think, is the primary virtue of Erlang: It's a very high level language which insulates the programmer from a lot of unnecessary tedium⁵. I've found that, even when writing fairly complex code, I'm more likely to get it right the first time around. I attribute that to the fact that most operations are fairly abstract; you spend a lot of time doing things like pattern matching or recursing through lists. There's generally no need to worry about the gory details of data structures, pointers, IPC, etc.; the ERTS does that all for you. So, while I wouldn't recommend that people switch to Erlang wholesale, they should at least give it it's day in court.

1 There's Smerl, but its not part of the standard distribution and doesn't give the impression that it's actively being maintained.
2 Though I had to patch edoc_lib.erl to get it to work reliably on my Windows laptop. Why am I using a Windows laptop? 'Cause that's what happens to be available.
3 Atoms are a fundamental data type, equivalent to symbols in Ruby. They stand for themselves and nothing else.
4 Another fundamental data type used to encapsulate an ordered, fixed number of items.
5 Though that can be a double-edged sword at times. If you do need to get under the hood for some reason it can be difficult-to-impossible to do so.

Thursday, January 21, 2010

Erlang Dabbling

General Observations

A Binary Buffer

The Virtue of Erlang

0 Comments:

Previous Posts