Have fun with Unix

main() { printf(&unix["\021%six\012\0"],(unix)["have"]+"fun"-0x60);}

Just one line of code, but lots of confusion.  What does this program do?

Who wrote this code?

The code won the “Best One-Liner” prize at the IOCCC in 1984.  It was written by David Korn, who is also the author of Korn Shell (ksh).

Let’s run it

The code compiles just fine with gcc on Linux, giving a couple of harmless warnings about implicit declaration of printf and omitted main() return type.  After running, it prints one single word:

% ./korn
unix

Where does this come from?  A quick glance at the source shows apparently some NUL characters,  "six", "have" and "fun", but unix in this code looks more like an implicitly-declared variable than a character string.

What is UNIX and where does it come from?

The obvious and boring answer is that UNIX is an operating system that comes from Bell Labs, but what we’re looking for is the symbol unix and its value in the program.  Let’s run the code through the preprocessor now.

% cpp korn.c
# 1 "korn.c"
# 1 ""
# 1 ""
# 31 ""
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 32 "" 2
# 1 "korn.c"
 main() { printf(&1["\021%six\012\0"],(1)["have"]+"fun"-0x60);}

The output shows that the preprocessor has substituted unix with 1 in the code.  But why is it doing that?

The online GNU documentation for cpp says “it is common to find unix defined on Unix systems”, quoting historical reasons — so that the code could contain clauses like #ifdef unix  … #endif for conditional compilation.  Luckily, it is possible to make cpp output all #define directives during its execution.

% cpp -dM korn.c | grep unix
#define __unix__ 1
#define __unix 1
#define unix 1

Here’s the confirmation — unix is a system-specific predefined macro that has the value of 1. Let’s make the substitution in the source too:

int main() {
  printf(&1["\021%six\012\0"],
      (1)["have"]+"fun"-0x60);
}

Why so many NULs?

We will begin with clearing the confusion about multiple "\0"s found in the string.  Turns out "\021" does not mean a NUL followed by a '2' and a '1', but is an escape sequence that represents a byte with the value represented as an octal (8-based) number.  This is also true for "\012".  These two are better written as 0x11 and 0x0A; the former is defined as “Device control 1” and (as we will see soon) is not really important here, the latter is just a newline "\n".

int main() {
  printf(&1["\x11%six\n\0"],
      (1)["have"]+"fun"-0x60);
}

Confusing pointers

The code uses the commutativity of addition to cleverly obfuscate the character strings passed to the printf() function.  Let’s take a closer look at the technique now.

C allows to access elements of the arrays using square brackets: array[index].  Since array names behave like pointers, the elements can also be accessed like this: *(array+index).  Now, as the addition operation is commutative, it is also possible to write the former as *(index+array), and, as a consequence, as index[array].  Just this trick alone is enough to confuse programmers not used to seeing constructs like 5["abcdef"], and here it is wrapped with yet another layer of obfuscation.

Knowing all this, it is possible to write the first parameter, 1["\x11%six\n\0"], in a more clear way as "\x11%six\n\0"[1].  Since strings in C are just zero-indexed character arrays, [1] means just skipping the 0th character '\x11' altogether — so the result will be the percent sign here.  Then, the ampersand &  is used to take the pointer (memory address), which is then passed to printf().  In the end, the format string passed as the first parameter will be "%six\n\0".

Time to take the second argument apart now.  The %s at the beginning of the format string says the next argument is going to be a standard character string, null-terminated.

After what we have went through here, the expression (1)["have"]+"fun"-0x60 is pretty easy to take apart.  First there is (1)["have"], which can be rewritten as "have"[1] and then as 'a'.  Next there’s "fun", and 0x60.  A look at the ASCII table shows that 'a' has the value of 0x61 — so the expression can be simplified to "fun"+0x61-0x60, and then to "fun"+1 — which evaluates to "un".

The final version

int main() {
  printf("%six\n\0", "un");
}

This is the code with all the obfuscations removed. There’s no “fun” at all here, what remains is just an “un” which is the first half of “unix” that’s written to the standard output.

An exercise for the reader

Can you have fun without UNIX? What happens when the code is compiled on a non-UNIX machine? What if unix is defined as 0?

Advertisements

3 thoughts on “Have fun with Unix

  1. What is the purpose of the “\0” at the end of the format string? Surely it’s redundant because C string literals are implicitly null-terminated…?

    Like

    • Perhaps that’s the case with a modern compiler, but back in 1984 there was no C standard. It wasn’t until 1989 with C89 that there was any general consensus on how to write C compilers. It’s very possible that a non-terminated string could cause problems with unorthodox compilers. It’s also possible he included it just for the sake of being more confusing, who knows!

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s