Core language concepts

2013-01-18

Computer languages, like human languages, come in many forms. This post aims to give an overview of the most common programming ideas. It’s meant to be read as one is learning a particular programming language, to help understand your experience in a more general context. I’m writing for conceptual learners, who delight in the underlying structure and rules of a system.

Many of these concepts have varying (and conflicting) names. I’ve tried to include alternates wherever possible, so you can search this post when you run into an unfamiliar word.

Syntax

Every program has two readers: the computer, and the human. Your job is to communicate clearly to both. Programs are a bit like poetry in that regard–there can be rules about the rhythm of words, how punctuation works, do adjectives precede nouns, and so forth.

Every program is made up of expressions, organized in a tree. You can think of an expression like a sentence: it has some internal structure, and can contain other expressions as clauses.

English                  | Javascript         | Clojure
One plus one.            | 1 + 1              | (+ 1 1)
                         |                    |
One plus one,            | (1 + 1) / 3        | (/ (+ 1 1) 3)
divided by three.        |                    |
                         |                    |
Zoe kicks the ball.      | zoe.kick(ball)     | (kick zoe ball)
                         |                    |
The ball which Zoe kicks | deshawn.catch(     | (catch deshawn
is caught by DeShawn.    |   zoe.kick(ball))  |   (kick zoe ball))

All of these expressions have the same syntax tree, but phrase it in different ways.

Every expression is equivalent to something. (+ 1 1) is equal to 2. We call 2 the value of the expression. The computer’s job is to evaluate expressions, converting them gradually to values.

(/ (+ 2 4) 3)
(/ 6 3)
2

Most languages evaluate the deepest expression first. We had to evaluate (+ 2 4) before we could divide it by three. Most languages also have a way to evaluate sequences of expressions in order, usually from top to bottom. We call these “statements”, but they’re really just expressions where we don’t care about the return value. If expressions are clauses, statements are sentences.

cat.pounce(mouse);
1 + (3.0 / 5)

cat.pounce(mouse) is an expression, so it has a value. We just didn’t do anything with it; once evaluated, we forgot about its value and moved on to the next statement. Some languages have statement terminators. Like the period in a sentence, the semicolon in javascript ends a statement. Some languages, like Ruby, put each statement on a separate line, and the semicolon is optional. Other languages use commas, or indentation.

In Lisps, statements are just a special kind of expression:

(do
  (pounce cat mouse)
  (+ 1 (/ 3.0 5)))

Every language you learn will involve picking up a new syntax, which helps you build the syntax tree the computer uses to run your program.

Values and identity

Values are the things in the world. The desk I’m typing at right now, made of wood and steel, with particular scratches on it, that’s a value. My desk is an identity. Maybe this desk today, and maybe tomorrow a different desk entirely. My body, with a particular pattern of cells and fluids frozen in time, is a value. Kyle is an identity, which points to a different body every second. Identities are the fixed names for changing values. The values themselves never change, but identities do.

We say that values are immutable, because they never change. We say that identities are mutable, and their changing values over time are called state. Many languages call identities variables and values constants.

// x is a variable, an identity, and its current value is 5.
x = 5;

// Now we print x to the screen. The number five appears.
print(x);

// We can change the value x points to. Now it's six.
x = 6;

// This time, we print the number six.
print(x);

Different languages have different conventions about which things are immutable. Numbers, like 2, 1/5, and 3.1415 are always immutable. Java says strings, like "hi there" are immutable, but Ruby has mutable strings. You can change a string from one value to another. Collections like [1, 2, 3] are typically mutable, but some languages like Haskell, Erlang, and Clojure consider collections immutable too.

Why does mutability matter? Your program needs to talk about the real world, and in the real world things change. Identities help us understand change. At the same time, when things change, you can’t rely on them any more. Someone might hide your keys, or switch out the meal you’re enjoying. Immutability lets you guarantee that things won’t change over time.

Functions

Functions are the verbs of programming. Given some arguments (also called parameters), they return a value. When you call a function, the computer evaluates the function’s expressions, using the parameters you specify, to come up with a return value. In Clojure:

(defn fly [bird]
  (println "The " bird " is flying!"))

defn means “define function”. The function’s name is fly, and it takes one argument, called bird. When called, we evaluate the println expression within. The argument bird will stand for a specific bird. In Javascript:

function fly(bird) {
  println("The" + bird + "is flying!");
}

There’s a critical distinction between the function itself, and calling it. For example, think about the verb “fly”. It’s the potentiality of flight, and we can talk about flying without actually doing it. But to really fly, we connect the verb with subjects and objects: “Fly, swan!”. fly, by itself, is a function. But calling fly with “swan” will evaluate the function’s code, and return a new value.

(fly "swan")

fly("swan");

Purity

Some functions, given the same values, always do the same thing. For instance,

function add(a, b) {
  return a + b;
}

…will always return the same sum for any pair of numbers. We say that add is pure. We know that add(1,5) is always 6, and nothing can ever change that. That means that anywhere we see add(1,5), we don’t even have to run the function. We know exactly what the consequences will be, so we can speed up the program by omitting needless work. This kind of optimization happens at various levels, from the physical chip to the language itself.

Impure functions, on the other hand, don’t do the same thing every time. They might have side effects.

function add(a, b) {
  alert("I'm adding "  a + " and " + b);
  return a + b;
}

This function is not safe to optimize away, because it prints a message to the screen when invoked. We have to run it every time.

Where possible, programmers try to write pure functions. They’re easy to test, because they always do the same thing. They’re easy to reason about, because they won’t interact in sneaky ways with the world when you’re not looking. You can run functions in any order, or skip their evaluation when their output is never used. They’re also safe to run in parallel.

Pure functions and immutable values work together, but the real world requires side effects and change. A big challenge in organizing code is controlling the use of impure functions and mutable identities, to balance performance and reasonable-ness.

Types

Types are the kinds of values in the world, the taxonomy of creatures. A given rabbit is a member of the family Leporidae, within the order Lagomorpha. By extension they are also mammals, vertebrates, and animals. The number 2 is an integer, and by extension a number. Every language has a type system, which is the taxonomy (called a “type hierarchy”) and the rules about how different types interact.

If you know the type of two objects, you can make assertions about how they work together. You can add any two numbers, so the integer 2 plus the decimal 0.5 is 2.5. But what is 2 plus an apple? Plus operates with numbers, so the expression doesn’t make sense. We call this a type error.

Some languages have strict rules about the types of values and identities. You must declare the types you’re working with in advance. In these statically typed languages, the computer can prove (to varying degrees) that the program is correct before it even runs.

Other languages have flexible rules about types. In a dynamically typed language, you don’t have to know what kind of value you’re working with until you actually evaluate an expression. Then the computer checks to see if your values can interact in that way.

Common types

You’ll encounter the same types over and over again in different languages.

Integers are numbers like -1, 0, 1, 1242354, and so forth. Floats are numbers with decimal points, like 0.5, -1.999, etc. Most integers and floats only encompass a limited range of numbers before degrading in some way: 32-bit integers can only talk about numbers from -2147483648 to 2147483647. Floats have a limited number of decimal places, so they can’t talk about big numbers with high precision. There are special types for large or very precise numbers. Some languages have rational types, like 2/3, which can express fractions perfectly.

Strings are lists of characters, like "hi there" or "音韻体系".

Keywords (in Ruby, symbols, in Erlang, atoms) are lightweight strings. Not every language has these. Sometimes they’re written with a colon in front, like :cat. There’s another sense of the word “keyword”, which refers to special words in the language like if and function. That’s different.

Lists are ordered collections of things, like (6, 4, 2). Getting the first element is fast, as is going over each of the elements in order. However, getting elements towards the end of the list takes more time.

Arrays, also called vectors are ordered collections of things where you can get the element at any given position quickly. They’re often written as [6, 4, 2].

Maps, like {"cat": "meow", "dog": "woof"}, are dictionaries. Given a key, like “cat”, you can look up a corresponding value, like “meow”, quickly. You’ll also see them called hashmaps, maps, associative arrays, or objects.

Functions are values, and they have a type. In JS, a function which adds two numbers together could be written as function(a, b) { return a + b; }.

Identities are a type, too. In fact, every type in the system has a corresponding type of identity. There’s a special type for “identities which point to floats”, and a type for “identities which point to lists”. Most languages let you pretend identities just are their values.

Organizing code

One of the hardest problems in writing large systems is managing complexity. You need to reduce the problem into manageable chunks–pieces you can reason about individually.

Functions (often called methods) are your first and most broadly useful. Every function should do one thing, and (like everything else) should have a short, meaningful name. If your function is longer than thirty lines or so, it’s too big. Find the logical borders or distinct phases and break them up into their own functions.

Above functions, languages vary in how they organize code. A common pattern is to group functions into a namespace, or module, or package. Inside the namespace, you use short names to refer to the functions. Outside the namespace, you refer (or import or require) to functions and values from other namespaces as necessary. Think of namespaces like papers; each with citations to draw in ideas and proofs from other papers. Namespaces are usually hierarchical; they can be nested in other namespaces to break up large projects.

Object-oriented languages have the concept of classes of objects. An object is a map of keys to values, and some functions which operate that map. The class defines what types of data are stored in an object, and defines the functions (also called methods) on the object. Each individual object (called an instance) has different data, but the same functions.

Objects and classes are typically bound up with the type system in some way. You might have an instance of the Rabbit class, which defines methods like hop. Rabbit might be a subclass of Animal, which has functions allowing rabbits and other kinds of animals to eat. The problem of organizing classes is… the subject of much debate.

Libraries are distinct collections of code geared towards solving a particular problem, like “working with geography” or “parsing natural language”. A library usually keeps its code inside a distinct namespace, so you can use it in other projects. Every language comes with a standard library built in, which defines the basic datatypes and functions everybody needs. There’s usually a package manager which helps you download other libraries from the internet and integrate them into your code.

Frameworks are giant libraries which provide a skeleton for your code to fill out. Rails is a framework for serving web pages, written in the Ruby language. Frameworks take care of organizing code and solving common problems for you, so you can focus on solving a particular problem. Like skeletons, frameworks shape your code in a particular way. You can’t build an elephant around a human skeleton; it’s just not the right shape to support that problem. When you choose a framework, it’s important to find one that’s designed for the problem you’re trying to solve.

Symbols

So far we’ve spoken in concrete terms, but solving a real problem requires abstraction. It requires names for things. In a program, you build up complex ideas from smaller pieces, by using symbols (also called identifiers or variables) to name them.

I want to make clear that symbols are a different level of a language from values. Before, we talked about real things: actual swans, the concept of flight. Now I’m talking about the words themselves, on the paper: the word “swan”, the word “fly”. In a sense, symbols are the pronouns of a language. Behind a symbol is a value. The word “she” can mean “Amelia Earhart”, or “Grace Hopper”. We infer the specific meaning of the symbol “she” from context.

In code, we need to talk about many ideas at once, and so our range of pronouns is essentially infinite.

subtotal = 5.25 + 1.40;
tax = subtotal * 0.07;
total = subtotal + tax;

5.25 is a literal value. It’s the number 5.25. subtotal is a symbol, a pronoun which refers to the value of 5.25 + 1.40. In the next line, we can use subtotal to stand for 5.25 + 1.40. tax and total are symbols too. Choosing simple, descriptive symbols helps us understand the meaning of the code.

In some languages like Clojure, Erlang, and Haskell, symbols refer directly to values. subtotal can’t change, because it is immutable. In languages like Ruby, Javascript, and C, symbols refer to identities, which point to values. subtotal can change in those languages, because it’s an identity. It’s mutable.

A symbol without a value is unbound: it represents the abstract potential for some value to come along. A symbol which has taken on a specific value is called bound. Now it stands for something. This is how functions work!

function add(a, b) {
  return a + b;
}

In this function, a and b are unbound symbols. They have no specific value, but that’s OK, because the computer doesn’t need to evaluate the function yet. When we call add(3, 5), we provide values for a and b. The computer evaluates a + b with a bound to 3 and b bound to 5.

Scope

In English, “he” is usually bound to the most recent male person in the text. If you start off a book with “he devoured the mouse whole”, we have no idea who did the devouring! The scope of a symbol is the region of text where that symbol is bound to a value.

Global scope means a symbol is bound everywhere. Depending on your Bible, capitalized “He” or “She” means God, no matter where in the text it appears. We might say that “He” is a global variable.

Most modern languages also use lexical scope, which means that a symbol is bound within an expression.

function add(a, b) {
  // a and b are bound in this function expression
  if (a > 2) {
    // And also in nested expressions
    return function() {
      // For instance, this new function
      return a + b;
    }
  }
}

// However, a and b are *not* bound here!
a + b; // Wrong!

Lexical scope only applies to the code as written. Lexical symbols are not bound in other parts of the text. For instance:

function trouble() {
  // x isn't bound here
  return x + x; // Wrong!
}

function double(x) {
  // x is bound here
  return trouble();
}
double(2);

Inside double, x is bound to 2. double calls trouble–but because trouble is defined outside the scope of x, x is no longer bound. This program doesn’t work.

There is another kind of scope called dynamic scope, where this program does work. Dynamic scope means a symbol is bound anywhere in an expression–and also within any function calls that expression makes. Dynamic scope means you may not know where a variable comes from, which makes it harder to reason about. It’s used with restraint.

If you learn an object-oriented language, you’ll hear about instance variables and class variables. These symbols are available within a particular instance of a class, or shared across the whole class itself. If you think of an class definition as being a class expression, surrounding an instance expression, then class and instance variables are just lexically scoped. The language may not write it that way, though.

Wrapping up

This guide doesn’t teach you how to program, but hopefully it explains why languages work the way they do. As you explore a specific language, keep an eye out for how the language writes and combines expressions. What are the basic types, and how do they work together? Look for functions, and how they’re named and organized. How are the symbols named, and do they refer to identities or values? What are the rules around scope? And don’t worry if this doesn’t make sense just yet. As you copy existing code, make a few small changes, and gradually write your own programs from scratch, these concepts will solidify. :)

Post a Comment