Introduction to data structures, complexity and algorithms in PHP

I don’t really know my complexity and data structures. So, as to mitigate two holes in my knowledge that shouldn’t be so deep I will explain complexity and relate it to some data structures.

Doesn’t O(n) or O(1) seem familiar? Big O is used to describe how complexity grows dependant on the number of items n in data for an algorithm.
O(n) describes a linear complexity. If you remember your early math classes then this means a straight growing line.

O(2*n) is also linear, only growing twice as fast. O(1) means that the size of n is of no importance — there is no added complexity for 1000 items than for 1. O(n²) means that the complexity grows by the square product, or in other words n=3 (9) is a lot worse than n=2 (4).

When I was even less confident of big O notation I got confused by what the heck O() would mean, just think of it as a symbol. It only tells you that the mathematical expression in the parenthesis is a description of how a given algorithm uses resources given size n data.

Lets look at complexity in conjunction with some data structures.

Array

<?php
$foo = array(
    0 => 'bar',
    1 => 'baz'
);

If you know the key in an array, what complexity does performing a lookup have?
Given a web page with a dropdown with $foo as its options:

<?php
function valueByKey($key) { return $foo[$key]; }
$value = valueByKey($_POST['foo']);

We have now demonstrated O(1) complexity! This algorithm will have a constant complexity regardless the size of $foo

O(n)
What if you have the value and need to do a search for the key?

<?php
for ($i = 0,$n = count($foo); $i < $n; $i++) {
    if ($value == $_POST['foo']) return $key;
}

As you can see this could potentially end in $n iterations, giving us O(n) complexity.
For inserts array list gives O(n) because inserts at any point except at the tail forces index shuffling potentialy n elements.

Linked List

Consider this crude implementation of linked list.

<?php
class Item {
    public $value, $key, $next = null;
    public function __construct($key, $value) {
        $this->key = $key;
        $this->value = $value;
    }
    public function next() {
        return $this->next;
    }
}

class LinkedList {
    protected $items = array();
    public $first=null,$last=null;
    public function append($key, $value) {
        $item = new Item($key, $value);
        $this->items[] = $item;
        if ($this->last) $this->last->next = $item;
        $this->last = $item;
        if ($this->first === null) $this->first = $item;
        return $this;
    }
    public function prepend($key, $value) {
        $item = new Item($key, $value);
        array_unshift($this->items, $item);
        if ($this->first) $item->next = $this->first;
        $this->first = $item;
        if ($this->last === null) $this->last = $item;
        return $this;
    }
}


class LinkedListTest extends PHPUnit_Framework_TestCase {
    public function testLinkedList() {
        $list = new LinkedList;
        $data = array('foo','bar');
        foreach ($data as $k => $v) $list->append($k, $v);
        $this->assertEquals($data[0], $list->first->value);
        $this->assertEquals($data[1], $list->last->value);
        $this->assertNull($list->last->next());
        // Prepend data
        $prepend = array('a','b');
        foreach (array_reverse($prepend) as $k => $v) $list->prepend($k, $v);
        $item = $list->first;
        foreach (array_merge($prepend, $data) as $value) {
            $this->assertEquals($value, $item->value);
            $item = $item->next();
        }
    }
}

Do you notice that a key lookup is O(n)? And the same for value search? So it performs worse than an array for lookup, and equal for search — working with PHP this seems like a useless data structure right? And mostly it makes little sense to use more complex data structures in high level languages, but inserting data into an array at a given index requires moving every value after that index. While in a linked list we only need to change the references of two objects. So array inserts are O(n) while linked list boosts O(1) inserts.

One could also argue that being able to iterate over an ordered list given only a member by doing next() is prettier — but thats irrelevant for this discussion.

Binary search

Both array list and linked list suffers when used for searching. Linear complexity is quite bad for searching long datasets. Imagine you tried to search the yellow pages for a name among millions of millions of names that are actually sorted. Surely you wouldn’t need to search through a,b,c,… when looking for something on z? This is where a binary search can come in handy. This is not a data structure, but its a good example for looking at complexity. You could use different datastructures as ordered list or binary search tree to search on.

Consider this ordered list: $list = array(0,1,2,5,7,9,12,15,19);

What if it contained 1000 items? 10000? 1000000? Its time to showcase a better level of complexity, even though it looks more complex!

O(log n) Logarithmic complexity. I find the easiest way to think about logarithms is to think of powers. Log n is the power you need to raise the logarithms base to get n. In computer science log uses base 2, thus log(8) is “The power we need to raise 2 to to get 8″. 2³ = 8. Log(8) == 3!

How does this relate to complexity you ask? Consider the following sizes of n:

  • 8: O(1) == 1, O(n) == 8, O(log n) == 3
  • 16 : O(1) == 1, O(n) == 16, O(log n) == 4
  • 128 : O(1) == 1, O(n) == 128, O(log n) == 7

Binary search has O(log n). The way it works is that it takes the middle element and inspects if that is either higher, lower or the one wanted. If is the one, return, else redo the split-check on the higher or lower part only. So where linear complexity chops of 1 item each time, logarithmic complexity chops of half the stack! So 1000 items go to 500 in one fell swoop!

You can see an example binary search implementation here.

Summary

So, we have hopefully learnt what the heck complexity and Big O is, and we have linked it to some algorithms / data structures. How do you use this? Every time you need to perform operations on data you should keep this in the back of your mind. Dont use an unordered list for searching if your data set is likely to be big and you expect more reading than writing, now you have the power to keep it ordered and make you searches potentially many, many times faster! And you can use this as vocabulary when discussing implementation with colleagues.

References: