Hash-Tables

Hash-tables are extremely useful data-structures and you should definitely prepare well this topic before your interview. It is the most likely possibility that in the interview there will be at least a question (if not more) that can be solved easily with hash-tables. The most two important sections that you should review before your interview are hash-tables and dynamic-programming.

Now that we pointed out the importance of this section, let us explain more how hash-tables work. You have seen them many times, and you have use them many times in your programs. In Python hash-tables are called dictionaries. In this article, we will take a high level look ‘under the hood’ to see how exactly they work as well as where and why are most useful.

A hash-table is similar to an array/list. We want to support an immediate random access. For a list ListA this means that if we want to get its third element we just type ListA[2], if we want to change the 17-th element of the same list we just assign ListA[16] to the new value that we want it to have, both of these happen in a constant time by the machine without depending on the size of the list or the position of the element that we want to interact with.

Now imagine that we want to create a phone book catalog with names corresponding to telephone numbers. We would like that given a name to have access to the corresponding phone number in constant time. One solution would be to enumerate all possible names that someone could put as an input. Then, we can create a list such that the corresponding telephone number of a given name would be at the index given from the previous enumeration. However, this list would be huge and in most practical cases it would be impossible to make it. Let us constrain ourselves with only English names using exactly 10 characters without using any other character except of a letter in the English alphabet, this would require a list of \(26^{10}\) elements. This an impractical big number for a list and it is without taking into account many more cases that we would have if we were building a real phone-book catalog (e.g. non-English names, white space between first and last names, varying size of input names etc). It is easy with so many combinations of possible inputs that no-modern machine could support such a big list. However, in most cases, we won’t need to store more than a couple of thousands names.

The solution is to “hash” the input names and the result of the “hash-function” would determine where to store the telephone number. The correct implementation of hash-function is tricky and we will avoid explain the details here. The idea is that it maps a big universe of possible inputs, all possible names in our previous example, to a much smaller universe. Take for example the function that takes an integer and returns the remainder of the integer divided by \(1000\). The input integers can be as many as we want, but the result of the function has only \(1000\) possible outcomes. The possible outcomes of a hash-function should be roughly the same as the number of elements that we want to store, i.e. the number of names in the phone-book example. Notice that because the possible inputs are many more than the possible outputs of the hash-function two different inputs can have the same output, this is called a collision and the implementation of the hash-table is responsible of dealing with with issue (which we avoid to explain here). The other observation is that the complexity of the hash-function does not depend on the size of the elements that we want to store, it depends only on the input of the hash-function which in most cases it is relative small so we can consider it to be constant.

Now that we have a good idea how hash-tables work it is time to see an example using it. Let imagine that we have a list of \(n\) integers and we want to find if there is a pair among these integers that their sum is equal to another given number \(k\) . The straightforward solution would be to calculate the sum of all possible pairs and check if any of them is equal to \(k\). All possible pairs are \(\frac{n\cdot (n-1)}{2}\), then the previous algorithm has complexity of \(O(n^2)\). Another, a bit more clever approach, is to first sort the list of \(n\) integers, and then for every of these integers \(m\) we do a binary-search for finding \(k-m\) (if we find such an integer then the sum of these two is \(k\)). The complexity of this algorithm is \(O(n\cdot \log(n))\) for the initial sorting and followed by \(n\) binary searches, with each binary search is \(O(\log(n))\). Hence the complexity of this algorithm is \(O(n\cdot\log(n))\).

We can notice that the second solution has many binary searches to find a given value of a list. Very fast look-up is the main benefit of hash-tables, this should make us think of using hash-tables. This algorithm takes all the \(n\) integers and stores them in a hash-table. Then, for each integer \(m\) we look if the integer \(k-m\) is the the hash-table. The complexity of this algorithm is \(O(n)\) for creating the hash-table and \(O(n)\) for the \(n\) searches, hence, the algorithm’s complexity is \(O(n)\). A good improvement over the initial algorithm. This speed up in the algorithm came from very fast look ups that we got by using hash-tables.