Topological Sort – A Complete Overview

This article is about a sorting algorithm on a directed graph’s vertices that commonly pops up in interview questions. If you need a refresher on graphs please refer to our intro to graphs article.

Formal Definition

Topological Sort is an ordering algorithm on a DAG’s (directed acyclic graph) vertices. Intuitively, we want to label n vertices with the numbers 1, 2, 3,..., n such that the “descendants” of a vertex (not necessarily direct descendants) are labeled a larger number than the parent. More formally:

A topological sort for a directed graph G=(V,E) with n vertices is a labeling function L: V \rightarrow [n] such that if the edge u \rightarrow v  exists then L[u] < L[v].

Examples

Valid Sort

A valid topological sort

The above example is a valid labeling of the vertices. You will notice that every vertex points to a vertex that has a higher number labeled than itself.

Invalid Sort

An invalid topological sort

The above example is an invalid labeling of the vertices. You can see that we have edges where the vertices are labeled 2 \rightarrow 0, 2 \rightarrow 1, 1 \rightarrow 0. In each of those we have a parent vertex that is pointing to a child vertex with a smaller number in the labeling. Just one of these is enough to cause this labeling to be invalid.

The Labeling Is Not Necessarily Unique

Note that you can possibly find more than one possible labeling.
For example here are 2 different topological sorts for the same graph:

2 different valid labelings for the same graph

Does a Topological Sort Always Exist?

A topological sort for a graph G=(V,E) exists if and only if G is a DAG (Directed Acyclic Graph).

In layman’s terms, if G has a cycle \Rightarrow no topological sort. (and vice versa)

To understand why let’s prove one direction: G has a cycle \Rightarrow no topological sort.
Let’s assume by contradiction that there is a valid labeling function L for the vertices. We know that G has a cycle. This means that there is a path v_1 \rightarrow v_2 \rightarrow v_3 \rightarrow ... \rightarrow v_1 that starts from some node v_1 and also ends at v_1.
Because L is a valid labeling function, we know this means that L[v_1] < L[v_2] < L[v_3] < ... < L[v_1] \Rightarrow L[v_1] < L[v_1], and this is impossible.

We’ll leave the proof of the opposite direction for now :).

Algorithm

import numpy as np

def topological_sort(graph: List[List[int]]):
    in_degrees = np.zeros(len(graph), dtype=int)

    # Go over all vertices and calculate the in degress
    for vertex_neighbor_list in graph:
        for vertex_neighbor in vertex_neighbor_list:
            in_degrees[vertex_neighbor] += 1

    # Create a list of sources
    sources = list(np.where(in_degrees == 0)[0])

    final_labeling = []
        
    while sources:
        curr_source = sources.pop()
        final_labeling.append(curr_source)

        for source_neighbor in graph[curr_source]:
            # Update the in degree of all the neighbors and add them to the sources list if they have no more incoming edges
            in_degrees[source_neighbor] -= 1
            if in_degrees[source_neighbor] == 0:
                sources.append(source_neighbor)

    return final_labeling

To understand the algorithm, we first have to understand that the algorithm maintains 2 data structures during the run.

  1. A sources data structure – A source is a vertex that has an in degree of zero. In our case we are using a Python list for this data structure, but we can use any data structure that let’s us add elements and remove an element (regardless of position) in O(1) time complexity.
  2. An in degrees array – This array has the same number of elements as the vertices of the graph. Each cell contains the in degree of a specific vertex. Note that the graph is directed, so there is a difference between in degree and out degree of a vertex.

The algorithm can be devided into 2 phases:

  1. Initialization phase – Calculate the in degrees and store in the array. Calculate the sources in the graph and store in the sources list.
  2. Iterative phase – We repeatedly remove an element from the source list. The removed element will be the next element in the toplogical sort. In addition, we remove the vertex from the graph and reduce the in degrees of all the vertexes it points to by 1. If there is a new source now, we add it to the sources list.

The algorithm ends when there are no more sources to remove. Note that the fact that the algorithm ended doesn’t mean that we were successful. If there are vertexes left in the graph that were not removed “as sources” if means we couldn’t find a topological sort.

To understand this let’s visualize the successful and unsuccessful case:

Successful Run

We first initalize the in degree array and sources list.
We removed a source (A) and updated the in degrees. We could also have removed B instead of A.
After removing B, C and D become new sources available for removal.
We could have also chosen C instead of D first.
Success.

Unsuccessful Run

D is the only source.
No more sources. We are stuck.

Runtime and Memory Complexity

Let’s analyze the different parts of the alogrithm in terms of runtime and memory complexity.

Runtime complexity:

  1. Initialize an in degrees array with zero values. O(V)
  2. Calculating the in degrees array – We are going over all of the vertices and all the edges for each vertex. In total we are going over all vertices and edges in the graph. O(V+E)
  3. Calculating and populating the source data structure by going over the in degrees array. O(V)
  4. In the final phase of the algorithm we are going over all sources in the data structure (potentially we can add all the vertices there) and for each one we are removing of its edges. O(V+E)

The total runtime complexity is: O(V+E).

Memory complexity:

  1. In degrees array which has the size of |V|. O(V)
  2. Sources data structure which potentially will contain all vertices. O(V)

The total memory complexity is: O(V).

Example Interview Questions

Cycle Detection

Given a directed graph G, write a function that detects whether G has a cycle or not.

Solution:

def has_cycle(graph)
    labeling = topological_sort(graph)
    return len(labeling) != len(graph)

We learned that a topological sort exists if and only if the directed graph has no cycle. We can check if our sorting algorithm succeed. Success in this case means the topological sort has labeled all the vertices in the graph and didn’t stop beforehand because it could not find any more sources. If that is the case, the graph has no cycle and we can return the result accordingly.

Runtime Complexity: As mentioned, the sorting has a runtime compexity of O(V+E).

Planning Tasks With Dependencies Completion

We are given a list of tasks that need to be done T_1, T_2, T_3, ..., T_n with m dependencies between them. We are given the dependencies as a list of tuples. The tuple (T_i, T_j) means T_i has to be completed before completing T_j.
Write an algorithm that schedules the right order for the tasks to be done.

Solution:

We can represent the problem as a graph using adjency lists. We build a graph where the tasks are the vertexes and the edges are the dependencies between them. This means an edge from T_i to T_j exists only if and only if T_i needs to be completed before completing T_j.
Running a topological sort on this graph will give us a possible scheduling for the tasks to be completed.

Formally:
G=(V,E)
V=\{T_1, T_2, T_3, ..., T_n\}
E=\{T_i \rightarrow T_j | T_i needs to be completed before T_j \}

def build_graph(num_vertices: int, edges: List[List[int]]):
    graph = [[] for _ in range(num_vertices)]

    for source, dest in edges:
        graph[source].append(dest)
    return graph

def task_planning(num_tasks, dependencies):
    graph = build_graph(num_tasks, dependencies)
    return topological_sort(graph)

Runtime complexity:

  1. Building the graph – Going over the tasks and the dependencies. O(n+m)
  2. Topological sort – O(V+E) and in the case of the question, O(n+m)

In total: O(n+m)