0

I've written a very basic function that generates a freqency for a given list:

myList = ['hello', 'apply', 'big', 'apple', 'tall', 'apply'] --input

myList = [('hello', 1), ('apply', 2), ('big', 1), ('apple', 1), ('tall',1)] --output

My function:

frequency :: (Eq a) => [a] -> ([(a, Int)] -> [(a, Int)]) -> [(a, Int)]
frequency li removeDup = removeDup $ map (\el -> (el, indexes el)) li
   where
      indexes el = length $ findIndices (== el) li

removeDuplicates :: (Eq a) => [(a, Int)] -> [(a, Int)]
removeDuplicates [] = []
removeDuplicates ((x1, x2) : xs) = (x1, x2) : removeDuplicates (filter (\(y1, y2) -> x1 /= y1) xs)

I'm trying to figure out the run time complexity of my function. My initial thought was that its O(n) as I'm first getting the 'indexes' value, then mapping and finally filtering.

  1. Is this conclusion correct?

  2. Is there a O(1) method of performing the same function?

Babra Cunningham
  • 2,949
  • 1
  • 23
  • 50

1 Answers1

1

I will assume that you will pass removeDuplicates to frequency and work with this slightly modified version of your code:

frequency :: (Eq a) => [a] -> [(a, Int)]
frequency li = removeDuplicates $ map (\el -> (el, indexes el)) li
    where
    indexes el = length $ findIndices (== el) li

removeDuplicates :: (Eq a) => [(a, Int)] -> [(a, Int)]
removeDuplicates [] = []
removeDuplicates ((x1, x2) : xs) =
    (x1, x2) : removeDuplicates (filter (\(y1, y2) -> x1 /= y1) xs)

Let's look at what each part of frequency is doing:

map (\el -> (el, indexes el)) li

As you allude to, map f li is, in principle, O(n) in the length of the list li. That, however, only holds if the complexity of f does not depend on li. For that reason, we need to double-check the function being mapped:

\el -> (el, indexes el)

Substituting the definition of indexes, we get:

\el -> (el, length $ findIndices (== el) li)

findIndices is O(n) in the length of the list, as it needs to test each element, and so the complexity of this function is at least O(n) in the length of li. length is also linear in the length of the list, which means that in the worst case (that is, when all elements are equal to el) it will also be O(n) in the length of li. Given that findIndices is already O(n), length doesn't affect the overall complexity. Finally, the creation of the pair, which is the final step, is constant time and unproblematic.

We can thus conclude \el -> (el, indexes el) is O(n) in the length of li. That being so, map (\el -> (el, indexes el)) li is actually O(n^2) in the length of li, as it performs an O(n) operation n times.

removeDuplicates

Let's focus on the recursive case:

(x1, x2) : removeDuplicates (filter (\(y1, y2) -> x1 /= y1) xs)

The key operation here is the filtering, which is O(n) in the length of xs. The filtering is done once per element of li. Now, even though xs gets shorter as we move towards the end of the list, the average length of xs is proportional to the length of li. That being so, we are once more performing an O(n) operation (in the length of li) n times, which means removeDuplicates is O(n^2) -- just like nub from Data.List. (Another way of reaching the same conclusion would be noticing that removeDuplicates compares each element with every other element, resulting in n*(n-1)/2 comparisons.)

frequency li = removeDuplicates $ map (\el -> (el, indexes el)) li

frequency consists of an O(n^2) operation followed by another O(n^2) operation; therefore, it is O(n^2) in the length of the list.


Is there a O(1) method of performing the same function?

O(1) is impossible, as there is no getting around the need to do something to each element of the list. It is certainly possible to do better than O(n^2), though. For instance, by sorting the list you would avoid the need of comparing each element with all others (as happens both in map (\el -> (el, indexes el)) li and removeDuplicates), as in a sorted list only elements next to each other might possibly be equal. For a concrete example, this function...

group . sort

... is O(n*log(n)) (sort from Data.List is O(n*log(n)), and group is O(n), as it only needs to compare each element to the next one).

P.S.: This is probably beside the point for what you are trying to do, but for something entirely different, you might want to experiment with using a dictionary to keep track of the tallies. That would make an effectively linear frequency possible, which should pay off performance-wise if you need to handle large input lists.

duplode
  • 33,731
  • 7
  • 79
  • 150
  • I like this answer! Do you know of any good resources that discuss techniques to identify and improve run-time performance? I've gone over the basics of big o, but haven't found anything that discusses the strategies to maximizing run-time efficiency? – Babra Cunningham Nov 21 '16 at 16:59
  • 1
    @BabraCunningham [1/2] That can be a pretty broad topic. A good slice of it amounts to knowing your algorithms and, crucially, your data structures. Relevant sources range from the documentation of libraries and brief overviews of common choices of data structures (such as [this](https://en.wikibooks.org/wiki/Haskell/Libraries/Data_structures_primer) and [this](http://stackoverflow.com/q/9611904/2751851)), to [in-depth discussion](http://stackoverflow.com/q/1990464/2751851) of how said data structures are implemented. – duplode Nov 21 '16 at 19:17
  • 1
    @BabraCunningham [2/2] Another related topic is the assortment of practical tricks you will want to use when, at some point in the future, you find yourself having to debug performance issues. [Chapter 25 of *Real World Haskell*](http://book.realworldhaskell.org/read/profiling-and-optimization.html) and [*Anatomy of a thunk leak*](http://blog.ezyang.com/2011/05/anatomy-of-a-thunk-leak/) (plus related posts in Edward Z. Yang's blog) are two links you will want to save for later. – duplode Nov 21 '16 at 19:22