Why are we interested in how long it takes to sort a file that is already sorted?

Question

This is asked in google interview I didn't get the answer. Even worse didn't got the question.

When discussing sorting algorithms we talk about the behavior on files that are already sorted. Why are we interested in how long it takes to sort a file that is already sorted? Explain your answer briefly ?

amit · Answer 1 · 2015-04-20T11:33:00.240

The question is basically:

Why do we care how a sorting algorithm will behave on an input which is already sorted?

Long story short, files tend to be sorted with higher probability than the "expected" of 1/n! that is assuming the file is randomly permuted¹.

Here are two use cases where we care a lot for the performance of the algorithm if the array/file is already sorted:

Users (of API) don't tend to check if their file is already sorted before using the API and sorting it again, and since the probability of it to be already sorted is not that slim (because someone already sorted it at some point), this worst case behavior is not something that is unlikely to happen. This will make "our" API slowish comparing to our competitors who do care about it.
If we know how it works on a sorted file, it will most likely behave similarly on an almost sorted file, and again - this input is even more likely. Assume a user has a file, appends some entries to it, and send it to the sorting algorithm again - the file is almost sorted, and the performance will be very close to the one expected on sorted ones.

Foot notes:

(1) It's an empirical fact due to the nature of incremental processing, a mathematical support is: A file that is randomly generated has a probability of 1/n! to be already sorted. Assume there is some probability p that the file was sorted since last update. It means that the probability of it being sorted is not 1/n! anymore, it is p + (1-p)1/n!. Assuming p>0, it means the probability for the file to be already sorted is higher than the probability of other files.

@ArjunChaudhary I will fix it to file, but there is no conceptual difference between an array and a file here. — amit, Apr 20 '15 at 11:14
How can we support our statement if he asks how can u say that arrays (and files) tend to be sorted with higher probability ? — Arjun Chaudhary, Apr 20 '15 at 11:16
@ArjunChaudhary It's an empirical fact, a mathematical support is: A file that is randomly generated has a probability of 1/n! to be already sorted. Assume there is some probability `p` that the file was sorted since last update. It means that the probability of it being sorted is not 1/n! anymore, it is `p + (1-p)1/n!`. Assuming `p>0`, it means the probability for the array to be already sorted is higher than the probability of other arrays (similarly to files, nothing different about it). — amit, Apr 20 '15 at 11:20

score 3 · Answer 2 · edited May 23 '17 at 10:33

3

Assuming the question was about the behavior on input that is already sorted (not necessarily files), different sorting algorithms behave differently when they receive already sorted input:

in case of Insertion sort, you will get best case, i.e. O(n)
in case of Quicksort implementation without randomization, already sorted input will cause worst case! (O(n^2))

These two simple examples show that you have to include the behavior of the sorting algorithm on the already sorted input in your algorithm analysis.

The main reason is that in practical applications, user's input tends to come in sorted or almost sorted form, some examples:

almost-chronological data - data that is received in chronological order, but occasionally some elements are out of position (see link below)
natural ordering - when you increase your database of users, you will probably assign higher id to the new user

See also:

Worst case for Quicksort - when can it occur?

edited May 23 '17 at 10:33

Community

1
1

answered Apr 20 '15 at 11:03

Miljen Mikic

14,765
8
58
66

I am not sure you got the question: `Why are we interested in how long it takes to sort a file that is already sorted?` you just showed two anecdotes for two sorting algorithms that behaves differently - but why would we care about it? that's the question. – amit Apr 20 '15 at 11:05
@amit Thanks for the suggestion. The answer was incomplete when posted, as you can see I edited it few times meanwhile. The examples with various sorting algorithms were first that I would spell out if I got this question on the coding interview. – Miljen Mikic Apr 20 '15 at 11:09
@Miljen u helped me to understand the question but u rephrased the question with an example. Basic thing is why we are interested in doing so. – Arjun Chaudhary Apr 20 '15 at 11:10
The main reason is that in practical applications, user's input tends to come in sorted or almost sorted form. Is this right how can we support this claim. – Arjun Chaudhary Apr 20 '15 at 11:11

Lie Ryan · Answer 3 · 2015-04-20T11:26:15.987

Most real world data are not uniformly randomly distributed, but rather they're usually almost sorted. If a sorting algorithm always take O(nlog(n)) even when the data is nearly sorted, then it wouldn't perform as well with real world data.

For example sorting a log file based on the datetime entry in the log. As log entries are created while the events happens, most of the log entries would be close to where they should be, with just a few that are out of place due to concurrent writes. A log file can be extremely large, on the order of gigabytes or more, so a sorting algorithm that doesn't take advantage of the near sorted state of the log file is not as efficient as it should be.

Another case with log files, in a distributed system, multiple systems can produce log entries concurrently. The individual log file itself is sorted (or nearly sorted) but you'd want to merge the multiple log files into a single linear log file containing all the events in all systems. You can just concatenate all the logs and if the sorting algorithm recognizes that most of the data have wide spans of already sorted entries, it can do a much more efficient O(n) merge operation rather than an O(nlog(n)) sort.

great example that can be a good answer. Yours + @amit answer combined is best. Read his answer also. — Arjun Chaudhary, Apr 20 '15 at 11:21

score 0 · Answer 4 · edited Jun 27 '15 at 17:44

I will give another example. Most bank transactions are almost sorted, as people like to receive their bank statements in order of their check numbers. People write checks in order by check number and traders cash them accordingly. So the problem of converting a time of transaction ordering is again an example of sorting an always sorted input.

One of the most widely used sorting technique is quicksort. However quicksort performs very poorly when the input is sorted. When the input data is sorted it approaches a time complexity of O(n^2). This is not good so we go for randomization. Either we randomly arrange the data elements or randomly choose a pivot (median of three). This is done so that the input data set will not matter. There is no bad ordering of the input data. Of course, if your randomization causes the input data set to be sorted then it is just you being unlucky. The running time in such a randomized quick sort algorithm is now independent of input ordering. The expected run time of the randomized quick sort is O(nlogn).

Actually the point is that u don't have to disturb the sorting o files if you disturb that then no use other simple sorting algorithm like insertion sort has O(n) complexity to sort a already sorted file. — Arjun Chaudhary, Jun 27 '15 at 17:31
Question itself is kind of confusing i would suggest you to read answer given above that would give you better understanding of problem. — Arjun Chaudhary, Jun 27 '15 at 17:37

Why are we interested in how long it takes to sort a file that is already sorted?

4 Answers4

Linked