What influences the sort order of mrjob output?

Question

I have a project based on mrjob, with automated tests. One test runs mrjob locally against known input, and asserts the actual output matches expected output.

The issue is that the test passes in development environment, but fails in continous integration. The failure is due to the sort order of the output lines.

What can I do to make sure that the output is sorted consistently across environments (without sorting the files in bash manually)? I already sort the input files consistently.

I checked the following between dev and CI: OS versions are the same (well, almost: Ubuntu 14.04.3 vs 14.04.2), Python versions are the same (2.7.6), locale are the same (en_US.UTF-8).

FWIW I start the job programmatically like so:

mr_job = myMrjob(args=args)
with mr_job.make_runner() as runner, open(output_filename, 'w') as fout:
    runner.run()
    for line in runner.stream_output():
        fout.write(line)

[Warning: Do not let your tests depend on the input lines being processed in a certain order. Input is divided nondeterministically by the local, hadoop, and emr runners.](https://pythonhosted.org/mrjob/guides/testing.html) — Patrick Maupin, Aug 10 '15 at 21:56
not familiar with mrjob, but a search of their docs for "sort" yeilded [SORT_VALUES](https://pythonhosted.org/mrjob/job.html?highlight=sort#mrjob.job.MRJob.SORT_VALUES) and [SECONDARY_SORT](https://pythonhosted.org/mrjob/job.html?highlight=sort#secondary-sort) — Chris Montanaro, Aug 10 '15 at 22:43

What influences the sort order of mrjob output?

0 Answers0