1

I have a 300-lines JQ code which run (literally hours) on the files I deal with (plain list of 200K-2.5M JSON objects, 500MB-6GB size).

On the first glance the code looks linear in complexity, but I can easily miss something.

Are there most common traps to be aware of in terms of code complexity in JQ? Or some tools to identify the key bottlenecks in my code?

I'm bit reluctant with making my code public, for size&complexity on one hand, and for its somewhat proprietary nature on the other.

PS. Trimming the input file to keep only most relevant objects AND pre-deflating it to keep only the fields I need are obvious steps towards optimizing my processing flow. I'm wondering what can be done specifically on query complexity side.

wass rubleff
  • 326
  • 1
  • 14

2 Answers2

1

Since you are evidently not a beginner, the likelihood of your making beginners' mistakes seems small, so if you cannot figure out a way to share some details about your program and data, you might try breaking up the program so you can see where the computing resources are being consumed. Well-placed debug statements can be helpful in that regard.

The following filters for computing the elapsed clock time might also be helpful:

def time(f):
  now as $start | f as $out | (now - $start | stderr) | "", $out;

def time(f; $msg):
  now as $start | f as $out | ("\(now - $start): \($msg)" | stderr) | "", $out;

Example

def ack(m;n):
  m as $m | n as $n
  | if $m == 0 then $n + 1
    elif $n == 0 then ack($m-1; 1)
    else ack($m-1; ack($m; $n-1))
    end ;

time( ack(3;7) | debug)

Output:

["DEBUG:",1021]
0.7642250061035156
1021
peak
  • 105,803
  • 17
  • 152
  • 177
1

Often, a program that takes longer than expected is also producing incorrect results, so perhaps the first thing to check is that the results are correct. If they are, then the following might be worth checking:

  • avoid slurping (i.e., use input and/or inputs in preference);
  • beware of functions with arity greater than 0 that call themselves;
  • avoid recomputing intermediate results unnecessarily, e.g. by storing them in $-variables, or by including them in a filter's input;
  • use functions with "short-circuit" semantics when possible, notably any and all
  • use limit/2, first/1, and/or foreach as appropriate;
  • the implementation of index/1 on arrays can be a problem for large arrays, as it first computes all the indices;
  • remember that unique and group_by should be used carefully since both involve a sort.
  • use bsearch for insertion and for binary search for an item in a sorted array;
  • using JSON objects as dictionaries is generally a good idea.

Note also that the streaming parser (invoked with the --stream option) is designed to make the tradeoff between time and space in favor of the latter. It succeeds!

Finally, jq is stream-oriented, and using streams is sometimes more efficient than using arrays.

peak
  • 105,803
  • 17
  • 152
  • 177
  • _avoid recomputing intermediate results unnecessarily, e.g. by storing them in $-variables, or by including them in a filter's input._ -- I wonder what counts as "recomputing". I DO store intermediate results as $variables a lot; they are small in size but some theoretically require full scan of entire input; those vars calculated later depends on values prepared before them. Any suggestions here beyond limit/2? – wass rubleff May 19 '21 at 08:32
  • My original input is guaranteedly sorted by the key I select by (using index/1 and select(key==value / >value / – wass rubleff May 19 '21 at 08:35
  • Does `first/0` works any faster than `.[0]`? – wass rubleff May 19 '21 at 08:42
  • 1
    No. See builtin.jq (https://github.com/stedolan/jq/blob/master/src/builtin.jq) for the defs of built-in functions. – peak May 19 '21 at 08:44
  • 1
    Be aware of the built-in bsearch. See also the updates regarding unique, group_by, index, and stream-orientation. – peak May 19 '21 at 08:50
  • _group_by should be used carefully since ... involve a sort_ — That's my case: I `group_by` entire input by the key it's pre-sorted, then keep only small amount N in the head / tail of the resulting array (but there is no way to estimate how large should be the input part in order to definitely have N unique elements). Is there any way to give hint to `group_by` that input is pre-sorted? – wass rubleff May 19 '21 at 09:01
  • 1
    You could use GROUPS_BY as defined at https://stackoverflow.com/questions/48321235 ; you might also wish to review some of the recipes in the jq Cookbook (google terms: `jq` `cookbook`) – peak May 19 '21 at 14:53
  • For now, I solved the problem trimming the input before processing to 2000 most relevant records. Will try GROUPS_BY in future if trimming becomes a bad choice for some situations. Thank you again! – wass rubleff May 19 '21 at 17:33