0

From my Ruby script I am spawning a PhantomJS process. That process will return a JSON string to Ruby, and Ruby will parse it.

Everything works fine, but when the PhantomJS script returns a huge JSON (more than 10000 entries), it seems that the Ruby doesn't know how to handle it (it doesn't get the whole string, but it gets cut).

An entry in the JSON looks like this (but some entries can have about 5 more attributes):

{"id" : 3, "nSteps" : 5, "class" : "class-name", "width" : 300, "height" : 500, "source" : "this is a big string", "nMove" : 10, "id-name" : "this is a big string", "checked" : true, "visible" : false}

This is the code I have right now:

@pid = Process.spawn("phantomjs",
                       "myparser.js",
                       :out => pipe_cmd_out, :err => pipe_cmd_out
                      )
  Timeout.timeout(400) do
    Process.wait(@pid)
    pipe_cmd_out.close
    output = pipe_cmd_in.read
    return JSON.parse(output)
  end

Is there any way I can read the JSON by chunks or somehow increase the buffer limit of the pipe?

EDIT:

In order to send the data from PhantomJS to Ruby, I have the following at the very end of my PhantomJS script:

console.log(JSON.stringify(data));
phantom.exit();

If I launch the PhantomJS script from the terminal I get the JSON correctly. However, when I do it from within Ruby, the response get cut.

The size of the string that is being put in the console.log when it breaks is: 132648

EDIT:

I think I found out what is the exact problem. When I specify an :out when spawning the process, if the JSON returned is big (132648 length) it won't let the Ruby to read it. So when doing:

reader, writer = IO.pipe
pid = Process.spawn("phantomjs",
                    "script.js",
                    :out => writer
                    )
Timeout.timeout(100) do
  Process.wait(pid)
  writer.close
  output = reader.read
  json_output = JSON.parse(output)
end

It won't work.

But if I let PhantomJS to just write to its standard stdout, it will output the JSON correctly. So, doing:

reader, writer = IO.pipe
pid = Process.spawn("phantomjs",
                    "script.js"
                    )
Timeout.timeout(100) do
  Process.wait(pid)
  writer.close
  output = reader.read
  json_output = JSON.parse(output)
end

Will output the results in the terminal correctly. So I believe the problem is that somehow for big JSON it doesn't write correctly, or the Ruby reader doesn't know how to read it.

Hommer Smith
  • 26,772
  • 56
  • 167
  • 296
  • 1
    Define "big" and "huge". – the Tin Man Jul 17 '14 at 15:08
  • @theTinMan more than 10.000 entries in the JSON. – Hommer Smith Jul 17 '14 at 15:10
  • That doesn't tell us how big the string is, only how many entries there are. How big is an entry? – the Tin Man Jul 17 '14 at 15:11
  • Edited with a sample of an entry @theTinMan. I have not been able to find what is the buffer limit on the docs though. – Hommer Smith Jul 17 '14 at 15:15
  • Ruby's strings can be *HUGE*, [as large as available memory](http://stackoverflow.com/a/3638854/128421). Creating, or filling, a string that size can take a while, but it's easily tested: `foo = 'a' * 1_000_000_000; foo.size # => 1000000000`. That's not the maximum, but sufficiently large to show it's most likely NOT a buffer-size problem. – the Tin Man Jul 17 '14 at 15:28
  • @theTinMan but even though Ruby is able to handle big strings, what about the pipe that connects Ruby and the spawned process. Where can I find what limits the buffer has? – Hommer Smith Jul 17 '14 at 15:32

1 Answers1

1

I suspect the problem isn't Ruby, but either how or when you're reading.

It could be that PhantomJS hasn't finished sending the output before you read, leading to a partial response. Try routing the output to a file to determine its size in bytes. That will tell you whether PhantomJS is completing its task and closing the JSON correctly, and will let you know how many bytes you can expect to see in the buffer.


...what about the pipe that connects Ruby and the spawned process. Where can I find what limits the buffer has?

Digging around in the dark corners of my memory to root out how this'd work...

This should be reasonably accurate from what I remember: The TCP/IP stack will buffer incoming data until its buffer is full, then it will tell the sending side to stop. Once the buffer is clear, because the script has read the buffer, the sender is told to resume sending. So, even if there are multiple GB of data pending, it won't all be sent at once unless the script is reading continuously from the buffer to keep it clear.

When the script does a read, read doesn't grab only what's in the buffer, it wants to see an EOF, and it shouldn't see that until the TCP/IP session thinks it's received everything from the sender and the session/connection closes. The Ruby I/O sub-system and the TCP/IP sub-system read the data in chunks received from the sender, and store it in the variable. In other words, your script should pause and wait until all the data is transferred, then continue, since read is a blocking action.

There are different ways of handling I/O. You're slurping the data, which is what we call it when we read everything in step. That's not scalable, because, if the incoming data is larger than Ruby's string can store, you've got a problem. Possibly you should be doing incremental reads, then storing the data in a temporary file, then stream that into the JSON parser using something like YAJL, but the first step is to determine whether you're actually getting the complete JSON string.

An alternate way of dealing with the problem is to request smaller sets of data, then reassemble them in your script. Just as requesting every record from a database via SQL is a bad idea, because it isn't scalable and beats up the DBM, maybe you should request your JSON data in pages or blocks, and only process the immediately necessary results.


This might be it...

@pid = Process.spawn("phantomjs",
                       "myparser.js",
                       :out => pipe_cmd_out, :err => pipe_cmd_out
                      )
  Timeout.timeout(400) do
    Process.wait(@pid)
    pipe_cmd_out.close
    output = pipe_cmd_in.read
    return JSON.parse(output)
  end

The STDOUT and STDERR of PhantomJS is being assigned to pipe_cmd_out, but you close that stream with pipe_cmd_out.close, then try to read pipe_cmd_in which isn't defined. That all seems wrong. I think you should do a pipe_cmd_in.close then pipe_cmd_out.read:

@pid = Process.spawn(
  "phantomjs",
  "myparser.js",
  :in => pipe_cmd_in,
  :out => pipe_cmd_out, 
  :err => pipe_cmd_out
)
Timeout.timeout(400) do
  pipe_cmd_in.close
  Process.wait(@pid)
  output = pipe_cmd_out.read
  return JSON.parse(output)
end

Be careful trying to parse the STDERR output though. It's likely to not be JSON, and will cause an exception when the parser throws up.

We close the output of our script/the input to the command-line application, because a lot of command-line tools that read input from STDIN will hang until their STDIN is closed. That's what pip_cmd_in.close will do, it closes PhantomJS's STDIN and signals to it that it should begin processing. Then, when it outputs to its STDOUT, your script should see that via the stream available in pipe_cmd_out.

And, rather than double up the output for STDOUT and STDERR into one variable, I'd probably use:

@pid = Process.spawn(
  "phantomjs",
  "myparser.js",
  :in => pipe_cmd_in,
  :out => pipe_cmd_out, 
  :err => pipe_cmd_err
)
Timeout.timeout(400) do
  pipe_cmd_in.close
  Process.wait(@pid)
  output = pipe_cmd_out.read
  if output.empty?
    pipe_cmd_err
  else
    JSON.parse(output)
  end
end

The code that calls the above code would need to sense whether the return value was a String or an Array or a Hash. If it's the first an error occurred. If it's one of the later two it was successful and you can iterate over them.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Interesting. From the phantom side I actually have the finished string and I outputed like this: `console.log(JSON.stringify(parsedData));` and right after I finish the execution of the process, so there is nothing else written there.. So maybe is actually a problem within the console.log... – Hommer Smith Jul 17 '14 at 15:28
  • I tried running the PhantomJS from the terminal and it outputs the JSON correctly (not being cut), but when spawning the same script from Ruby, the output gets cut. So, it is not the console.log, but something with the pipe, right? – Hommer Smith Jul 17 '14 at 15:39
  • I don't fully understand why do I need to close the output stream earlier? Notice that pipe_cmd_out is the :err. :err => pipe_cmd_out – Hommer Smith Jul 17 '14 at 16:08
  • I did your changes but I get this error ``spawn': wrong exec redirect action (ArgumentError)`` when doing this: https://gist.github.com/anonymous/51373c40e0520c4a8e3c I guess it's because IO.pipe just returns two objects. How would you define the err pipe? – Hommer Smith Jul 17 '14 at 16:36
  • and without assigning the pipes when spawning the processes: https://gist.github.com/anonymous/7bd76ea81c5c7d611da7 works for all cases except when the JSON is big. – Hommer Smith Jul 17 '14 at 16:45
  • I also realize I can't close the writer pipe before the child process has finished. Otherwise, it would not be able to write to the pipe... – Hommer Smith Jul 17 '14 at 17:22
  • Please don't store code on other sites, especially code that shows your debugging or progress. *WHEN* the link breaks your code will be lost to those hoping to get solutions to the same problem. Instead, append the code to your question by editing it. – the Tin Man Jul 17 '14 at 17:35
  • I added some editions. I think the problem is when a specific :out is passed. The PhantomJS works well if it has to write to its default stdout, but somehow things get messed up if it tries to write a specific pipe. – Hommer Smith Jul 17 '14 at 18:18