3

This is my first Catalyst app and I'm not sure how to solve the following problem.

The user enters some data in a form and selects a file (up to 100MB) for uploading. After submitting the form, the actual computation takes up to 5 minutes and the results are stored in a DB.

What I want to do is to run this process (and maybe also the file upload) in the background to avoid a server timeout. There should be some kind of feedback to the user (like a message "Job has been started" or a progress bar). The form should be blocked while the job is still running. A result page should be displayed once the job finished.

In hours of reading I stumbled upon concepts like asynchronous requests, job queues, daemons, Gearman, or Catalyst::Plugin::RunAfterRequest.

How would you do it? Thanks for helping a web dev novice!

PS: In my current local app the work is done in parallel with Parallel::ForkManager. For the real app, would it be advisable to use a cloud computing service like Amazon EC2? Or just find a hoster who offers multi-core servers?

sega.dev
  • 93
  • 8
  • 3
    Doing the upload as an asynchronous request would make sense. Return a job ID and have the Action set a flag in a Model when it's done. Then have your page poll the backend asynchronously regularly (like every 10s) and if it gets a _done_, refresh the page. I'll type up an answer in a bit. – simbabque Jan 19 '17 at 13:29
  • With regards to your hoster question it really depends on the use case. Amazon or other cloud services have the advantage of being easy to scale up if necessary, but may be more expensive than having your own server, in addition to other considerations. That question should probably best be posted separately and elsewhere. – bytepusher Jan 19 '17 at 22:14
  • @simbabque If you have the time, hints for useful tools/plugins or some example code would be very helpful. Thanks bytepusher, I'll consider that when the app goes into production – sega.dev Jan 20 '17 at 11:05
  • The answer from Julien below is helpful in terms of the pure backend. But we need to know a bit more. You are talking about parallel processing and stuff like that. Does your application have an academic research background? Is it ok if it takes a while because there are not many users? Or is this commercial and should be as fast as possible? – simbabque Jan 20 '17 at 11:09
  • It is a bioinformatics app for research purposes. Maybe later also for commercial use. Basically it searches for text strings in a large file, extracts sequences, stores results in the DB, and visualizes them. In the initial phase, the app will mostly be used only by 1 or 2 persons simultaneously. Anyway, in my local dev environment I use 32 threads with ForkManager so that one run takes approx. 1 minute which is ok. Computation increases sevenfold if the task isn't parallelized. – sega.dev Jan 20 '17 at 11:33

2 Answers2

2

Somehow I couldn't get the idea of File::Queue. For non-blocking parallel execution, I ended up using a combination of TheSchwartz and Parallel::Prefork like it is implemented in the Foorum Catalyst App. Basically, there are 5 important elements. Maybe this summary will be helpful to others.

1) TheSchwartz DB

2) A client (DB handle) for the TheSchwartz DB

package MyApp::TheSchwartz::Client;

use TheSchwartz;    
sub theschwartz {
    my $theschwartz = TheSchwartz->new(
        databases => [ {
            dsn  => 'dbi:mysql:theschwartz',
            user => 'user',
            pass => 'pass',
        } ],
        verbose => 1,
    );
    return $theschwartz;
}

3) A job worker (where the actual work is done)

package MyApp::TheSchwartz::Worker::Test;

use base qw( TheSchwartz::Moosified::Worker );  
use MyApp::Model::DB;      # Catalyst DB connect_info
use MyApp::Schema;         # Catalyst DB schema   

sub work {
    my $class = shift;
    my $job = shift;    
    my ($args) = $job->arg;
    my ($arg1, $arg2) = @$args;

    # re-use Catalyst DB schema    
    my $connect_info = MyApp::Model::DB->config->{connect_info};
    my $schema = MyApp::Schema->connect($connect_info);

    # do the heavy lifting

    $job->completed();
}

4) A worker process TheSchwartzWorker.pl that monitors the table job non-stop

use MyApp::TheSchwartz::Client qw/theschwartz/;    # db connection
use MyApp::TheSchwartz::Worker::Test;
use Parallel::Prefork;

my $client = theschwartz();

my $pm = Parallel::Prefork->new({
    max_workers  => 16,
    trap_signals => {
        TERM => 'TERM',
        HUP  => 'TERM',
        USR1 => undef,
    }
});

while ($pm->signal_received ne 'TERM') {
    $pm->start and next;

    $client->can_do('MyApp::TheSchwartz::Worker::Test');    
    my $delay = 10;    # When no job is available, the working process will sleep for $delay seconds
    $client->work( $delay );

    $pm->finish;
}    
$pm->wait_all_children();

5) In the Catalyst controller: insert a new job into the table job and pass some arguments

use MyApp::TheSchwartz::Client qw/theschwartz/;
sub start : Chained('base') PathPart('start') Args(0) {
    my ($self, $c ) = @_;

    $client = theschwartz();
    $client->insert(‘MyApp::TheSchwartz::Worker::Test’, [ $arg1, $arg2 ]);

    $c->response->redirect(
        $c->uri_for(
            $self->action_for('archive'),
            {mid => $c->set_status_msg("Run '$name' started")}
        )
    );
}

The new run is greyed out on the "archive" page until all results are available in the database.

sega.dev
  • 93
  • 8
1

Put the job in a queue and do it in a different process, outside of the Web application. While you Catalyst process is busy, even if using Catalyst::Plugin::RunAfterRequest, it cannot be used to process other web requests.

There are very simple queuing systems, like File::Queue. Basically, you assign a job ID to the document, put it in the queue. Another process checks the queue and picks up new jobs.

You can save the job status in a database, or anything accessible any the web applications. On the front end, you can poll the job status every X seconds or minutes to give feedback to the user.

You have to figure out how much memory and CPU you need. Multi-core CPU or multiple CPUs may not be required, even if you have several processes running. Choosing between a dedicated server or cloud like EC2 is more about the flexibility (resizing, snapshot, etc.) vs. price.

Julien
  • 5,729
  • 4
  • 37
  • 60
  • This is pretty much what I was going to say too. – simbabque Jan 20 '17 at 11:10
  • Thanks for your advice Julien. I'm new to using Perl for "complex" web apps. I couldn't figure out how to use File::Queue. Would you mind writing some example code for queuing and polling out of Catalyst? Apart from that, do you think I could adapt this approch using TheSchwartz: http://fayland.me/perl/2007/10/04/use-theschwartz-job-queue-to-handle/ ? – sega.dev Jan 20 '17 at 11:20
  • The doc for File::Queue has examples: http://search.cpan.org/perldoc?File%3A%3AQueue – Julien Jan 20 '17 at 20:49