0

I'm using Perl WWW::Mechanize package in order to fetch and process data from some websites. Usually my way of action is as follows:

  1. Fetch a webpage

    $mech->get("$url");

  2. Save the webpage contents in a variable (BTW, I'm not sure if it's the right way to save this amount of text inside a scalar which, as far as I know, supposed to be used for a single value)

    my $list = $mech->content();

  3. Use a subroutine that I've created to write the contents of the variable to a text file. (The writetoFile subroutine includes few more features, like path and existing file validations..)

    writeToFile("$filename.tmp","$path",$list);

  4. Processing the text in a file created in the previous step by creating an additional file and save the processed content there (Then deleting the initial temporary file).

What I wonder about, is whether it is possible to perform the processing before storing the text in a file, directly inside the $list variable? The whole process is working as expected but I don't really like the logic behind it and it seems a bit inefficient as well, since I have to rewrite the same file multiple times.

EDIT: Just to give a bit more information about what I'm actually after when I process the variable contents. So the data I fetch from the website in this case is actually a list of items separated by a blank line and the first line is irrelevant to me. So what I'm doing while processing this data is 2 things:

  1. Remove the empty (CRLF) lines
  2. Remove the first line if it includes a particular text.

Ideally I want to save the processed list (no blank spaces and first line removed) in a file without creating any additional files on the way. In order to save the file I would like to use the writeToFile sub (I wrote) since it also performs validation on whether such file already exists (If a file will be saved before final processing - the writeToFile will always rewrite the existing file).

Hope it makes sense.

Alexander
  • 23,432
  • 11
  • 63
  • 73
Eugene S
  • 6,709
  • 8
  • 57
  • 91
  • Of course it is. What exactly aren't you managing? – Mat Mar 24 '13 at 11:10
  • @Mat Hi. My problem seems to be reading the text inside the variable line by line and perform process each line of text according to some conditions and then saving the output somewhere.. just like I do it with a file. With I file I read each line, check each line and write the processed output into another file. Thank you. – Eugene S Mar 24 '13 at 11:16
  • 1
    Look at the answers here: http://stackoverflow.com/questions/1445426/how-can-i-process-a-multi-line-string-one-line-at-a-time-in-perl-with-use-strict, especially http://stackoverflow.com/a/1445732/635608 – Mat Mar 24 '13 at 11:19
  • The question lacks the detail what you want to accomplish by parsing line-by-line. I smell a XY problem. – daxim Mar 24 '13 at 17:28
  • @daxim I have added some details in my question. Hope it makes my question clearer. Thank you! – Eugene S Mar 25 '13 at 05:35

1 Answers1

1

You're looking for split. The pattern depends: use (?<=\n) split at a new line character and keep it. If that doesn't matter, use \R to include all sort of line breaks.

foreach my $line (split qr/\R/, $mech->content) {
    …
}

Now the obligatory HTML-parsing-with-regex admonishment: if you get HTML source with Mechanize, parsing it line-by-line does not make much sense. You probably want to process the HTML-stripped text version of the document instead, or pass the HTML source to a parser such as Web::Query to declaratively get at the pieces you need.

daxim
  • 39,270
  • 4
  • 65
  • 132
  • Thank you for your answer. I was able to incorporate your suggestion to my code. But could you please explain this `qr/\R/` pattern. You have mentioned that `\R` is a pattern for line break, but what about the `qr`? Thank you – Eugene S Mar 28 '13 at 16:26
  • 1
    http://p3rl.org/rebackslash#%5cR http://p3rl.org/qr http://p3rl.org/op#qr%2fSTRING%2fmsixpodual The `qr` operator makes a pattern. `split`'s first argument is a pattern. – daxim Mar 28 '13 at 20:12
  • Thanks a lot for the links and explanation! – Eugene S Mar 28 '13 at 20:13