0

I want to reset one of my groups (a class discussion), but I would like to retain the discussion for reference. There aren't many posts (maybe 50), and I could do it by hand, but is there a way to do that through google apps scripts or python?

I found a few possibilities, but neither in a language I'm familiar with (though I might be able to translate):

this link: http://saturnboy.com/2010/03/scraping-google-groups/

this Perl code:

#!/usr/bin/perl
# groups2csv.pl
# Google Groups results exported to CSV suitable for import into Excel.
# Usage: perl groups2csv.pl < groups.html > groups.csv

# The CSV Header.
print qq{"title","url","group","date","author","number of articles"\n};

# The base URL for Google Groups.
my $url = "http://groups.google.com";

# Rake in those results.
my($results) = (join '', <>);

# Perform a regular expression match to glean individual results.
while ( $results =~ m!<a href=(/groups[^\>]+?rnum=[0-9]+)>(.+?)</a>.*?
<br>(.+?)<br>.*?<a href="?/groups.+?class=a>(.+?)</a> - (.+?) by 
(.+?)\s+.*?\(([0-9]+) article!mgis ) {
    my($path, $title, $snippet, $group, $date, $author, $articles) =
        ($1||'',$2||'',$3||'',$4||'',$5||'',$6||'',$7||'');
    $title =~ s!"!""!g; # double escape " marks
    $title =~ s!<.+?>!!g; # drop all HTML tags
    print qq{"$title","$url$path","$group","$date","$author","$articles"\n\n};
}
Srik
  • 7,907
  • 2
  • 20
  • 29
wgw
  • 601
  • 1
  • 8
  • 20
  • You can definitely scrape in python. It doesn't sound like you want to screen scrape in this instance, though - it sounds like you just want to take a back up of the discussion. – chucksmash Aug 21 '12 at 20:18
  • 2
    My eyes are watering from reading the regular expression. – Burhan Khalid Aug 22 '12 at 06:40

1 Answers1

0

Take a look at the HTTrack utility mentioned in this webapps question and in this forum discussion.

Note I'm assuming you don't actually want to screen scrape and process data but merely have a copy of the discussion for future reference.

EDIT: If you actually want to screen scrape, you can do this too but writing a script to do it can be a significant time sink. Screen scraping is more about extracting specific pieces of data from an html document than it is grabbing the entire html document. An example where you might need to screen scrape would be if you were looking at the jeopardy website and wanted to grab individual questions, their point values, who answered them right, which game they occurred in, etc for insertion into a database.

Community
  • 1
  • 1
chucksmash
  • 5,777
  • 1
  • 32
  • 41
  • Thanks! I think that is the way to go. It might do a bit more than I want, but if it is configured properly (contained), it should work. (You are right: I don't want to scrape, I just want the posts!) – wgw Aug 21 '12 at 21:55
  • 1
    Gave it a try... was too confused by the group interface (I don't think it understands link parameters). Looks like I will need some kind of browser mechanization (like mechanize). But for the moment, it will have to be done by hand... :( Google groups needs some data liberation... – wgw Aug 21 '12 at 23:15