2

I have some files in .sgm format and I have to evaluate them (apply a language model and obtain the perplexity of the text).

The main problem is that I need these files in plain format, i.e. in txt format. However I have been searching into the internet for an online convert or for somekind of script doing this and could not find.

Besides this, a teacher of mine sent me this command in perl:

perl -n 'print $1."\n" if /<seg[^>]+>\s*(.*\S)\s*<.seg>/i;’ < file.sgm > file

I have never worked using perl and have, honestly, no idea of it. I think I have perl installed:

$ perl -v

This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)

Copyright 1987-2013, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

By the way, I am using Mac OS X.

Sample .sgm file:

<srcset setid="newsdiscusstest2015" srclang="any">
<doc sysid="ref" docid="39-Guardian" genre="newsdiscuss" origlang="en">
<p>
<seg id="1">This is perfectly illustrated by the UKIP numbties banning people with HIV.</seg>
<seg id="2">You mean Nigel Farage saying the NHS should not be used to pay for people coming to the UK as health tourists, and saying yes when the interviewer specifically asked if, with the aforementioned in mind, people with HIV were included in not being welcome.</seg>
<seg id="3">You raise a straw man and then knock it down with thinly veiled homophobia.</seg>

Otuput .txt file:

This is perfectly illustrated by the UKIP numbties banning people with HIV. You mean Nigel Farage saying the NHS should not be used to pay for people coming to the UK as health tourists, and saying yes when the interviewer specifically asked if, with the aforementioned in mind, people with HIV were included in not being welcome. You raise a straw man and then knock it down with thinly veiled homophobia.

lucasrodesg
  • 638
  • 1
  • 6
  • 22
  • Please post sample `.sgm` file and what it should look like after conversion to `.txt` – bart Apr 24 '16 at 18:57
  • I have edited the post @bart – lucasrodesg Apr 24 '16 at 19:06
  • Is *perplexity* really a measurement of something? I want to be involved! – Borodin Apr 25 '16 at 09:30
  • Hi @Borodin, yes _perplexity_ is a widely used measure to evaluate a Language Model (LM). A LM is obtained by analyzing a huge training text.To evaluate it, you apply the LM to several test texts. To quantify how good/accurate is the generated LM you compute the perplexity of each test. The lower is the perplexity the better is your LM. If, for instance, you use the Bibble as the training corpus and WhatsApp Group logs as test corpus, the perplexity will be high since the nature of the training and the test corpus is quite different. Refer to "Natural Language Modeling" for more info :-) – lucasrodesg Apr 27 '16 at 10:54

3 Answers3

7

You can try using this script to strip the SGML tags from the file:

#!/usr/bin/env perl
use strict;
use warnings;

use HTML::Parser;

my $file = $ARGV[0];

HTML::Parser->new(default_h => [""],
    text_h => [ sub { print shift }, 'text' ]
  )->parse_file($file) or die "Failed to parse $file: $!";

Use it as follows:

./strip_sgml.pl file.sgm > file.txt
bart
  • 898
  • 4
  • 5
1

Ok, I have found a solution:

rename the file from "file.sgm" to "file.html". Then open the html file with a text editor and add on the top the line <meta charset="utf-8">, so that all the characters can be correctly displayed. Finally, open this file with a web browser and copy the content into a new text file.

lucasrodesg
  • 638
  • 1
  • 6
  • 22
0

For a python solution, the answer here from user Hugo will remove all tags from the document (Python/BeautifulSoup - how to remove all tags from an element?).

TLDR Use the get_text() function from Beautiful Soup.

brch
  • 407
  • 4
  • 7