4

I have a large number of text files (1000+) each containing an article from an academic journal. Unfortunately each article's file also contains a "stub" from the end of the previous article (at the beginning) and from the beginning of the next article (at the end).

I need to remove these stubs in preparation for running a frequency analysis on the articles because the stubs constitute duplicate data.

There is no simple field that marks the beginning and end of each article in all cases. However, the duplicate text does seem to formatted the same and on the same line in both cases.

A script that compared each file to the next file and then removed 1 copy of the duplicate text would be perfect. This seems like it would be a pretty common issue when programming so I am surprised that I haven't been able to find anything that does this.

The file names sort in order, so a script that compares each file to the next sequentially should work. E.G.

bul_9_5_181.txt
bul_9_5_186.txt

are two articles, one starting on page 181 and the other on page 186. Both of these articles are included bellow.

There is two volumes of test data located at [http://drop.io/fdsayre][1]

Note: I am an academic doing content analysis of old journal articles for a project in the history of psychology. I am no programmer, but I do have 10+ years experience with linux and can usually figure things out as I go.

Thanks for your help

FILENAME: bul_9_5_181.txt

SYN&STHESIA

ISI

the majority of Portugese words signifying black objects or ideas relating to black. This association is, admittedly, no true synsesthesia, but the author believes that it is only a matter of degree between these logical and spontaneous associations and genuine cases of colored audition. REFERENCES

DOWNEY, JUNE E. A Case of Colored Gustation. Amer. J. of Psycho!., 1911, 22, S28-539MEDEIROS-E-ALBUQUERQUE. Sur un phenomene de synopsie presente par des millions de sujets. / . de psychol. norm, et path., 1911, 8, 147-151. MYERS, C. S. A Case of Synassthesia. Brit. J. of Psychol., 1911, 4, 228-238.

AFFECTIVE PHENOMENA — EXPERIMENTAL BY PROFESSOR JOHN F. .SHEPARD University of Michigan

Three articles have appeared from the Leipzig laboratory during the year. Drozynski (2) objects to the use of gustatory and olfactory stimuli in the study of organic reactions with feelings, because of the disturbance of breathing that may be involved. He uses rhythmical auditory stimuli, and finds that when given at different rates and in various groupings, they are accompanied by characteristic feelings in each subject. He records the chest breathing, and curves from a sphygmograph and a water plethysmograph. Each experiment began with a normal record, then the stimulus was given, and this was followed by a contrast stimulus; lastly, another normal was taken. The length and depth of breathing were measured (no time line was recorded), and the relation of length of inspiration to length of expiration was determined. The length and height of the pulsebeats were also measured. Tabular summaries are given of the number of times the author finds each quantity to have been increased or decreased during a reaction period with each type of feeling. The feeling state accompanying a given rhythm is always complex, but the result is referred to that dimension which seemed to be dominant. Only a few disconnected extracts from normal and reaction periods are reproduced from the records. The author states that excitement gives increase in the rate and depth of breathing, in the inspiration-expiration ratio, and in the rate and size of pulse. There are undulations in the arm volume. In so far as the effect is quieting, it causes decrease in rate and depth of

182

JOHN F. SHEPARD

breathing, in the inspiration-expiration ratio, and in the pulse rate and size. The arm volume shows a tendency to rise with respiratory waves. Agreeableness shows

fdsayre
  • 175
  • 2
  • 11
  • Is the beginning and end of the actual article in each file not marked in some way? –  Apr 06 '09 at 21:59
  • No. The closest thing is the title and author names which start each article proper. These have the following format: NAME OF ARTICLE BY FIRSTNAME LASTNAME But there are other all caps fragments (the running heads) although not the combination of both the title and author name on sequential lines. – fdsayre Apr 06 '09 at 22:09
  • @fdsayre — I made some minor formatting changes so that your examples will (hopefully) stand out better. Hope you don't mind. :-) – Ben Blank Apr 06 '09 at 23:23
  • @ben-blank That does look better, thank you. – fdsayre Apr 06 '09 at 23:46
  • @fdsayre -- added file name retention. – MarkusQ Apr 10 '09 at 16:25
  • @fdsayre -- see notes in comments on my answer. – MarkusQ Apr 14 '09 at 00:06
  • Why can't the page numbers be used as markers? Information about this can even be extracted from the filenames. – Waylon Flinn Apr 15 '09 at 20:50
  • Yeah... I just realized that. The ironic thing is that I already extract all the page numbers to use a meta data anyway and they are just sitting in text files and a database linked with the file names and content. – fdsayre Apr 15 '09 at 20:51

7 Answers7

4

It looks like a much simpler solution would actually work.

No one seems to be using the information provided by the filenames. If you do make use of this information, you may not have to do any comparisons between files to identify the area of overlap. Whoever wrote the OCR probably put some thought into this problem.

The last number in the file name tells you what the starting page number for that file is. This page number appears on a line by itself in the file as well. It also looks like this line is preceded and followed by blank lines. Therefore for a given file you should be able to look at the name of the next file in the sequence and determine the page number at which you should start removing text. Since this page number appears in your file just look for a line that contains only this number (preceded and followed by blank lines) and delete that line and everything after. The last file in the sequence can be left alone.

Here's an outline for an algorithm

  1. choose a file; call it: file1
  2. look at the filename of the next file; call it: file2
  3. extract the page number from the filename of file2; call it: pageNumber
  4. scan the contents of file1 until you find a line that contains only pageNumber
  5. make sure this line is preceded and followed by a blank line.
  6. remove this line and everything after
  7. move on to the next file in the sequence
Waylon Flinn
  • 19,969
  • 15
  • 70
  • 72
  • Huh. Last night I realized the filenames were useful in this context, but I hadn't thought about the page numbers located WITHIN the file. Thats rather clever. – fdsayre Apr 15 '09 at 20:48
  • recomment off OP: "Yeah... I just realized that. The ironic thing is that I already extract all the page numbers to use a meta data anyway and they are just sitting in text files and a database linked with the file names and content." – fdsayre Apr 15 '09 at 21:53
3

You should probably try something like this (I've now tested it on the sample data you provided):

#!/usr/bin/ruby

class A_splitter
    Title   = /^[A-Z]+[^a-z]*$/
    Byline  = /^BY /
    Number = /^\d*$/
    Blank_line = /^ *$/
    attr_accessor :recent_lines,:in_references,:source_glob,:destination_path,:seen_in_last_file
    def initialize(src_glob,dst_path=nil)
        @recent_lines = []
        @seen_in_last_file = {}
        @in_references = false
        @source_glob = src_glob
        @destination_path = dst_path
        @destination = STDOUT
        @buffer = []
        split_em
        end
    def split_here
        if destination_path
            @destination.close if @destination
            @destination = nil
          else
            print "------------SPLIT HERE------------\n" 
          end
        print recent_lines.shift
        @in_references = false
        end
    def at_page_break
        ((recent_lines[0] =~ Title  and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Number) or
         (recent_lines[0] =~ Number and recent_lines[1] =~ Blank_line and recent_lines[2] =~ Title))
        end
    def print(*args)
        (@destination || @buffer) << args
        end
    def split_em
        Dir.glob(source_glob).sort.each { |filename|
            if destination_path
                @destination.close if @destination
                @destination = File.open(File.join(@destination_path,filename),'w')
                print @buffer
                @buffer.clear
              end
            in_header = true
            File.foreach(filename) { |line|
                line.gsub!(/\f/,'')
                if in_header and seen_in_last_file[line]
                    #skip it
                  else 
                    seen_in_last_file.clear if in_header
                    in_header = false
                    recent_lines << line
                    seen_in_last_file[line] = true
                  end
                3.times {recent_lines.shift} if at_page_break
                if recent_lines[0] =~ Title and recent_lines[1] =~ Byline
                    split_here
                  elsif in_references and recent_lines[0] =~ Title and recent_lines[0] !~ /\d/
                    split_here
                  elsif recent_lines.length > 4
                    @in_references ||= recent_lines[0] =~ /^REFERENCES *$/
                    print recent_lines.shift
                  end
                }
            } 
        print recent_lines
        @destination.close if @destination
        end
    end

A_splitter.new('bul_*_*_*.txt','test_dir')

Basically, run through the files in order, and within each file run through the lines in order, omitting from each file the lines that were present in the preceding file and printing the rest to STDOUT (from which it can be piped) unless a destination director is specified (called 'test_dir' in the example see the last line) in which case files are created in the specified directory with the same name as the file which contained the bulk of their contents.

It also removes the page-break sections (journal title, author, and page number).

It does two split tests:

  • a test on the title/byline pair
  • a test on the first title-line after a reference section

(it should be obvious how to add tests for additional split-points).

Retained for posterity:

If you don't specify a destination directory it simply puts a split-here line in the output stream at the split point. This should make it easier for testing (you can just less the output) and when you want them in individual files just pipe it to csplit (e.g. with

csplit -f abstracts - '---SPLIT HERE---' '{*}'

or something) to cut it up.

MarkusQ
  • 21,814
  • 3
  • 56
  • 68
  • This looks interesting. if by "split" you mean keeping the files separate then I do need them split. I did a quick test and it seems to work, but without keeping each article intact its difficult to compare. Thanks. – fdsayre Apr 09 '09 at 22:23
  • So this uses title and byline to determine the proper start (and thus end) of each article? If so, unfortunately it won't work as there is no specific field that uniquely identifies the starting point of every article. Some use title/by but others (reviews/etc.) do not have an author field. – fdsayre Apr 09 '09 at 23:29
  • Thus i believe that the script needs to compare the beginings/end of each file with the next file and remove 1 set of duplicates. The stubs should only be in the approx. 1/2 page on either side and only with the previous/next file. Sorry this is so complex, it may not be solvable. – fdsayre Apr 09 '09 at 23:30
  • Clever to ignore order on the duplicate lines. Works with high probability. – Norman Ramsey Apr 10 '09 at 01:12
  • @fdsayre -- It could be any number of tests; the title and byline are just examples. I'll add another pattern I've noticed, as an example. – MarkusQ Apr 10 '09 at 01:34
  • Okay, so just so I understand, title/byline/ref are only used to determine splits, not to determine what to remove, right? the actual removal is done via a test for duplicate lines between files. I have no experience with ruby (or lua) but it looks easy to add split points. – fdsayre Apr 10 '09 at 02:20
  • ahhh. sorry man. I need to keep the original filenames intact, and as far as I can tell thats impossible with the split... damn. Thanks for all your help. – fdsayre Apr 10 '09 at 03:46
  • Are the split tests required if the script outputs to the original filename? I ask because the splits happen in many different ways (no coherent pattern) and right now the outputted files are not working with high probability... – fdsayre Apr 10 '09 at 21:24
  • Yes, the split tests are needed. If there is no coherent pattern (which I doubt, having spotted two patterns in the data you provided) you are out of luck, since there's no way to automate such a task. Post an example of something that doesn't split right and I'll see if I can spot a pattern. – MarkusQ Apr 10 '09 at 23:03
  • @MarkusQ Can I send you some data via email? – fdsayre Apr 10 '09 at 23:40
  • @fdsayre -- It would be better if you could just post them somewhere--that way others could see them too. – MarkusQ Apr 11 '09 at 02:56
  • @fdsayre -- I see them, but it doesn't help much. The problem is you are want ing it to split at some points that apparently aren't obvious, but what those points are _isn't_obvious_. – MarkusQ Apr 14 '09 at 00:03
  • @fdsayre -- Maybe if you could find some places where you think it should be split and post several of those (as an edit to the question) someone could spot the additional pattern(s). But the raw data doesn't help much without knowing what you are wanting it to do. – MarkusQ Apr 14 '09 at 00:04
  • @fdsayer -- Also, at least some of those appear (to me at least) to be one file per page, not one file per article, so it may be that you are wanting to _join_ files as well as split them. – MarkusQ Apr 14 '09 at 00:05
  • The split thing is a problem. I don't think there are any coherent split patterns between all files. The "stems" are because many articles end and/or start mid page and when that happens that page is duplicated in both articles original files. – fdsayre Apr 14 '09 at 00:29
  • Unfortunately I need to keep the original file names/content intact (this is a more important requirement that 100% accuracy on the removal of duplicate data, or for that matter, which copy of the duplicate data is removed). – fdsayre Apr 14 '09 at 00:31
  • It seems to be a complex problem but I originally thought group programming tools (diff, etc.) would help. I may have to deal with this problem statistically by estimating the amount of duplicate data and correcting my results appropriately, but obviously I would rather proceed empirically. – fdsayre Apr 14 '09 at 00:34
  • @MarkusQ The only thing I can think of - if removing the duplicate data and the split process are separate - is adding a split point to the top of each file before running the script, which would allow putting the files back together again with perfect accuracy once the dup. data is removed. – fdsayre Apr 14 '09 at 00:41
2

You have a nontrivial problem. It is easy to write code to find the duplicate text at the end of file 1 and the beginning of file 2. But you don't want to delete the duplicate text---you want to split it where the second article begins. Getting the split right might be tricky---one marker is the all caps, another is the BY at the start of the next line.

It would have helped to have examples from consecutive files, but the script below works on one test case. Before trying this code, back up all your files. The code overwrites existing files.

The implementation is in Lua. The algorithm is roughly:

  1. Ignore blank lines at the end of file 1 and the start of file 2.
  2. Find a long sequence of lines common to end of file 1 and start of file 2.
    • This works by trying a sequence of 40 lines, then 39, and so on
  3. Remove sequence from both files and call it overlap.
  4. Split overlap at title
  5. Append first part of overlap to file1; prepend second part to file2.
  6. Overwrite contents of files with lists of lines.

Here's the code:

#!/usr/bin/env lua

local ext = arg[1] == '-xxx' and '.xxx' or ''
if #ext > 0 then table.remove(arg, 1) end  

local function lines(filename)
  local l = { }
  for line in io.lines(filename) do table.insert(l, (line:gsub('', ''))) end
  assert(#l > 0, "No lines in file " .. filename)
  return l
end

local function write_lines(filename, lines)
  local f = assert(io.open(filename .. ext, 'w'))
  for i = 1, #lines do
    f:write(lines[i], '\n')
  end
  f:close()
end

local function lines_match(line1, line2)
  io.stderr:write(string.format("%q ==? %q\n", line1, line2))
  return line1 == line2 -- could do an approximate match here
end

local function lines_overlap(l1, l2, k)
  if k > #l2 or k > #l1 then return false end
  io.stderr:write('*** k = ', k, '\n')
  for i = 1, k do
    if not lines_match(l2[i], l1[#l1 - k + i]) then
      if i > 1 then
        io.stderr:write('After ', i-1, ' matches: FAILED <====\n')
      end
      return false
    end
  end
  return true
end

function find_overlaps(fname1, fname2)
  local l1, l2 = lines(fname1), lines(fname2)
  -- strip trailing and leading blank lines
  while l1[#l1]:find '^[%s]*$' do table.remove(l1)    end
  while l2[1]  :find '^[%s]*$' do table.remove(l2, 1) end
  local matchsize  -- # of lines at end of file 1 that are equal to the same 
                   -- # at the start of file 2
  for k = math.min(40, #l1, #l2), 1, -1 do
    if lines_overlap(l1, l2, k) then
      matchsize = k
      io.stderr:write('Found match of ', k, ' lines\n')
      break
    end
  end

  if matchsize == nil then
    return false -- failed to find an overlap
  else
    local overlap = { }
    for j = 1, matchsize do
      table.remove(l1) -- remove line from first set
      table.insert(overlap, table.remove(l2, 1))
    end
    return l1, overlap, l2
  end
end

local function split_overlap(l)
  for i = 1, #l-1 do
    if l[i]:match '%u' and not l[i]:match '%l' then -- has caps but no lowers
      -- io.stderr:write('Looking for byline following ', l[i], '\n')
      if l[i+1]:match '^%s*BY%s' then
        local first = {}
        for j = 1, i-1 do
          table.insert(first, table.remove(l, 1))
        end
        -- io.stderr:write('Split with first line at ', l[1], '\n')
        return first, l
      end
    end
  end
end

local function strip_overlaps(filename1, filename2)
  local l1, overlap, l2 = find_overlaps(filename1, filename2)
  if not l1 then
    io.stderr:write('No overlap in ', filename1, ' an
Boann
  • 48,794
  • 16
  • 117
  • 146
Norman Ramsey
  • 198,648
  • 61
  • 360
  • 533
  • Well I'm glad I didn't miss some obvious answer but "nontrivial" doesn't sound good. This script looks good. Unfortunately it exits with "no overlap" right now. I've uploaded a sample of files to: http://dl.getdropbox.com/u/239647/bul_9_5_181.txt http://dl.getdropbox.com/u/239647/bul_9_5_186.txt – fdsayre Apr 07 '09 at 03:20
  • Two issues: your files are inconstent in how they use ^L, and my overlap detector needs to be improved. How long do the files get? – Norman Ramsey Apr 08 '09 at 03:53
  • The largest file will be around 250KB, but that is abnormal. The vast majority are under 100KB. The mean is probably 30KB. These are all academic articles, so while a few are large reports, most are a couple pages. Thanks. – fdsayre Apr 08 '09 at 05:25
  • OK, I have improved things to the point where it works on your two test files. It finds at most 40 lines of overlap. Let me know how it goes.... – Norman Ramsey Apr 09 '09 at 01:53
  • This looks really good. Just tested on a few files, but will give it a workout tonight/tomorrow morning. THANKS. – fdsayre Apr 09 '09 at 03:09
  • Strange. Works fine on directory with a couple dozen files, when scaling does:
    lua: ./s1.lua:71: attempt to index field '?' (a nil value)
    stack traceback:
     ./s1.lua:71: in function 'split_overlap'
     ./s1.lua:88: in function 'strip_overlaps'
     ./s1.lua:102: in main chunk
     [C]: ?
    
    – fdsayre Apr 09 '09 at 19:45
  • This looks _way_ more complicated than it needs to be. – MarkusQ Apr 09 '09 at 22:42
  • @fdsayre: my bad, line 68 should loop to #l-1, not to #l – Norman Ramsey Apr 10 '09 at 01:07
  • @MarkusQ: Strangers solve problems for free writing code late at night, and all you can do is complain? :-) Test your code, then we'll talk :-) – Norman Ramsey Apr 10 '09 at 01:09
  • @Norman Ramsey -- I wasn't complaining, just kibitzing. When I'm writing code for free late at night I always like to go with the simplest solution possible, – MarkusQ Apr 10 '09 at 01:40
  • I just like people who write code late at night! @markusQ I'm testing this now. – fdsayre Apr 10 '09 at 01:47
  • Opps... @ramsey: I am testing this now – fdsayre Apr 10 '09 at 01:49
  • @MarkusQ: You're a better man. When it's late at night, I can't make anything simple. The 'have I seen it before' test is ingenious. Wrong, but it will never be wrong on a real input. – Norman Ramsey Apr 10 '09 at 02:02
  • @Norman Ramsey -- My program is now tested (and an embarrassing typo fixed) as per your guilt trip above. Thanks for goading me. – MarkusQ Apr 10 '09 at 02:12
  • @MarkusQ Yeah, the slashes confused me at first too. – fdsayre Apr 10 '09 at 02:15
  • Re:slashes. I'm multitasking and evidently don't have good enough wetware trapping of cross project memory contamination. – MarkusQ Apr 10 '09 at 16:24
  • Ramsey: returning "lua: ./s1.lua:49: '=' expected near 'for'" would love to get this to work as so far this script returns best results. – fdsayre Apr 10 '09 at 21:31
  • @Sayre: maybe we have a transcription error -- what's on line 49? I've put a current version in http://www.cs.tufts.edu/~nr/drop/rm-overlaps. Maybe you can post a zip file containing your texts? Or are they proprietary? – Norman Ramsey Apr 11 '09 at 01:15
  • Added two volumes of data at http://drop.io/fdsayre . the data in vol1 seems to run fine (although the script doesn't seem to constantly change the original files, although maybe I'm executing it wrong. the data in vol27 seems to exit with the same code above. – fdsayre Apr 13 '09 at 23:56
2

Here's is the beginning of another possible solution in Perl (It works as is but could probably be made more sophisticated if needed). It sounds as if all you are concerned about is removing duplicates across the corpus and don't really care if the last part of one article is in the file for the next one as long as it isn't duplicated anywhere. If so, this solution will strip out the duplicate lines leaving only one copy of any given line in the set of files as a whole.

You can either just run the file in the directory containing the text files with no argument or alternately specify a file name containing the list of files you want to process in the order you want them processed. I recommend the latter as your file names (at least in the sample files you provided) do not naturally list out in order when using simple commands like ls on the command line or glob in the Perl script. Thus it won't necessarily compare the correct files to one another as it just runs down the list (entered or generated by the glob command). If you specify the list, you can guarantee that they will be processed in the correct order and it doesn't take that long to set it up properly.

The script simply opens two files and makes note of the first three lines of the second file. It then opens a new output file (original file name + '.new') for the first file and writes out all the lines from the first file into the new output file until it finds the first three lines of the second file. There is an off chance that there are not three lines from the second file in the last one but in all the files I spot checked that seemed to be the case because of the journal name header and page numbers. One line definitely wasn't enough as the journal title was often the first line and that would cut things off early.

I should also note that the last file in your list of files entered will not be processed (i.e. have a new file created based off of it) as it will not be changed by this process.

Here's the script:

#!/usr/bin/perl
use strict;

my @files;
my $count = @ARGV;
if ($count>0){
    open (IN, "$ARGV[0]");
    @files = <IN>;
    close (IN);
} else {
    @files = glob "bul_*.txt";
}
$count = @files;
print "Processing $count files.\n";

my $lastFile="";
foreach(@files){
    if ($lastFile ne ""){
        print "Processing $_\n";
        open (FILEB,"$_");
        my @fileBLines = <FILEB>;
        close (FILEB);
        my $line0 = $fileBLines[0];
            if ($line0 =~ /\(/ || $line0 =~ /\)/){
                    $line0 =~ s/\(/\\\(/;
                    $line0 =~ s/\)/\\\)/;
            }
        my $line1 = $fileBLines[1];
        my $line2 = $fileBLines[2];
        open (FILEA,"$lastFile");
        my @fileALines = <FILEA>;
        close (FILEA);
        my $newName = "$lastFile.new";
        open (OUT, ">$newName");
        my $i=0;
        my $done = 0;
        while ($done != 1 and $i < @fileALines){
            if ($fileALines[$i] =~ /$line0/ 
                && $fileALines[$i+1] == $line1
                && $fileALines[$i+2] == $line2) {
                $done=1;
            } else {
                print OUT $fileALines[$i];
                $i++;
            }
        }
        close (OUT);
    }
    $lastFile = $_;
}

EDIT: Added a check for parenthesis in the first line that goes into the regex check for duplicity later on and if found escapes them so that they don't mess up the duplicity check.

dagorym
  • 5,695
  • 3
  • 25
  • 23
  • this looks really good and worked when tested on a small sample. any advice for quickly generating the list? I'm playing with sort and find but they really don't seem to like the fields, especially the 11 and 1 in the second field. – fdsayre Apr 15 '09 at 02:27
  • got it: sort -t "_" -n -k2,2 -k3,3 -k4,4 – fdsayre Apr 15 '09 at 04:52
  • error: Processing bul_2_6_200.txt Unmatched ) in regex; marked by <-- HERE in m/PROCEEDINGS OF THE MEETING OF THE NORTH CENTRAL SECTK) <-- HERE N OF THE AMERICAN PSYCHOLOGICAL ASSOCIATION. / at ./x line 34. – fdsayre Apr 15 '09 at 04:58
  • line 34 = "if ($fileALines[$i] =~ /$line0/" – fdsayre Apr 15 '09 at 04:59
  • I'll take a look at it. Is that file (bul_2_6_200.txt) and it's proceeding file in the sets of files you provided? – dagorym Apr 15 '09 at 15:04
  • I just checked and it doesn't look like they are. It would be useful to have the files that caused the error to try to diagnose the problem. – dagorym Apr 15 '09 at 15:17
  • Don't need the files. Closer inspection revealed that the problem is coming from the first line of the second file having an unmatched parenthesis. Added a bit of code after the assignment to the $line0 variable to escape the () to prevent the problem. Try it now. – dagorym Apr 15 '09 at 17:32
  • Added file (along with some neighbours) to http://drop.io/fdsayre as more-test-material.tar.gz Thanks... I've looked at the file but cannot see anything that should be causing this problem. – fdsayre Apr 15 '09 at 17:49
  • Would it help to just remove all punctuation before running the script? I don't really need punctuation, all I am interested in is the word frequency. – fdsayre Apr 15 '09 at 17:51
  • Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE 4 8 / at ./x line 38. – fdsayre Apr 15 '09 at 17:56
  • never mind. Some weird characters in some of these files, but only a couple and the script's output makes it easy to fix these by hand. Am testing how well it works now. Thanks – fdsayre Apr 15 '09 at 19:42
  • Fantastic. Thank you so much. I'll cite back to this page in bibliography unless you would rather I cite to another page/person. Mind if I contact you just to clarify what this script does (I should probably understand how it works and not just that it works) – fdsayre Apr 15 '09 at 20:42
  • Yes you can contact me at thomas.stephens@nasa.gov. If you want, I can send you a heavily commented version describing what is happening and why. – dagorym Apr 15 '09 at 21:44
0

Are the stubs identical to the end of the previous file? Or different line endings/OCR mistakes?

Is there a way to discern an article's beginning? Maybe an indented abstract? Then you could go through each file and discard everything before the first and after (including) the second title.

Tobias
  • 3,882
  • 2
  • 22
  • 25
  • The OCR is good, so they are for all practical purposes identical. See comment on OP for comment about what marks the beginning of each article proper, i used all 300 char. there. :) – fdsayre Apr 06 '09 at 22:11
0

Are the titles & author always on a single line? And does that line always contain the word "BY" in uppercase? If so, you can probably do a fair job withn awk, using those criteria as the begin/end marker.

Edit: I really don't think that using diff is going to work as it is a tool for comparing broadly similar files. Your files are (from diff's point of view) actually completely different - I think it will get out of sync immediately. But then, I'm not a diff guru :-)

  • The title and author's name are usually on separate, following lines in all caps. Unfortunately these do not always mark the beginning of the article, for example, some article's (reviews) do not have title/author names, so something with DIFF would probably work best. – fdsayre Apr 06 '09 at 22:42
0

A quick stab at it, assuming that the stub is strictly identical in both files:

#!/usr/bin/perl

use strict;

use List::MoreUtils qw/ indexes all pairwise /;

my @files = @ARGV;

my @previous_text;

for my $filename ( @files ) {
    open my $in_fh,  '<', $filename          or die;
    open my $out_fh, '>', $filename.'.clean' or die;

    my @lines = <$in_fh>;
    print $out_fh destub( \@previous_text, @lines );
    @previous_text = @lines;
}


sub destub {
    my @previous = @{ shift() };
    my @lines = @_;

    my @potential_stubs = indexes { $_ eq $lines[0] } @previous;

    for my $i ( @potential_stubs ) {
        # check if the two documents overlap for that index
        my @p = @previous[ $i.. $#previous ];
        my @l = @lines[ 0..$#previous-$i ];

        return @lines[ $#previous-$i + 1 .. $#lines ]
                if all { $_ } pairwise { $a eq $b } @p, @l;

    }

    # no stub detected
    return @lines;
}
Yanick
  • 1,250
  • 8
  • 9