1

I would like to count documents in which two strings appear within a set distance, within 10 words of each other. Let's say 'German*' and 'War'. I do not want to count the times they appear in total, but only the number of documents in which the set appears (if it appears once, count it as one).

I know how to count documents that contain a word. But I am not sure whether I need to extract 10-grams and see whether the two words appear and then count this per document, or if there is a more efficient way.

Melvin Wevers
  • 151
  • 4
  • 11
  • How huge are documents? My first idea is to hold each document in list as one single string, and then grep all frase from german to war as regex. Then split result to words and count them. – M. Siwik Aug 25 '16 at 08:11
  • They are rather big (up to 500MB). – Melvin Wevers Aug 25 '16 at 08:24
  • So i guess each line of document in diffrent string? Then grep your key words. Then i guess your data will be a lot smaller. After this, if strings are close to each other you could join them and count words between german and war. – M. Siwik Aug 25 '16 at 08:26
  • @Melvin could you maybe share some text as in your data structure? – 000andy8484 Aug 25 '16 at 09:33

1 Answers1

1

Hereafter is a small function that tests if two words are closer than 100 characters in a text.

isclose = function(text){
  test <- FALSE
  limit <- 100 # Interval in char counts
  match1 <- gregexpr('war', text)[[1]]
  match2 <- gregexpr('German', text)[[1]]

  for(i in 1:length(match1)){
    for(j in 1:length(match2)){
      if(abs(match1[i]-match2[j]) < limit) test <- TRUE
    }
  }
  return(test)
}

It works fine but should be improved to count the amount of words instead of characters.

JohnBee
  • 1,720
  • 1
  • 15
  • 19
  • I am now using regex to find it and this works well, for instance: `\b?:ameri[k|c]a[a-z]*\W+(?:\w+\W+){1,10}?[s|c]igaret[a-z]*|[s|c]igaret[a-z]*\W+(?:\w+\W+){1,10}?ameri[k|c]a[a-z]*)\b` for Amerika and cigarettes – Melvin Wevers Aug 25 '16 at 10:54
  • Can you post a MWE? – JohnBee Aug 25 '16 at 12:13