0

I am using an algorithm to lemmatize a text vector. The output is a .txt file stored in the way shown in the picture below. output

The original word is listed in the first column, whilst the various lemmas are listed in the second column, followed by some grammatical classifications. I want to read this into R, but have no idea how to do this. I have tried various forms of separators, but none seem to work.

Ideally, I want the data frame in R to look as follows, where I only read the first occurence of each lemma:

wanted structure

Perhaps the best option could be to read the data, keep only the first occurence (ie. da da adv), then do something like text to columns and only keep the first two columns.

Output from lemmatization algorithm:

"<da>"
    "da" adv
    "da" sbu
    "da" subst fork
"<dette>"
    "dette" det dem nøyt ent
    "dette" pron nøyt ent pers 3
    "dette" verb inf
"<er>"
    "være" verb pres <aux1/perf_part>
"<den>"
    "den" det dem fem ent
    "den" det dem mask ent
    "den" pron mask fem ent pers 3

Wanted structure:

da      da 
dette   dette
er  være
den den
Oda Ned
  • 47
  • 6
  • Hi, putting data as image is not the best way to share them: since they're in a .txt, you could paste your example as text (equal to your: spaces, etc..), to make possible to copy and paste it in a .txt, and create your situation in our computers. – s__ Apr 30 '20 at 08:41
  • Thank you for the comment, you are of course entirely correct... I have updated the question now. – Oda Ned Apr 30 '20 at 08:52
  • Your text doesn't match your image... which is correct? – rg255 Apr 30 '20 at 08:58
  • I think they should match now... – Oda Ned Apr 30 '20 at 09:01

2 Answers2

2

Here's an interesting result: You can read the file quite nicely with read.table:

s <- '"<da>"
    "da" adv
    "da" sbu
    "da" subst fork
"<dette>"
    "dette" det dem nøyt ent
    "dette" pron nøyt ent pers 3
    "dette" verb inf
"<er>"
    "være" verb pres <aux1/perf_part>
"<den>"
    "den" det dem fem ent
    "den" det dem mask ent
    "den" pron mask fem ent pers 3
 '

 x <- read.table(sep='', text=s, colClasses=c('character','character'), flush=TRUE, fill=TRUE)

> x
        V1    V2   V3
1     <da>           
2       da   adv     
3       da   sbu     
4       da subst fork
5  <dette>           
6    dette   det  dem
7    dette  pron nøyt
8    dette  verb  inf
9     <er>           
10    være  verb pres
11   <den>           
12     den   det  dem
13     den   det  dem
14     den  pron mask

Using packages dplyr and tidyr, we can unpack it into:

(y <- x %>% mutate(a=grepl('<', V1, fixed=TRUE), b=cumsum(a)) %>% 
  group_by(b) %>% 
  summarise(verbs=list(t(unique(V1)))) %>% 
  unnest(cols=c(verbs)))
# A tibble: 4 x 2
      b verbs[,1] [,2] 
  <int> <chr>     <chr>
1     1 <da>      da   
2     2 <dette>   dette
3     3 <er>      være 
4     4 <den>     den  

result <- y$verbs
 result[,1] <- gsub('(<|>)', '', result[,1])


    [,1]    [,2]   
[1,] "da"    "da"   
[2,] "dette" "dette"
[3,] "er"    "være" 
[4,] "den"   "den"
MrGumble
  • 5,631
  • 1
  • 18
  • 33
  • Very nice import, more straightforward than mine get my upvote for it. – s__ Apr 30 '20 at 09:28
  • This is a very elegant solution, thanks! I am however getting the following error message when running the second part (mutate etc): Error in vec_rbind(!!!x, .ptype = ptype) : Internal error in `vec_assign()`: `value` should have been recycled to fit `x`. Any suggestions for how to solve this? I'm still relatively new to R, so any help is really appreciated! – Oda Ned Apr 30 '20 at 13:09
  • I was able to find a work-around using the same intuition as your suggestion, but using aggregate instead. Thank you so much for your help! – Oda Ned Apr 30 '20 at 13:59
0

This worked for me when copy-pasted the text into a text file :

#Read the data
data <- readLines('temp.txt')
#index where new group starts. I have considered no whitespace at the beginning
# of the string as an indication for new group.
groups <- !startsWith(data, ' ')
#Since the first word is same in the entire group, we take first word 
#from 2nd element as 1st element is group name
value <- tapply(data, cumsum(groups), function(x) 
                     sub('"(\\w+).*', '\\1', trimws(x[2])))
#Create dataframe with group name and value. 
data.frame(groups = data[groups], value)


#    groups value
#1    "<da>"    da
#2 "<dette>" dette
#3    "<er>"  være
#4   "<den>"   den
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213