How can I split a text using upper cases follow (blabla) - in R

Question

I have text data that include parliamentary speeches like this:

 df <- "MEHMET ALİ ÇELEBİ (İzmir) – Teşekkürler Sayın Başkan.

26-30 Haziran Özel Güvenlik Görevlileri Haftası’ndayız, kutluyorum.

Bugün ülkemizde 350 bini aşkın özel güvenlik görevlimiz esnek ve güvencesiz çalışmanın en ağır koşullarına muhataptır.

Bir: Maaş, özlük hakları, çalışma şartları ve risk tazminatları iyileştirilmelidir. İki: Görev tanımı dışında çalıştırılmaları engellenmelidir. Üç: Yıpranma hakkı, ödüllendirme, şehit ve gazilik talepleri karşılanmalıdır. Dört: Onlara yönelik yeni bir işçi sağlığı ve iş güvenliği düzenlemesi yapılmalıdır. Beş: Adli vakalarda avukat desteği verilmelidir. Altı: Taşeronda değil, çalıştıkları kurum bünyesinde istihdam edilmelidirler. Yedi: Belediye şirketlerine geçen özel güvenliklerimizin iş kollarının belirsizliği giderilmelidir. “Özel güvenlik her yerde, görmezden gelme!” diyorum, yüce Meclisi saygıyla selamlıyorum. 

BAŞKAN – Sayın Sazak… 

METİN NURULLAH SAZAK (Eskişehir) – Teşekkürler Başkanım.

Türk sinemasının değerli ismi, Eskişehirli hemşehrim Cüneyt Arkın’a Allah’tan rahmet; ailesine, sevenlerine sabırlar dilerim. Türk sinemasının başı sağ olsun. Cüneyt Arkın, oynamış olduğu filmlerde, Türk tarihinin önemli kahramanlarını gençliğe sevdirmiş; sadece sinemada değil, yaşadığı hayatta da duruşuyla takdir toplamıştır. Ruhu şad, mekânı cennet olsun. 

BAŞKAN – Sayın Aycan…

SEFER AYCAN (Kahramanmaraş) – Sayın Başkan, şehirlerimiz büyümektedir; bu nedenle de yeni imar planlarına, imar bölgelerine ihtiyaç doğmaktadır.

Sağlıklı şehirleşme, imar planı doğrultusunda alt yapısı tamamlanarak yeni imar bölgeleri oluşturmaktan geçmektedir; kentsel dönüşüm, sağlıklı şehirleşme ve güvenli bina için de buna ihtiyaç vardır. Şehrim Kahramanmaraş’ın merkezi de konut ihtiyacı açısından tıkanmıştır, yeni imar planına ihtiyacı vardır. Güneyi tarım arazileridir; buralara zarar vermemek, imara kapamak gerekmektedir.

Diğer taraftan, eski mahallelerde kentsel dönüşüm zorunlu hâle gelmiştir; bu nedenle, eski mahallelerde, özellikle Dulkadiroğlu bölgesinde kentsel dönüşümün teşvik edilmesi, kart sayısının 2’den 4’e hatta 6 veya 8 katlara çıkarılması gerekmektedir. Kahramanmaraş’ta kentsel dönüşüm teşvik edilmelidir; böylece, yeniden şehirleşme sağlanmalıdır; böylece, şehrin merkezinde konut ihtiyacı karşılanmış olacaktır.

Saygılarımla."

I want to create a separate data frame for every speaker using strsplit function and end up like this:

#      [Speaker]                            [text]                           
# [1,] "MEHMET ALİ ÇELEBİ"                  "Teşekkürler Sayın Başkan. 26-30 Haziran..."
# [2,] "METİN NURULLAH SAZAK"               "Teşekkürler Başkanım. Türk sinemasının..."   
# [3,] "SEFER AYCAN".                       "Sayın Başkan, şehirlerimiz büyümektedir..."

I have tried the code below but could not get the result. I have list of speakers if regular expressions not enough.

pat <- r"{(?>\p{Lu}+?\s?)+\(?\p{Lu}+\)?\K(:)|(?<!\w)(\s)(?=\p{Lu}{2,})}"
tmp <- trimws(el(strsplit(df, pat, perl=TRUE)))[-1]
res <- matrix(tmp, ncol=2, byrow=TRUE)
res

Could you help me? I am kind of new in R. Thanks in advance,

Is the pattern with `NAME (location) - text` always the same? if so, it would be much easer to use the brackets plus the `-` as separators — kabr, Jul 22 '22 at 10:41
Yes it is always the same except the president of parliament. It is always like "BAŞKAN -". The actual data have the order like this: SPEAKER (Location) - ...... BAŞKAN - ..... SPEAKER (Location) — noetherlaw, Jul 22 '22 at 10:44
Maybe [this solution](https://ideone.com/jTyO1X) will be enough? — Wiktor Stribiżew, Jul 22 '22 at 10:46
I have now corrected the df that includes "BAŞKAN -......" too. — noetherlaw, Jul 22 '22 at 10:48
Actually my data include much more than this such as reports or roll call rates, but 90% percent of the df is similar to "SPEAKER (Location) - ...... BAŞKAN - ..... SPEAKER (Location)...BAŞKAN - ....." order. — noetherlaw, Jul 22 '22 at 10:54

score 0 · Answer 1 · edited Jul 22 '22 at 10:59

A workflow can be to look for the (location) - pattern and insert a placeholder in order to split afterwards

Example:

df <- tibble(text = c("MEHMET ALİ ÇELEBİ (İzmir) – Teşekkürler Sayın Başkan. 26-30 Haziran Özel Güvenlik Görevlileri Haftası’ndayız, kutluyorum. Bugün ülkemizde 350 bini aşkın özel güvenlik görevlimiz esnek ve güvencesiz çalışmanın en ağır koşullarına muhataptır. Bir: Maaş, özlük hakları, çalışma şartları ve risk tazminatları iyileştirilmelidir. İki: Görev tanımı dışında çalıştırılmaları engellenmelidir. Üç: Yıpranma hakkı, ödüllendirme, şehit ve gazilik talepleri karşılanmalıdır. Dört: Onlara yönelik yeni bir işçi sağlığı ve iş güvenliği düzenlemesi yapılmalıdır. Beş: Adli vakalarda avukat desteği verilmelidir. Altı: Taşeronda değil, çalıştıkları kurum bünyesinde istihdam edilmelidirler. Yedi: Belediye şirketlerine geçen özel güvenliklerimizin iş kollarının belirsizliği giderilmelidir. “Özel güvenlik her yerde, görmezden gelme!” diyorum, yüce Meclisi saygıyla selamlıyorum.", "METİN NURULLAH SAZAK (Eskişehir) – Teşekkürler Başkanım. Türk sinemasının değerli ismi, Eskişehirli hemşehrim Cüneyt Arkın’a Allah’tan rahmet; ailesine, sevenlerine sabırlar dilerim. Türk sinemasının başı sağ olsun. Cüneyt Arkın, oynamış olduğu filmlerde, Türk tarihinin önemli kahramanlarını gençliğe sevdirmiş; sadece sinemada değil, yaşadığı hayatta da duruşuyla takdir toplamıştır. Ruhu şad, mekânı cennet olsun.","SEFER AYCAN (Kahramanmaraş) – Sayın Başkan, şehirlerimiz büyümektedir; bu nedenle de yeni imar planlarına, imar bölgelerine ihtiyaç doğmaktadır. Sağlıklı şehirleşme, imar planı doğrultusunda alt yapısı tamamlanarak yeni imar bölgeleri oluşturmaktan geçmektedir; kentsel dönüşüm, sağlıklı şehirleşme ve güvenli bina için de buna ihtiyaç vardır. Şehrim Kahramanmaraş’ın merkezi de konut ihtiyacı açısından tıkanmıştır, yeni imar planına ihtiyacı vardır. Güneyi tarım arazileridir; buralara zarar vermemek, imara kapamak gerekmektedir."))


df_with_split <- df %>% 
  mutate(text_helper = text,
         # look for (LOCATION) - and replace with SPLIT_HERE
         text_helper = str_replace(text_helper, "(?<=\\p{L}.\\)\\s–)", "SPLIT_HERE"),
         left = trimws(str_extract(text_helper, ".*(?=SPLIT_HERE)")),
         right = trimws(str_extract(text_helper, "(?<=SPLIT_HERE).*")))

df_with_split

# A tibble: 3 × 4

text

    text_helper left  right
  <chr>                                                                                                        
    <chr>       <chr> <chr>
1 MEHMET ALİ ÇELEBİ (İzmir) – Teşekkürler Sayın Başkan. 26-30 Haziran Özel Güvenlik Görevlileri Haftası’ndayız, k… MEHMET ALİ… MEHM… Teşe…
2 METİN NURULLAH SAZAK (Eskişehir) – Teşekkürler Başkanım. Türk sinemasının değerli ismi, Eskişehirli hemşehrim C… METİN NURU… METİ… Teşe…
3 SEFER AYCAN (Kahramanmaraş) – Sayın Başkan, şehirlerimiz büyümektedir; bu nedenle de yeni imar planlarına, imar… SEFER AYCA… SEFE… Sayı…

Note that [`[A-z]` matches more than just ASCII letters](http://stackoverflow.com/questions/29771901/why-is-this-regex-allowing-a-caret/29771926#29771926). `\p{L}` matches any letter. — Wiktor Stribiżew, Jul 22 '22 at 10:59
Thanks for your response, however, actually my data in a raw format include much more than this such as reports or roll call rates, but 90% percent of the df is similar to "SPEAKER (Location) - ...... BAŞKAN - ..... SPEAKER (Location)...BAŞKAN - ....." order. I updated the relevant part above. — noetherlaw, Jul 22 '22 at 11:12
Sure, I expected that these are only the standard cases. I did a big project on text data from the German parlament as well a few years ago. Text can be messy. My and @Rémy Pétremand's approach give you pointers in the right direction. you could the check if the (location) part exists and if it does not, do something different. The text you send is quite messy and structuring in one (!) step is probably impossible. I suggest a step by step apporach — kabr, Jul 22 '22 at 11:50
Thanks again. Actually all the speech parts of these raw texts includes the pattern here: "SPEAKER (Location) - ...... BAŞKAN - ..... SPEAKER (Location) - ......BAŞKAN - .....SPEAKER (Location) - ......BAŞKAN - .....". Is it still not possible to isolate them? I could not get results from these approaches. — noetherlaw, Jul 22 '22 at 12:33
Of course, you can. just include `BAŞKAN -` in your regex, e.g. `text_helper = str_replace(text_helper, "(?<=\\p{L}.\\)\\s–)|(?<=BAŞKAN -)", "SPLIT_HERE")` — kabr, Jul 22 '22 at 13:13

score 0 · Answer 2 · answered Jul 22 '22 at 11:04

I have found a simple way to obtain the same result by using the extract function from the tidyr library.

For each speach I use the regex "(.*) [(](.*)[)] – (.*)" which will extract respectively the author, the location and the text of the corresponding speach with the () groups.

# Load libraries
library(dplyr) 
library(tidyr)

# Get a data.frame with each row = one speach
speaches <- trimws(strsplit(x = df, split = "\n")[[1]], which = "both")
speaches <- speaches[speaches != ""]
speaches <- data.frame("speaches_raw" = speaches)

# Get the informattion of the author, location and speach
res <- speaches %>% 
  extract(speaches_raw, c("author", "location", "speach"), "(.*) [(](.*)[)] – (.*)", remove = F)

And the resulting data.frame looks like:

Thanks for your response, however, actually my data in a raw format include much more than this such as reports or roll call rates, but 90% percent of the df is similar to "SPEAKER (Location) - ...... BAŞKAN - ..... SPEAKER (Location)...BAŞKAN - ....." order. I updated the relevant part above. — noetherlaw, Jul 22 '22 at 11:16

score 0 · Answer 3 · answered Jul 22 '22 at 12:43

0

I'm not sure if this is of help given the OP's comments to the previous answers (which seem to imply that the data given is not entirely representative of the actual data). With the data as posted, this solution works:

Note that for simplicity sake I've included in the character classes those two upper-case characters that seemed outside of the scope of A-Z, namely İ and Ç; if there are more to the Turkish alphabet, then these might also be included:

library(tidyr)
data.frame(df) %>%
  # separate into rows:
  separate_rows(df, sep = "\\n\\n(?=[A-ZİÇ ]+\\(.*?\\) – )") %>%
  # remove new-line character:
  mutate(df = gsub("\\n", "", df)) %>%
  # extract into columns:
  extract(df,
          into = c("Speaker", "Text"),
          regex = "^([A-ZİÇ ]+) [^–]+– (.*)")
# A tibble: 3 × 2
  Speaker              Text                                                                                                           
  <chr>                <chr>  
1 MEHMET ALİ ÇELEBİ    "Teşekkürler Sayın Başkan.26-30 Haziran Özel Güvenlik Görevlileri Haftası’ndayız, kutluyorum.Bugün ülkemizde 3…
2 METİN NURULLAH SAZAK "Teşekkürler Başkanım.Türk sinemasının değerli ismi, Eskişehirli hemşehrim Cüneyt Arkın’a Allah’tan rahmet; ai…
3 SEFER AYCAN          "Sayın Başkan, şehirlerimiz büyümektedir; bu nedenle de yeni imar planlarına, imar bölgelerine ihtiyaç doğmakt…

answered Jul 22 '22 at 12:43

Chris Ruehlemann

20,321
4
12
34

Thanks for your response. Actually all the speech parts of these raw texts includes the pattern here: "SPEAKER (Location) - ...... BAŞKAN - ..... SPEAKER (Location) - ......BAŞKAN - .....SPEAKER (Location) - ......BAŞKAN - .....". Is it still not possible to isolate them? I could not get results from these approaches. – noetherlaw Jul 22 '22 at 12:55
Can you please post the data in a form that reflects its actual shape? It does not make sense and is annoying for ppl who want to help to be given one set of data only to learn that the solution they've elaborated for it does not work on the actual data. – Chris Ruehlemann Jul 22 '22 at 16:49
Sorry for the misunderstandings. This is my first project. ""https://www5.tbmm.gov.tr/tutanak/donem27/yil5/ham/b10901h.htm" here is the example htm link that I used for webscrapping. There is similar 2430 texts in my data frame. – noetherlaw Jul 22 '22 at 17:00
No that doesn't work like that. Can you please post the output of `dput(head(YOURDATA))`? – Chris Ruehlemann Jul 22 '22 at 17:50
"https://drive.google.com/file/d/1PYm-f3K6AjHess834O9sTsfD77JrPRcg/view?usp=sharing" I created an RDS file for the results of dput(head(YOURDATA)) – noetherlaw Jul 22 '22 at 18:24
No sorry, that doesn't work either. You need to post the output of `dput(head(YOURDATA))` **as part of** your question so that we can copy and paste it from there into R to work with it. – Chris Ruehlemann Jul 23 '22 at 08:00
Thanks for patient, I could not do due to the character limitations since dput(head(YOURDATA)) was too long. I still can't because any of the randomly selected items with dput(DATA) resulted longer character than stackoverflow's limits. Then I do not know how it works. – noetherlaw Jul 23 '22 at 08:44
Then try to assemble manually some sample data that is representative of the variability in your actual data and post that. – Chris Ruehlemann Jul 23 '22 at 10:16

How can I split a text using upper cases follow (blabla) - in R

3 Answers3