How to apply (semi-)supervised methods (structural topic models, seeded lda) to corpora with only one topic and aggregate their results per year in r?

Question

I am currently working on a project in which I am interested in the prevalence of one topic (social inequality) in German plenary debates and newspaper articles. I am using quantitative text analysis tools in order to generate outputs from texts to include in regression analyses with economic measures, such as the Gini coefficient. So far, I have created a dictionary to represent social inequality, which has allowed me to generate count and frequency data aggregated on a yearly basis. However, in view of the problems of dictionary analysis, I would also like to use supervised methods such as seeded lda or structural topic models.

In this context, two difficulties arise for me. First, I struggle with implementing any of these supervised methods in r, as these methods seem designed to work with multiple topics rather than focusing on just one. Second, I don't know which of the object of the outputs I have to choose for the regression analyses and how to aggregate them on a yearly basis.

So far I have tried Seeded LDA from the "seededlda" package as it seems to be the most suitable. I have 2 text corpora: one consisting of 2296 documents (plenary debates), the other of 190404 documents (newspaper articles). However, when I try to create the seeded-lda model, r gets no result and freezes on top for both corpora.

I have created the following dictionary:

require(quanteda)
dict <- dictionary(list(Alles = c("soziale_sich*", "soziale_teilhabe*", "soziale_verunsich*", 
 "soziale_zwänge", "soziale_ängste", "sozialer_absturz", "sozialer_sprengstoff", "sozialer_zwang",
 "soziales_schicksal", "sozialrisik*", "sozialschmarotzer", "steigende_haushaltseinkommen",
 "superreiche", "unterprivilegierte_gruppe*", "verarmung*", "vermögenskonzentration*", 
 "wirtschaftliche_und_soziale_auswirkung*", "wohlfahrtsdiktatur", "ärmere_hälfte_der_bevölkerung",
 "ärmsten_haushalte", "ökonomische_sachzwänge*", "abzug_von_vorsorgeaufwendungen", "agenda_2010", 
 "allgemeinverbindliche_tarifverträge", "arbeitnehmersparzulage", "arbeitsbedingung*", 
 "arbeitsmarktinklusion*", "arbeitsmarktpolitische maßnahme*", "bafög*", "besteuerung_von_erbschaften",
 "besteuerung_von_kapitalerträgen", "bildungs-_und_teilhabepaket", "bildungsbenachteiligung*",
 "bildungsfern*", "bildungspaket*", "bildungsungleichheit*", "daseinsvorsorge", "einkommen_entlasten",
 "empowerment", "entlastung_niedriger_und_mittlerer_einkommen", "entlastung_von_einkommen", 
 "erbschaftsbesteuerung", "erbschaftssteuer", "erhöhung_des_Mindestlohns", "freibetrag", "hartz_IV*",
 "lohnentwicklung*", "mehr_soziale_rechte", "mehr_umverteilung*", "mindestlohn", "mindestlöhne", 
 "mini_job*", "neue_soziale_frage", "reichensteuer", "sozialabgaben", "sozialbudget",
 "soziale_integration*", "soziale_sanktion*", "sozialgesetzgebung*", "sozialhilfeempfänger*", 
 "sozialintegration", "sozialleistungsquote*", "sozialpolitische_maßnahme*", 
 "sozialpolitische_reformmaßnahme*", "sozialstaat*", "sozialtarife", "sozialversicherung*", 
 "staatliche_förderung_der_vermögensbildung", "steuerlicher_ausgleich", "steuervergünstigung*", 
 "stärkung_der_tarifbindung*", "teilhabeleistung", "teilhabepaket", "transferleistung*", 
 "vermögensbesteuerung*", "verteilungspolitik", "weniger_umverteilung*", "wohlfahrt*",  
 "wohlfahrtsstaat*", "wohlstandsverlust*", "wohngeld*", "wohnungsbauprämie*", 
 "Verteilungs-_und_Stratifikationsaspekte", "aufstiegsmobilität", "einkommensgrenz*", 
 "einkommensschere", "einkommensschwach*", "einkommensungleich*", "einkommensungleichheit*", 
 "einkommensunterschied*", "einkommensverteilung*", "gesellschaftliche_mobilität*", 
 "gesellschaftlicher_abstieg*", "gesellschaftlicher_aufstieg*", "gini*", 
 "gleichheit_der_lebensverhältnisse", "konkurrent_um_soziale_güter", "konsumungleichheit*", 
 "lohngerechtigkeit*", "pay_gap", "schonvermögen*", "steuergerechtigkeit*", "steuerungerechtigkeit*",
 "ungleiche einkommensverteilung", "ungleiche_verteilung_der_vermögen*", 
 "ungleichheit_der_markteinkommen", "ungleichheit_der_vermögen*", "ungleichheit_von_vermögen", 
 "ungleichverteilung*", "vermögensungleichheit*", "verteilung_der_einkommen*", 
 "verteilung_von_einkommen", "verteilungsverhältnis*", "wirtschaftliche_kluft",
 "wirtschaftlicher_aufstieg*", "ökonomischer_aufstieg*","arbeit_muss_sich_lohnen", 
 "egalitäre_gesellschaft*", "einkommensgerecht*", "entsolidarisierung", "faire_löhne", 
 "generationengerechtigkeit", "gerechte_löhne", "gerechter_lohn*",  "gesellschaftlich_gleich*", 
 "gesellschaftliche_chance*", "gesellschaftliche_emanzipation", "gesellschaftliche_gleich*", 
 "gesellschaftliche_kohäsion", "gesellschaftliche_schieflage*", "gesellschaftliche_solidarität",
 "gesellschaftliche_verantwort*", "gesellschaftlicher_zusammenhalt", "gleiche_gesellschaft*", 
 "lohngerechtigkeit*", "politische_gleichheit*", "schlechte_soziale_verhältnisse",
 "solidarische_ökonomie", "sozial_gerecht*", "sozial_ungerecht*", "soziale_aufstieg*", 
 "soziale_chancen*", "soziale_chancengerechtigkeit", "soziale_chancengleichheit", 
 "soziale_emanzipation", "soziale_gerechtigkeit*", "soziale_kohäsion*",
 "soziale_schieflage*", "soziale_solidarität", "soziale_ungerechtigkeit*", "soziales_wohl*", 
 "sozialgerecht*", "verteilungsgerecht*", "wirtschaftlich_gerecht*", 
 "wirtschaftliche_chancengleichheit", "wirtschaftliche_emanzipation", "wirtschaftliche_solidarität", 
 "ökonomisch_gerecht*", "ökonomische_emanzipation", "abstiegsangst", "abstiegsängste", 
 "allgemeine_wirtschaftliche_entwicklung*", "arbeitslos*", "arm", "arme", "armut*", 
 "einkommenslücke*", "einkommensreiche", "einkommensreichtum", "existenzsich*", 
 "geringverdiener*", "gesellschaftliche_absich*", "gesellschaftliche_ausgrenz*", 
 "gesellschaftliche_folge*", "gesellschaftliche_gräben", "gesellschaftliche_kluft", 
 "gesellschaftliche_lage*", "gesellschaftliche_perspektiv*", "gesellschaftliche_sich*", 
 "gesellschaftliche_spalt*", "gesellschaftliche_teilhabe*", "gesellschaftliche_ungleich*", 
 "gesellschaftliche_verunsich*", "gesellschaftliche_zwänge", "gesellschaftlicher_ungleich*", 
 "gesellschaftlicher_zwang", "großverdiener*", "hohe_vermögen*", "hoher_wohlstand", 
 "hohes_wohlstandsniveau", "marktfundamentalismus",  "marktradikalismus", "mehr_einkommen*", 
 "mehrverdiener*", "mitellos*", "niedriglohn*", "niedriglöhne*", "obdachlos*",  "plutokrat*", 
 "politisch_ungleich", "politische_ungleich*", "prekariat", "prekär*", "putzfrau", "reichste_prozent", 
 "reichtumsgrenze", "schere_zwischen*", "sinkende_haushaltseinkommen", "sozial_abgehängt*", 
 "sozial_bedürf*", "sozial_bedürftig*", "sozial_bessergestellt*", "sozial_dring*", "sozial_exklusiv*", 
 "sozial_explosiv*", "sozial_gleich*", "sozial_isoliert*", "sozial_schlecht*", "sozial_schwach*", 
 "sozial_selektiv*", "sozial_ungleich*", "sozial_unterprivilegiert*", "sozial-wirtschaftliche_lage",
 "soziale_absich*", "soziale_angst", "soziale_ausgrenz*", "soziale_bedürf*", "soziale_differenz*",
 "soziale_disparität*", "soziale_dring*",  "soziale_entwicklung*", "soziale_exklusion", 
 "soziale_fehlentwicklung*", "soziale_folge*", "soziale_frage*", "soziale_gleich*", 
 "soziale_grenzlinie*", "soziale_gräbe*", "soziale_herausforderung*", "soziale_herkunft", 
 "soziale_hinder*", "soziale_hürde*", "soziale_isolation*", "soziale_kluft", "soziale_konkurrenz",
 "soziale_lage*", "soziale_mobilität*", "soziale_perspektiv*", "soziale_rahmen*", 
 "soziale_reproduktion*", "soziale_risiken", "soziale_schicksale", "soziale_selektiv*", 
 "soziale_sich*", "soziale_spalt*", "soziale_ungleich*", "soziale_ungleichgewichte", 
 "soziale_verzerrung*", "sozialer_abstieg*", "sozialer_aufstieg*", "sozialer_bedürf*", 
 "sozialer_diff*", "sozialer_disparität*", "sozialer_entwick*", "sozialer_folg*", "sozialer_rahm*",
 "sozialer_ungleich*", "soziales_ungleich*", "soziales_ungleichgewicht", "sozialordnung", 
 "sozialverträglich*", "sozio-ökonomische_entwicklung*", "sozioökonomische_folge*", 
 "sozioökonomische_frage*", "ungleichheitsdebatte*", "ungleichheitsentwicklung*", "unterschicht*", 
 "vermögensungleich*", "verteilungsdiskussion*", "von_der_gesellschaft_abgehängt*", 
 "weniger_einkommen*", "wirtschaftlich_abgehängt*", "wirtschaftlich_bedürf*", 
 "wirtschaftliche_ausgrenz*", "wirtschaftliche_disparität*", "wirtschaftliche_schwierigkeit*", 
 "zwischen_arm_und_reich", "ökonomisch_ungleich*", "ökonomische_absich*", "ökonomische_sich*",
 "ökonomische_ungleich*")))

which I apply to a dfm of one of my corpora made in the following way.

spiegel_dfm <- spiegel_corpus %>% 
 tokens(remove_punct = TRUE) %>% 
 tokens_compound(pattern = phrase(multiwords)) %>%
 dfm()

I then try to calculate the model in the following way.

require(seededlda)
tmod_slda <- textmodel_seededlda(spiegel_dfm, dictionary = dict, max_iter = 2000, weight = 0.01)

This is my session info, if the information is needed.

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] seededlda_0.5.1 quanteda_2.1.2 

loaded via a namespace (and not attached):
[1] Rcpp_1.0.5         rstudioapi_0.13    magrittr_2.0.1     usethis_2.0.0      stopwords_2.1     
[6] tidyselect_1.1.0   munsell_0.5.0      colorspace_2.0-0   lattice_0.20-41    R6_2.5.0          
[11] rlang_0.4.9        fastmatch_1.1-0    dplyr_1.0.2        tools_4.0.3        grid_4.0.3        
[16] data.table_1.13.4  gtable_0.3.0       xfun_0.19          tinytex_0.28       ellipsis_0.3.1    
[21] yaml_2.2.1         RcppParallel_5.0.2 tibble_3.0.4       lifecycle_0.2.0    crayon_1.3.4      
[26] Matrix_1.2-18      purrr_0.3.4        ggplot2_3.3.2      fs_1.5.0           vctrs_0.3.6       
[31] glue_1.4.2         stringi_1.5.3      compiler_4.0.3     pillar_1.4.7       generics_0.1.0    
[36] scales_1.1.1       pkgconfig_2.0.3

So, I am unsure if I applied seeded lda correctly, or rather, if there is a supervised method that is appropriate for highlighting a single topic and, eventually, how to then aggregate its results on a yearly basis.

I would be very grateful for any assistance!

Regarding `R` freezing: have you tried fitting the model using a much smaller dictionary (5 keywords or so)? Yours has a lot of seed-words (even more with `word*`), more than I’ve seen in tutorials, `seededlda` papers or personally used. Maybe it is impossible for the model to achieve convergence with so many constraints (seedwords). Also: allways monitor RAM usage, this could also be the wall you are hitting. — mpaladino, Jan 05 '21 at 16:32
Regarding the use of fully supervised models for your problem. Yes, Naive Bayes or penalized regression directly addresses binary classification and could be better for your problem. The issue is that you’ll have to hand code **a lot** of texts to train your model and that’s very time consuming. Also, if your texts cover more than one topic maybe you’ll have to fit the model over sentences, not full texts. Even more work to hand code beforehand. — mpaladino, Jan 05 '21 at 16:43
My take from my own practice: besides tutorial grade examples is very hard fitting your theoretical topics 1 to 1 with model results. Corpora are complicated, topics aren’t always clear cut (even for human classifiers). Explore your corpus: fit an unsupervised LDA model allowing many topics to gain insight on how your corpus interacts with the method and what’s a reasonable *k*. Try a different tool: `keyATM` allows for an arbitrary k of “other” categories, it could help isolating your topic of interest. — mpaladino, Jan 05 '21 at 16:59

How to apply (semi-)supervised methods (structural topic models, seeded lda) to corpora with only one topic and aggregate their results per year in r?

0 Answers0