Converting journal titles to their abbreviated form

Question

Good morning my hero!

I have a list of journal titles in English, Spanish and Portuguese that I want to convert to their abbreviated form. The official abbreviation dictionary for journal titles is the List of Title Word Abbreviations found on the ISSN website.

# example of my data
journal names <- c(journals = c("peste revista psicanalise sociedade", "abanico veterinario", "abcd arquivos brasileiros cirurgia digestiva sao paulo", "academo asuncion", "accion psicologica", "acimed", "acta academica", "acta amazonica", "acta bioethica", "acta bioquimica clinica latinoamericana")

I have split each title into a list of single words. So currently I have a list of lists, where each title is a list of its individual words.

[[1]]
[1] "peste"       "revista"     "psicanalise" "sociedade"  

[[2]]
[1] "abanico"     "veterinario"

Once I remove the stop words (as seen above), I need to match any relevant words to the suffixes or prefixes in the LTWA and then convert them to the abbreviation. I have converted the LTWA words so that they have regular expressions and can be used to search for a match easily with a package like stringi.

# this is an excerpt from the dataframe I created with the LTWA
the ABBREVIATIONS_NA replaces the n.a. with the original word and the REXP has the prefix/suffix with the regular expressions

WORDS,ABBREVIATIONS,LANGUAGES,REXP,ABBREVIATIONS_NA
proofreader,proofread.,eng,proofreader,proofread.
prophylact-,prophyl.,eng,^prophylact.*,prophyl.
propietario,prop.,spa,propietario,prop.
propriedade,propr.,por,propriedade,propr.
prostético,prostét.,spa,prostético,prostét.
protecção,prot.,por,protecção,prot.
proteccion-,prot.,spa,^proteccion.*,prot.
prototyping,prototyp.,eng,prototyping,prototyp.
provisional,n.a.,eng,provisional,provisional
provisóri-,n.a.,por,^provisóri.*,provisóri-
proyección,proyecc.,spa,proyección,proyecc.
psicanalise,psicanal.,por,psicanalise,psicanal.
psicoeduca-,psicoeduc.,spa,^psicoeduca.*,psicoeduc.
psicosomat-,psicosom.,spa,^psicosomat.*,psicosom.
psicotecni-,psicotec.,spa,^psicotecni.*,psicotec.
psicoterap-,psicoter.,spa,^psicoterap.*,psicoter.
psychedelic,n.a.,eng,psychedelic,psychedelic
psychoanal-,psychoanal.,eng,^psychoanal.*,psychoanal.
psychodrama,n.a.,eng,psychodrama,psychodrama
psychopatha,n.a.,por,psychopatha,psychopatha
pteridolog-,pteridol.,eng,^pteridolog.*,pteridol.
publicitar-,public.,spa,^publicitar.*,public.
puericultor,pueric.,spa,puericultor,pueric.
Puerto Rico,P. R.,spa,Puerto Rico,P. R.

The search and conversion needs to be done from largest prefix/suffix to smallest prefix/suffix, and words that have already been processed cannot be processed again.

The issue: I would like to convert each title word to its proper abbreviation. However, if there is a prefix like 'latinoamericano', it should only respond to the prefix 'latinoameri-' and be converted to latinoam. The problem is that it will also respond to 'latin-' and then get converted to 'latin.' How can I make it so that each word is only processed once?

Also note that my LTWA database only has about 12,000 words in total, so there will be words that don't have a match at all.

I have gotten up to this point, but not sure where to go from here to accomplish this. So far, I have only come up with very clunky solutions that do not work perfectly.

Thank you!

What does the input data look like? Could you make this a reproducible example with example data and requested output? What does prefix/suffix mean in this context. Is there more to it then just looking up an abbreviation per word? — Bernhard, Aug 03 '21 at 16:33
Hi! Depending on what files you are working with (please clarify this in the question), package [journalabbr](https://github.com/zoushucai/journalabbr) may be worth checking. — Henrik, Aug 03 '21 at 16:33
Hi @Henrik! I will check out this package. I am not working with bib files though as these journal names come from an XML database that I parsed. — Amalia, Aug 03 '21 at 17:04
If you have the _official_ full journal names, perhaps it's easy to create dummy bib data and parse with `journalabbr`. Or simply (?) grab their lookup table and join with your data - it seems to be available, see the "Access internal data" on their github. — Henrik, Aug 03 '21 at 17:14
@Henrik I will look into this now and report back - thank you! — Amalia, Aug 03 '21 at 17:16

Converting journal titles to their abbreviated form

0 Answers0