Convert data dictionary from word to excel with R

Question

I got the data dictionary from data provider which contains hundreds vars in different word files and looks like this:

In order to add this dictionary to my current dataset, I need to convert it to certain format in Excel. For example,for first var:"intarm_actual", i would like to create columns in a spreadsheet: col of "variable" puts the left top words, col of "label" store content of "label" (for this var, it is NA, but for second var, it should be "tpe_lab"), col of "type" stors the words of " string(str2), col of "value" stores "4", col of "missing" stores "46/102", col of "tabulation" stores "46 "", 14 "RO",14 "RV",14 "TO",14 "TV"". Ideally, it should look like this:

Could anyone who happens have done this before help to provide some suggestions for this? (I appreciate for any suggestion like what package I should refer and use, any related posts article I should read, similar type of code i can learn...)Can R package "labelled" handle this type of task? Thanks a lot~~!!

update:_________________________________________________

I use package qdapTool to imported one of the docx files, it looks like this:

How can I retrieve the demanded words and assign them to right place in my spreadsheet? Thanks~~!

Update 2:--------------------------------------------
Issue has been solved in another way.

In case someone will encounter the similar situation, 1) This type of codebook file is generated by STATA; 2) Instead of reading this complex text file, the alternative solution is using package of "codebook" in R to generate the new .csv codebook which contains both these information and even more.

Is this a real format, or just something the provider threw together. If it's the former, then we should be looking for a tool to read and parse the format, if it's the latter, then its a question about parsing a complex text file. In the former case, any info on the format (names, what program generated it, extensions, etc.) would be useful, while in the latter you should include a [mcve] of the data file with, perhaps the first few records, ideally copied into the body of the question — divibisan, Feb 07 '22 at 21:37
Thanks for your response @divibisan,, i think this is a real format generated by some software. Although i am not clear which one generated that, when i used SAS, i remember I did the similar "card-like" dict in word file. Given the fact that each var has the same format so I hardly think it could be latter situation. I will try to contact data provider to see which software they used to generate this file and then come back here to further discuss. Thanks~~ — Rstudyer, Feb 07 '22 at 21:51

score 1 · Answer 1 · answered Feb 07 '22 at 20:07

assuming that indeed, you have zero clue, I would recommend you to get started with regular expressions in R. I often use the R package stringr to work with regular expressions, and you find the respective cheat sheet here. They will allow you to, e.g., select the word following a ":".

I have never worked with Word Documents in R, but I guess that there are packages out there that allow you to read Word documents into R. Just Google them. :) I am sure they also have good instructions on how to use them.

Another issue you might encounter is encoding. If you have issues with reading the text into read in the correct way, e.g. reading in strange character combinations, that is most likely the source of the problem.

Once you have looked at these things and started working on your own code, you will be able to ask more precise questions.

Convert data dictionary from word to excel with R

1 Answers1