I would like to import a very large text file as a dataframe into R. The file is produced by the "National Institute for Health and Disability Insurance" in Belgium. It contains individual data on health professionals in Belgium.
It's a fixed width text file, and I usualy import that kind of data with "read_fwf". The problem is that the structure of the file is a bit tricky :
- The individual data are divided into eight "types" ("type 11", "type 12", "type 13", etc.) that are on different rows. So each individual has several rows (= different type of information).
- Each row begins with a "prefix part" of the same information about these individuals, which allows the data to be joined by individual. This "prefix part" is therefore repeated from row to row for the same individual.
- Each part (each "type"/row and the "prefix part") has its own fixed width structure. For exemple, the prefix part has 54 characters, in which the type is indicated in the 53rd and 54th characters ; type 11 has 249 characters divided into 18 variables ; type 12 has 78 characters divided into 5 variables ; type 13 has 110 characters divided into 12 variables ; etc.
- BUT, the number of "types" (and therefore the number of rows) per individual is variable. Some individuals have 3 types/row, others 4, 5, 6, 7 or 8. This means that there is a variable number of types/rows per individual.
Here's a fictive exemple of the structure, very simplified. The "prefixed part" is the 0000xxx part. The 5 first characters indicates the individual. The 2 last characters of the "prefixed part" specifies the "type" of the row. Each row contains data specific to that "type", divided into variables (the width structure of the variables is indicated in comment) :
0000111abaabacd
0000112abbbba
0000113ccaxyzzghj5
0000211acatbacz
0000212zbabba
0000311tyyyuacd
0000312xbbiop
0000411pkggbacz
#type 11 has 3 variables. Width for each : 2,3,3
#type 12 has 2 variables. Width for each : 3,3
#type 13 has 4 variables. Width for each : 3,3,1,4
So, what i'd like to do is to import into a dataframe in R each individual into one unique row, with all the variables of each type into separated columns. If the variables are missing (because there is no type/row), i'd like indicate empty data (NA) for each variable.
id var1 var2 var3 var4 var5 var6 var7 var8 var9
00001 ab aab acd abb bba cca xyz z ghj5
00002 ac atb acz zba bba NA NA NA NA
00003 ty yyu acd xbb iop NA NA NA NA
00004 pk ggb acz NA NA NA NA NA NA
I'm a bit lost, I don't know where to start, because the structure is really not usual for me. Do you have ideas or advices ?
Thank you very much !