How can I match the header of a .CSV file with dynamic field names

Question

I hope this cut-down version of my previous question sounds much better than earlier. I have a .CSV file (frequency - weekly) whose header contains the year-week value in two of the columns which keeps changing every week. For an instance please see below.

store nbr,city,state,region,item nbr,category,201420 sales,201420 sold qty
1001,Kansas City,MO,2,30945232,7,$11,1
1003,Kansas City,KS,3,30945742,6,$40,4
2004,New York,NY,6,30995232,11,$72,12
6003,Sacramento,CA,4,3025742,3,$24,6
9003,Cleveland,OH,9,3057432,12,$44,11
6009,Los Angeles,CA,4,3098294,42,$35,7

I have to check if the data is placed under the right column and then proceed loading that data into the datamart. For this reason, I tried to create a standard header to compare it with the header of this file, which would be something like below:

store nbr,city,state,region,item nbr,category,XXXXXX sales,XXXXXX sold qty

If my file (that I am about to load) is of the format above, my process should proceed loading the data. If the columns are not in their places, lets say as below

store nbr,region,city,state,category,item nbr,201420 sold qty,201420 sales
1001,2,Kansas City,MO,7,30945232,1,$11
1003,3,Kansas City,KS,6,30945742,4,$40
2004,6,New York,NY,11,30995232,12,$72
6003,4,Sacramento,CA,3,3025742,6,$24
9003,9,Cleveland,OH,12,3057432,11,$44
6009,4,Los Angeles,CA,42,3098294,7,$35

then the process should fail. My main concern is - due to the dynamic nature of the fields 7 and 8, I am not able to create/match the data with the standard header every week. There is no date in the field name and I cannot use sysdate (as I might have to use old files with very old dates in the 7th and 8th fields). Could you please suggest me how can I achieve this either using cat/awk/sed or by using any wildcards for the Year-week value in the header?

It is not entirely clear what you are asking. From what I have read can you not just replace the new header with the old as the number of columns and the placement of data will be the same? — rob, Jul 29 '14 at 15:14
I will never know what data I am loading if I replace the header. I have seen some files whose column sequence were not in the same order every time. In such cases, my process is not supposed to load any data from the file. For this reason, I have to depend on the file header and the column sequence. — Dhruuv, Jul 29 '14 at 15:26
So would the solution be to identify each column from the header on each import and for some of those columns names to be based on the date? And are the date column names computable based on "the date", "file name"? — rob, Jul 29 '14 at 15:29
There will be no date in the file name and I cannot use Sysdate... Is there a way I can match those characters with some wildcards? I am not sure... — Dhruuv, Jul 29 '14 at 15:33
I suggest you show us a cut-down version of your data, with 6 columns (say) instead of 16. Also show some sample data, for maybe from two differnet weeks, with three records each week, and show the desired output for each of the inputs. Then maybe we can help you. At the moment, there is too little precise information and too much extraneous information in the question. — Jonathan Leffler, Jul 30 '14 at 02:23
I edited the question for better understanding. Sorry for not being clear in my previous version. Let me know if you have any suggestion. — Dhruuv, Jul 31 '14 at 14:52

skamazin · Accepted Answer · 2014-08-01T15:44:34.543

1

If by "dynamic nature of the field 14 and 15" you mean that the week value keeps incrementing by 1 then you should use some regex to find it.

I'm not sure if this will work because I don't know the actual values of Column1-13 but if you use \d{6} it will match the string of 6 digits (ie YYYYWW). As long as the other Columns don't share this form, you can always grab the year-week value with that regex and then do what you will with it.

Or if you don't want to use regex you can use something like cut -d, -f14 file.csv | cut -d" " -f1 to print out year-week value of column 14, and then change -f14 to -f15 to give you the year-week value for column 15.

After entering a chat room I finally understood the exact problem that OP had

By using egrep and replacing the year-week value with [0-9]+ and using the exit codes ($?) we were able to check if a header was properly formed or malformed.

edited Aug 01 '14 at 15:44

answered Jul 29 '14 at 15:40

skamazin

757
5
12

I am sorry but the above solutions did not work for me for the previous version of my question :( I edited my question to make it more clear, can you check and let me know how to do it? – Dhruuv Aug 01 '14 at 13:17
@Dhruuv I can have `sed` follow this regex so that it will only print out lines that match the format. Meaning that it will print the first half out and then you can process that, but it won't print the second half thereby failing to process. – skamazin Aug 01 '14 at 14:03
Thank you so much for providing the regex solution and introducing me to regex101. This is very helpful and deep. If the file header matches the standard header, the data under those columns will be correct. I do not have to cross check the data. But your solution provides extra check for that too which is great. How can I use this regex as part of my shell? Or can you help me with a solution for `sed` where it checks only for the header and ignoring the digits in 7th and 8th fields? Thanks in advance – Dhruuv Aug 01 '14 at 14:41
So you want to only check for fields 1-6 and not check 7 or 8 at all? So something [like this](http://regex101.com/r/nA8mR3/2)? – skamazin Aug 01 '14 at 14:49
I need to check all the fields, but (if possible) by ignoring the year-week value in fields 7th and 8th. I also need to check 7 and 8 because the fields might change their order. For example, I can check for the header of the file with that of the standard header that I created above by either ignoring the year-week or replacing the year-week with **`XXXXXX`** and check the rest as to make sure the fields are in the right place. Sorry for all the hassle but I really hope you understand my question. – Dhruuv Aug 01 '14 at 14:56
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/58500/discussion-between-skamazin-and-dhruuv). – skamazin Aug 01 '14 at 14:57

How can I match the header of a .CSV file with dynamic field names

1 Answers1