-1

I am newbie in hadoop and I have to add data into table in hive. I have data from FIX4.4 protocol, something like this...

8=FIX.4.4<SHO>9=85<SHO>35=A<SHO>34=524<SHO>49=SSGMdemo<SHO>52=20150410-15:25:55.795<SHO>56=Trumid<SHO>98=0<SHO>108=30<SHO>554=TruMid456<SHO>10=154<SHO>
8=FIX.4.4<SHO>9=69<SHO>35=A<SHO>34=1<SHO>49=Trumid<SHO>52=20150410-15:25:58.148<SHO>56=SSGMdemo<SHO>98=0<SHO>108=30<SHO>10=093<SHO>
8=FIX.4.4<SHO>9=66<SHO>35=2<SHO>34=2<SHO>49=Trumid<SHO>52=20150410-15:25:58.148<SHO>56=SSGMdemo<SHO>7=1<SHO>16=0<SHO>10=174<SHO>
8=FIX.4.4<SHO>9=110<SHO>35=5<SHO>34=525<SHO>49=SSGMdemo<SHO>52=20150410-15:25:58.164<SHO>56=Trumid<SHO>58=MsgSeqNum too low, expecting 361 but received 1<SHO>10=195<SHO>

Firstly, what i want is, in 8=FIX.4.4 8 as column name, and FIX.4.4 as value of that column, in 9=66 9 should be column name and 66 would be value of that column and so on.... and there are so many rows in raw file like this.

Secondly, same thing for another row, and that data would append in next row of table in hive.

Now what should i do that i am not able to think.

Any help would be appriciable.

  • How are you usually bringing in data? Which tools/programming language do you use? Using a rather simple regex you could probably filter out column and value pairs. Using these pairs it shouldn't be too hard to create a CSV which you can easily import into hive. – spijs Feb 08 '17 at 13:51
  • How to make regex for this file and convert it into CSV? @spijs – Unnat Pandya Feb 09 '17 at 13:16

2 Answers2

0

I would first create a tab-separated-file containing this data. I suggested to use a regex in the comments but if that is not your strong suit you can just split on the <SHO> tag and =. Since you did not specify the language you want to use I will suggest a 'solution' in Python. The code below shows you how to write one of your input lines to a CSV file. This can easily be extended to support multiple of these lines or to append lines to the CSV files once it is already created.

import csv
input = "8=FIX.4.4<SHO>9=85<SHO>35=A<SHO>34=524<SHO>49=SSGMdemo<SHO>52=20150410-15:25:55.795<SHO>56=Trumid<SHO>98=0<SHO>108=30<SHO>554=TruMid456<SHO>10=154<SHO>"
l = input.split('<SHO>')[:-1] # Don't include last element since it's empty
list_of_pairs = map(lambda x: tuple(x.split('=')),l)
d = dict(list_of_pairs)

with open('test.tsv', 'wb') as c:
cw = csv.writer(c, delimiter='\t')
cw.writerow(d.keys()) # Comment this if you don't want to have a header
cw.writerow(d.values())

What this code does is first split the input line on <SHO> meaning it creates a list of col=val strings. What I does next is create a list of tuple pairs where each tuple is (col,val). Then it creates a dictionary from this, which is not strictly necessary but might help you if you want to extend the code for more lines. Next I create a tab-separated-value file test.tsv containing a header and the values in the next line.

This means now you have a file which Hive can understand. I am sure you can find a lot of articles on importing CSV or tab-separated-value files, but I will give you an example of a generic Hive query you can use to import this file once it is in HDFS.

 CREATE TABLE if not exists [database].[table]
 ([Col1] Integer, [Col2] Integer, [Col3] String,...)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
 LINES TERMINATED BY '\n'
 TBLPROPERTIES('skip.header.line.count'='1');

 LOAD DATA inpath '[HDFS path]'
 overwrite INTO TABLE [database].[table];

Hope this gives you a better idea on how to proceed.

spijs
  • 1,489
  • 18
  • 36
  • Well, after asking to leaders, I must choose java only, but I manage to convert above logic into java. BTW thank you. – Unnat Pandya Feb 10 '17 at 06:36
  • If this solves your question, consider accepting the answer so others can also see it is useful :). Glad to help – spijs Feb 10 '17 at 07:30
0

Copy the file to HDFS and create an external table with a single column (C8), then use the below select statement to extract each columns

create external table tablename(
c8 string )
STORED AS TEXTFILE
location 'HDFS path';


select regexp_extract(c8,'8=(.*?)<SHO>',1)  as c8,
regexp_extract(c8,'9=(.*?)<SHO>',1)  as c9,
regexp_extract(c8,'35=(.*?)<SHO>',1)  as c35,
regexp_extract(c8,'34=(.*?)<SHO>',1)  as c34,
regexp_extract(c8,'49=(.*?)<SHO>',1)  as c49,
regexp_extract(c8,'52=(.*?)<SHO>',1)  as c52,
regexp_extract(c8,'56=(.*?)<SHO>',1)  as c56,
regexp_extract(c8,'98=(.*?)<SHO>',1)  as c98,
regexp_extract(c8,'108=(.*?)<SHO>',1)  as c108,
regexp_extract(c8,'554=(.*?)<SHO>',1)  as c554,
regexp_extract(c8,'35=(.*?)<SHO>',1)  as c10
from tablename
user7343922
  • 316
  • 4
  • 17