2

I want to create an external table with set of text files. Each row should be one text files. Example of one text file is as below and there can be multiple text files.(files are stored in HDFS)

thanking 
you 
for 
the 
participation 

Lines are terminated by /n. I want to create an external table with the above text files and data in a text file should be in one row(one cell).

I tried the following Create table statement.

Create External table if not exists sample_email(
  email STRING
)
STORED AS TEXTFILE
LOCATION '/tmp/txt/sample/';

It will give create table as follows.

+--------------------------------------+
+   email                              +
+--------------------------------------+
+ thanking                             +
+--------------------------------------+
+ you                                  +
+--------------------------------------+
+ for                                  +
+--------------------------------------+
+ the                                  +
+--------------------------------------+
+participation                         +
+--------------------------------------+
+please                                +
+--------------------------------------+
+find                                  +
+--------------------------------------+
+the                                   +
+--------------------------------------+
+discussed                             +
+--------------------------------------+
+points                                +
+--------------------------------------+

But I want as follows.

+--------------------------------------+
+   email                              +
+--------------------------------------+
+ thanking you for the participation   +
+--------------------------------------+
+ please find the discussed points     +
+--------------------------------------+

How to overcome my issue? Thank you in advance

Manura Omal
  • 1,005
  • 3
  • 10
  • 26

3 Answers3

1
select      concat_ws(' ',collect_list(email))  as emails
from        sample_email
group by    input__file__name

+------------------------------------+
|               emails               |
+------------------------------------+
| thanking you for the participation |
| please find the discussed points   |
+------------------------------------+
David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88
  • What I want is to create a table as I mentioned above. not to select from that table – Manura Omal Mar 24 '17 at 05:06
  • i am going to save txt files in HDFS and files will be saved daily. So I want to create an external table to read those files and do text analysis. – Manura Omal Mar 24 '17 at 05:18
  • Actually I uploaded one email(not a email thread) as a text file to HDFs. After executing this query, it gave 3 dupliate rows – Manura Omal Mar 24 '17 at 05:28
  • If it returned 3 duplicated rows then you have 3 duplicated files. Add `input__file__name` to the SELECT clause to verify. – David דודו Markovitz Mar 24 '17 at 05:31
  • Yes. you are correct. I have same file in different folders. Thank you very much. Can you guide me how to create external table from this query – Manura Omal Mar 24 '17 at 05:36
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/138891/discussion-between-manura-omal-and-dudu-markovitz). – Manura Omal Mar 24 '17 at 06:12
0

Use tr to remove \n from files.

hadoop fs -cat file.txt |  tr -d '\n' | hadoop fs -put - new_file.txt
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • This is a bad solution. **(1)** you are streaming all the data from the distributed HDFS cluster to a local file system (and back) **(2)** you are merging all the files to a single file while loosing the information about the source files – David דודו Markovitz Mar 23 '17 at 15:36
  • Yes, I agree about (1),it's not optimal but still working and can be used for small files. It's an example for single file, not all of them. – leftjoin Mar 23 '17 at 15:40
0
set textinputformat.record.delimiter='\0';

select  translate(email,'\n',' ') as emails 
from    sample_email

+-------------------------------------+
|               emails                |
+-------------------------------------+
| thanking you for the participation  |
| please find the discussed points    |
+-------------------------------------+

Unfortunately, I still don't know how to set textinputformat.record.delimiter back to newline within the same session.

How to reset textinputformat.record.delimiter to its default value within hive cli / beeline?

Community
  • 1
  • 1
David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88