1

In spark2.0.1 ,hadoop2.6.0, I have many files delimited with '!@!\r' and not with the usual new line \n,for example:

=========================================

2001810086  rongq   2001    810!@!
2001810087  hauaa   2001    810!@!
2001820081  hello   2001    820!@!
2001820082  jaccy   2001    820!@!
2002810081  cindy   2002    810!@!

=========================================

I try to extracted data according to Setting textinputformat.record.delimiter in spark set textinputformat.record.delimiter='!@!\r';or set textinputformat.record.delimiter='!@!\n';but still cannot extracted the data

In spark-sql,I do this : ===== ================================

create table ceshi(id int,name string, year string, major string)
row format delimited
fields terminated by '\t';

load data local inpath '/data.txt' overwrite into table ceshi;
select count(*) from ceshi;

the result is 5,but I try to set textinputformat.record.delimiter='!@!\r'; then select count(*) from ceshi; the result is 1, the delimiter donot work well;

I also check the source of hadoop2.6.0, the method of RecordReader in TextInputFormat.java,I notice that default textinputformat.record.delimiter is null,then the the LineReader.java use the method readDefaultLine to read a line terminated by one of CR, LF, or CRLF(CR ='\r',LF ='\n').

JayForest
  • 11
  • 2

1 Answers1

2

You should use sparkContext's hadoopConfiguration api to set the textinputformat.record.delimiter as

sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!@!\r")

Then if you read the text file using sparkContext as

sc.textFile("the input file path")

You should fine.

Updated

I have noticed that a text file with delimiter \r when saved is changed to \n delimiter.

so, following format should work for you as it did for me

sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!@!\n")
val data = sc.textFile("the input file path")
val df = data.map(line => line.split("\t"))
.map(array => ceshi(array(0).toInt, array(1), array(2), array(3)))
.toDF

a case class called ceshi is needed as

case class ceshi(id: Int, name: String, year: String, major :String)

which should give dataframe as

+----------+-----+-----+-----+
|id        |name |year |major|
+----------+-----+-----+-----+
|2001810086|rongq| 2001|810  |
|2001810087|hauaa| 2001|810  |
|2001820081|hello| 2001|820  |
|2001820082|jaccy| 2001|820  |
|2002810081|cindy| 2002|810  |
+----------+-----+-----+-----+

Now you can hit the count function as

import org.apache.spark.sql.functions._
df.select(count("*")).show(false)

which would give output as

+--------+
|count(1)|
+--------+
|5       |
+--------+
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • ,hi, according to your help ,I try to set the parameter in spark-shell,use the peopel.txt in examples, but the result is not well; scala> sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!@!\r") scala> val sb = sc.textFile("file:///data/people.txt") sb: org.apache.spark.rdd.RDD[String] = file:///home/mr/ych/data/people.txt MapPartitionsRDD[1] at textFile at :24 scala> sb.first() res1: String = "Michael, 29!@! Andy, 30!@! Justin, 19!@! " scala> sb.count() res2: Long = 1 – JayForest Jul 23 '17 at 07:38