pyspark tsv duplicate columns

Question

I am trying to read a tsv file in pyspark using spark_csv package. My spark version is 1.6.3. In my dataset, two columns have same name. I am using following code to read the data.

temp=sqlContext.read.load(data_file,
               format='com.databricks.spark.csv',
               header='true',
               delimiter='\t',
               mode='FAILFAST',
               codec="org.apache.hadoop.io.compress.GzipCodec").cache()

When I read using above code, I get the following exception:

pyspark.sql.utils.IllegalArgumentException: u"The header contains a duplicate entry: 'member_id' in [member_status, md5_hash_email, member_id, first_name, last_name, email_daily_double, email_personal_coupon_reminder, email_personal_shopping_offers, email_site_wide_sales, email_hot_deals_daily_newsletter, is_referral, traffic_source, traffic_source_type, traffic_source_subtype, signup_date_id, email_domain_group, first_order_date, first_shopping_date, is_mobile, is_tablet, is_pc, first_order_id, member_engaged, last_visit_date, last_order_date, last_shopping_date, total_order_amount, total_commission_amount, total_rebate_amount, total_cash_payments, number_of_cash_payments, life_cycle_stage, total_orders, member_id]"

So, I would like to know if there is someway to drop the duplicate column before start. I know that I can specify schema before hand. But I want it to be dynamic so that I may handle any schema at run time. Thanks

score 0 · Answer 1 · answered Nov 15 '17 at 23:21

0

This has been fixed in recent versions of spark - https://issues.apache.org/jira/browse/SPARK-16896

If you cannot upgrade, you will have to construct the header yourself.

answered Nov 15 '17 at 23:21

manojlds

290,304
63
469
417

I know but my problem is that I can not use spark 2.2 – mc29 Nov 21 '17 at 18:07

pyspark tsv duplicate columns

1 Answers1