3

I am trying to load a CSV (tab delimited) using spark csv - by scala.

what I observed is , if a column contains new line character LF (\n) spark considering it as end of the line even though we have double quotes on both sides of the column in the csv file.

any body encountered similar problem ?

this is the scala code I have used

but spark recognizing it as a end of line

after @ashraful suggestion i tried below query :

but the result remains the same here is the record i am trying to load

`but after loading to spark data frame i am trying check the record but rest of the columns are loaded as null after the LS character.

enter image description here

same row is being split into two enter image description here

this is the data : note there is LF character end of the first line.

cuenta_id   fecha_creacion  saldo   descripcion saldo_inicial   fecha_ultima_actualizacion  usuario_id  saldo_anterior  saldo_reserva   tipo_cuenta pais_iso_3166   moneda_iso_4217 valor_comision_retiro   retiro_en_proceso   plan_id grupo_comision_id   documento   tipo_documento  operacion_local estado_cuenta   nombre  telefonos   fax direccion   nombre_contacto email_contacto  url ciudad  ventas_mensuales    comentarios modelo_pagos    modo_prueba tiene_fondo_reserva porcentaje_fondo_reserva    aceptar_tc_internacionales  dias_fondo_reserva  telefono_servicio_cliente   email_soporte_cliente   pagina_web_soporte_cliente  nombre_comercial    fuente_creacion procesar_sin_cvv2   productos_servicios usuario_modificacion_id sesion_modificacion_id  saldo_congelado perfil_usuario_id   grupo_perfil_cobranza_id    limite_creacion_transacciones   limite_procesamiento_transacciones  codigo_mcc_id   url_atencion_cliente    ubicacion_regional  direccion_linea2    direccion_linea3    direccion_codigo_postal grupo_perfil_cobranza_personalizado_id  logo    migrada_acreditacion    origen_comercial
"500181"    "2012-04-13 17:14:47.958"   "0.00"  "CTA POL [MARISOL GONDOLA ROMERO  PE]"  "0.00"      "500106"    "0.00"  "0.00"  "CHECK_ACCOUNT" "PE"    "PEN"   "0.00"  "f" "1" "9" "3742396"   "ID"    "f" "ENABLED"   "MARISOL GONDOLA ROMERO"    "5073984473"    ""  "CLL SALISBURY CASA 226 URBANIZACION COLINAS DEL GOLF CORREGIMIENTO DE JOSE D ESPINAR"  "ALEJANDRO JUAN COMPARADA"  "pol.no.responder@gmail.com"        "LIMA"      "VALIDAR TIPO DE DOCUMENTO Y TIPO DE CUENTA
"   "AGGREGATOR"    "f" "f"     "t"     "5073984473"    "pol.no.responder@gmail.com"                "f"     "2" "C64C7A54C942D496DCFCA2C24671FD46.admin-nodo1"  "0" "MG-PRICING_302_157077" "MG-PRICING_303_285226"         "5311"                              "f" 

following is the output:

cala> sqlcontext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\t").option("quote", "\"").option("charset","UTF-8").load("/home/goutham/Work/data/account.csv").show()
+--------------+--------------------+-----+--------------------+-------------+--------------------------+----------+--------------------+-------------+-------------+-------------+---------------+---------------------+-----------------+--------------------+-----------------+--------------------+--------------------+---------------+-------------+--------------------+----------+----+--------------------+--------------------+--------------------+----+------+----------------+--------------------+------------+-----------+-------------------+------------------------+--------------------------+------------------+-------------------------+---------------------+--------------------------+----------------+---------------+-----------------+-------------------+-----------------------+----------------------+---------------+-----------------+------------------------+-----------------------------+----------------------------------+-------------+--------------------+------------------+----------------+----------------+-----------------------+--------------------------------------+----+--------------------+----------------+
|     cuenta_id|      fecha_creacion|saldo|         descripcion|saldo_inicial|fecha_ultima_actualizacion|usuario_id|      saldo_anterior|saldo_reserva|  tipo_cuenta|pais_iso_3166|moneda_iso_4217|valor_comision_retiro|retiro_en_proceso|             plan_id|grupo_comision_id|           documento|      tipo_documento|operacion_local|estado_cuenta|              nombre| telefonos| fax|           direccion|     nombre_contacto|      email_contacto| url|ciudad|ventas_mensuales|         comentarios|modelo_pagos|modo_prueba|tiene_fondo_reserva|porcentaje_fondo_reserva|aceptar_tc_internacionales|dias_fondo_reserva|telefono_servicio_cliente|email_soporte_cliente|pagina_web_soporte_cliente|nombre_comercial|fuente_creacion|procesar_sin_cvv2|productos_servicios|usuario_modificacion_id|sesion_modificacion_id|saldo_congelado|perfil_usuario_id|grupo_perfil_cobranza_id|limite_creacion_transacciones|limite_procesamiento_transacciones|codigo_mcc_id|url_atencion_cliente|ubicacion_regional|direccion_linea2|direccion_linea3|direccion_codigo_postal|grupo_perfil_cobranza_personalizado_id|logo|migrada_acreditacion|origen_comercial|
+--------------+--------------------+-----+--------------------+-------------+--------------------------+----------+--------------------+-------------+-------------+-------------+---------------+---------------------+-----------------+--------------------+-----------------+--------------------+--------------------+---------------+-------------+--------------------+----------+----+--------------------+--------------------+--------------------+----+------+----------------+--------------------+------------+-----------+-------------------+------------------------+--------------------------+------------------+-------------------------+---------------------+--------------------------+----------------+---------------+-----------------+-------------------+-----------------------+----------------------+---------------+-----------------+------------------------+-----------------------------+----------------------------------+-------------+--------------------+------------------+----------------+----------------+-----------------------+--------------------------------------+----+--------------------+----------------+
|        500181|2012-04-13 17:14:...| 0.00|CTA POL [MARISOL ...|         0.00|                      null|    500106|                0.00|          0.0|CHECK_ACCOUNT|           PE|            PEN|                  0.0|                f|                   1|                9|             3742396|                  ID|              f|      ENABLED|MARISOL GONDOLA R...|5073984473|null|CLL SALISBURY CAS...|ALEJANDRO JUAN CO...|pol.no.responder@...|null|  LIMA|            null|VALIDAR TIPO DE D...|        null|       null|               null|                    null|                      null|              null|                     null|                 null|                      null|            null|           null|             null|               null|                   null|                  null|           null|             null|                    null|                         null|                              null|         null|                null|              null|            null|            null|                   null|                                  null|null|                null|            null|
|"  "AGGREGATOR"|                   f|    f|                null|            t|                      null|5073984473|pol.no.responder@...|         null|         null|         null|              f|                 null|                2|C64C7A54C942D496D...|                0|MG-PRICING_302_15...|MG-PRICING_303_28...|           null|         null|                5311|      null|null|                null|                null|                null|null|  null|               f|                null|        null|       null|               null|                    null|                      null|              null|                     null|                 null|                      null|            null|           null|             null|               null|                   null|                  null|           null|             null|                    null|                         null|                              null|         null|                null|              null|            null|            null|                   null|                                  null|null|                null|            null|
+--------------+--------------------+-----+--------------------+-------------+--------------------------+----------+--------------------+-------------+-------------+-------------+---------------+---------------------+-----------------+--------------------+-----------------+--------------------+--------------------+---------------+-------------+--------------------+----------+----+--------------------+--------------------+--------------------+----+------+----------------+--------------------+------------+-----------+-------------------+------------------------+--------------------------+------------------+-------------------------+---------------------+--------------------------+----------------+---------------+-----------------+-------------------+-----------------------+----------------------+---------------+-----------------+------------------------+-----------------------------+----------------------------------+-------------+--------------------+------------------+----------------+----------------+-----------------------+--------------------------------------+----+--------------------+----------------+
Goutham
  • 97
  • 2
  • 10

1 Answers1

0

identified the issue. latest Spark 2.1.0 is not recognising this. I have tried the same using Spark 2.0.2 its working.

anybody please let me know how to report this ?

update : Hi All , we have a PR lodged last month for this issue and it has been fixed.

JIRA

new update : its confirmed that this would be fixed in coming Spark 2.2.0 version. GIT Issue report

Goutham
  • 97
  • 2
  • 10