0

I'm making some cleanings in a +11M rows dataset that contains entreprises names with so much noise. I want to make some regex cleanings (deleting punctuation, whitespaces, digits, accents and some othe issues) and then apply some approximate string matching algorithm like TF-IDF or BM25, but that's another problem.

EDIT

These are my first five rows (including headers).

FILE_PATH = "my_folder/dataset.csv/"
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
my_rdd = spark.sparkContext.textFile(FILE_PATH)
my_rdd.take(5)

['cuit_impor,prove_dest,sim_sptos,porg',
 '34557619099,PUMA SPORTS LA SA                                 ,61099000111N,337',
 '34557619099,PUMA SPORTS LA SA                                 ,61102000110R,306',
 '34557619099,PUMA SPORTS LA SA                                 ,61102000110R,310',
 '34557619099,PUMA SPORTS LA SA                                 ,61103000912D,337']

It is presented as a list of strings. But with the following code I splitted each string into a list of four elements.

my_rdd_splitted = my_rdd.map(lambda x: x.split(',')).cache()
my_rdd_splitted.take(5)

[['cuit_impor', 'prove_dest', 'sim_sptos', 'porg'],
 ['34557619099',
  'PUMA SPORTS LA SA                                 ',
  '61099000111N',
  '337'],
 ['34557619099',
  'PUMA SPORTS LA SA                                 ',
  '61102000110R',
  '306'],
 ['34557619099',
  'PUMA SPORTS LA SA                                 ',
  '61102000110R',
  '310'],
 ['34557619099',
  'PUMA SPORTS LA SA                                 ',
  '61103000912D',
  '337']]

My columns, as you can see, are cuit_impor,prove_dest,sim_sptos,porg. The second one, prove_dest, is the one that has the names. To reach my goal of cleaning up the dataset, I first got all the distinct "dirty" names, so I read the dataset with pyspark and ran:

#in the map method I split the string and select the element in position 1, and 
#then apply distinct. 
unique_names = my_rdd.map(lambda x: x.split(",")[1]).disinct().cache()
print(unique_names.count())

The output was 2,292,034.

To check if everything was ok I looked for distinct values again, using the DB Browser for SQLite, Version 3.11.2, with this sentence:

CREATE TABLE "dataset" (
    "cuit_impor"    REAL,
    "prove_dest"    TEXT,
    "sim_sptos" TEXT,
    "porg"  INTEGER
);

SELECT COUNT(DISTINCT prove_dest)
FROM dataset;

The output was 2,312,252

The number of rows read in both cases is the same: 11,474,580. That means that in both cases the dataset is the same. There's no problem in data reading.

It seems that SQLite identifies as distinct values ​​some strings that pyspark does not. Has anyone had this problem at any time? How can I solved this?

Nachengue
  • 376
  • 1
  • 2
  • 11

1 Answers1

1

In the sql query you are doing distinct on one column (prove_dest) But in pyspark you are doing distinct on all the columns of the rdd. That's why you have different results

dasilva555
  • 93
  • 1
  • 2
  • 12
  • Could you please take a look to edits??? I think that with my_rdd.map(lambda x: x.split(",")[1]).distinct() i'm first selecting the elements of position 1 (that is `prove_dest`) and then performing a distinct only for those values. – Nachengue May 31 '21 at 17:04
  • I offer you to join the different result you get and try to understand which values missing from each side. In addition use csv validator to validate your csv file. Maybe your file is corrupted @Nachengue – dasilva555 Jun 01 '21 at 15:38