I'm making some cleanings in a +11M rows dataset that contains entreprises names with so much noise. I want to make some regex cleanings (deleting punctuation, whitespaces, digits, accents and some othe issues) and then apply some approximate string matching algorithm like TF-IDF or BM25, but that's another problem.
EDIT
These are my first five rows (including headers).
FILE_PATH = "my_folder/dataset.csv/"
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
my_rdd = spark.sparkContext.textFile(FILE_PATH)
my_rdd.take(5)
['cuit_impor,prove_dest,sim_sptos,porg',
'34557619099,PUMA SPORTS LA SA ,61099000111N,337',
'34557619099,PUMA SPORTS LA SA ,61102000110R,306',
'34557619099,PUMA SPORTS LA SA ,61102000110R,310',
'34557619099,PUMA SPORTS LA SA ,61103000912D,337']
It is presented as a list of strings. But with the following code I splitted each string into a list of four elements.
my_rdd_splitted = my_rdd.map(lambda x: x.split(',')).cache()
my_rdd_splitted.take(5)
[['cuit_impor', 'prove_dest', 'sim_sptos', 'porg'],
['34557619099',
'PUMA SPORTS LA SA ',
'61099000111N',
'337'],
['34557619099',
'PUMA SPORTS LA SA ',
'61102000110R',
'306'],
['34557619099',
'PUMA SPORTS LA SA ',
'61102000110R',
'310'],
['34557619099',
'PUMA SPORTS LA SA ',
'61103000912D',
'337']]
My columns, as you can see, are cuit_impor
,prove_dest
,sim_sptos
,porg
. The second one, prove_dest
, is the one that has the names. To reach my goal of cleaning up the dataset, I first got all the distinct "dirty" names, so I read the dataset with pyspark
and ran:
#in the map method I split the string and select the element in position 1, and
#then apply distinct.
unique_names = my_rdd.map(lambda x: x.split(",")[1]).disinct().cache()
print(unique_names.count())
The output was 2,292,034.
To check if everything was ok I looked for distinct values again, using the DB Browser for SQLite, Version 3.11.2, with this sentence:
CREATE TABLE "dataset" (
"cuit_impor" REAL,
"prove_dest" TEXT,
"sim_sptos" TEXT,
"porg" INTEGER
);
SELECT COUNT(DISTINCT prove_dest)
FROM dataset;
The output was 2,312,252
The number of rows read in both cases is the same: 11,474,580. That means that in both cases the dataset is the same. There's no problem in data reading.
It seems that SQLite identifies as distinct values some strings that pyspark does not. Has anyone had this problem at any time? How can I solved this?