How to count the same rows between multiple CSV files in Pandas?

Question

I merged 3 different CSV(D1,D2,D3) Netflow datasets and created one big dataset(df), and applied KMeans clustering to this dataset. To merge them I did not use pd.concat because of memory error and solved with Linux terminal.

df = pd.read_csv('D.csv')
#D is already created in a Linux machine from terminal

........
KMeans Clustering
........

As a result of clustering, I separated the clusters into a dataframe
then created a csv file.
cluster_0 = df[df['clusters'] == 0]
cluster_1 = df[df['clusters'] == 1]
cluster_2 = df[df['clusters'] == 2]

cluster_0.to_csv('cluster_0.csv')
cluster_1.to_csv('cluster_1.csv')
cluster_2.to_csv('cluster_2.csv')

#My goal is to understand the number of same rows with clusters
#and D1-D2-D3
D1 = pd.read_csv('D1.csv')
D2 = pd.read_csv('D2.csv')
D3 = pd.read_csv('D3.csv')

All these datasets contain the same column names, they have 12 columns(all numerical values)

Example expected result:

cluster_0 has xxxx numbers of same rows from D1, xxxxx numbers of same rows from D2, xxxxx numbers of same rows from D3?

I recommend you share a sample showing what columns your dataframes(D1, D2, D3, df) contain. It will help to better know how to assist you. — DF.Richard, May 12 '22 at 15:06
@linuxpanther can you share the code where `D1`, `D2`, and `D3`, get converted to `df`? — artemis, May 12 '22 at 15:48
I do not exactly remember but it was something like : awk 'FNR > 1' D1.csv D2.csv D3.csv> D.csv https://predictivehacks.com/?all-tips=how-to-concatenate-multiple-csv-files-in-linux — linuxpanther, May 12 '22 at 15:55
@linuxpanther I can't help you without seeing the full example as DF.Richard said. I would think you just read in the files using `.read_csv()`, add a column for which `.csv` they came from and exclude that from your modeling, and you're done. But I can't post that solution because I don't have the rest of your code. — artemis, May 12 '22 at 16:08
There is **still** not sufficient information in the question. Please read https://stackoverflow.com/help/minimal-reproducible-example. There is no way for us to reproduce this. — artemis, May 13 '22 at 19:48

score 1 · Accepted Answer · answered May 13 '22 at 20:31

cluster0_D1 = pd.merge(D1, cluster_0, how ='inner')
number_of_rows_D1 = len(cluster0_D1)

cluster0_D2 = pd.merge(D2, cluster_0, how ='inner')
number_of_rows_D2 = len(cluster0_D2)

cluster0_D3 = pd.merge(D3, cluster_0, how ='inner')
number_of_rows_D3 = len(cluster0_D3)

print("How many samples belong to D1, D2, D3 for cluster_0?")
print("D1: ",number_of_rows_D1)
print("D2: ",number_of_rows_D2)
print("D3: ",number_of_rows_D3)

I think this solved my problem.

score 0 · Answer 2 · answered May 13 '22 at 19:49

I do not believe there is enough information in the question to cover edge cases, but this shoul work if I understand correctly.

# Read in the 3, and add a column called "file" so we know which file they came from
D1 = pd.read_csv('D1.csv')
D1['file'] = 'D1.csv'
D2 = pd.read_csv('D2.csv')
D2['file'] = 'D2.csv'
D3 = pd.read_csv('D3.csv')
D3['file'] = 'D3.csv'

# Merge them together into the DF that the "awk" command was doing
df = pd.concat([D1, D2, D3], axis=1)

# Save off the series showing which files each row belong sto
files = df['file']
# Drop it so that doesnt get included in your analysis
df.drop('file', inplace=True, axis=1)

"""
There is no code in the question to show the KMeans clustering
"""

# Add the filename back
df['filename'] = files

We will avoid using the awk command and instead opt for pd.concat.

The problem with pd.concat that I am getting a memory error because of the size. They are quite big CSV and that is why it does not run, Linux just solved quickly. My question is actually not looking for merging them. Right now I am trying to count the same rows in two CSV files. For example between D1 and cluster_0. – — linuxpanther, May 13 '22 at 20:03

How to count the same rows between multiple CSV files in Pandas?

2 Answers2