-1

What I have:

I have a DataFrame (df) with 2 columns.

In df["Words"] I have some Persian\Farsi words.

Words Counts
سلام
کشور زیبا ؟
28 % ایران
ایران طلا
طلا ایران
سلام ایران

What I would:

I would separate the words and count the frequency of every single word in column "Words":

Words Counts
سلام 2
کشور 1
زیبا 1
؟ 1
ایران 4
طلا 2
% 1

What I did:

df.Words.str.get_dummies(sep=' ').mul(df['count'], axis=0).sum()

What I received from python :

Words Counts
سلام NAN
کشور NAN
زیبا NAN
؟ NAN
ایران NAN
طلا NAN
% NAN

The problem is the formatting or the code?

Jsmoka
  • 59
  • 10

1 Answers1

1

This handles " " and "." (at the end of a sentence). I am not sure if there are any othere separators in farsi. If you need to add them, just add them to the "separators" string.

import pandas as pd
import re

separators = ". "
df = pd.DataFrame({"Words": ["hi you there", "hello all"]})

def get_word_len(words: str) -> int:
   return len(re.split(separators, words))

df["Counts"] = df.Words.apply(get_word_len)

print(df)

Thank you for your feedback. I understood the task a little bit wrong. This should solve your problem. (of course df should be replaced with your dataframe:

import pandas as pd

df = pd.DataFrame({"Words": ["hi you there", "hello all hi"]})

words = list()
for word in df["Words"]:
    words = words + word.split(" ")

df_a = pd.DataFrame({"words": words})
print(df_a["words"].value_counts())

result:

hi       2
there    1
all      1
hello    1
you      1
Semmel
  • 575
  • 2
  • 8
  • Unfortunately it's not what I want. I would, that the code: 1- Separate the word. -->2-Count the Words/Symbols --> 3- Show me how many is the count of each Word\Symbol – Jsmoka Feb 02 '21 at 12:29