2

I'm trying to apply word_tokenization to a Pandas DataFrame column as the step before POS tagging. The source/raw column is 'sent' (already sentence-tokenized) and the destination column is 'word'. Here's the code, including the max column width instruction:

pd.set_option('display.max_colwidth', None)

LC_HD_df['word'] = LC_HD_df['sent'].apply (lambda x: nltk.tokenize.word_tokenize(str(x)))

This appears to work... except... Each cell in 'word' only has the first 101 tokens from the 'sent' cell. Why is it truncating at 101 tokens? How do I fix this?

The 101 words end with "..." does that suggest that they have been tokenized but do not appear for some reason? (That doesn't make sense.)

Attached is a picture of the first row.

One row, two columns, one with the source words, one with the 101 word tokens

I searched for related questions to no avail. Many questions related generally, but did not find one addressing the truncation problem. This should be an easy fix that I just don't know, but, once I know the solution, will never forget.

Thanks in advance for your assistance.

ddormer
  • 23
  • 3
  • How did you confirm that each 'word' only has the first 101 tokens? your "..." indicates to me that you used the print fuction which in pandas often makes the output pretty by not printing everything. Did you confirm by actually printing the length of each 'word'? – sev Jun 27 '22 at 17:55
  • Sorry for the slow response. Was way offline for the past 2 weeks. Back now. I counted two ways: First, I copied the contents of several sells under 'word' to a Word doc and did a simple word count. Every cell comes in at 101. Second, I tried the following: # Count the items in 'word' LC_HD_df['word_count'] = LC_HD_df['word'].apply (lambda x:x.count(',')+1) but this gave me various results, 50, 16, 26, etc. I have no idea what these numbers represent. Any assistance would be so much appreciated. – ddormer Jul 12 '22 at 17:53

1 Answers1

1

I don't think that your word cells only have 101 tokens in them, just that that many are being printed.

I assume your function nltk.tokenize.word_tokenize(str(x)) is a more elaborate version of x.split(). Taking a string and returning a list of strings.

To check the length of this list in each of the cells you could any of the methods mentioned in this post How to determine the length of lists in a pandas dataframe column eg.: LC_HD_df['word_count'] = LC_HD_df['word'].str.len()

I don't think you will come to 101 with this method.

sev
  • 1,500
  • 17
  • 45
  • Yup. This worked. Thank you. One interesting thing: The number of tokens is greater than the number of words I calculate when I use Word word count. I think that's because tokens are created for punctuation. Assuming so, I'll clean that up when I run the preprocessor. All is good. Thanks again. – ddormer Jul 14 '22 at 18:41