1

I'm analyzing a twitter dataset in python and try to find every quote. The code is supposed to give me a .csv file with a list of all tweets and their quotes. I found a code on Github where someone tried the same thing but with data from a website. I adjusted my dataset for the code. The tweets are all in an .xml-file like this:

<articles>
     <article>
          <paragraph>Tweet text is here.</paragraph>
          <paragraph>Tweet text is here.</paragraph>
     </article>
</articles>

My dataset has 1.000.000 tweets. When analyzing a sample size of 50.000 tweets everything works as supposed. When analyzing the full dataset I get this message:

Traceback (most recent call last):
  File "C:/xxx.py", line 16, in <module>
    count = text.count("\'")
AttributeError: 'NoneType' object has no attribute 'count'

Why do I get this when I analyze the whole dataset but not when I analyze the sample?

Here's my code:

import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np

tree = ET.parse('tweets.xml')
articles = tree.getroot()

paragraphs_with_quotes = []
paragraphs_with_double_quotes = []
quotes = []
extracted_paragraphs = []

for article in articles:
    for paragraph in article.findall('paragraph'):
        text = paragraph.text
        count = text.count("\'")
        indexes = []
        if count > 1:
            paragraphs_with_quotes.append(text)
            index = text.index("\'")
            while count > 0:

                if text[index - 1] == " " or index == len(text) - 1 or text[index + 1] in " .,":
                    indexes.append(index)
                if count > 1:
                    index = text.index("\'", index + 1)
                count -= 1
            for i in range(0, len(indexes), 2):
                start = indexes[i]
                end = indexes[min(len(indexes) - 1, i + 1)]
                print(text)

                quotes.append(text[indexes[i]:indexes[min(len(indexes) - 1, i + 1)] + 1])
                extracted_paragraphs.append(text)

                print("Quote:" + quotes[len(quotes) - 1])
                print()

d = {'Paragraph:': extracted_paragraphs, 'Quote:': quotes}
quote_data = pd.DataFrame(d)
quote_data.to_csv('quote_data.csv')

for i in range(1):
    print()

print(len(paragraphs_with_quotes))

Thank you!

klassetyp
  • 11
  • 1

1 Answers1

0

My guess would be that you have an article that doesn't have a paragraph.

You need to be able to handle when text is NoneType

if (text is not None):
Mads Hansen
  • 63,927
  • 12
  • 112
  • 147