I'm analyzing a twitter dataset in python and try to find every quote. The code is supposed to give me a .csv file with a list of all tweets and their quotes. I found a code on Github where someone tried the same thing but with data from a website. I adjusted my dataset for the code. The tweets are all in an .xml-file like this:
<articles>
<article>
<paragraph>Tweet text is here.</paragraph>
<paragraph>Tweet text is here.</paragraph>
</article>
</articles>
My dataset has 1.000.000 tweets. When analyzing a sample size of 50.000 tweets everything works as supposed. When analyzing the full dataset I get this message:
Traceback (most recent call last):
File "C:/xxx.py", line 16, in <module>
count = text.count("\'")
AttributeError: 'NoneType' object has no attribute 'count'
Why do I get this when I analyze the whole dataset but not when I analyze the sample?
Here's my code:
import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np
tree = ET.parse('tweets.xml')
articles = tree.getroot()
paragraphs_with_quotes = []
paragraphs_with_double_quotes = []
quotes = []
extracted_paragraphs = []
for article in articles:
for paragraph in article.findall('paragraph'):
text = paragraph.text
count = text.count("\'")
indexes = []
if count > 1:
paragraphs_with_quotes.append(text)
index = text.index("\'")
while count > 0:
if text[index - 1] == " " or index == len(text) - 1 or text[index + 1] in " .,":
indexes.append(index)
if count > 1:
index = text.index("\'", index + 1)
count -= 1
for i in range(0, len(indexes), 2):
start = indexes[i]
end = indexes[min(len(indexes) - 1, i + 1)]
print(text)
quotes.append(text[indexes[i]:indexes[min(len(indexes) - 1, i + 1)] + 1])
extracted_paragraphs.append(text)
print("Quote:" + quotes[len(quotes) - 1])
print()
d = {'Paragraph:': extracted_paragraphs, 'Quote:': quotes}
quote_data = pd.DataFrame(d)
quote_data.to_csv('quote_data.csv')
for i in range(1):
print()
print(len(paragraphs_with_quotes))
Thank you!