0

I have some xml that is formatted like this:

  <Paragraph Type="Character">
   <Text>
    TED
   </Text>
  </Paragraph>
  <Paragraph Type="Dialogue">
   <Text>
    I thought we had a rule against that.
   </Text>
  </Paragraph>
  <Paragraph Type="Character">
   <Text>
    ANNIE
   </Text>
  </Paragraph>
  <Paragraph Type="Dialogue">
   <Text>
    ...oh.  

I'm trying to extract the data so that it looks like this:

Character   Dialogue

TED         I thought we had a rule against that.
ANNIE       ...oh. 

I've been trying with:

soup.find(Type = "Character").get_text()
soup.find(Type = "Dialogue").get_text()

which will return one line at a time. When I try to do more than one, with soup.find_all, i.e.:

soup.find_all(Type = "Character").get_text()

I get the error:

AttributeError: ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

I understand that find_all() returns an array of elements (thanks to this previous answer: https://stackoverflow.com/a/21997788/8742237), and that I should select one element in the array, but I would like to get all of the elements in the array into the format I showed above.

Jeremy K.
  • 1,710
  • 14
  • 35

3 Answers3

2

have you tried looping over the array and getting the text like that?

[x.get_text() for x in soup.find_all(Type = "Character")]

The array doesn't have the get_text() attribute, but the elements should.

Anna Nevison
  • 2,709
  • 6
  • 21
  • That works very well, and I will look into extracting everything via loops. Would you have any suggestions on how to pair the `Dialogue` text with the `Character` text that immediately precedes it, as in the desired output I showed above? – Jeremy K. Aug 13 '19 at 20:09
  • 1
    I wouldn't use `find_all()` in that case because you now have two separate lists of paragraphs and these might not be perfectly correlated. Instead, use `find()` to get the first paragraph, then use `next_sibling` to move to the next in a `while` loop. Check the type of each paragraph to decide how to format it. – kindall Aug 13 '19 at 20:12
  • 1
    just create character list and then dialogue lists and do list(zip(character list, dialogue list)) assuming the lists would be of the same length. this would pair them – Anna Nevison Aug 13 '19 at 20:14
2

To get pairs of Character and Dialogue, you can use zip() method:

html_data = '''  <Paragraph Type="Character">
   <Text>
    TED
   </Text>
  </Paragraph>
  <Paragraph Type="Dialogue">
   <Text>
    I thought we had a rule against that.
   </Text>
  </Paragraph>
  <Paragraph Type="Character">
   <Text>
    ANNIE
   </Text>
  </Paragraph>
  <Paragraph Type="Dialogue">
   <Text>
    ...oh.
   </Text>
  </Paragraph>
  '''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_data, 'html.parser')

print('{: <10} {}'.format('Character', 'Dialogue'))
print()
for character, dialogue in zip(soup.select('[Type="Character"]'), soup.select('[Type="Character"] + [Type="Dialogue"]')):
    print('{: <10} {}'.format( character.get_text(strip=True), dialogue.get_text(strip=True)) )

Prints:

Character  Dialogue

TED        I thought we had a rule against that.
ANNIE      ...oh.

The CSS selector [Type="Character"] + [Type="Dialogue"] will select tag with Type=Dialogue that is placed immediately after tag with Type=Character

More here : CSS Selectors Reference

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 1
    For anyone struggling with this question in the future, the CSS Selectors Reference link above is excellent. – Jeremy K. Aug 13 '19 at 20:24
  • I'm still trying to get my head around this one, as I'm still learning and Googling every command. If I want to save the output in a list or zip object is there an easy way to do that? – Jeremy K. Aug 14 '19 at 03:39
  • It's okay, I've figured it out. Thank you. – Jeremy K. Aug 14 '19 at 04:02
0

The answer from Andrej Kesely is exactly what I was looking for: https://stackoverflow.com/a/57484760/8742237

Just in case anyone looking at this question in the future is a beginner, this my attempt to break it down:

list1 = [x.get_text(strip = True) for x in soup.select('[Type="Character"]')]
print(list1)

list2 = [x.get_text(strip = True) for x in soup.select('[Type="Dialogue"]')]
print(list2)

zip1 = zip(list1, list2)
print(list(zip1))
Jeremy K.
  • 1,710
  • 14
  • 35