0

I'm trying to process the data extracted from the web. The decoded raw data was bytes of xml file. I had some old code that just magically worked. However, I'm not sure what they were doing since it's been a while.

import urllib, urllib.request
url = 'http://export.arxiv.org/api/query?search_query=all:electron+OR+query?id_list=hep-th/9112001&start=0&max_results=2'
data = urllib.request.urlopen(url)

which could be decoded with

data.read().decode('utf-8')

which had a file of the form

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <link href="http://arxiv.org/api/query?search_query%3Dall%3Aelectron%20OR%20query%3Fid_list%3Dhep-th%2F9112001%26id_list%3D%26start%3D0%26max_results%3D2" rel="self" type="application/atom+xml"/>
  <title type="html">ArXiv Query: search_query=all:electron OR query?id_list=hep-th/9112001&amp;id_list=&amp;start=0&amp;max_results=2</title>
  <id>http://arxiv.org/api/hNIXPXLfJXds3VmSJQ2mnDpmElY</id>
  <updated>2023-03-20T00:00:00-04:00</updated>
  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">194139</opensearch:totalResults>
  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">2</opensearch:itemsPerPage>
  <entry>
    <id>http://arxiv.org/abs/cond-mat/0102536v1</id>
    <updated>2001-02-28T20:12:09Z</updated>
    <published>2001-02-28T20:12:09Z</published>
    <title>Impact of Electron-Electron Cusp on Configuration Interaction Energies</title>
    <summary>  The effect of the electron-electron cusp on the convergence of configuration
interaction (CI) wave functions is examined. By analogy with the
pseudopotential approach for electron-ion interactions, an effective
electron-electron interaction is developed which closely reproduces the
scattering of the Coulomb interaction but is smooth and finite at zero
electron-electron separation. The exact many-electron wave function for this
smooth effective interaction has no cusp at zero electron-electron separation.
We perform CI and quantum Monte Carlo calculations for He and Be atoms, both
with the Coulomb electron-electron interaction and with the smooth effective
electron-electron interaction. We find that convergence of the CI expansion of
the wave function for the smooth electron-electron interaction is not
significantly improved compared with that for the divergent Coulomb interaction
for energy differences on the order of 1 mHartree. This shows that, contrary to
popular belief, description of the electron-electron cusp is not a limiting
factor, to within chemical accuracy, for CI calculations.
</summary>
    <author>
      <name>David Prendergast</name>
      <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Department of Physics</arxiv:affiliation>
    </author>
    <author>
      <name>M. Nolan</name>
      <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">NMRC, University College, Cork, Ireland</arxiv:affiliation>
    </author>
    <author>
      <name>Claudia Filippi</name>
      <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Department of Physics</arxiv:affiliation>
    </author>
    <author>
      <name>Stephen Fahy</name>
      <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Department of Physics</arxiv:affiliation>
    </author>
    <author>
      <name>J. C. Greer</name>
      <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">NMRC, University College, Cork, Ireland</arxiv:affiliation>
    </author>
    <arxiv:doi xmlns:arxiv="http://arxiv.org/schemas/atom">10.1063/1.1383585</arxiv:doi>
    <link title="doi" href="http://dx.doi.org/10.1063/1.1383585" rel="related"/>
    <arxiv:comment xmlns:arxiv="http://arxiv.org/schemas/atom">11 pages, 6 figures, 3 tables, LaTeX209, submitted to The Journal of
  Chemical Physics</arxiv:comment>
    <arxiv:journal_ref xmlns:arxiv="http://arxiv.org/schemas/atom">J. Chem. Phys. 115, 1626 (2001)</arxiv:journal_ref>
    <link href="http://arxiv.org/abs/cond-mat/0102536v1" rel="alternate" type="text/html"/>
    <link title="pdf" href="http://arxiv.org/pdf/cond-mat/0102536v1" rel="related" type="application/pdf"/>
    <arxiv:primary_category xmlns:arxiv="http://arxiv.org/schemas/atom" term="cond-mat.str-el" scheme="http://arxiv.org/schemas/atom"/>
    <category term="cond-mat.str-el" scheme="http://arxiv.org/schemas/atom"/>
  </entry>
  <entry>
    <id>http://arxiv.org/abs/astro-ph/0608371v1</id>
    <updated>2006-08-17T14:05:46Z</updated>
    <published>2006-08-17T14:05:46Z</published>
    <title>Electron thermal conductivity owing to collisions between degenerate
  electrons</title>
    <summary>  We calculate the thermal conductivity of electrons produced by
electron-electron Coulomb scattering in a strongly degenerate electron gas
taking into account the Landau damping of transverse plasmons. The Landau
damping strongly reduces this conductivity in the domain of ultrarelativistic
electrons at temperatures below the electron plasma temperature. In the inner
crust of a neutron star at temperatures T &lt; 1e7 K this thermal conductivity
completely dominates over the electron conductivity due to electron-ion
(electron-phonon) scattering and becomes competitive with the the electron
conductivity due to scattering of electrons by impurity ions.
</summary>
    <author>
      <name>P. S. Shternin</name>
      <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Ioffe Physico-Technical Institute</arxiv:affiliation>
    </author>
    <author>
      <name>D. G. Yakovlev</name>
      <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Ioffe Physico-Technical Institute</arxiv:affiliation>
    </author>
    <arxiv:doi xmlns:arxiv="http://arxiv.org/schemas/atom">10.1103/PhysRevD.74.043004</arxiv:doi>
    <link title="doi" href="http://dx.doi.org/10.1103/PhysRevD.74.043004" rel="related"/>
    <arxiv:comment xmlns:arxiv="http://arxiv.org/schemas/atom">8 pages, 3 figures</arxiv:comment>
    <arxiv:journal_ref xmlns:arxiv="http://arxiv.org/schemas/atom">Phys.Rev. D74 (2006) 043004</arxiv:journal_ref>
    <link href="http://arxiv.org/abs/astro-ph/0608371v1" rel="alternate" type="text/html"/>
    <link title="pdf" href="http://arxiv.org/pdf/astro-ph/0608371v1" rel="related" type="application/pdf"/>
    <arxiv:primary_category xmlns:arxiv="http://arxiv.org/schemas/atom" term="astro-ph" scheme="http://arxiv.org/schemas/atom"/>
    <category term="astro-ph" scheme="http://arxiv.org/schemas/atom"/>
  </entry>
</feed>

From the web tutorial, the xml language is a tree structure data, with contents enclosed by

<a></a>

very similar to that of the html, which is why

import xml.etree.ElementTree as ET

could be very useful. However, when I used the code from the python standard library

#https://docs.python.org/3/library/xml.etree.elementtree.html
root=ET.fromstring(data.read().decode('utf-8'))
for child in root:
    print(child.tag, child.attrib)

the returned result was not what desired:

{http://www.w3.org/2005/Atom}link {'href': 'http://arxiv.org/api/query?search_query%3Dall%3Aelectron%20OR%20query%3Fid_list%3Dhep-th%2F9112001%26id_list%3D%26start%3D0%26max_results%3D100', 'rel': 'self', 'type': 'application/atom+xml'}
{http://www.w3.org/2005/Atom}title {'type': 'html'}
{http://www.w3.org/2005/Atom}id {}
{http://www.w3.org/2005/Atom}updated {}
{http://a9.com/-/spec/opensearch/1.1/}totalResults {}
{http://a9.com/-/spec/opensearch/1.1/}startIndex {}
{http://a9.com/-/spec/opensearch/1.1/}itemsPerPage {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}

This was obviously wrong. I suspected this was because there's an extra line of code

<?xml version="1.0" encoding="UTF-8"?>

in front of the body context

<feed xmlns="http://www.w3.org/2005/Atom">

from a bit of reading, this seemed to be the "namespaces", but I'm not sure how to use it.

Could you explain what should I do so that I could search for the title of the article, and the summary of that article, please?

For example,

<title>Impact of Electron-Electron Cusp on Configuration Interaction Energies</title>
<summary>  The effect of the elec... </summary>

were obviously under the same node, how to identify and collect those nodes and attributes to two list

[node1, node2, ..., nodeN] #This is to identify and label the <entry></entry>
[id, updated, tile,...,category]

such that some functions can be used to extract the context, i.e.

ET.{somefunction}(node1,id)=http://arxiv.org/api/hNIXPXLfJXds3VmSJQ2mnDpmElY

Since the website use http://www.w3.org/2005/Atom, can this "namespace" provide some sort of convention or shortcut?

mzjn
  • 48,958
  • 13
  • 128
  • 248
  • 1
    See https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces. There are many existing questions about processing XML with namespaces. For example: https://stackoverflow.com/q/20435500/407651. – mzjn Mar 20 '23 at 08:23

1 Answers1

1

This page is a feed. You can do this to scrap the title and summary:

import xml.etree.ElementTree as ET
import requests
url = 'http://export.arxiv.org/api/query?search_query=all:electron+OR+query?id_list=hep-th/9112001&start=0&max_results=2'
data = requests.get(url)
#print(data.text)

root = ET.fromstring(data.content)

for elem in root.iter():
    if elem.tag =="{http://www.w3.org/2005/Atom}title":
        print("Title: ",elem.text,"\n")
    if elem.tag =="{http://www.w3.org/2005/Atom}summary":
        print("Summary: ",elem.text,"\n")

Output (Extract):

...
Title:  Electron thermal conductivity owing to collisions between degenerate
  electrons 

Summary:    We calculate the thermal conductivity of electrons produced by
electron-electron Coulomb scattering in a strongly degenerate electron gas
taking into account the Landau ...
Hermann12
  • 1,709
  • 2
  • 5
  • 14
  • Footnotes: It might be be better to use `for link in root.findall('./{http://www.w3.org/2005/Atom}entry'):` to list the context nodes and then iterate through the sun elements, because `ArXiv Query: search_query=all:electron OR query?id_list=hep-th/9112001&id_list=&start=0&max_results=2 http://arxiv.org/api/hNIXPXLfJXds3VmSJQ2mnDpmElY` was in the `"http://www.w3.org/2005/Atom"` namespaces as well. – ShoutOutAndCalculate Mar 21 '23 at 21:35
  • 1
    @ShoutOutAndCalculate I think iter() will be faster, you iterate once over all elements. With findall() you generate first a list and after that you go again through the list and iterate about the parts. – Hermann12 Mar 22 '23 at 16:15