I'm trying to process the data extracted from the web. The decoded raw data was bytes of xml file. I had some old code that just magically worked. However, I'm not sure what they were doing since it's been a while.
import urllib, urllib.request
url = 'http://export.arxiv.org/api/query?search_query=all:electron+OR+query?id_list=hep-th/9112001&start=0&max_results=2'
data = urllib.request.urlopen(url)
which could be decoded with
data.read().decode('utf-8')
which had a file of the form
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<link href="http://arxiv.org/api/query?search_query%3Dall%3Aelectron%20OR%20query%3Fid_list%3Dhep-th%2F9112001%26id_list%3D%26start%3D0%26max_results%3D2" rel="self" type="application/atom+xml"/>
<title type="html">ArXiv Query: search_query=all:electron OR query?id_list=hep-th/9112001&id_list=&start=0&max_results=2</title>
<id>http://arxiv.org/api/hNIXPXLfJXds3VmSJQ2mnDpmElY</id>
<updated>2023-03-20T00:00:00-04:00</updated>
<opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">194139</opensearch:totalResults>
<opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
<opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">2</opensearch:itemsPerPage>
<entry>
<id>http://arxiv.org/abs/cond-mat/0102536v1</id>
<updated>2001-02-28T20:12:09Z</updated>
<published>2001-02-28T20:12:09Z</published>
<title>Impact of Electron-Electron Cusp on Configuration Interaction Energies</title>
<summary> The effect of the electron-electron cusp on the convergence of configuration
interaction (CI) wave functions is examined. By analogy with the
pseudopotential approach for electron-ion interactions, an effective
electron-electron interaction is developed which closely reproduces the
scattering of the Coulomb interaction but is smooth and finite at zero
electron-electron separation. The exact many-electron wave function for this
smooth effective interaction has no cusp at zero electron-electron separation.
We perform CI and quantum Monte Carlo calculations for He and Be atoms, both
with the Coulomb electron-electron interaction and with the smooth effective
electron-electron interaction. We find that convergence of the CI expansion of
the wave function for the smooth electron-electron interaction is not
significantly improved compared with that for the divergent Coulomb interaction
for energy differences on the order of 1 mHartree. This shows that, contrary to
popular belief, description of the electron-electron cusp is not a limiting
factor, to within chemical accuracy, for CI calculations.
</summary>
<author>
<name>David Prendergast</name>
<arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Department of Physics</arxiv:affiliation>
</author>
<author>
<name>M. Nolan</name>
<arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">NMRC, University College, Cork, Ireland</arxiv:affiliation>
</author>
<author>
<name>Claudia Filippi</name>
<arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Department of Physics</arxiv:affiliation>
</author>
<author>
<name>Stephen Fahy</name>
<arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Department of Physics</arxiv:affiliation>
</author>
<author>
<name>J. C. Greer</name>
<arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">NMRC, University College, Cork, Ireland</arxiv:affiliation>
</author>
<arxiv:doi xmlns:arxiv="http://arxiv.org/schemas/atom">10.1063/1.1383585</arxiv:doi>
<link title="doi" href="http://dx.doi.org/10.1063/1.1383585" rel="related"/>
<arxiv:comment xmlns:arxiv="http://arxiv.org/schemas/atom">11 pages, 6 figures, 3 tables, LaTeX209, submitted to The Journal of
Chemical Physics</arxiv:comment>
<arxiv:journal_ref xmlns:arxiv="http://arxiv.org/schemas/atom">J. Chem. Phys. 115, 1626 (2001)</arxiv:journal_ref>
<link href="http://arxiv.org/abs/cond-mat/0102536v1" rel="alternate" type="text/html"/>
<link title="pdf" href="http://arxiv.org/pdf/cond-mat/0102536v1" rel="related" type="application/pdf"/>
<arxiv:primary_category xmlns:arxiv="http://arxiv.org/schemas/atom" term="cond-mat.str-el" scheme="http://arxiv.org/schemas/atom"/>
<category term="cond-mat.str-el" scheme="http://arxiv.org/schemas/atom"/>
</entry>
<entry>
<id>http://arxiv.org/abs/astro-ph/0608371v1</id>
<updated>2006-08-17T14:05:46Z</updated>
<published>2006-08-17T14:05:46Z</published>
<title>Electron thermal conductivity owing to collisions between degenerate
electrons</title>
<summary> We calculate the thermal conductivity of electrons produced by
electron-electron Coulomb scattering in a strongly degenerate electron gas
taking into account the Landau damping of transverse plasmons. The Landau
damping strongly reduces this conductivity in the domain of ultrarelativistic
electrons at temperatures below the electron plasma temperature. In the inner
crust of a neutron star at temperatures T < 1e7 K this thermal conductivity
completely dominates over the electron conductivity due to electron-ion
(electron-phonon) scattering and becomes competitive with the the electron
conductivity due to scattering of electrons by impurity ions.
</summary>
<author>
<name>P. S. Shternin</name>
<arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Ioffe Physico-Technical Institute</arxiv:affiliation>
</author>
<author>
<name>D. G. Yakovlev</name>
<arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Ioffe Physico-Technical Institute</arxiv:affiliation>
</author>
<arxiv:doi xmlns:arxiv="http://arxiv.org/schemas/atom">10.1103/PhysRevD.74.043004</arxiv:doi>
<link title="doi" href="http://dx.doi.org/10.1103/PhysRevD.74.043004" rel="related"/>
<arxiv:comment xmlns:arxiv="http://arxiv.org/schemas/atom">8 pages, 3 figures</arxiv:comment>
<arxiv:journal_ref xmlns:arxiv="http://arxiv.org/schemas/atom">Phys.Rev. D74 (2006) 043004</arxiv:journal_ref>
<link href="http://arxiv.org/abs/astro-ph/0608371v1" rel="alternate" type="text/html"/>
<link title="pdf" href="http://arxiv.org/pdf/astro-ph/0608371v1" rel="related" type="application/pdf"/>
<arxiv:primary_category xmlns:arxiv="http://arxiv.org/schemas/atom" term="astro-ph" scheme="http://arxiv.org/schemas/atom"/>
<category term="astro-ph" scheme="http://arxiv.org/schemas/atom"/>
</entry>
</feed>
From the web tutorial, the xml language is a tree structure data, with contents enclosed by
<a></a>
very similar to that of the html, which is why
import xml.etree.ElementTree as ET
could be very useful. However, when I used the code from the python standard library
#https://docs.python.org/3/library/xml.etree.elementtree.html
root=ET.fromstring(data.read().decode('utf-8'))
for child in root:
print(child.tag, child.attrib)
the returned result was not what desired:
{http://www.w3.org/2005/Atom}link {'href': 'http://arxiv.org/api/query?search_query%3Dall%3Aelectron%20OR%20query%3Fid_list%3Dhep-th%2F9112001%26id_list%3D%26start%3D0%26max_results%3D100', 'rel': 'self', 'type': 'application/atom+xml'}
{http://www.w3.org/2005/Atom}title {'type': 'html'}
{http://www.w3.org/2005/Atom}id {}
{http://www.w3.org/2005/Atom}updated {}
{http://a9.com/-/spec/opensearch/1.1/}totalResults {}
{http://a9.com/-/spec/opensearch/1.1/}startIndex {}
{http://a9.com/-/spec/opensearch/1.1/}itemsPerPage {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
{http://www.w3.org/2005/Atom}entry {}
This was obviously wrong. I suspected this was because there's an extra line of code
<?xml version="1.0" encoding="UTF-8"?>
in front of the body context
<feed xmlns="http://www.w3.org/2005/Atom">
from a bit of reading, this seemed to be the "namespaces", but I'm not sure how to use it.
Could you explain what should I do so that I could search for the title of the article, and the summary of that article, please?
For example,
<title>Impact of Electron-Electron Cusp on Configuration Interaction Energies</title>
<summary> The effect of the elec... </summary>
were obviously under the same node, how to identify and collect those nodes and attributes to two list
[node1, node2, ..., nodeN] #This is to identify and label the <entry></entry>
[id, updated, tile,...,category]
such that some functions can be used to extract the context, i.e.
ET.{somefunction}(node1,id)=http://arxiv.org/api/hNIXPXLfJXds3VmSJQ2mnDpmElY
Since the website use http://www.w3.org/2005/Atom
, can this "namespace" provide some sort of convention or shortcut?