3

I have a program which retrieves a list of PubMed publications and wish to build a graph of co-authorship, meaning that for each article I want to add each author (if not already present) as a vertex and add an undirected edge (or increase its weight) between every coauthor.

I managed to write the first of the program which retrieves the list of authors for each publication and understand I could use the NetworkX library to build the graph (and then export it to GraphML for Gephi) but cannot wrap my head on how to transform the "list of lists" to a graph.

Here follows my code. Thank you very much.

### if needed install the required modules
### python3 -m pip install biopython
### python3 -m pip install numpy

from Bio import Entrez
from Bio import Medline
Entrez.email = "rja@it.com"
handle = Entrez.esearch(db="pubmed", term='("lung diseases, interstitial"[MeSH Terms] NOT "pneumoconiosis"[MeSH Terms]) AND "artificial intelligence"[MeSH Terms] AND "humans"[MeSH Terms]', retmax="1000", sort="relevance", retmode="xml")
records = Entrez.read(handle)
ids = records['IdList']
h = Entrez.efetch(db='pubmed', id=ids, rettype='medline', retmode='text')
#now h holds all of the articles and their sections
records = Medline.parse(h)
# initialize an empty vector for the authors
authors = []
# iterate through all articles
for record in records:
    #for each article (record) get the authors list
    au = record.get('AU', '?')
    # now from the author list iterate through each author
    for a in au: 
        if a not in authors:
            authors.append(a)
    # following is just to show the alphabetic list of all non repeating 
    # authors sorted alphabetically (there should become my graph nodes)
    authors.sort()
    print('Authors: {0}'.format(', '.join(authors)))
Robert Alexander
  • 875
  • 9
  • 24
  • 1
    If you want some help in building the graph, a sample of the nested list would be more helpful than this code – yatu Feb 09 '19 at 18:30
  • @yatu thank you for your comment. I thought providing a runnable code would be better but understand your suggestion. – Robert Alexander Feb 10 '19 at 14:47

1 Answers1

6

Cool - the code was running, so the data structures are clear! As an approach, we build the conncetivity-matrix for both articles/authors and authors/co-authors.

List of authors : If you want to describe the relation between the articles and the authors, I think you need the author list of each article

authors = []
author_lists = []              # <--- new
for record in records:
    au = record.get('AU', '?')
    author_lists.append(au)    # <--- new
    for a in au: 
        if a not in authors: authors.append(a)
authors.sort()
print(authors)

numpy, pandas matplotlib - is just the way I am used to work

import numpy as np
import pandas as pd
import matplotlib.pylab as plt

AU = np.array(authors)        # authors as np-array
NA = AU.shape[0]              # number of authors

NL = len(author_lists)        # number of articles/author lists
AUL = np.array(author_lists)  # author lists as np-array

print('NA, NL', NA,NL)

Connectivity articles/authors

CON = np.zeros((NL,NA),dtype=int) # initializes connectivity matrix
for j in range(NL):               # run through the article's author list 
    aul = np.array(AUL[j])        # get a single author list as np-array
    z = np.zeros((NA),dtype=int)
    for k in range(len(aul)):     # get a singel author
        z += (AU==aul[k])         # get it's position in the AU, add it  up
    CON[j,:] = z                  # insert the result in the connectivity matrix

#---- grafics --------
fig = plt.figure(figsize=(20,10)) ; 
plt.spy(CON, marker ='s', color='chartreuse', markersize=5)
plt.xlabel('Authors'); plt.ylabel('Articles'); plt.title('Authors of the articles', fontweight='bold')
plt.show()

enter image description here

Connectivity authors/co-authors, the resulting matrix is symmetric

df = pd.DataFrame(CON)          # let's use pandas for the following step
ACON = np.zeros((NA,NA))         # initialize the conncetivity matrix
for j in range(NA):              # run through the authors
    df_a = df[df.iloc[:, j] >0]  # give all rows with author j involved
    w = np.array(df_a.sum())     # sum the rows, store it in np-array 
    ACON[j] = w                  # insert it in the connectivity matrix

#---- grafics --------
fig = plt.figure(figsize=(10,10)) ; 
plt.spy(ACON, marker ='s', color='chartreuse', markersize=3)
plt.xlabel('Authors'); plt.ylabel('Authors'); plt.title('Authors that are co-authors', fontweight='bold')
plt.show()

enter image description here

For the graphics with Networkx, I think think you need clear ideas what you want represent, because there are many points and many possibilities too (perhaps you post an example?). Only a few author-circels are ploted below.

import networkx as nx

def set_edges(Q):
    case = 'A'
    if case=='A':
        Q1 = np.roll(Q,shift=1)
        Edges = np.vstack((Q,Q1)).T
    return Edges

Q = nx.Graph()
Q.clear()

AT = np.triu(ACON)                        # only the tridiagonal is needed
fig = plt.figure(figsize=(7,7)) ;
for k in range (9):
    iA = np.argwhere(AT[k]>0).ravel()     # get the indices with AT{k}>0
    Edges = set_edges(iA)                 # select the involved nodes and set the edges
    Q.add_edges_from(Edges, with_labels=True)
nx.draw(Q, alpha=0.5)
plt.title('Co-author-ship', fontweight='bold')
plt.show()

enter image description here

pyano
  • 1,885
  • 10
  • 28
  • @pyano thank you soooooo much!!! Of course this is a lot more interesting when the query term retrieves a lot more articles. If instead of the current query you replace it with ("HIV"[Mesh]) and draw the graph you will find out many subgraphs which represent scientists working with each other and will discover "groups" that do not necessarily belong to the same institutions and do all sorts of Social Network Analysis. Thank you very much again. – Robert Alexander Feb 10 '19 at 14:47
  • @Raymond Hettinger and @Scott Boston : Thank you for the flowers ! I still have a question: In the section `Connectivity articles/authors` I tried with `z = AU[AU==aul]`. But this has not been successful and I ended with `for k in range(len(aul)): z += (AU==aul[k]) `. Any idea for a vectorized form and so get rid the `for`-statement ? – pyano Feb 10 '19 at 22:34
  • @Robert Alexander: (1) I have learned a lot from your unlocking of the medline/pubmed data base, so I have been motivated to continue your effort.- thank you ! (2) If you have more data, prefer numpy/pandas and avoid `for`- and `if`-statements whenever possible. (3) if my answer is useful, you might click the green 'solved' button, so the result gets better visible, tx – pyano Feb 10 '19 at 22:48
  • @pyano again thanks. no green solved button for me, maybe I still do not have enough points here? :( I hope to learn enough from your example to slowly polish the code and remove as much iteration I can :) (I programmed in APL many eons ago and gave a run to R recently and love the matrix operators). Just a minor warning in your code: MatplotlibDeprecationWarning: isinstance(..., numbers.Number) if cb.is_numlike(alpha): – Robert Alexander Feb 11 '19 at 06:35
  • 1
    @Robert Alexander: Such warnings can happen (I have not got one). Usually they disapear when newer releases are installed. - On the left, below the up-vote/down-votes signs (in grey/orange) is usually a "done/o.k." sign (in grey, and becomes green when clicked) – pyano Feb 11 '19 at 07:28
  • @pyano aw ok the grey "checkmark", found it and now it's green. In the next few days I hope to master your code and do a lot of interesting analysis. Thank you and take care. – Robert Alexander Feb 11 '19 at 08:16
  • Have a look here: https://github.com/DataScienceUB/introduction-datascience-python-book/blob/master/ch08_Network_Analysis.ipynb – pyano Feb 11 '19 at 08:58