0

I have a 20gb database file (*.db) that contains a lot of information on scientific articles, acquired from Scopus. Most information is stored in the responses table within the database file. I only want to obtain the subject areas from each published article. I have no idea how to get this from the database file.

In each row in the database file a lot of information is stored on the author, the article, and also the subject area. A snippet looks like this:

{"abstracts-retrieval-response":{"coredata":{"prism:url":"http://api.elsevier.com/content/abstract/scopus_id/85012897283","dc:identifier":"SCOPUS_ID:85012897283","eid":"2-s2.0-85012897283","language":{"@xml:lang": "eng"},"authkeywords":{"author-keyword": [{"@_fa": "true", "$" :"austerity policies"},{"@_fa": "true", "$" :"housing policy"},{"@_fa": "true", "$" :"Italy"},{"@_fa": "true", "$" :"Mediterranean welfare regime"},{"@_fa": "true", "$" :"Neoliberalism"},{"@_fa": "true", "$" :"Spain"}]},"idxterms":null,"subject-areas":{"subject-area": [{"@_fa": "true", "@abbrev": "SOCI", "@code": "3303", "$" :"Development"},{"@_fa": "true", "@abbrev": "SOCI", "@code": "3322", "$" :"Urban Studies"},{"@_fa": "true", "@abbrev": "ENVI", "@code": "2308", "$" :"Management, Monitoring, Policy and Law"}]}}

From this large (but in reality even much larger) table, I am only interested in getting the dc-identifier and the (multiple) subject-areas. Ideally I would get them in a *.csv file.

Is there a straightforward way to obtain this information from the *.db file using Python? Using sqlite3 I seem to be able to get acces to the database using the following code:

import sqlite3
conn = sqlite3.connect('C:/responses.db')       
c = conn.cursor()

To me it's not clear how I can now only get the dc-identifier and subject-areas information from the database file.

Tom Bailey
  • 75
  • 5
  • 2
    The data appears to be JSON, so after reading the data from the database (e.g., `c.execute("select * from responses;"`), iterate through the results and use Python's `json` library to load each row into a Python data structure, from which you can extract the information you want. You can then store that into whatever other format you want for analysis or export. – rd_nielsen Sep 27 '17 at 22:59
  • Thanks a lot! I'm still a beginner with Python (and json), so now I'm stuck in how I could iterate through the whole file. How do I load each row into a data structure using json? Any suggestions or ideas would be extremely helpful. – Tom Bailey Sep 28 '17 at 09:12
  • 1
    See answer added below. – rd_nielsen Sep 28 '17 at 14:08

1 Answers1

1

Using the json.loads() function will convert the input JSON string to a Python object consisting of a hierarchical set of dictionaries and lists. You then need to extract the information from that using standard Python operators. Following is an example, though I'm not sure that the example pulls the correct document ID. This is also untested, so you may need to tweak or correct it. Also, your example input has unbalanced curly braces, so the interpretation of the structure used below may not be quite right.

import sqlite3
import json

conn = sqlite3.connect('C:/responses.db')       
c = conn.cursor()

# Initialize the output list.
subjectlist = []

# Get the data from SQLite.
c.execute("select * from responses;")

# Iterate over all the rows of data
for row in c:
    # Extract and save the subject information.
    article = json.loads(row[0])
    doc_id = article["abstracts-retrieval-response"]["coredata"]["dc:identifier"]
    subjects = [s["$"] for s in article["abstracts-retrieval-response"]["coredata"]["subject-areas"]["subject-area"]]
    for s in subjects:
        subjectlist.append([doc_id, s])

At the conclusion of this code, subjectlist will be a list of two-element lists, each of the latter consisting of the document ID and a subject area. You can then use the csv library to export this, or maybe push it back into the database as a new table to make further querying easier.

rd_nielsen
  • 2,407
  • 2
  • 11
  • 18
  • This is really, really great -- thanks so much. I'm almost getting there now. The issue is that the relevant info seems to be in `row[6]`, so that the command should use `json.loads(row[6])`. I do get the first author ID, but then the script stops working. You've been of great help, so feel free to stop helping, it's just much complexer than I thought. – Tom Bailey Sep 28 '17 at 15:25
  • 1
    You could modify the SELECT statement to get only the column you need (i.e., the 7th) instead of `*`, which might be a little faster and use a little less memory. Instead of `for row in c:` you could do `rows = curs.fetchall()` followed by `for row in rows:` -- this would allow you to inspect the data returned. You might also need to specify an encoding with the `json.loads()` command. – rd_nielsen Sep 28 '17 at 15:30