Reading substring within quotes from a massive string in Python

Question

I have the following string:

{"name":"INPROCEEDINGS","__typename":"PublicationConferencePaper"},"hasPermiss
ionToLike":true,"hasPermissionToFollow":true,"publicationCategory":"researchSu
mmary","hasPublicFulltexts":false,"canClaim":false,"publicationType":"inProcee
dings","fulltextRequesterCount":0,"requests":{"__pagination__":
[{"offset":0,"limit":1,"list":[]}]},"activeFiguresCount":0,"activeFigures":
{"__pagination__":[{"offset":0,"limit":100,"list":
[]}]},"abstract":"Heterogeneous Multiprocessor System-on-Chip (MPSoC) are 
progressively becoming predominant in most modern mobile devices. These 
devices are required to perform processing of applications within thermal,
 energy and performance constraints. However, most stock power and thermal
 management mechanisms either neglect some of these constraints or rely on 
frequency scaling to achieve energy-efficiency and temperature reduction on 
the device. Although this inefficient technique can reduce temporal thermal
 gradient, but at the same time hurts the performance of the executing task.
 In this paper, we propose a thermal and energy management mechanism which 
achieves reduction in thermal gradient as well as energy-efficiency through 
resource mapping and thread-partitioning of applications with online 
optimization in heterogeneous MPSoCs. The efficacy of the proposed approach is 
experimentally appraised using different applications from Polybench benchmark 
suite on Odroid-XU4 developmental platform. Results show 28% performance 
improvement, 28.32% energy saving and reduced thermal variance of over 76%
 when compared to the existing approaches. Additionally, the method is able to
 free more than 90% in memory storage on the MPSoC, which would have been 
previously utilized to store several task-to-thread mapping 
configurations.","hasRequestedAbstract":false,"lockedFields"

I am trying to fetch the substring between "abstract":" and ","hasRequestedAbstract". For that I am using the following code:

    import requests
    #some more codes here........
    to_visit_url = 'https://www.researchgate.net/publication/328749434_TEEM_Online_Thermal-_and_Energy-Efficiency_Management_on_CPU-GPU_MPSoCs'
    this_page = requests.get(to_visit_url)
    content = str(page.content, encoding="utf-8")
    abstract = re.search('\"abstract\":\"(.*)\",\"hasRequestedAbstract\"', content)
    print('Abstract:\n' + str(abstract))

But in the abstract variable it hold a value of None. What could be the issue? How can I fetch the substring as mentioned above?

Note: Although it seems like I can read it as JSON object but that is not an option because the sample text provided above is just a small part of the complete html content from which it is very difficult to extract the JSON object.

P.S. Full content of page i.e. page.content, could be downloaded from here: https://docs.google.com/document/d/1awprvKsLPNoV6NZRmCkktYwMwWJo5aujGyNwGhDf7cA/edit?usp=sharing

Or the source could also be downloaded directly from the URL: https://www.researchgate.net/publication/328749434_TEEM_Online_Thermal-_and_Energy-Efficiency_Management_on_CPU-GPU_MPSoCs

`content[content.index("abstract:")+9:content.index("hasRequestedAbstract")]`? — hilberts_drinking_problem, Dec 07 '18 at 09:50
First thing you need to know that search() return an index and None if isn't present a matching string. So this means that your regex doesn't find a string that matches your pattern. — Iulian, Dec 07 '18 at 10:01

score 1 · Answer 1 · answered Dec 07 '18 at 09:58

re.search doesn't return parsed result list. It returns SRE_Match object. If you want to get matched list, you need to use re.findall method.

Tested Code

import re
import requests

test_pattern = re.compile('\"abstract\":\"(.*)\",\"hasRequestedAbstract\"')
test_requests = requests.get("https://www.researchgate.net/publication/328749434_TEEM_Online_Thermal-_and_Energy-Efficiency_Management_on_CPU-GPU_MPSoCs")

print(test_pattern.findall(test_requests.text)[0])

Result

'Heterogeneous Multiprocessor System-on-Chip (MPSoC) are progressively becoming predominant in most modern mobile devices. These devices are required to perform processing of applications within thermal, energy and performance constraints. However, most stock power and thermal management mechanisms either neglect some of these constraints or rely on frequency scaling to achieve energy-efficiency and temperature reduction on the device. Although this inefficient technique can reduce temporal thermal gradient, but at the same time hurts the performance of the executing task. In this paper, we propose a thermal and energy management mechanism which achieves reduction in thermal gradient as well as energy-efficiency through resource mapping and thread-partitioning of applications with online optimization in heterogeneous MPSoCs. The efficacy of the proposed approach is experimentally appraised using different applications from Polybench benchmark suite on Odroid-XU4 developmental platform. Results show 28% performance improvement, 28.32% energy saving and reduced thermal variance of over 76% when compared to the existing approaches. Additionally, the method is able to free more than 90% in memory storage on the MPSoC, which would have been previously utilized to store several task-to-thread mapping configurations.'

this relies on both keys being present in the json object, and being in the order posted. obviously it works in this case, but in general I would personally work with it as a json rather than using regex. — Stael, Dec 07 '18 at 10:10
@kde713 I am getting the following error --> abstract = test_pattern.findall(test_requests.text)[0] Traceback (most recent call last): File "", line 1, in IndexError: list index out of range — TheCoder, Dec 07 '18 at 14:14
@TheCoder Can you update result of `print(test_requests.text)` ? I think your request result is different with mine. — Dongeon Kim, Dec 09 '18 at 05:03
@kde713 Please have a look at my updated question. You can download the content from there. — TheCoder, Dec 09 '18 at 13:44
@TheCoder I can't find abstract part in your google docs file. could you highlight the target part in your docs? — Dongeon Kim, Dec 10 '18 at 05:43

score 1 · Accepted Answer · answered Dec 10 '18 at 21:40

This answer is not using regex (regular expression) but does the job. Answer as follows:

import re
import requests

def fetch_abstract(url = "https://www.researchgate.net/publication/328749434_TEEM_Online_Thermal-_and_Energy-Efficiency_Management_on_CPU-GPU_MPSoCs"):
    test_requests = requests.get(url)
    index = 0
    inner_count = 0
    while index < len(test_requests.text):
            index = test_requests.text.find('[Show full abstract]</a><span class=\"lite-page-hidden', index)
            if index == -1:
                break
            inner_count += 1
            if inner_count == 4:
                #extract the abstract from here -->
                temp = test_requests.text[index-1:]
                index2 = temp.find('</span></div><a class=\"nova-e-link nova-e-link--color-blue')
                quote_index = temp.find('\">')
                abstract = test_requests.text[index + quote_index + 2 : index - 1 + index2]
                print(abstract)
            index += 52

if __name__ == '__main__':
    fetch_abstract()

Result:

Heterogeneous Multiprocessor System-on-Chip (MPSoC) are progressively becoming predominant in most modern mobile devices. These devices are required to perform processing of applications within thermal, energy and performance constraints. However, most stock power and thermal management mechanisms either neglect some of these constraints or rely on frequency scaling to achieve energy-efficiency and temperature reduction on the device. Although this inefficient technique can reduce temporal thermal gradient, but at the same time hurts the performance of the executing task. In this paper, we propose a thermal and energy management mechanism which achieves reduction in thermal gradient as well as energy-efficiency through resource mapping and thread-partitioning of applications with online optimization in heterogeneous MPSoCs. The efficacy of the proposed approach is experimentally appraised using different applications from Polybench benchmark suite on Odroid-XU4 developmental platform. Results show 28% performance improvement, 28.32% energy saving and reduced thermal variance of over 76% when compared to the existing approaches. Additionally, the method is able to free more than 90% in memory storage on the MPSoC, which would have been previously utilized to store several task-to-thread mapping configurations.

score 0 · Answer 3 · answered Dec 07 '18 at 10:07

when you do requests.get(...) you should get a request object?

these objects are really clever, and you can use the built-in .json() method to return the string you've posted in the question as a python dictionary.

although i note that the link you posted doesn't point to anything like that, but to a full html document. If you're trying to parse a website like that you should look at beautifulsoup instead. (https://www.crummy.com/software/BeautifulSoup/)

Reading substring within quotes from a massive string in Python

3 Answers3