1

Parsing the data from Wikipedia takes an unacceptably long time. I want to do instead of one thread\process, at least 5. After googling I found that in Python 3.5 there is async for.

Below is a "very short" version of the current "synced" code to show the whole proccess (with comments to quickly understand what the code does).

def update_data(region_id=None, country__inst=None, upper_region__inst=None):
    all_ids = []

    # Get data about countries or regions or subregions
    countries_or_regions_dict = OSM().get_countries_or_regions(region_id)

    # Loop that I want to make async
    for osm_id in countries_or_regions_dict:
        names = countries_or_regions_dict[osm_id]['names']

         if 'wiki_uri' in countries_or_regions_dict[osm_id]:
            wiki_uri = countries_or_regions_dict[osm_id]['wiki_uri']

            # PARSER: From Wikipedia gets translations of countries or regions or subregions
            translated_names = Wiki().get_translations(wiki_uri, osm_id)

            if not region_id:  # Means it is country
                country__inst = Countries.objects.update_or_create(osm_id=osm_id,
                                                                   defaults={**countries_regions_dict[osm_id]})[0]

            else: # Means it is region\subregion (in case of recursion)
                upper_region__inst = Regions.objects.update_or_create(osm_id=osm_id,
                                                                      country=country__inst,
                                                                      region=upper_region__inst,
                                                                      defaults={**countries_regions_dict[osm_id]})[0]
            # Add to DB translated names from wiki
            for lang_code in names:
                ###

            # RECURSION: If country has regions or region has subregions, start recursion
            if 'divisions' in countries_or_regions_dict[osm_id]:
                regions_list = countries_or_regions_dict[osm_id]['divisions']

                for division_id in regions_list:
                    all_regions_osm_ids = update_osm(region_id=division_id, country__inst=country__inst,
                                                              upper_region__inst=upper_region__inst)

                    all_ids += all_regions_osm_ids

    return all_ids

I realized that I need to change the def update_data to async def update_data and accordingly for osm_id in countries_or_regions_dict to async for osm_id in countries_or_regions_dict,

but I could not find the information whether is it necessary to use get_event_loop() in my case and where?, and how\where to specify how many iterations of the loop can be run simultaneously? Could someone help me please to make the loop for asynchronous?

TitanFighter
  • 4,582
  • 3
  • 45
  • 73

1 Answers1

0

asyncio module doesn't create multiple threads/process, it run code in one thread, one process, but can handle situations, with I/O blocks (if you wrote code special way). Read, when you should use asyncio.

As soon as your code have synchronous nature, I would suggest to use threads instead of asyncio. Create ThreadPoolExecutor and use it to parse Wiki in multiple threads.

Community
  • 1
  • 1
Mikhail Gerasimov
  • 36,989
  • 16
  • 116
  • 159
  • Your post pushed me to read a lot :) Now I see, that asyncio does everything in one thread\process. Could you clarify few things pls? 1) ThreadPoolExecutor (TPE) can start different threads, but just one thread can be used at a given time, so it is similiar to asyncio. Am I right? 2) And in my case I can use both methods (because both solve problem of IO bound, in my case 'parsing'), but TPE is easier because just few changes of my code required? – TitanFighter Mar 25 '16 at 01:50
  • 3) Method 'get_translations' uses 'requests'. If 'requests' takes resources to download pages, and TPE uses just one thread, where is the benefit? Is execution of IO operations out of threads' scope? – TitanFighter Mar 25 '16 at 01:55
  • @TitanFighter, 1) Yes 2) Yes 3) You should run multiple 'get_translations' in thread executor. TPE will use multiple threads. Although only one of them can be run at time, as soon as it faces with I/O operation, it transfers control to another thread instead of just wasting time waiting while I/O will finish. It happens because network I/O operations are asynchronous themself. Read this link about I/O: http://stackoverflow.com/a/16528847/1113207 You can also read this article with code examples (especially "Getting Started" and below): http://chriskiehl.com/article/parallelism-in-one-line/ – Mikhail Gerasimov Mar 25 '16 at 05:09