1

I am writing a python program for getting the ipaddress of the website by using socket module. Here, i have a list of dicts with n number of websites and numbers.

Here's some sample data:

data_list = [{'website': 'www.google.com', 'n': 'n1'}, {'website': 'www.yahoo.com', 'n': 'n2'}, {'website': 'www.bing.com', 'n': 'n3'}, {'website': 'www.stackoverflow.com', 'n': 'n4'}, {'website': 'www.smackcoders.com', 'n': 'n5'}, {'website': 'www.zoho.com', 'n': 'n6'}, {'website': 'www.quora.com', 'n': 'n7'}, {'website': 'www.elastic.co', 'n': 'n8'}, {'website': 'www.google.com', 'n': 'n9'}, {'website': 'www.yahoo.com', 'n': 'n10'}, {'website': 'www.bing.com', 'n': 'n11'}, {'website': 'www.stackoverflow.com', 'n': 'n12'}, {'website': 'www.smackcoders.com', 'n': 'n13'}, {'website': 'www.zoho.com', 'n': 'n14'}, {'website': 'www.quora.com', 'n': 'n15'}, {'website': 'www.elastic.co', 'n': 'n16'}, {'website': 'www.google.com', 'n': 'n17'}, {'website': 'www.yahoo.com', 'n': 'n18'}, {'website': 'www.bing.com', 'n': 'n19'}, {'website': 'www.stackoverflow.com', 'n': 'n20'}]

Here's my program:

import socket
import time


data_list = [{'website': 'www.google.com', 'n': 'n1'}, {'website': 'www.yahoo.com', 'n': 'n2'}, {'website': 'www.bing.com', 'n': 'n3'}, {'website': 'www.stackoverflow.com', 'n': 'n4'}, {'website': 'www.smackcoders.com', 'n': 'n5'}, {'website': 'www.zoho.com', 'n': 'n6'}, {'website': 'www.quora.com', 'n': 'n7'}, {'website': 'www.elastic.co', 'n': 'n8'}, {'website': 'www.google.com', 'n': 'n9'}, {'website': 'www.yahoo.com', 'n': 'n10'}, {'website': 'www.bing.com', 'n': 'n11'}, {'website': 'www.stackoverflow.com', 'n': 'n12'}, {'website': 'www.smackcoders.com', 'n': 'n13'}, {'website': 'www.zoho.com', 'n': 'n14'}, {'website': 'www.quora.com', 'n': 'n15'}, {'website': 'www.elastic.co', 'n': 'n16'}, {'website': 'www.google.com', 'n': 'n17'}, {'website': 'www.yahoo.com', 'n': 'n18'}, {'website': 'www.bing.com', 'n': 'n19'}, {'website': 'www.stackoverflow.com', 'n': 'n20'}]

field = "website"
action = "append"
max_retry = 1
hit_cache_size = 10
cache = []
d1 = []

for data in data_list:
    temp={}
    for item in data:
        if item ==field:
            if data[item]!="Not available":
                try:
                    ad=socket.gethostbyname(data[item])
                    if len(cache)<hit_cache_size:
                        cache.append({data[item]:ad})
                    else:
                        cache=[]
                    if action=="replace":
                        temp[item]=ad
                    elif action=="append":
                        temp[item]=str([data[item],ad])
                except:
                    count=0
                    while(True):
                        try:
                            ad=socket.gethostbyname(data[item])
                        except:
                            count+=1
                            if count==max_retry:
                                if action=="replace":
                                    temp[item]="Unknown"
                                elif action=="append":
                                    temp[item]=str([data[item],"Unknown"])
                                break
                            else:
                                continue    
            else:
                temp[item]="Not available"
        else:
            temp[item]=data[item]
    temp['timestamp']=time.ctime()   
    d1.append(temp)
print(d1)

Here, d can have millions of websites. Due to this, my code takes more time. so i created a cache to store some websites with their ip there.The cache size is defined in hit_cache_size. If the same website address comes in the list, instead of checking using the socket module, it should first check the cache. If the website address is there, it should get the ip from there and save it. I tried some ways by creating arrays. Eventhough it takes some time. How to make it possible.....

Smack Alpha
  • 1,828
  • 1
  • 17
  • 37
  • I would start using different variable names instead of `i`, `j` and so on, it makes it really hard to read. Use expressive variable names, then people here will have the chance to answer more quickly, also if you pass to code to other programmers. – uphill Apr 24 '19 at 07:57
  • Changed variables in code to make it easier to read.... – Smack Alpha Apr 24 '19 at 08:10
  • Why are you limiting the cache size to 10 elements, and why are you resetting the cache when it reaches 10 elements? – Erik Cederstrand Apr 24 '19 at 12:19
  • this is just a sample code. In real, the cache size will be 5000 or more. this is the temporary cache. it will be resetted after the certain limit. I just wanna check the cache instead of socket module to process it fast. But i don't know the way to achieve it. – Smack Alpha Apr 24 '19 at 13:21

2 Answers2

1

In general a cache should be a data structure which is quicker than a array. A array will in worst cases take always as many iterations as it has entries(n) take a look at https://wiki.python.org/moin/TimeComplexity .

E.g.: if you look up the mapping of 'c' here it will take 3 iterations.

entries = [('a', 1), ('b', 2), ('c', 3)]
result = None
for key, val in entries:
   if key == 'c':
      result = val
print(result)

If you want to fasten up access speed to a cache use a python dict. This will give you a much faster access. Usually this will give you an average case of n log n in run-time which is much better. Nice side effect: much better to read as well.

entries = {'a': 1, 'b': 2, 'c': 3}
result = entries['c']
uphill
  • 399
  • 1
  • 13
1

You mentioned that you could have millions of websites, so one way of resolving this would be to go in for frameworks which are specialized in caching. One of such examples would be Redis.

Installing and getting started with redis

Below is just a sample code to SET and GET the data.

import redis

# step 2: define our connection information for Redis
# Replaces with your configuration information
redis_host = "localhost"
redis_port = 6379
redis_password = ""


def hello_redis():
    """Example Hello Redis Program"""

    # step 3: create the Redis Connection object
    try:

        # The decode_repsonses flag here directs the client to convert the responses from Redis into Python strings
        # using the default encoding utf-8.  This is client specific.
        r = redis.StrictRedis(host=redis_host, port=redis_port, password=redis_password, decode_responses=True)

        # step 4: Set the hello message in Redis 
        r.set("msg:hello", "Hello Redis!!!")

        # step 5: Retrieve the hello message from Redis
        msg = r.get("msg:hello")
        print(msg)        

    except Exception as e:
        print(e)


if __name__ == '__main__':
    hello_redis()

Now using the above example you can implement it in your codebase. Below is an example I have written where you can plug-in with minimalistic changes.

def operate_on_cache(operation, **value):
    """Operate on Redis Cache"""
    try:

        # The decode_repsonses flag here directs the client to convert the responses from Redis into Python strings
        # using the default encoding utf-8.  This is client specific.
        r = redis.StrictRedis(host=redis_host, port=redis_port, password=redis_password, decode_responses=True)

        # Set the key value pair
        if operation == 'set':
            msg = r.set("{}:ip".format(value['site_name']), value['ip'])

        #Retrieve the key
        elif operation == 'get':
            msg = r.get('{}:ip'.format(value['site_name']))
        # If not get/set then throw exception.
        return msg
    except Exception as e:
        print(e)


# Snippet of your code where of how you could implement it.


if data[item] != "Not available":
    try:
        if operate_on_cache('get', site_name = data[item]):
            ad = socket.gethostbyname(data[item])
            operate_on_cache('set', site_name=data[item], ip=ad)

This is just the basics of how you could make use of Redis for caching. If you are lookinng for pure python implementation for python try out

cachetools Example of cachetools

vdkotian
  • 539
  • 6
  • 13