0

I have a text file which contains a list of URLs and I am willing to print the contents of the URL in another text file, along with the URL as the header. I have used this project file https://pypi.org/project/Wikipedia-API/ to extract the content, but I would have to enter the link one after another, which I do not want to delve into, since my list is huge, with at least 3000 links per text file.

Can anyone help me with this, it would be highly appreciated.

EDIT:

I have tried this in the following way, but there is no content in the output txt file.

import urllib
import datetime as dt
from datetime import datetime

import time

linklist = []
with open ("test.txt", 'r', encoding = 'utf=8') as wikitxt :
         #content = wikitxt.read().splitlines()
         for i in wikitxt:
                  linklist.append (i)

output = open('Wikipedia_content.txt', 'w', encoding='utf-8')

startTime = time.time()
endTime = time.time()
runTime = endTime - startTime
print("Runtime is %3f seconds" % runTime)

Here is the txt file that I have used https://pastebin.com/Y4bwsHGB , and this is the text file that I need to use : https://pastebin.com/SXDAu8jV.

Thanks in advance.

PROBLEM:

Traceback (most recent call last):


 File "C:/Users/suva_/Desktop/Project specification/data/test2.py", line 13, in <module>
    output_file.write((urlopen(link).read()))
  File "D:\Python 36\lib\urllib\request.py", line 228, in urlopen
    return opener.open(url, data, timeout)
  File "D:\Python 36\lib\urllib\request.py", line 531, in open
    response = self._open(req, data)
  File "D:\Python 36\lib\urllib\request.py", line 554, in _open
    'unknown_open', req)
  File "D:\Python 36\lib\urllib\request.py", line 509, in _call_chain
    result = func(*args)
  File "D:\Python 36\lib\urllib\request.py", line 1389, in unknown_open
    raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: https>

FINAL FIX:

import urllib
import datetime as dt
from datetime import datetime
import requests
import time
import re
import html2text

startTime = time.time()
def text_opener():
    linklist=[]
    with open ("test.txt", 'r', encoding = 'utf=8') as wikitxt :

         #content = wikitxt.read().splitlines()
        for i in wikitxt:
            try:
                linklist.append(i.strip())
            except UnicodeEncodeError as enror:
                linklist.append  ("")

    return linklist

linklist = text_opener() # put the content in a list and then opened the text

'''
This is a string of characters which I wanted to remove from the URL content

rejectedChar = list('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~0123456789')
rejectedChar.append("\t")
special="\t" 
regexWords = r"[\w']+"

'''


'''STOPWORDS LIST WHICH CONTAINS A BUNCH OF WORDS WHICH I DON"T NEED TO BE PRINTED--- ONLY FOR LARGE FILES
#stopwords = []
#with open('stopwords.txt', 'r', encoding='utf-8') as inFile:
 #   for i in inFile:
  #      stopwords.append(i.strip())
'''
content = ""
count = 0

for i in linklist:
    print(count,"   ",i.encode('utf-8'))
    count+=1
    try:
        f = urllib.request.urlopen(i).read()
        content+=str(f)
    except Exception as e:
        continue
#print((linklist[0:4000]).encode('utf-8'))

#combinedstops= rejectedChar+stopwords # combining them together

#for item in combinedstops:
    #content=content.replace(item,"") # now this items are removed from the 
#content

def output_file (content):
    with open('June_wikipedia_content.txt', 'w', encoding = 'utf-8') as output:
              output.write(str(content))

##    try:
##        output_file (content)
##    except UnicodeEncodeError as enror:
##        print ("Got lost in the game")
#sky=open("sky.txt",'w')
#sky.write(str(content))
output_file (content)

#print("hahahahahaha",stopwords)

#for i in content:
  #       i = re.findall(regexWords, i)
    #     i = [i for i in i if i in stopwords]


startTime = time.time()
endTime = time.time()
runTime = endTime - startTime
print("Runtime is %3f seconds" % runTime)
S_Chakra
  • 27
  • 1
  • 9

1 Answers1

0

You can use the following function to open the text file and store all the links in a list:

with open('links.txt') as f:
    content = f.read().splitlines()

The variable content is a list with each element containing the string associated with a URL. This will only work though if your links.txt has the URL's arranged line by line i.e:

www.google.co.in
www.wikipedia.co.in
www.youtube.co.in 

Once you get this list you can iterate through it with a simple for loop and do what you desire.

If you want a more detailed answer I suggest posting an example text file of the links.

EDIT :

This works but it dumps the whole data into the file. The data is not formatted correctly. Is this what you need ?

from urllib.request import urlopen
with open('links.txt') as f:
    content = f.read().splitlines()

with open('Wikipedia_content.txt', 'w') as output_file:
for link in content :
    output_file.write(link)
    output_file.write((urlopen(link).read()))
Rohan R
  • 70
  • 1
  • 1
  • 8
  • Hello there! First of thank you for the help. I have tried the way that you have showed, but it is not showing me anything (I am outputting it as another txt file). – S_Chakra Oct 15 '18 at 19:20
  • In your code you aren't accessing the URL anywhere ! – Rohan R Oct 16 '18 at 04:01
  • it is giving me errors as shown above, I have tried removing the package and reinstalling it (after trying your code) – S_Chakra Oct 16 '18 at 05:55
  • Since I am not able to reproduce that error, I'll find it difficult to help directly. You can try referring to [link](https://stackoverflow.com/questions/27115803/urllib-error-urlerror-urlopen-error-unknown-url-type-https) also are you trying the the URL's here : https://pastebin.com/SXDAu8jV then that error is expected as they are not in a proper format, I dont know what those charactrers like '8g0g85,' mean ? Did you try it with the following URL's : https://pastebin.com/Y4bwsHGB – Rohan R Oct 16 '18 at 06:05
  • Hey, I was able to fix it, I will upload it after I finalize it. – S_Chakra Oct 18 '18 at 05:30