I have downloaded a web page(charset=iso-8859-1
) using curl
curl "webpage_URL" > site.txt
The encoding of my terminal is utf-8
. Here I try to see the encoding of this file:
file -i site.txt
site.txt: regular file
Now: the strange thing: If I open the file with nano
I find all the words that are visible in a normal browser. BUT when I use:
cat site.txt
some words are missing. This fact makes me curious and after some hours of research I didn't figure out why.
In python too, it does't find all the words:
def function(url):
p = subprocess.Popen(["curl", url], stdout=subprocess.PIPE)
output, err = p.communicate()
print output
soup=BeautifulSoup(output)
return soup.body.find_all(text=re.compile('common_word'))
I also tried to use urllib2
but I had no success.
What am I doing wrong?