word count from web text document result in 0

Question

I tried the python codes from the article of Rasha Ashraf "Scraping EDGAR with Python". He used urllib2 which is now invalid in python 3, I guess. Thus, I changed it into urllib.

I could bring the following Edgar web page. However, the number of word counting resulted in 0 no matter how I tried to fix the codes. Please help me to fix this problem. FYI, I manually check on the URL page so that "ADDRESS", "TYPE", and "transaction" occur 5 times, 9 times, and 49 times each. Nevertheless, my faulty python result shows 0 results for these three words.

Here are the python codes of Rasha Ashraf amended by me (only the urllib part and web URL). The original URL contains vast text content. So I changed it into a more simple page of the web.

import time
import csv
import sys

CIK = '0001018724'
Year= '2013'
string_match1= 'edgar/data/1018724/000112760220028651/0001127602-20-028651.txt'
url3= 'http://www.sec.gov/Archives/'+string_match1

import urllib.request
 
response3= urllib.request.urlopen(url3)
#output = response3.read()
#print(output)
words=  ['ADDRESS','TYPE', 'transaction']
count= {}
for elem in words:
    count[elem]= 0
    
for line in response3:
    elements= line.split()
    for word in words:
       count[word]= count[word] + elements.count(word)

print (CIK)
print (Year)
print (url3)
print (count)

=> The result of my codes so far

0001018724

2013

http://www.sec.gov/Archives/edgar/data/1018724/000112760220028651/0001127602-20-028651.txt

{'ADDRESS': 0, 'TYPE': 0, 'transaction': 0}

You need to clarify - when you say you are looking, for example, for "transaction": is that the word itself or the string? The word - as a standalone - appears only once; the other 48 appearances are as part of tags like `transactionTimeliness`. — Jack Fleeting, Nov 12 '20 at 22:42
Thank you for your commenting. I mean only word - as a standalone. I have 2 years of python coding as a master-degree student. However, web page text analysis is my first time. So I confused a little. Thank you for correcting me. — Jason SJ Yim, Nov 12 '20 at 23:06
Well, this is going to get you into separate issue (linguistics, basically; really) of defining the word "word". The string `TYPE` does appear 9 times as you said, but it doesn't appear entirely standalone (unlike "transaction") even once; 6 times it appears as part of a compound word like `documentType` (which clearly don't count); twice with a colon (`TYPE:`) and once inside a tag (``) - does one or both of these count? This will lead you into an endless maze of rules about how to tokenize strings, etc. It's an interesting topic in NLP, but not one you can solve here... — Jack Fleeting, Nov 12 '20 at 23:18
It seems to me that the original python codes by the author count every word regardless of their form. Because the original URL in the codes (not my changed URL) contains a 10-K report of which the number of lines is almost endless. Sorry for my limited understanding this field. At this point of time, the first thing I'd like to know is why the codes do not result the word counts other than 0. The author's word list contains 'anticipate', 'believe', 'uncertain' and so forth. I just made my short word list for my new URL. — Jason SJ Yim, Nov 12 '20 at 23:45

score 0 · Accepted Answer · answered Nov 12 '20 at 23:53

0

To get the correct count of the number of times each of your 3 strings (not words!) appear in the filing, try something like this:

import requests
url = "http://www.sec.gov/Archives/edgar/data/1018724/000112760220028651/0001127602-20-028651.txt"
req = requests.get(url)

words = ['address','type','transaction']
filing = req.text
for word in words:
    print(word,': ',filing.lower().count(word))

Output:

address :  5
type :  9
transaction :  49

answered Nov 12 '20 at 23:53

Jack Fleeting

24,385
6
23
45

It worked. Thank you for saving my day. It worked marvelously. As you recommend, I will be careful to use the term "words" and "strings". I will study the differences and the topic you mentioned in your 2nd comment. Before asking this question, I tried to solve this problem for two days. However, due to lack of my ability and experience, I felt frustrated, then. Because you shed lights on me on this matter, I will go on. However, I will definitely come back to the topic you mentioned in your 2nd comment. Once agin, Thank you so much. – Jason SJ Yim Nov 13 '20 at 00:31
@JasonSJYim Glad it worked for you! If we're are done, don't forget to accept the answer, please. – Jack Fleeting Nov 13 '20 at 02:11
Uhhh. How can I accept the answer? I've reading Stack Overflow for 2 years only. This is my first time to take part in. Would you please let me know how I accept your answer? – Jason SJ Yim Nov 13 '20 at 02:38
@JasonSJYim "To mark an answer as accepted, click on the check mark beside the answer to toggle it from greyed out to filled in." – Jack Fleeting Nov 13 '20 at 03:00
Thank you for letting me know how to accept as an answer. I did. This is the first day I uploaded my question and got answers. And now I know how I checked the mark as accepted answer. What a day! I owe it all to you Jack! – Jason SJ Yim Nov 13 '20 at 05:33
Jack, you already helped another person with the same issue here: https://stackoverflow.com/questions/57103640/text-scrapping-from-edgar-10k-amazon-code-not-working. I prefer your solution over there because the result format is just what I need. The person and I met the same situation and got helped from you. – Jason SJ Yim Nov 13 '20 at 21:02

word count from web text document result in 0

1 Answers1

Linked