0

I am trying to extract business descriptions of multiple firms from their 10-K reports using the R package, edgar. I am using getBusinDescr function to do so.

As I want business descriptions of many firms (1000+), I created a vector of firms' cik identifier and let R download descriptions of 1000+ firms using the vector. The problem is, R perfectly downloads fillings I want (10-K reports) while it fails to extract the section I am interested in. It stopped at 61% for year 2007 and at 31% for year 2011. However, for year 2010, the extraction worked out 100%.

To sum up, the extraction works for certain years but does not work for other years. I am curious to know where this error comes from. Do you think it is because of data availability (i.e., certain firms do not have business description for some years) or some natural errors from repeated scraping attempts? Please help me interpret and hopefully deal with the error.

Just fyi, I am using the latest R on my Mac.

The code I use is:

# using edgar package on R
library(edgar)

# cikvector is a vector of multiple firms' identifier codes

# for year 2007
- filings.BusinDes.2007 <- getBusinDescr( cik.no=cikvector, filing.year=2007)
# for year 2008
filings.BusinDes.2008 <- getBusinDescr( cik.no=cikvector, filing.year=2008)

The ideal results are as follows:

Downloading fillings. Please wait...              
100%
Extracting 'Item 1' section...
100%
Business descriptions are stored in 'Business descriptions text' directory.

The error I encounter is as follows (Downloading the whole reports is done without any problem, though):

Downloading fillings. Please wait...     
100%
Extracting 'Item 1' section...                                                                                                             
**|  31%Error in (grep("<DOCUMENT>", filing.text, ignore.case = TRUE)[1]):    (grep("</DOCUMENT>",  : 
NA/NaN argument**
Rouje
  • 3
  • 3

1 Answers1

0

I got the same error, but found that simply 'commenting out' the problematic lines in the functions code fixed the problem.

So, you need to edit the function 'getBusinDescr' from the Edgar package. One easy way to do this in R-Studio is to just run:

fix(getBusinDescr)

Next, you need to find the following lines:

    filing.text <- filing.text[(grep("<DOCUMENT>", filing.text, 
                                 ignore.case = TRUE)[1]):(grep("</DOCUMENT>", filing.text, 
                                                               ignore.case = TRUE)[1])]

and add a # at the beginning of each line to remove them from the function (i.e. comment them out). Then, when you run the function it should work fine.

The problem began for me a few weeks or so back, and I am sure it ran perfectly before then using the exact same underlying data. My best guess as to why this happened is that the SEC probably changed their HTML code a bit so that the "" tags don't appear in some of the raw files. I haven't bothered to test this theory, but it would make sense.