3

I am using NSRegularExpression to pick out image URLs from HTML. However, when trying to actually use it, I get the following error:

* Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '* -[NSRegularExpression enumerateMatchesInString:options:range:usingBlock:]: nil argument'

I have looked at other Stackoverflow answers like this, but that question uses an NSMatchingOption and I do not, and the answer gives no information on what is wrong with my situation.

Here is the code that is crashing:

NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"(<img\\s[\\s\\S]*?src\\s*?=\\s*?['\"](.*?)['\"][\\s\\S]*?>)+?" options:NSRegularExpressionCaseInsensitive error:nil];
NSString *source = [NSString stringWithContentsOfURL:[NSURL URLWithString:object[@"link"]] encoding:NSUTF8StringEncoding error:nil];

NSArray *imageResults = [regex matchesInString:source options:0 range:NSMakeRange(0, source.length)];
NSURL *link = [imageResults.firstObject URL];
UIImage *img = [UIImage imageWithData:[NSData dataWithContentsOfURL:link]];
if (img)
{
    [self.images setObject:img forKey:object[@"link"]];
    dispatch_async(dispatch_get_main_queue(), ^{
        cell.imageView.image = img;
        [cell layoutSubviews];
    });
}

The crash itself occurs on the line where imageResults is instantiated.

Does anyone know what is wrong with this code?

Community
  • 1
  • 1
erdekhayser
  • 6,537
  • 2
  • 37
  • 69
  • Is `source` good? Have you examined it, NSLog'ed it? Is it nil? – zaph Oct 03 '14 at 02:22
  • @Zaph `source` is good for every URL. The URL that seems to crash this (at the moment, there have been different ones in the past), is http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNHxbyTylS9C3u-udvR_GxAPKlwMZg&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52778622450743&ei=qAouVNqON4btgAfbzICACw&url=http://espn.go.com/new-york/nfl/story/_/id/11626444/michael-vick-not-perfect-pick-new-york-jets – erdekhayser Oct 03 '14 at 02:33
  • The encoding is incorrect, it should be `NSISOLatin1StringEncoding`. From the header: "charset=iso-8859-1". This is why a network analyzer such as Charles Proxy is invaluable – zaph Oct 04 '14 at 12:49
  • @Zaph Is there a way to determine the encoding first? – erdekhayser Oct 04 '14 at 12:52
  • Yes, use Charles and look at the Response Headers which show "Content-type text/html; charset=iso-8859-1. Looking up iso-8859-1 will lead you to "Latin1String" and then to the `NSString` encoding `NSISOLatin1StringEncoding`. Note that by adding the `error` parameter you will get an encoding error message: error: "The file “url” couldn’t be opened using text encoding Unicode (UTF-8).". Which is why ignoring errors is a bad idea. Note also that some `NSISOLatin1StringEncoding` encoding may also be legal `NSUTF8StringEncoding`. – zaph Oct 04 '14 at 13:41
  • There are other methods to obtain URL data that will also provide the Response. – zaph Oct 04 '14 at 13:48

2 Answers2

1

There is a problem: matchesInString:source returns an array of NSTextCheckingResults.

Example, error checking must be added:

NSString *regExp = @"<img\\s+src=[\"']([^\"']+)";
NSError *error;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:regExp options:NSRegularExpressionCaseInsensitive error:&error];

NSString *source = @"leading<img src=\"news.google.com/news/…\" alt=\"Smiley face\">more";

NSArray *matchResults = [regex matchesInString:source options:0 range:NSMakeRange(0, source.length)];
NSTextCheckingResult *result0 = matchResults[0];
NSRange imgRange = [result0 rangeAtIndex:1];
NSLog(@"imgRange: %@, '%@'", NSStringFromRange(imgRange), [source substringWithRange:imgRange]);

Output:

imgRange: {17, 22}, 'news.google.com/news/…'

See: ICU User Guide Regular Expressions

zaph
  • 111,848
  • 21
  • 189
  • 228
  • This solves a problem I didn't even know I had! But the app still crashes the way it did before. – erdekhayser Oct 03 '14 at 03:08
  • Please supply the shortest source string that causes the crash. NSLog the crashing source string, then add it to the test code. Keep making it shorter until it is as short as possible and still causes the crash. Then post that in your question. For debugging one needs to reduce the problem as much as possible to locate the error. – zaph Oct 03 '14 at 03:13
  • I ended up just wrapping the line that is crash-prone in a try-catch, and now everything loads fine (except for the images for a few pages, which is fine). Your regex works better than the original one actually. Thanks for helping me out! – erdekhayser Oct 03 '14 at 03:19
  • Wrapping in a try/catch just hides the error. If you don't understand the error the code will likely produce incorrect results occasionally. Would you be OK if another developer did that? Further: try/catch is only used for unrecoverable programming errors in Objective-C. – zaph Oct 03 '14 at 03:22
  • So you are saying that this is not a solution, just a temporary workaround that I need to replace? – erdekhayser Oct 03 '14 at 03:23
  • Yes. Find the shortest source string that causes the problem and post it, you should get help--but not from me until tomorrow. ;-) – zaph Oct 03 '14 at 03:25
  • About try/catch, in Objective-C it is not guaranteed to work, is known not to work correctly across stack frames. – zaph Oct 03 '14 at 03:27
  • apparently, there is was a problem with loading URLs from certain URLs that I did not catch before. The URLs follow this pattern: `http://news.google.com/news/url?...(lots of chars and numbers)...url=http://espn.go.com/...`. I believe this causes it to load a new page, but gives nothing for my source variable to use. – erdekhayser Oct 03 '14 at 20:06
  • Full URL here: http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNHVXXEN0DG2pblU2_FBFfeS3klRVw&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52778623354837&ei=ugAvVMDVIsLGwAHPtYG4CA&url=http://espn.go.com/new-york/nba/story/_/id/11634537/cleveland-cavaliers-open-regularly-resting-lebron-james-season – erdekhayser Oct 03 '14 at 20:06
0

This answer is specific to the question and URL provided in a comment to the previous answer. It assumes that there are multiple image URLs and all are wanted.
Note 1: The html encoding is NSISOLatin1StringEncoding.
Note 2: The RegExp was changes to handle ordering of "src=".

NSString *urString = @"http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNHVXXEN0DG2pblU2_FBFfeS3klRVw&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52778623354837&ei=ugAvVMDVIsLGwAHPtYG4CA&url=http://espn.go.com/new-york/nba/story/_/id/11634537/cleveland-cavaliers-open-regularly-resting-lebron-james-season";
NSURL *url = [NSURL URLWithString:urString];
NSError *error;

NSString *source = [NSString stringWithContentsOfURL:url encoding:NSISOLatin1StringEncoding error:&error];
if (source.length){
    NSString *regExp = @"<img.*?\\s+src=[\"']([^\"']+)";
    NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:regExp options:NSRegularExpressionCaseInsensitive error:&error];
    NSRange marchRange = NSMakeRange(0, source.length);

    [regex enumerateMatchesInString:source
                            options:0
                              range:marchRange
                         usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags NSRegularExpressionCaseInsensitive, BOOL *stop) {
         NSRange imgRange = [result rangeAtIndex:1];
         NSLog(@"imgRange: %@, '%@'", NSStringFromRange(imgRange), [source substringWithRange:imgRange]);
    }];

}
else {
    Display(@"error: %@", error);
}

Output:

imgRange: {18793, 68},  'http://a.espncdn.com/espncitysites/newyork/prod/assets/sub_ny_r3.png'
imgRange: {18793, 68},  'http://a.espncdn.com/espncitysites/newyork/prod/assets/sub_ny_r3.png'
imgRange: {19784, 172}, 'http://ad.Doubleclick.net/ad/espn.local.newyork.com/nba;pgtyp=story;sp=nba;tm=cle;pl=1015;pl=1966;pl=2028618;pl=215;pl=2419;objid=11634537;col=mcmenamin_dave;sz=150x45,1x1;'
imgRange: {22162, 182}, 'http://ad.Doubleclick.net/ad/espn.local.newyork.com/nba;pgtyp=story;sp=nba;tm=cle;pl=1015;pl=1966;pl=2028618;pl=215;pl=2419;objid=11634537;col=mcmenamin_dave;sz=1280x946,200x800,1x1;'
imgRange: {23470, 186}, 'http://ad.Doubleclick.net/ad/espn.local.newyork.com/nba;pgtyp=story;sp=nba;tm=cle;pl=1015;pl=1966;pl=2028618;pl=215;pl=2419;objid=11634537;col=mcmenamin_dave;sz=728x90,970x66,924x50,1x1;'
imgRange: {29706, 36},  'http://a.espncdn.com/icons/in_15.png'
imgRange: {30352, 103}, 'http://a.espncdn.com/media/motion/2014/1003/dm_141003_nba_schwartz_bron/dm_141003_nba_schwartz_bron.jpg'
imgRange: {31339, 37},  'http://a.espncdn.com/icons/video2.png'
imgRange: {34098, 65},  'http://a.espncdn.com/photo/2014/1001/nba_a_lebron01jr_300x300.jpg'
imgRange: {35987, 55},  'http://a.espncdn.com/i/columnists/windhorst_brian_m.jpg'
imgRange: {38249, 79},  'http://a.espncdn.com/combiner/i?img=/photo/2014/0926/nba_a_james_mb_203x114.jpg'
imgRange: {41787, 36},  'http://a.espncdn.com/icons/in_15.png'
imgRange: {42698, 87},  'http://a.espncdn.com/combiner/i?img=%2fi%2fcolumnists%2fmcmenamin_dave_35.jpg&w=35&h=48'
imgRange: {48148, 68},  'http://a.espncdn.com/photo/2014/1002/nba_garnett_wiggins_203x114.jpg'
imgRange: {48834, 33},  'http://a.espncdn.com/icons/in.gif'
imgRange: {50157, 181}, 'http://ad.Doubleclick.net/ad/espn.local.newyork.com/nba;pgtyp=story;sp=nba;tm=cle;pl=1015;pl=1966;pl=2028618;pl=215;pl=2419;objid=11634537;col=mcmenamin_dave;sz=300x600,300x250,1x1;'
imgRange: {51105, 45},  '/photo/2014/1003/mlb_g_martinez_b1_110x62.jpg'
imgRange: {51801, 41},  '/photo/2014/1003/ny_u_geno2_js_110x62.jpg'
imgRange: {52491, 43},  '/photo/2014/1002/nhl_g_fleury_b3_110x62.jpg'
imgRange: {53201, 44},  '/photo/2014/1002/ny_g_betances_js_110x62.jpg'
imgRange: {53902, 42},  '/photo/2014/1003/ny_g_murphy_js_110x62.jpg'
imgRange: {54986, 66},  'http://a.espncdn.com/i/Integrators/shop.lebron.welcome.300x100.jpg'
zaph
  • 111,848
  • 21
  • 189
  • 228