0

I'm stuck with a problem. I'm not an Excel expert.

I found some threads how to use regex in Excel, like:

But I don't know how to filter several things out of a data source to a newly created column.

Example:

The data source is mostly HTML-sourcecode in one cell.

I built RegEx for matching links (ahref), youtube, facebook videos, infogr.am links and other stuff.

I want to:

Extract/filter the links from the html to a new column, like:

New column "Links"

 <a href="url">link text</a> \n
 <a href="url">link text</a> \n

(every filtered link a new line in the cell)

Optional: paste another column next to it, which counts the paragraphs/links (counting the lines).

Then the video links (one line per filtered data again) + optional counting column.

And so on.

I found Openrefine. Which seems very nice and professional. But I didn't find out how to do it there.

I don't have to work with Excel, later the results can be converted to *.csv - it will be data for the database.

Though I think Excel is nice for the first steps, as the optional counting paragraph thing.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459

1 Answers1

0

To create a new column from the HTML with each links in the same cell, separated by a newline:

forEach(value.parseHtml().select('a'), e, e).join('\n')

You should see something like this:

enter image description here

Or if you want only the text of the links:

forEach(value.parseHtml().select('a'), e, e.htmlText()).join('\n')

Or if you want only the urls:

forEach(value.parseHtml().select('a'), e, e.htmlAttr('href')).join('\n')

To create a new column from the HTML that count the number of links:

forEach(value.parseHtml().select('a'), e, e).length()

There is no picture or video in your sample, so I can't show you how to find them. To extract the infogr.am links :

forEach(value.parseHtml().select('iframe[src*=infogr.am]'), e, e.htmlAttr('src')).join('\n')

OpenRefine uses JSoup selectors, that are very similar to CSS selectors. Simply use the right selector inside the .select() function.

Ettore Rizza
  • 2,800
  • 2
  • 11
  • 23
  • Thanks again! It works great and i think i understand it a bit. In all cases i just need the link/image/video-URL. Filtered .... with that: forEach(value.parseHtml().select('img'), e, e.htmlAttr('src')).join('\n') Works nice. Image Count with that: forEach(value.parseHtml().select('umg'), e, e.htmlAttr('src')).length() The Link Filter of course also contains the image and videos, but that's fine the link counter is to be expected to count all clickable links. – timmyfromspace Oct 26 '18 at 13:46
  • Looks like this in HTML: https://pastebin.com/FikRy1nS My RegEx: \ – timmyfromspace Oct 26 '18 at 14:08