0

I tried to parse firefox bookmark(JSON exported version), using this efforts:

cat boo.json | grep '\"uri\"\:\"^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}\"'
cat boo.json | grep '"uri"\:"^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}'
cat boo.json | grep '"uri"\:"^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}"'

And few others but all fails, json bookmarked file will look like this:

.........."uri":"http://www.google.com/?"......"uri":"http://stackoverflow.com/"

So, the output should be like this:

"uri":"http://www.google.com/?"
"uri":"http://stackoverflow.com/"

What is the missing part on my regular expression?

UPDATE:

Url's on bookmark file ending with one of this special character:

/, ex: "uri":"http://stackoverflow.com/"

", ex: "uri":"http://stackoverflow.com/questions/13148794/parsing-firefox-bookmarks-using-regular-expression"

}, ex: "uri":"https://fr.add-ons.mozilla.com/fr/firefox/bookmarks/"}

With this modified regular expression:

$ egrep -o "(http|https)://([^ ]*).(*\/)"  boo.json

Result:

http://fr.fxfeeds.mozilla.com/fr/firefox/headlines.xml"},{"name":"livemark/siteURI","flags":0,"expires":4,"mimeType":null,"type":3,"value":"http://www.lemonde.fr/"}],"type":"text/x-moz-place-container","children":[]}]},{"index":2,"title":"Tags","id":4,"parent":1,"dateAdded":1344432674984000,"lastModified":1344432674984000,"type":"text/
http://stackoverflow.com/questions/13148794/parsing-firefox-bookmarks-using-regular-expression","charset":"UTF-8"},{"index":29,"title":"adrusi/
http://stackoverflow.com/
...

But with this still doesn't get me only url's.

SIFE
  • 5,567
  • 7
  • 32
  • 46
  • I'm unfamiliar with JSON format but from the very small snippet you posted it LOOKS like it'd be a very brief, simple awk script to pull out the URLs. If you posted a bit more sample input (say a 10-line file) and expected output, I'd take a look. – Ed Morton Dec 08 '12 at 09:03

3 Answers3

0

Have you tried JSON.sh? Its works great!

https://github.com/dominictarr/JSON.sh

Nicholas Terry
  • 1,812
  • 24
  • 40
  • What do you mean it doenst work? Did you use it that way that the README says to? It'll still need some parsing, but this will get you a standard output for any JSON from which to extract values – Nicholas Terry Oct 31 '12 at 16:21
0

I use this regex to extract urls , it's works great

cat *.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort | uniq

BuGaU0
  • 391
  • 3
  • 2
-1

Mr Jeff Atwood had posted an article the problem with urls, With his proposed Regular Expression, I managed to extract all the url's from FireFox bookmark:

egrep -o "\(?\bhttp://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]"  my-bookmark.json
SIFE
  • 5,567
  • 7
  • 32
  • 46