1

The following code extracts all the domain names from a website and sets them to the value of $domain from a httrack data stream.

domain=$(httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | grep -iEo '[[:alnum:]-]+\.(com|net|org)')

The value of $domain looks like this...

googlesyndication.com facebook.com facebook.com ilovefreestuff.com ilovefreestuff.com facebook.com facebook.com ilovefreestuff.com ilovefreestuff.com peadig.com facebook.net ilovefreestuff.com w3.org ilovefreestuff.com yoast.com ilovefreestuff.com

I have my database setup and this command works perfectly.

mysql -pPASSWORD -e "INSERT INTO domains.domains (domains) VALUES ($domain)”

I’m trying to insert each individual domain within the variable $domain into my MySQL database on the fly from the httrack data stream within one combined command line. So in my crazy mind it should look something similar to below… Unfortunately this doesn’t work. I get no output, just another bash prompt.

domain=$(httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | grep -iEo '[[:alnum:]-]+\.(com|net|org)') | mysql -pPASSWORD -e "INSERT INTO domains.domains (domains) VALUES ($domain)"

I’m not sure how to cut up the $domain variable into individual domains so I can enter one domain per MySQL data cell, and I’m not certain how to pipe the result to MySQL given that the command is a data stream. Maybe I need a for loop and a cut command?

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
Wyatt Jackson
  • 303
  • 1
  • 2
  • 11

1 Answers1

1

Maybe I need a for loop and a cut command?

You said it. It won’t be a one-liner but this should work:

domain=$(httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | grep -iEo '[[:alnum:]-]+\.(com|net|org)')

for single_domain in $domain
do
  mysql -pPASSWORD -e "INSERT INTO domains.domains (domains) VALUES ($single_domain)"
done
Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
  • 1
    Yes, this work. The problem with this solution is the httrack data stream is very large and the httrack command could possibly run for hours and/or days on some websites. Sometimes it never completes. Therefore I was hoping to import the domains as it discovers them in real time. Is there a better way for me to complete my desired result? – Wyatt Jackson May 25 '14 at 17:13
  • @user3672303 Fair enough. You basically want the inserts to happen on each domain in realtime & not in bulk, correct? – Giacomo1968 May 25 '14 at 17:15
  • Yes this is the solution I'm looking for. – Wyatt Jackson May 25 '14 at 17:18
  • @user3672303 Maybe the best solution is to save the data to a text file or a CSV in a crawling stage & then have another stage where that crawled data is processed. – Giacomo1968 May 25 '14 at 17:20
  • FWIW: can be written in one line, too ;-) – Kaii May 25 '14 at 17:23
  • @Kaii Fair enough. Please post it as a one liner to help the original poster & if it is good, it will be up voted. – Giacomo1968 May 25 '14 at 17:26
  • This was my initial solution as I simply dumped the result to a text file. Unfortunately, I received so much data mostly in the form of duplicate domains that it was difficult to sort/unique. Some of these text files were 50GB or more. Therefore my solution was to setup a mysql database with a unique cell and import on the fly so I wouldn't have to deal with so much data. However, I guess I could simply import my data from txt file into a unique mysql data cell after httrack is complete to remove duplicates. – Wyatt Jackson May 25 '14 at 17:28
  • @user3672303 If you have duplicates it might be easier for you to just do post-process cleanup with the text file to shake out dupes & then insert. There’s no glory in doing this as one single line. And the issues you are facing—and the solutions attempted—are basic engineering concepts. – Giacomo1968 May 25 '14 at 17:38