0

This code searches through website html files and extracts a list of domain names...

httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | grep -iEo '[[:alnum:]-]+\.(com|net|org)'

The result looks like this.

  • domain1.com
  • domain2.com
  • domain3.com

I plan to use this code on very large websites, therefore this will generate a very large list of domain names. In addition, the above code generates a lot of duplicate domain names. Therefore, I setup a mysql database with a unique field so duplicates will not be inserted.

Using my limited knowledge of programming I hacked together this line below, but this is not working. When I execute the command, I get no error, just a new command prompt of > and a blinking cursor. I assume I'm not using the correct syntax or methodology, and/or maybe what I want to do is not possible via command line. Any help is much appreciated.

httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | domain=“$(grep -iEo '[[:alnum:]-]+\.(com|net|org)’)” | mysql -pPASSWORD -e "INSERT INTO domains.domains (domains) VALUES ($domain)”

And yes, my database name is domains, and my table name is domains, and my field name is domains.

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
Wyatt Jackson
  • 303
  • 1
  • 2
  • 11
  • Ok, will do, thanks... When I execute the command, I get no error, just a new command prompt of > and a blinking cursor. – Wyatt Jackson May 24 '14 at 22:57
  • Not much to work with. I would organize the code a little different,to create the `insert ....` statements just to echo to std-out. Then, worst case you can cut paste a block of them into a gui-editor. But first, have you done basic tests like `echo "select getdate()" | mysql -pPS -dDB` to confirm the most basic stuff works? (use what ever is the mysql equivalent for `getdate()`). And, yes, this shouldn't be problem to get working. Try searching here for small examples. Good luck. – shellter May 24 '14 at 23:05
  • Ok. After your suggestion, I took out the mysql part for time being, and still just get a new command prompt upon execution. Apparently, the problem may be with my attempt at setting of the domain variable.. httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" | domain=“$(grep -iEo '[[:alnum:]-]+\.(com|net|org)’)” | echo $domain Any idea on what is wrong with this? – Wyatt Jackson May 24 '14 at 23:11
  • best to work like `echo "1_line_Sample_from_httrak_with_domain" | egrep -iEo '[[:alnum:]-]+\.(com|net|org\)'` until you get your regExp working. What you have seems like it should work. So also eliminate odd chars that might be throwing off grep with `tr -d '[:alnum:]-]' smallTestCleanFile`, or the above like `echo "1_line_Sample_from_httrak_with_domain" | tr -d '[:alnum:]-]' | egrep -iEo '[[:alnum:]-]+\.(com|net|org\)'`. Going out for the evening. Good luck! – shellter May 24 '14 at 23:48

1 Answers1

0

Judging from the MySQL syntax for INSERT:

INSERT [LOW_PRIORITY | DELAYED | HIGH_PRIORITY] [IGNORE]
    [INTO] tbl_name
    [PARTITION (partition_name,…)] 
    [(col_name,…)]
    {VALUES | VALUE} ({expr | DEFAULT},…),(…),…
    …

you need to convert the domain names into parenthesized, quoted, comma separated items:

('domain1.com'),('domain2.com'),…

and then attach this list to the end of the INSERT statement you generated.

httrack --skeleton http://www.ilovefreestuff.com -V "cat \$0" |
grep -iEo '[[:alnum:]-]+\.(com|net|org)’ |
sort -u |
sed -e "s/.*/,('&')/" -e '1s/,/INSERT IGNORE INTO domains.domains(domain) VALUES /' |
mysql -pPASSWORD

The sort -u ensures that the names are unique. The first -e to sed converts the contents of a line (e.g. domain1.com) into ,('domain1.com); the second -e removes the comma of the first line (added by the first -e) and replaces it with the INSERT prefix. The IGNORE in the INSERT statement means that if a domain is already in the table, the new entry will be ignored.

Clearly, if the number of domains generated is too large for a valid SQL statement in MySQL, you'll have to do some splitting of the data, but you're likely to be able to process a few thousand domains at a time.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278