1

I am trying to collect all available text information(as much as possible) from web-pages for Uzbek language(for my research). What is the best way to do it?.

I found the Common Crawl, but not sure if it's easy to extract specific language text.

  • some portals use language name in url - ie. `../gb/...` - or as parameters - ie. `?lang=gb`. They can also keep it in some cookies. Web browser should send header [Accept-Language](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Language) with language(s) which you set in browser's settings and portal can use this information. So every portal can use different method. – furas Apr 05 '19 at 13:39
  • 1
    Since August 2018 the Common Crawl archives [provide language annotations](http://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/) which makes it easy to find pages of a specific language. Every month about 300,000 Uzbek pages ([0.01% of all pages](https://commoncrawl.github.io/cc-crawl-statistics/plots/languages)) are crawled. There are samples in [Java](https://github.com/commoncrawl/cc-index-table) and [Python](https://github.com/commoncrawl/cc-pyspark) to extract content by language via SQL and Spark. – Sebastian Nagel Apr 05 '19 at 15:31

2 Answers2

0

There are numbers of ways you can achieve this. For example, I recently created a crawler using java Jsoup where I extracted content with multiple languages. I analyzed the URL pattern containing local: en-GB, en-US, etc.

enter image description here Each URL contains Local, so if you want to get in only a specific language, make sure to check local for your required language and create a filter that only catches your desired links.

0

I extracted all Telugu language pages from Common Crawl data using a single command.

$ duckdb -c """
    LOAD httpfs;
    LOAD parquet;

    SET s3_region='us-east-1';
    SET s3_access_key_id='s3_secret_access_key';
    SET s3_secret_access_key='s3_secret_access_key';

    COPY (select * from PARQUET_SCAN('s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2022-40/subset=warc/*.parquet') where content_languages ilike '%tel%') TO 'telugu.csv' (DELIMITER ',', HEADER TRUE);
"""

Common Crawl started providing language annotations to the index files. Duck db can read parquet files, remote files and it can read a series of parquet files as well.

With parquet & httpfs extensions we can read entire common crawl index at once with the above command.

Before running the command, install duckdb and its extensions.

$ brew install duckdb

$ duckdb -c 'INSTALL parquet;'
$ duckdb -c 'INSTALL httpfs;'

I wrote a detailed blog post on extracting subset of CC data as well.

Chillar Anand
  • 27,936
  • 9
  • 119
  • 136