I am trying to collect all available text information(as much as possible) from web-pages for Uzbek language(for my research). What is the best way to do it?.
I found the Common Crawl, but not sure if it's easy to extract specific language text.
I am trying to collect all available text information(as much as possible) from web-pages for Uzbek language(for my research). What is the best way to do it?.
I found the Common Crawl, but not sure if it's easy to extract specific language text.
There are numbers of ways you can achieve this. For example, I recently created a crawler using java Jsoup where I extracted content with multiple languages. I analyzed the URL pattern containing local: en-GB, en-US, etc.
enter image description here Each URL contains Local, so if you want to get in only a specific language, make sure to check local for your required language and create a filter that only catches your desired links.
I extracted all Telugu language pages from Common Crawl data using a single command.
$ duckdb -c """
LOAD httpfs;
LOAD parquet;
SET s3_region='us-east-1';
SET s3_access_key_id='s3_secret_access_key';
SET s3_secret_access_key='s3_secret_access_key';
COPY (select * from PARQUET_SCAN('s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2022-40/subset=warc/*.parquet') where content_languages ilike '%tel%') TO 'telugu.csv' (DELIMITER ',', HEADER TRUE);
"""
Common Crawl started providing language annotations to the index files. Duck db can read parquet files, remote files and it can read a series of parquet files as well.
With parquet
& httpfs
extensions we can read entire common crawl index at once with the above command.
Before running the command, install duckdb and its extensions.
$ brew install duckdb
$ duckdb -c 'INSTALL parquet;'
$ duckdb -c 'INSTALL httpfs;'
I wrote a detailed blog post on extracting subset of CC data as well.