-1

I would like to load the huge amount of data which is compressed (.gz) and I don't know how handle with it. My dataset it is pageviews from wikipedia.

My goal is to provide basic statistic measures to analyse them.

I found this article where is used the same dataset but I don't know how to load dataset using python script which is shown in step1.

I assume that with such a large set of analysis on a local computer is not the right approach, hence the idea to use google cloud

Merix
  • 63
  • 1
  • 9
  • What have you tried so far? Please research yourself, write some code/try to upload and then ask specific questions when you're stuck. See also [idownvotedbecau.se/noattempt/](http://idownvotedbecau.se/noattempt/). – IonicSolutions Jul 03 '18 at 14:57

1 Answers1

1

A tremendously huge dataset.

To copy files to Google Cloud Storage, just follow this : Cloud Storage > Documentation > Uploading Objects

I wouldn't recommend you to try, considering the costs, but anyways, you've got quite lucky for the goal you have: Wikipedia's pageviews dataset have been integrated into Google BigQuery, and it's available here:

https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia_v2.pageviews_2017?tab=details

Where :

  • "Google pays for the storage of these datasets and provides public access to the data via a project."

  • "You pay only for the queries that you perform on the data (the first 1 TB per month is free)."

See https://cloud.google.com/bigquery/public-data/ for more details.

Wiil
  • 629
  • 3
  • 8
  • 22
  • @Will: Do you recommend any way to processing pageviews or other huge dataset? – Merix Jul 03 '18 at 23:01
  • pageviews_2017 is 2.2TB. Anything else than what Google has made available on BigQuery involves loads of cash and coding hours. Don't bother about it and just click on "Query Editor", write some code, test it on a limited date range to stay on the free tier and you will have your answers. No ? – Wiil Jul 04 '18 at 22:13