10

Question

Is there a way to grep through the text documents stored in Google Cloud Storage?

Background

I am storing over 10 thousand documents (txt file) on a VM and is using up space. And before it reaches the limit I want to move the documents to an alternative location. Currently, I am considering to move to Google Cloud Storage on GCP.

Issues

I sometimes need to grep the documents with specific keywords. I was wondering if there is any way I can grep through the documents uploaded on Google Cloud Storage? I checked the gsutil docs, but it seems ls,cp,mv,rm is supported but I dont see grep.

llompalles
  • 3,072
  • 11
  • 20
tetsushi awano
  • 143
  • 1
  • 1
  • 5

6 Answers6

11

Unfortunately, there is no such command like grep for gsutil.

The only similary command is gsutil cat.

I suggest you can create a small vm, and grep on the cloud will faster and cheaper.

gsutil cat gs://bucket/ | grep "what you wnat to grep"
howie
  • 2,587
  • 3
  • 27
  • 43
  • 1
    Thank you for your reply. I tried `gsutil cat` and it works if i don't have much files on Google Cloud Storage (GCP). Although, when considering scalability `gsutil cat` is definitely not best. Let me check the performance of grep on the small vm as suggested. Thank you again!!! – tetsushi awano Mar 05 '19 at 03:48
3

@howie answer is good. I just want to mention that Google Cloud Storage is a product intended to store files and does not care about the contents of them. Also, it is designed to be massively scalable and the operation you are asking for is computationally expensive, so it is very unlikely that it will be supported natively in the future.

In your case, I would consider to create a index of the text files and trigger an update for it every time a new file is upload to GCS.

llompalles
  • 3,072
  • 11
  • 20
2

i found the answer to this issue. gcpfuse solved this problem.

mount the google cloud storage to a specific directory. and you can grep from there.

https://cloud.google.com/storage/docs/gcs-fuse https://github.com/GoogleCloudPlatform/gcsfuse

tetsushi awano
  • 143
  • 1
  • 1
  • 5
1

I have another suggestion. You might want to consider using Google Dataflow to process the documents. You can just move them, but more importantly, you can transform the documents using Dataflow.

Jay
  • 525
  • 5
  • 15
1

I've written a Linux native binary [mrgrep] (for ubuntu 18.04) (https://github.com/romange/gaia/releases/tag/v0.1.0) that does exactly this. It reads directly from GCS, and as a bonus, it handles compressed files and it's multi-threaded.

Roman
  • 1,351
  • 11
  • 26
0

you can try this python script in cloud console like -:
python script_file_name bucket_name pattern directory_if_any

from google.cloud import storage
import re
import sys

client = storage.Client()
BUCKET_NAME = sys.argv[1] 
PATTERN = sys.argv[2]
PREFIX = ""
try:
    PREFIX= sys.argv[3]
except:
    pass

def search(string, patern):
    obj = re.compile(patern) 
    return obj.search(string)

def walk(bucket_name, prefix=''):
    bucket = client.bucket(bucket_name) 
    blobs = bucket.list_blobs(prefix=prefix) 
    for ele in blobs:
        if not ele.name.endswith("/"): 
            yield ele

for file in walk(BUCKET_NAME, prefix=PREFIX): 
    temp = file.download_as_string().decode('utf-8') 
    if search(temp, PATTERN):
        print(file.name)
Aniket singh
  • 118
  • 1
  • 6