1

I have collected a large Twitter dataset (>150GB) that is stored in some text files. Currently I retrieve and manipulate the data using custom Python scripts, but I am wondering whether it would make sense to use a database technology to store and query this dataset, especially given its size. If anybody has experience handling twitter datasets of this size, please share your experiences, especially if you have any suggestions as to what database technology to use and how long the import might take. Thank you

AG100
  • 23
  • 4

2 Answers2

0

I recommend using a database schema for this, especially considering it's size. (this is without knowing anything about what the dataset holds) That being said, I suggest now or for future questions of this nature using the software suggestions website for this plus adding more about what the dataset would look like.

As for suggesting a certain database in specific, I recommend doing some research about what each do but for something that just holds data with no relations any will do and could show great query improvement vs just txt files as query's can be cached and data is faster to retrieve due to how databases store and lookup files weather it just be hashed values or whatever they use.

Some popular databases:

MYSQL, PostgreSQL - Relational Databases (simple and fast and easy to use/setup but need some knowledge of SQL)

MongoDB - NoSQL Database (also easy to use and setup and no SQL needed, it relies more on dicts to access DB through the API. Also memory mapped so can be faster than Relational but need to have enough RAM for the Indexes.)

ZODB - Full Python NoSQL Database (Kind of like MongoDB but written in Python)

These are very light and brief explanations of each DB, be sure to do your research before using them, they each have their pros and cons. Also, remember this is just a couple of many popular and highly used Databases, there's also TinyDB, SQLite (comes with Python), and PickleDB that are full Python but are generally for small applications.

My experience is mainly with PostgreSQL, TinyDB, and MongoDB, my favorite being MongoDB and PGSQL. For you, I'd look at either of those but don't limit yourself there's a slue of them plus many drivers that help you write easier/less code if that's what you want. Remember google is your friend! And welcome to Stack Overflow!

Edit

If your dataset is and will remain fairly simple but just large and you want to keep with using txt files, consider pandas and maybe a JSON or a csv format and library. It can greatly help and increase efficiency when querying/managing data like this from txt files plus less memory usage as it won't always or ever need the entire dataset in memory.

Jab
  • 26,853
  • 21
  • 75
  • 114
  • Thank you for your suggestions. As for the contents of the dataset, it is a dataset of tweets, a mentioned in the original post. – AG100 Jan 12 '19 at 00:23
  • Right, I mean are you only storing tweet ID's or tweet contents, likes, dislikes, retweets... I don't know much about twitter, just wondering the complexity is all. I do believe I heard twitter uses a graph database but that's just from memory and I highly doubt you'd need one. – Jab Jan 12 '19 at 00:26
  • I'm downloading tweets with all their associated metadata, using Twitter API. I have extracted specific pieced of data (e.g. tweet date and userid) from the whole dataset and imported them into a single Postgres table, but the importing what I had extracted (around 8gb of data) took about 12 hours. So I was wondering whether there are better ways of importing the whole thing and doing it faster. Thank you. – AG100 Jan 12 '19 at 20:26
  • Refer to [this](https://stackoverflow.com/questions/5233321/how-to-load-data-from-a-text-file-in-a-postgresql-database), also your welcome and good luck with your project! – Jab Jan 12 '19 at 21:38
-2

you can try using any NOSql DB. Mongo DB would be a good place to start