0

I am new here and I am asking for your understanding. I am a beginner in the field of data processing and analysis. I would like to ask for help in my task.

I have three datasets(logs) in the json format. Each of them has a size of approximately 1.5 GB and have the same attributes.

Next, I would like to analyze the data on these data sets together(statistics and graphs concerning various attributes). I would also like to be able to detect patterns, trands and relationships in the data later.

How can I do it to make it effective? What are the good practices? How can I deal with such large data? I tried the "pandas" library but it is very time-consuming. I prefer "Python", but I'm open to other solutions :)

I am asking you for help. This is very important for me.

Thank you in advance for any help.

AWL
  • 31
  • 1
  • 2
  • 1
    pandas is a Python Library. – Giorgos Myrianthous Mar 01 '19 at 21:31
  • 1) resample your logs and keep a subset for testing, 2) explore and prototype, 3) scale up to process all your dataset. If you're new to the field, it will be time-consuming no matter what you try because you're learning. Invest in knowledge while you can, read docs, tutorials, etc... Now, this question would be better suited to the appropriate forum, likely [Data Science](https://datascience.stackexchange.com/) as you are not seeking answers on your code, but general indications on processing methodology and tools. `pandas` is very good :) – FabienP Mar 01 '19 at 21:41
  • 1
    Im general JSON with its structure is not a great format for large datasets. – Klaus D. Mar 01 '19 at 21:45
  • @FabienP thank you, for your answer and advice. I know I'm learning and I have to read a lot about it. In my question I asked about the way / language / library, thanks to which data processing will be the most efficient and effective. I now use the pandas library, but with this size of data, the calculations are very time-consuming, which makes the entire analysis difficult. Nevertheless, thank you very much for your help. – AWL Mar 03 '19 at 00:20
  • @AWL: `pandas` is backed by C libraries from `numpy`, so there are plenty of parallelised operations that can speed up your processing. Nevertheless, beyond a certain size it could make sense to use distributed or parallel computation framework. Spark is famous for this, but [`dask`](https://docs.dask.org/en/latest/) could help too, and it uses `pandas.DataFrame` so it should be familiar to you. – FabienP Mar 03 '19 at 01:23

0 Answers0