I know that Memors Error is a common error when using different functions of the Pandas library. I want to get help in several areas. My questions are formulated below, after describing the problem.
My OS is Ubuntu 18, workspace is jupyter notebook within the framework of Anaconda, RAM volume 8Gb.
The task that I solve.
I have over 100,000 dictionaries containing data on site visits by users, like this.
{'meduza.io': 2, 'google.com': 4, 'oracle.com': 2, 'mail.google.com': 1, 'yandex.ru': 1, 'user_id': 3}
It is necessary to form a DataFrame from this data. At first I used the append function to add dictionaries line by line in a DataFrame.
for i in tqdm_notebook(data):
real_data = real_data.append([i], ignore_index=True)
But the toy dataset showed that this function takes a long time to complete. Then I directly tried to create a DataFrame by passing an array with dictionaries like this.
real_data = pd.DataFrame(data=data, dtype='int')
Converting a small amount of data is fast enough.But when I pass the complete data set to the function Memory Eror appears. I track the consumption of RAM. The function does not start execution and does not waste memory. I tried to expand the swap file. But this did not work, the function does not access it.
I understand that to solve my particular problem, I can break the data into parts, and then combine them. But I'm not sure that I know the most effective method of solving this problem.
I want to understand how the calculation of the required amount of memory for the operation of the Pandas works. Judging by the number of questions on this topic, a memory error occurs when reading, merging, etc. Is it possible to include a swap file to solve this problem?
How to more efficiently implement the solution to the problem with the addition of dictionaries in DataFrame? 'Append' is not working efficiently. Creating a DataFrame from a complete dataset is more efficient, but leads to an error. I do not understand the implementation of these processes, but I want to figure out what is the most efficient way to convert data like my task.