3

I received a data dump of the SQL database.

The data is formatted in an .sql file and is quite large (3.3 GB). I have no idea where to go from here. I have no access to the actual database and I don't know how to handle this .sql file in Python.

Can anyone help me? I am looking for specific steps to take so I can use this SQL file in Python and analyze the data.

TLDR; Received an .sql file and no clue how to process/analyze the data that's in the file in Python. Need help in necessary steps to make the .sql usable in Python.

dfundako
  • 8,022
  • 3
  • 18
  • 34
jurre
  • 91
  • 1
  • 2
  • 6
  • What kind of database? If its mysql, you need to install mysql on your local machine, create a new db, load the dump, and then query it like you would the original library (using PyMySQL, for example). Similar steps exis 5th for any other brand. Just google "restore from dump" and that should get you started – Ben Jan 10 '19 at 15:37
  • 1
    Possible duplicate of [How to parse MySQL built-code from within Python?](https://stackoverflow.com/questions/6035972/how-to-parse-mysql-built-code-from-within-python) – chb Jan 10 '19 at 15:39
  • I have emailed the company that provided the dump to know what kind of database was used. I do think that this is a viable solution, @Ben. Thank you for your reply. I will give an update once I tried your option. – jurre Jan 10 '19 at 16:52

3 Answers3

2

It would be an extraordinarily difficult process to try to construct any sort of Python program that would be capable of parsing the SQL syntax of any such of a dump-file and to try to do anything whatsoever useful with it.

"No. Absolutely not. Absolute nonsense." (And I have over 30 years of experience, including senior management.) You need to go back to your team, and/or to your manager, and look for a credible way to achieve your business objective ... because, "this isn't it."

The only credible thing that you can do with this file is to load it into another mySQL database ... and, well, "couldn't you have just accessed the database from which this dump came?" Maybe so, maybe not, but "one wonders."

Anyhow – your team and its management need to "circle the wagons" and talk about your credible options. Because, the task that you've been given, in my professional opinion, "isn't one." Don't waste time – yours, or theirs.

Mike Robinson
  • 8,490
  • 5
  • 28
  • 41
  • I'm with Mike on this one. – P.Salmon Jan 10 '19 at 16:10
  • The argument for not giving me access to their database is because they have data of multiple companies in their SQL database, including competing companies. My management has no idea of what I'm doing and how this all works. I am on my own in my company on this project and I've been given carte blanche to figure out how to do this. Is there any other solution of handling this? I want to analyze the data that they have from us, but I can't access the SQL database directly due to their "policy". – jurre Jan 10 '19 at 16:53
2

Eventually I had to install MAMP to create a local mysql server. I imported the SQL dump with a program like SQLyog that let's you edit SQL databases.

This made it possible to import the SQL database in Python using SQLAlchemy, MySQLconnector and Pandas.

jurre
  • 91
  • 1
  • 2
  • 6
0

The module sqlparse does a pretty good job in this. For example:

import sqlparse
import collections
import pandas as pd

with open('dump.sql', 'r') as sqldump:

   parser = sqlparse.parsestream(sqldump)
   headers = {}
   contents = collections.defaultdict(list)

   for statement in parser:

       if statement.get_type() == 'INSERT':

           sublists = statement.get_sublists()
           table_info = next(sublists)
           table_name = table_info.get_name()

        headers[table_name] = [
            col.get_name()
            for col in table_info.get_parameters()
        ]

        contents[table_name].extend(
            tuple(
                s.value.strip('"\'')
                for s in next(rec.get_sublists()).get_identifiers()
            )
            for rec in next(sublists).get_sublists()
        )

data = {
    name: pd.DataFrame.from_records(table, columns = headers[name])
    for name, table in contents.items()
}

It is slow, but does the job, I guess until a few GB file size. Even better if you extract the tables one by one (lower memory use), and seek in the file object to the first INSERT statement of the table of interest (in order to avoid processing the other huge statements by the sqlparse lexer).

deeenes
  • 4,148
  • 5
  • 43
  • 59