0

I need help creating database tables from html tables. Right now I'm just manually doing it. I have a html doc that has all the data but I don't know what the best way to extract the data.

<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>ATOMS Definition for Type tom.service.soc.SocRecord</title>
<style type="text/css">
body
{
    line-height: 1.6em;
    font-family: "Lucida Sans Unicode", "Lucida Grande", Sans-Serif;
    font-size: 14px;
    margin: 45px;
}
#box-table-a
{
    font-family: "Lucida Sans Unicode", "Lucida Grande", Sans-Serif;
    font-size: 12px;
    margin: 5%;
    width: 90%;
    text-align: left;
    border-collapse: collapse;
}
#box-table-a th
{
    font-size: 13px;
    font-weight: normal;
    padding: 8px;
    background: #b9c9fe;
    border-top: 4px solid #aabcfe;
    border-bottom: 1px solid #fff;
    color: #039;
}
#box-table-a td
{
    padding: 8px;
    background: #e8edff;
    border-bottom: 1px solid #fff;
    color: #669;
    border-top: 1px solid transparent;
}
#box-table-a tr:hover td
{
    background: #d0dafd;
    color: #339;
}
</style>

</head>
<body>
<table id="box-table-a" summary="Definition for tom.service.soc.SocRecord">
    <thead>
        <tr><th colspan="2">tom.service.soc.SocRecord</th></tr>
    </thead>
    <tbody>
        <tr>
            <td>Version</td>
            <td>1</td>
        </tr>
        <tr>
            <td>Description</td>
            <td>[type is UNCLASSIFIED] Temporary dummy test object for SOC</td>
        </tr>
    </tbody>
</table>
<table id="box-table-a" summary="Fields Definition for Type tom.service.soc.SocRecord">
    <thead>
        <tr>
            <th scope="col">Index</th>
            <th scope="col">Name</th>
            <th scope="col">Type</th>
            <th scope="col">Range</th>
            <th scope="col">Default</th>
            <th scope="col" width="50%">Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>1</td>
            <td>socID</td>
            <td>String</td>
            <td>           
-
            </td>
            <td>""</td>
            <td>
                [ ] The UUID of the tracked object -- String for transmission purposes  
            </td>
        </tr>
        <tr>
            <td>2</td>
            <td>satID</td>
            <td><a href="../../../../../tom/state/vcm/SatNumberType.html">SatNumberType</a></td>
            <td>
            </td>
            <td></td>
            <td>
                [ ] The ID of the tracked object -- copy of the satelliteId in the VCM  
            </td>
        </tr>
    </tbody>
</table>

</body></html>

Here is the html. I would like help making a create postgres database table script like this from the html. And if it has a href then that links to another table.

CREATE TABLE soc.SocRecord( 
    socId TEXT, --[ ] The UUID of the tracked object -- String for transmission purposes
    satId UUID, --[ ] The ID of the tracked object -- copy of the satId in the VCM
    commonName TEXT, --[ ] The name of the tracked object -- may be blank - 
                     --This field is optional in the current version of the message, check the set attribute before use.);
e417927
  • 19
  • 6
  • afaik, there is no way to parse html directly and put it into sql. you will need a general purpose programming language (e.g. javascript, python, .net, java, etc) equipped with html parsing library to parse the html and then put the data into database using ORM or something. – Bagus Tesa Jul 13 '23 at 04:46
  • also, you could pay people to do menial works.. just saying. we are not free code generator, show us what have you tried and where did you stuck in. – Bagus Tesa Jul 13 '23 at 04:47

1 Answers1

1

Edit Tried out some more variants and found that using zip resulted in no errors and returned the sql script.

for header, value in zip(headers, values):

I've updated the code bellow as well.


So your approach could constitute something like:

read .html file parse for table parse for theader and so on

I find idea pretty interesting, so I tried it out in python.

from bs4 import BeautifulSoup

# Specify the path to your HTML file
html_file_path = 'path/to/your/file.html'

here you could write another script to read all .html files in a folder and execute against that instead.

# Read the contents of the HTML file
with open(html_file_path, 'r') as file:
    html = file.read()

# Find all the tables in the HTML
tables = soup.find_all('table')

# Iterate over the tables
for table in tables:
    # Find the table's ID attribute
    table_id = table.get('id')

    # Extract the table headers
    headers = [th.get_text() for th in table.find('thead').find_all('th')]

    # Create a dictionary to store the table data
    table_data = {}

    # Iterate over the table rows
    for row in table.find('tbody').find_all('tr'):
        # Extract the row cells
        cells = row.find_all('td')

        # Extract the cell values
        values = [cell.get_text().strip() for cell in cells]

        # Store the values with their corresponding headers in the dictionary
        for header, value in zip(headers, values):
            if header not in table_data:
                table_data[header] = []
            table_data[header].append(value)

    # Generate the PostgreSQL table script
    create_table_script = f"CREATE TABLE {table_id} (\n"
    for header, values in table_data.items():
        # Handle column names with spaces or special characters
        column_name = header.lower().replace(' ', '_').replace('.', '_')

        # Combine the column values into a comma-separated string
        column_values = ', '.join([f"'{value}'" if isinstance(value, str) else str(value) for value in values])

        # Append the column definition to the script
        create_table_script += f"    {column_name} {column_values},\n"

    create_table_script = create_table_script.rstrip(',\n') + "\n);\n"

    # Print the table script
    print(create_table_script)

The result of the above is probably not going to be exactly what you want, but it gets you started.

aleksandar
  • 186
  • 12