0

Could someone please guide me on how to extract a .docx file and load it onto a database using an ETL(Extract-Transform-Load) or ELT(Extract-Load-Transform) tool?

Assuming that the .docx file contains mostly unstructured data, isn't it an ELT tool I should go for instead of ETL?

The ETL and ELT tools I found this far didn't support the MS Word component. What other way is there to extract and store the content in a .docx file onto a database?

My requirement is to:

  1. Extract the data inside the .docx file,
  2. Convert them into meaningful data, and
  3. Store them onto a data lake so I can perform data analysis, and take productive decisions based on those results.

It's just like how e-commerce companies convert customer reviews into meaningful data so they can take decisions to boost their sales. In my case, it's Word files I need to analyze.

I'm asking this because I've searched for so many ETL and ELT tools but couldn't find anything that supported Word files. Maybe it's because I haven't been searching for the right tool or the right way to do it?

If somebody knows a way, please guide me through the process. What should I start looking for? A tool, or a way to code the entire thing?

I've been looking for an answer for weeks now but didn't find a helpful answer. And it's starting to get really frustrating to see all the tools supporting every other component like social media, MongoDB, or whatever EXCEPT Word files.

KDeven
  • 5
  • 5
  • You can use python https://stackoverflow.com/questions/49617178/word-file-to-json-in-python – Wouter Sep 10 '21 at 14:39

1 Answers1

0

You have to do this in 2 steps:

  1. Extract the data from the .docx file to txt or xml
  2. Now use SSIS to import. (Azure Data Factory if you are in the cloud)
Francesco Mantovani
  • 10,216
  • 13
  • 73
  • 113
  • Isn't there really any ELT tool that supports Word files? Because doing everything without the help of a tool is a huge load of work. I'm focusing on `Big Data` here. So It doesn't seem practical to extract and import loads of Word files onto a database one by one. Even if I extract Word files into a `txt` file or so, it doesn't support content like images. What to do then? – KDeven Sep 10 '21 at 17:44
  • Also, what about the `data transformation` part that should happen before importing data to the database? Like identifying the same kind of data and sorting them under categories and stuff... Should everything be done manually? – KDeven Sep 10 '21 at 17:45